mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-09-05 07:57:45 -04:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
20
Documentation/x86/00-INDEX
Normal file
20
Documentation/x86/00-INDEX
Normal file
|
@ -0,0 +1,20 @@
|
|||
00-INDEX
|
||||
- this file
|
||||
boot.txt
|
||||
- List of boot protocol versions
|
||||
early-microcode.txt
|
||||
- How to load microcode from an initrd-CPIO archive early to fix CPU issues.
|
||||
earlyprintk.txt
|
||||
- Using earlyprintk with a USB2 debug port key.
|
||||
entry_64.txt
|
||||
- Describe (some of the) kernel entry points for x86.
|
||||
exception-tables.txt
|
||||
- why and how Linux kernel uses exception tables on x86
|
||||
mtrr.txt
|
||||
- how to use x86 Memory Type Range Registers to increase performance
|
||||
pat.txt
|
||||
- Page Attribute Table intro and API
|
||||
usb-legacy-support.txt
|
||||
- how to fix/avoid quirks when using emulated PS/2 mouse/keyboard.
|
||||
zero-page.txt
|
||||
- layout of the first page of memory.
|
1125
Documentation/x86/boot.txt
Normal file
1125
Documentation/x86/boot.txt
Normal file
File diff suppressed because it is too large
Load diff
42
Documentation/x86/early-microcode.txt
Normal file
42
Documentation/x86/early-microcode.txt
Normal file
|
@ -0,0 +1,42 @@
|
|||
Early load microcode
|
||||
====================
|
||||
By Fenghua Yu <fenghua.yu@intel.com>
|
||||
|
||||
Kernel can update microcode in early phase of boot time. Loading microcode early
|
||||
can fix CPU issues before they are observed during kernel boot time.
|
||||
|
||||
Microcode is stored in an initrd file. The microcode is read from the initrd
|
||||
file and loaded to CPUs during boot time.
|
||||
|
||||
The format of the combined initrd image is microcode in cpio format followed by
|
||||
the initrd image (maybe compressed). Kernel parses the combined initrd image
|
||||
during boot time. The microcode file in cpio name space is:
|
||||
on Intel: kernel/x86/microcode/GenuineIntel.bin
|
||||
on AMD : kernel/x86/microcode/AuthenticAMD.bin
|
||||
|
||||
During BSP boot (before SMP starts), if the kernel finds the microcode file in
|
||||
the initrd file, it parses the microcode and saves matching microcode in memory.
|
||||
If matching microcode is found, it will be uploaded in BSP and later on in all
|
||||
APs.
|
||||
|
||||
The cached microcode patch is applied when CPUs resume from a sleep state.
|
||||
|
||||
There are two legacy user space interfaces to load microcode, either through
|
||||
/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
|
||||
in sysfs.
|
||||
|
||||
In addition to these two legacy methods, the early loading method described
|
||||
here is the third method with which microcode can be uploaded to a system's
|
||||
CPUs.
|
||||
|
||||
The following example script shows how to generate a new combined initrd file in
|
||||
/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and
|
||||
original initrd image /boot/initrd-3.5.0.img.
|
||||
|
||||
mkdir initrd
|
||||
cd initrd
|
||||
mkdir -p kernel/x86/microcode
|
||||
cp ../microcode.bin kernel/x86/microcode/GenuineIntel.bin (or AuthenticAMD.bin)
|
||||
find . | cpio -o -H newc >../ucode.cpio
|
||||
cd ..
|
||||
cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img
|
136
Documentation/x86/earlyprintk.txt
Normal file
136
Documentation/x86/earlyprintk.txt
Normal file
|
@ -0,0 +1,136 @@
|
|||
|
||||
Mini-HOWTO for using the earlyprintk=dbgp boot option with a
|
||||
USB2 Debug port key and a debug cable, on x86 systems.
|
||||
|
||||
You need two computers, the 'USB debug key' special gadget and
|
||||
and two USB cables, connected like this:
|
||||
|
||||
[host/target] <-------> [USB debug key] <-------> [client/console]
|
||||
|
||||
1. There are a number of specific hardware requirements:
|
||||
|
||||
a.) Host/target system needs to have USB debug port capability.
|
||||
|
||||
You can check this capability by looking at a 'Debug port' bit in
|
||||
the lspci -vvv output:
|
||||
|
||||
# lspci -vvv
|
||||
...
|
||||
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
|
||||
Subsystem: Lenovo ThinkPad T61
|
||||
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
|
||||
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
|
||||
Latency: 0
|
||||
Interrupt: pin D routed to IRQ 19
|
||||
Region 0: Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
|
||||
Capabilities: [50] Power Management version 2
|
||||
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
|
||||
Status: D0 PME-Enable- DSel=0 DScale=0 PME+
|
||||
Capabilities: [58] Debug port: BAR=1 offset=00a0
|
||||
^^^^^^^^^^^ <==================== [ HERE ]
|
||||
Kernel driver in use: ehci_hcd
|
||||
Kernel modules: ehci-hcd
|
||||
...
|
||||
|
||||
( If your system does not list a debug port capability then you probably
|
||||
won't be able to use the USB debug key. )
|
||||
|
||||
b.) You also need a Netchip USB debug cable/key:
|
||||
|
||||
http://www.plxtech.com/products/NET2000/NET20DC/default.asp
|
||||
|
||||
This is a small blue plastic connector with two USB connections,
|
||||
it draws power from its USB connections.
|
||||
|
||||
c.) You need a second client/console system with a high speed USB 2.0
|
||||
port.
|
||||
|
||||
d.) The Netchip device must be plugged directly into the physical
|
||||
debug port on the "host/target" system. You cannot use a USB hub in
|
||||
between the physical debug port and the "host/target" system.
|
||||
|
||||
The EHCI debug controller is bound to a specific physical USB
|
||||
port and the Netchip device will only work as an early printk
|
||||
device in this port. The EHCI host controllers are electrically
|
||||
wired such that the EHCI debug controller is hooked up to the
|
||||
first physical and there is no way to change this via software.
|
||||
You can find the physical port through experimentation by trying
|
||||
each physical port on the system and rebooting. Or you can try
|
||||
and use lsusb or look at the kernel info messages emitted by the
|
||||
usb stack when you plug a usb device into various ports on the
|
||||
"host/target" system.
|
||||
|
||||
Some hardware vendors do not expose the usb debug port with a
|
||||
physical connector and if you find such a device send a complaint
|
||||
to the hardware vendor, because there is no reason not to wire
|
||||
this port into one of the physically accessible ports.
|
||||
|
||||
e.) It is also important to note, that many versions of the Netchip
|
||||
device require the "client/console" system to be plugged into the
|
||||
right and side of the device (with the product logo facing up and
|
||||
readable left to right). The reason being is that the 5 volt
|
||||
power supply is taken from only one side of the device and it
|
||||
must be the side that does not get rebooted.
|
||||
|
||||
2. Software requirements:
|
||||
|
||||
a.) On the host/target system:
|
||||
|
||||
You need to enable the following kernel config option:
|
||||
|
||||
CONFIG_EARLY_PRINTK_DBGP=y
|
||||
|
||||
And you need to add the boot command line: "earlyprintk=dbgp".
|
||||
(If you are using Grub, append it to the 'kernel' line in
|
||||
/etc/grub.conf)
|
||||
|
||||
On systems with more than one EHCI debug controller you must
|
||||
specify the correct EHCI debug controller number. The ordering
|
||||
comes from the PCI bus enumeration of the EHCI controllers. The
|
||||
default with no number argument is "0" the first EHCI debug
|
||||
controller. To use the second EHCI debug controller, you would
|
||||
use the command line: "earlyprintk=dbgp1"
|
||||
|
||||
NOTE: normally earlyprintk console gets turned off once the
|
||||
regular console is alive - use "earlyprintk=dbgp,keep" to keep
|
||||
this channel open beyond early bootup. This can be useful for
|
||||
debugging crashes under Xorg, etc.
|
||||
|
||||
b.) On the client/console system:
|
||||
|
||||
You should enable the following kernel config option:
|
||||
|
||||
CONFIG_USB_SERIAL_DEBUG=y
|
||||
|
||||
On the next bootup with the modified kernel you should
|
||||
get a /dev/ttyUSBx device(s).
|
||||
|
||||
Now this channel of kernel messages is ready to be used: start
|
||||
your favorite terminal emulator (minicom, etc.) and set
|
||||
it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to
|
||||
see the raw output.
|
||||
|
||||
c.) On Nvidia Southbridge based systems: the kernel will try to probe
|
||||
and find out which port has debug device connected.
|
||||
|
||||
3. Testing that it works fine:
|
||||
|
||||
You can test the output by using earlyprintk=dbgp,keep and provoking
|
||||
kernel messages on the host/target system. You can provoke a harmless
|
||||
kernel message by for example doing:
|
||||
|
||||
echo h > /proc/sysrq-trigger
|
||||
|
||||
On the host/target system you should see this help line in "dmesg" output:
|
||||
|
||||
SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
|
||||
|
||||
On the client/console system do:
|
||||
|
||||
cat /dev/ttyUSB0
|
||||
|
||||
And you should see the help line above displayed shortly after you've
|
||||
provoked it on the host system.
|
||||
|
||||
If it does not work then please ask about it on the linux-kernel@vger.kernel.org
|
||||
mailing list or contact the x86 maintainers.
|
95
Documentation/x86/entry_64.txt
Normal file
95
Documentation/x86/entry_64.txt
Normal file
|
@ -0,0 +1,95 @@
|
|||
This file documents some of the kernel entries in
|
||||
arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from
|
||||
an email from Ingo Molnar:
|
||||
|
||||
http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
|
||||
|
||||
The x86 architecture has quite a few different ways to jump into
|
||||
kernel code. Most of these entry points are registered in
|
||||
arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
|
||||
and arch/x86/ia32/ia32entry.S.
|
||||
|
||||
The IDT vector assignments are listed in arch/x86/include/irq_vectors.h.
|
||||
|
||||
Some of these entries are:
|
||||
|
||||
- system_call: syscall instruction from 64-bit code.
|
||||
|
||||
- ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
|
||||
either way.
|
||||
|
||||
- ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
|
||||
code
|
||||
|
||||
- interrupt: An array of entries. Every IDT vector that doesn't
|
||||
explicitly point somewhere else gets set to the corresponding
|
||||
value in interrupts. These point to a whole array of
|
||||
magically-generated functions that make their way to do_IRQ with
|
||||
the interrupt number as a parameter.
|
||||
|
||||
- APIC interrupts: Various special-purpose interrupts for things
|
||||
like TLB shootdown.
|
||||
|
||||
- Architecturally-defined exceptions like divide_error.
|
||||
|
||||
There are a few complexities here. The different x86-64 entries
|
||||
have different calling conventions. The syscall and sysenter
|
||||
instructions have their own peculiar calling conventions. Some of
|
||||
the IDT entries push an error code onto the stack; others don't.
|
||||
IDT entries using the IST alternative stack mechanism need their own
|
||||
magic to get the stack frames right. (You can find some
|
||||
documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
|
||||
Volume 3, Chapter 6.)
|
||||
|
||||
Dealing with the swapgs instruction is especially tricky. Swapgs
|
||||
toggles whether gs is the kernel gs or the user gs. The swapgs
|
||||
instruction is rather fragile: it must nest perfectly and only in
|
||||
single depth, it should only be used if entering from user mode to
|
||||
kernel mode and then when returning to user-space, and precisely
|
||||
so. If we mess that up even slightly, we crash.
|
||||
|
||||
So when we have a secondary entry, already in kernel mode, we *must
|
||||
not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
|
||||
not switched/swapped yet.
|
||||
|
||||
Now, there's a secondary complication: there's a cheap way to test
|
||||
which mode the CPU is in and an expensive way.
|
||||
|
||||
The cheap way is to pick this info off the entry frame on the kernel
|
||||
stack, from the CS of the ptregs area of the kernel stack:
|
||||
|
||||
xorl %ebx,%ebx
|
||||
testl $3,CS+8(%rsp)
|
||||
je error_kernelspace
|
||||
SWAPGS
|
||||
|
||||
The expensive (paranoid) way is to read back the MSR_GS_BASE value
|
||||
(which is what SWAPGS modifies):
|
||||
|
||||
movl $1,%ebx
|
||||
movl $MSR_GS_BASE,%ecx
|
||||
rdmsr
|
||||
testl %edx,%edx
|
||||
js 1f /* negative -> in kernel */
|
||||
SWAPGS
|
||||
xorl %ebx,%ebx
|
||||
1: ret
|
||||
|
||||
and the whole paranoid non-paranoid macro complexity is about whether
|
||||
to suffer that RDMSR cost.
|
||||
|
||||
If we are at an interrupt or user-trap/gate-alike boundary then we can
|
||||
use the faster check: the stack will be a reliable indicator of
|
||||
whether SWAPGS was already done: if we see that we are a secondary
|
||||
entry interrupting kernel mode execution, then we know that the GS
|
||||
base has already been switched. If it says that we interrupted
|
||||
user-space execution then we must do the SWAPGS.
|
||||
|
||||
But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
|
||||
which might have triggered right after a normal entry wrote CS to the
|
||||
stack but before we executed SWAPGS, then the only safe way to check
|
||||
for GS is the slower method: the RDMSR.
|
||||
|
||||
So we try only to mark those entry methods 'paranoid' that absolutely
|
||||
need the more expensive check for the GS base - and we generate all
|
||||
'normal' entry points with the regular (faster) entry macros.
|
292
Documentation/x86/exception-tables.txt
Normal file
292
Documentation/x86/exception-tables.txt
Normal file
|
@ -0,0 +1,292 @@
|
|||
Kernel level exception handling in Linux
|
||||
Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
|
||||
|
||||
When a process runs in kernel mode, it often has to access user
|
||||
mode memory whose address has been passed by an untrusted program.
|
||||
To protect itself the kernel has to verify this address.
|
||||
|
||||
In older versions of Linux this was done with the
|
||||
int verify_area(int type, const void * addr, unsigned long size)
|
||||
function (which has since been replaced by access_ok()).
|
||||
|
||||
This function verified that the memory area starting at address
|
||||
'addr' and of size 'size' was accessible for the operation specified
|
||||
in type (read or write). To do this, verify_read had to look up the
|
||||
virtual memory area (vma) that contained the address addr. In the
|
||||
normal case (correctly working program), this test was successful.
|
||||
It only failed for a few buggy programs. In some kernel profiling
|
||||
tests, this normally unneeded verification used up a considerable
|
||||
amount of time.
|
||||
|
||||
To overcome this situation, Linus decided to let the virtual memory
|
||||
hardware present in every Linux-capable CPU handle this test.
|
||||
|
||||
How does this work?
|
||||
|
||||
Whenever the kernel tries to access an address that is currently not
|
||||
accessible, the CPU generates a page fault exception and calls the
|
||||
page fault handler
|
||||
|
||||
void do_page_fault(struct pt_regs *regs, unsigned long error_code)
|
||||
|
||||
in arch/x86/mm/fault.c. The parameters on the stack are set up by
|
||||
the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
|
||||
regs is a pointer to the saved registers on the stack, error_code
|
||||
contains a reason code for the exception.
|
||||
|
||||
do_page_fault first obtains the unaccessible address from the CPU
|
||||
control register CR2. If the address is within the virtual address
|
||||
space of the process, the fault probably occurred, because the page
|
||||
was not swapped in, write protected or something similar. However,
|
||||
we are interested in the other case: the address is not valid, there
|
||||
is no vma that contains this address. In this case, the kernel jumps
|
||||
to the bad_area label.
|
||||
|
||||
There it uses the address of the instruction that caused the exception
|
||||
(i.e. regs->eip) to find an address where the execution can continue
|
||||
(fixup). If this search is successful, the fault handler modifies the
|
||||
return address (again regs->eip) and returns. The execution will
|
||||
continue at the address in fixup.
|
||||
|
||||
Where does fixup point to?
|
||||
|
||||
Since we jump to the contents of fixup, fixup obviously points
|
||||
to executable code. This code is hidden inside the user access macros.
|
||||
I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
|
||||
as an example. The definition is somewhat hard to follow, so let's peek at
|
||||
the code generated by the preprocessor and the compiler. I selected
|
||||
the get_user call in drivers/char/sysrq.c for a detailed examination.
|
||||
|
||||
The original code in sysrq.c line 587:
|
||||
get_user(c, buf);
|
||||
|
||||
The preprocessor output (edited to become somewhat readable):
|
||||
|
||||
(
|
||||
{
|
||||
long __gu_err = - 14 , __gu_val = 0;
|
||||
const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
|
||||
if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
|
||||
(((sizeof(*(buf))) <= 0xC0000000UL) &&
|
||||
((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
|
||||
do {
|
||||
__gu_err = 0;
|
||||
switch ((sizeof(*(buf)))) {
|
||||
case 1:
|
||||
__asm__ __volatile__(
|
||||
"1: mov" "b" " %2,%" "b" "1\n"
|
||||
"2:\n"
|
||||
".section .fixup,\"ax\"\n"
|
||||
"3: movl %3,%0\n"
|
||||
" xor" "b" " %" "b" "1,%" "b" "1\n"
|
||||
" jmp 2b\n"
|
||||
".section __ex_table,\"a\"\n"
|
||||
" .align 4\n"
|
||||
" .long 1b,3b\n"
|
||||
".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
|
||||
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
|
||||
break;
|
||||
case 2:
|
||||
__asm__ __volatile__(
|
||||
"1: mov" "w" " %2,%" "w" "1\n"
|
||||
"2:\n"
|
||||
".section .fixup,\"ax\"\n"
|
||||
"3: movl %3,%0\n"
|
||||
" xor" "w" " %" "w" "1,%" "w" "1\n"
|
||||
" jmp 2b\n"
|
||||
".section __ex_table,\"a\"\n"
|
||||
" .align 4\n"
|
||||
" .long 1b,3b\n"
|
||||
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
|
||||
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
|
||||
break;
|
||||
case 4:
|
||||
__asm__ __volatile__(
|
||||
"1: mov" "l" " %2,%" "" "1\n"
|
||||
"2:\n"
|
||||
".section .fixup,\"ax\"\n"
|
||||
"3: movl %3,%0\n"
|
||||
" xor" "l" " %" "" "1,%" "" "1\n"
|
||||
" jmp 2b\n"
|
||||
".section __ex_table,\"a\"\n"
|
||||
" .align 4\n" " .long 1b,3b\n"
|
||||
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
|
||||
( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
|
||||
break;
|
||||
default:
|
||||
(__gu_val) = __get_user_bad();
|
||||
}
|
||||
} while (0) ;
|
||||
((c)) = (__typeof__(*((buf))))__gu_val;
|
||||
__gu_err;
|
||||
}
|
||||
);
|
||||
|
||||
WOW! Black GCC/assembly magic. This is impossible to follow, so let's
|
||||
see what code gcc generates:
|
||||
|
||||
> xorl %edx,%edx
|
||||
> movl current_set,%eax
|
||||
> cmpl $24,788(%eax)
|
||||
> je .L1424
|
||||
> cmpl $-1073741825,64(%esp)
|
||||
> ja .L1423
|
||||
> .L1424:
|
||||
> movl %edx,%eax
|
||||
> movl 64(%esp),%ebx
|
||||
> #APP
|
||||
> 1: movb (%ebx),%dl /* this is the actual user access */
|
||||
> 2:
|
||||
> .section .fixup,"ax"
|
||||
> 3: movl $-14,%eax
|
||||
> xorb %dl,%dl
|
||||
> jmp 2b
|
||||
> .section __ex_table,"a"
|
||||
> .align 4
|
||||
> .long 1b,3b
|
||||
> .text
|
||||
> #NO_APP
|
||||
> .L1423:
|
||||
> movzbl %dl,%esi
|
||||
|
||||
The optimizer does a good job and gives us something we can actually
|
||||
understand. Can we? The actual user access is quite obvious. Thanks
|
||||
to the unified address space we can just access the address in user
|
||||
memory. But what does the .section stuff do?????
|
||||
|
||||
To understand this we have to look at the final kernel:
|
||||
|
||||
> objdump --section-headers vmlinux
|
||||
>
|
||||
> vmlinux: file format elf32-i386
|
||||
>
|
||||
> Sections:
|
||||
> Idx Name Size VMA LMA File off Algn
|
||||
> 0 .text 00098f40 c0100000 c0100000 00001000 2**4
|
||||
> CONTENTS, ALLOC, LOAD, READONLY, CODE
|
||||
> 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0
|
||||
> CONTENTS, ALLOC, LOAD, READONLY, CODE
|
||||
> 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2
|
||||
> CONTENTS, ALLOC, LOAD, READONLY, DATA
|
||||
> 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2
|
||||
> CONTENTS, ALLOC, LOAD, READONLY, DATA
|
||||
> 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4
|
||||
> CONTENTS, ALLOC, LOAD, DATA
|
||||
> 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2
|
||||
> ALLOC
|
||||
> 6 .comment 00000ec4 00000000 00000000 000ba748 2**0
|
||||
> CONTENTS, READONLY
|
||||
> 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0
|
||||
> CONTENTS, READONLY
|
||||
|
||||
There are obviously 2 non standard ELF sections in the generated object
|
||||
file. But first we want to find out what happened to our code in the
|
||||
final kernel executable:
|
||||
|
||||
> objdump --disassemble --section=.text vmlinux
|
||||
>
|
||||
> c017e785 <do_con_write+c1> xorl %edx,%edx
|
||||
> c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax
|
||||
> c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax)
|
||||
> c017e793 <do_con_write+cf> je c017e79f <do_con_write+db>
|
||||
> c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1)
|
||||
> c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3>
|
||||
> c017e79f <do_con_write+db> movl %edx,%eax
|
||||
> c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx
|
||||
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl
|
||||
> c017e7a7 <do_con_write+e3> movzbl %dl,%esi
|
||||
|
||||
The whole user memory access is reduced to 10 x86 machine instructions.
|
||||
The instructions bracketed in the .section directives are no longer
|
||||
in the normal execution path. They are located in a different section
|
||||
of the executable file:
|
||||
|
||||
> objdump --disassemble --section=.fixup vmlinux
|
||||
>
|
||||
> c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
|
||||
> c0199ffa <.fixup+10ba> xorb %dl,%dl
|
||||
> c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
|
||||
|
||||
And finally:
|
||||
> objdump --full-contents --section=__ex_table vmlinux
|
||||
>
|
||||
> c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
|
||||
> c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
|
||||
> c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
|
||||
|
||||
or in human readable byte order:
|
||||
|
||||
> c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................
|
||||
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
|
||||
^^^^^^^^^^^^^^^^^
|
||||
this is the interesting part!
|
||||
> c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................
|
||||
|
||||
What happened? The assembly directives
|
||||
|
||||
.section .fixup,"ax"
|
||||
.section __ex_table,"a"
|
||||
|
||||
told the assembler to move the following code to the specified
|
||||
sections in the ELF object file. So the instructions
|
||||
3: movl $-14,%eax
|
||||
xorb %dl,%dl
|
||||
jmp 2b
|
||||
ended up in the .fixup section of the object file and the addresses
|
||||
.long 1b,3b
|
||||
ended up in the __ex_table section of the object file. 1b and 3b
|
||||
are local labels. The local label 1b (1b stands for next label 1
|
||||
backward) is the address of the instruction that might fault, i.e.
|
||||
in our case the address of the label 1 is c017e7a5:
|
||||
the original assembly code: > 1: movb (%ebx),%dl
|
||||
and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
|
||||
|
||||
The local label 3 (backwards again) is the address of the code to handle
|
||||
the fault, in our case the actual value is c0199ff5:
|
||||
the original assembly code: > 3: movl $-14,%eax
|
||||
and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
|
||||
|
||||
The assembly code
|
||||
> .section __ex_table,"a"
|
||||
> .align 4
|
||||
> .long 1b,3b
|
||||
|
||||
becomes the value pair
|
||||
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
|
||||
^this is ^this is
|
||||
1b 3b
|
||||
c017e7a5,c0199ff5 in the exception table of the kernel.
|
||||
|
||||
So, what actually happens if a fault from kernel mode with no suitable
|
||||
vma occurs?
|
||||
|
||||
1.) access to invalid address:
|
||||
> c017e7a5 <do_con_write+e1> movb (%ebx),%dl
|
||||
2.) MMU generates exception
|
||||
3.) CPU calls do_page_fault
|
||||
4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
|
||||
5.) search_exception_table looks up the address c017e7a5 in the
|
||||
exception table (i.e. the contents of the ELF section __ex_table)
|
||||
and returns the address of the associated fault handle code c0199ff5.
|
||||
6.) do_page_fault modifies its own return address to point to the fault
|
||||
handle code and returns.
|
||||
7.) execution continues in the fault handling code.
|
||||
8.) 8a) EAX becomes -EFAULT (== -14)
|
||||
8b) DL becomes zero (the value we "read" from user space)
|
||||
8c) execution continues at local label 2 (address of the
|
||||
instruction immediately after the faulting user access).
|
||||
|
||||
The steps 8a to 8c in a certain way emulate the faulting instruction.
|
||||
|
||||
That's it, mostly. If you look at our example, you might ask why
|
||||
we set EAX to -EFAULT in the exception handler code. Well, the
|
||||
get_user macro actually returns a value: 0, if the user access was
|
||||
successful, -EFAULT on failure. Our original code did not test this
|
||||
return value, however the inline assembly code in get_user tries to
|
||||
return -EFAULT. GCC selected EAX to return this value.
|
||||
|
||||
NOTE:
|
||||
Due to the way that the exception table is built and needs to be ordered,
|
||||
only use exceptions for code in the .text section. Any other section
|
||||
will cause the exception table to not be sorted correctly, and the
|
||||
exceptions will fail.
|
119
Documentation/x86/i386/IO-APIC.txt
Normal file
119
Documentation/x86/i386/IO-APIC.txt
Normal file
|
@ -0,0 +1,119 @@
|
|||
Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
|
||||
which is an enhanced interrupt controller. It enables us to route
|
||||
hardware interrupts to multiple CPUs, or to CPU groups. Without an
|
||||
IO-APIC, interrupts from hardware will be delivered only to the
|
||||
CPU which boots the operating system (usually CPU#0).
|
||||
|
||||
Linux supports all variants of compliant SMP boards, including ones with
|
||||
multiple IO-APICs. Multiple IO-APICs are used in high-end servers to
|
||||
distribute IRQ load further.
|
||||
|
||||
There are (a few) known breakages in certain older boards, such bugs are
|
||||
usually worked around by the kernel. If your MP-compliant SMP board does
|
||||
not boot Linux, then consult the linux-smp mailing list archives first.
|
||||
|
||||
If your box boots fine with enabled IO-APIC IRQs, then your
|
||||
/proc/interrupts will look like this one:
|
||||
|
||||
---------------------------->
|
||||
hell:~> cat /proc/interrupts
|
||||
CPU0
|
||||
0: 1360293 IO-APIC-edge timer
|
||||
1: 4 IO-APIC-edge keyboard
|
||||
2: 0 XT-PIC cascade
|
||||
13: 1 XT-PIC fpu
|
||||
14: 1448 IO-APIC-edge ide0
|
||||
16: 28232 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet
|
||||
17: 51304 IO-APIC-level eth0
|
||||
NMI: 0
|
||||
ERR: 0
|
||||
hell:~>
|
||||
<----------------------------
|
||||
|
||||
Some interrupts are still listed as 'XT PIC', but this is not a problem;
|
||||
none of those IRQ sources is performance-critical.
|
||||
|
||||
|
||||
In the unlikely case that your board does not create a working mp-table,
|
||||
you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
|
||||
is non-trivial though and cannot be automated. One sample /etc/lilo.conf
|
||||
entry:
|
||||
|
||||
append="pirq=15,11,10"
|
||||
|
||||
The actual numbers depend on your system, on your PCI cards and on their
|
||||
PCI slot position. Usually PCI slots are 'daisy chained' before they are
|
||||
connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
|
||||
lines):
|
||||
|
||||
,-. ,-. ,-. ,-. ,-.
|
||||
PIRQ4 ----| |-. ,-| |-. ,-| |-. ,-| |--------| |
|
||||
|S| \ / |S| \ / |S| \ / |S| |S|
|
||||
PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l|
|
||||
|o| \/ |o| \/ |o| \/ |o| |o|
|
||||
PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t|
|
||||
|1| /\ |2| /\ |3| /\ |4| |5|
|
||||
PIRQ1 ----| |- `----| |- `----| |- `----| |--------| |
|
||||
`-' `-' `-' `-' `-'
|
||||
|
||||
Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:
|
||||
|
||||
,-.
|
||||
INTD--| |
|
||||
|S|
|
||||
INTC--|l|
|
||||
|o|
|
||||
INTB--|t|
|
||||
|x|
|
||||
INTA--| |
|
||||
`-'
|
||||
|
||||
These INTA-D PCI IRQs are always 'local to the card', their real meaning
|
||||
depends on which slot they are in. If you look at the daisy chaining diagram,
|
||||
a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of
|
||||
the PCI chipset. Most cards issue INTA, this creates optimal distribution
|
||||
between the PIRQ lines. (distributing IRQ sources properly is not a
|
||||
necessity, PCI IRQs can be shared at will, but it's a good for performance
|
||||
to have non shared interrupts). Slot5 should be used for videocards, they
|
||||
do not use interrupts normally, thus they are not daisy chained either.
|
||||
|
||||
so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
|
||||
Slot2, then you'll have to specify this pirq= line:
|
||||
|
||||
append="pirq=11,9"
|
||||
|
||||
the following script tries to figure out such a default pirq= line from
|
||||
your PCI configuration:
|
||||
|
||||
echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
|
||||
|
||||
note that this script won't work if you have skipped a few slots or if your
|
||||
board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
|
||||
connected in some strange way). E.g. if in the above case you have your SCSI
|
||||
card (IRQ11) in Slot3, and have Slot1 empty:
|
||||
|
||||
append="pirq=0,9,11"
|
||||
|
||||
[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting)
|
||||
slots.]
|
||||
|
||||
Generally, it's always possible to find out the correct pirq= settings, just
|
||||
permute all IRQ numbers properly ... it will take some time though. An
|
||||
'incorrect' pirq line will cause the booting process to hang, or a device
|
||||
won't function properly (e.g. if it's inserted as a module).
|
||||
|
||||
If you have 2 PCI buses, then you can use up to 8 pirq values, although such
|
||||
boards tend to have a good configuration.
|
||||
|
||||
Be prepared that it might happen that you need some strange pirq line:
|
||||
|
||||
append="pirq=0,0,0,0,0,0,9,11"
|
||||
|
||||
Use smart trial-and-error techniques to find out the correct pirq line ...
|
||||
|
||||
Good luck and mail to linux-smp@vger.kernel.org or
|
||||
linux-kernel@vger.kernel.org if you have any problems that are not covered
|
||||
by this document.
|
||||
|
||||
-- mingo
|
||||
|
305
Documentation/x86/mtrr.txt
Normal file
305
Documentation/x86/mtrr.txt
Normal file
|
@ -0,0 +1,305 @@
|
|||
MTRR (Memory Type Range Register) control
|
||||
3 Jun 1999
|
||||
Richard Gooch
|
||||
<rgooch@atnf.csiro.au>
|
||||
|
||||
On Intel P6 family processors (Pentium Pro, Pentium II and later)
|
||||
the Memory Type Range Registers (MTRRs) may be used to control
|
||||
processor access to memory ranges. This is most useful when you have
|
||||
a video (VGA) card on a PCI or AGP bus. Enabling write-combining
|
||||
allows bus write transfers to be combined into a larger transfer
|
||||
before bursting over the PCI/AGP bus. This can increase performance
|
||||
of image write operations 2.5 times or more.
|
||||
|
||||
The Cyrix 6x86, 6x86MX and M II processors have Address Range
|
||||
Registers (ARRs) which provide a similar functionality to MTRRs. For
|
||||
these, the ARRs are used to emulate the MTRRs.
|
||||
|
||||
The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
|
||||
MTRRs. These are supported. The AMD Athlon family provide 8 Intel
|
||||
style MTRRs.
|
||||
|
||||
The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
|
||||
are supported.
|
||||
|
||||
The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
|
||||
|
||||
The CONFIG_MTRR option creates a /proc/mtrr file which may be used
|
||||
to manipulate your MTRRs. Typically the X server should use
|
||||
this. This should have a reasonably generic interface so that
|
||||
similar control registers on other processors can be easily
|
||||
supported.
|
||||
|
||||
|
||||
There are two interfaces to /proc/mtrr: one is an ASCII interface
|
||||
which allows you to read and write. The other is an ioctl()
|
||||
interface. The ASCII interface is meant for administration. The
|
||||
ioctl() interface is meant for C programs (i.e. the X server). The
|
||||
interfaces are described below, with sample commands and C code.
|
||||
|
||||
===============================================================================
|
||||
Reading MTRRs from the shell:
|
||||
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
===============================================================================
|
||||
Creating MTRRs from the C-shell:
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
|
||||
or if you use bash:
|
||||
# echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
|
||||
|
||||
And the result thereof:
|
||||
% cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 128MB: write-back, count=1
|
||||
reg01: base=0x08000000 ( 128MB), size= 64MB: write-back, count=1
|
||||
reg02: base=0xf8000000 (3968MB), size= 4MB: write-combining, count=1
|
||||
|
||||
This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
|
||||
find out your base address, you need to look at the output of your X
|
||||
server, which tells you where the linear framebuffer address is. A
|
||||
typical line that you may get is:
|
||||
|
||||
(--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
|
||||
|
||||
Note that you should only use the value from the X server, as it may
|
||||
move the framebuffer base address, so the only value you can trust is
|
||||
that reported by the X server.
|
||||
|
||||
To find out the size of your framebuffer (what, you don't actually
|
||||
know?), the following line will tell you:
|
||||
|
||||
(--) S3: videoram: 4096k
|
||||
|
||||
That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
|
||||
A patch is being written for XFree86 which will make this automatic:
|
||||
in other words the X server will manipulate /proc/mtrr using the
|
||||
ioctl() interface, so users won't have to do anything. If you use a
|
||||
commercial X server, lobby your vendor to add support for MTRRs.
|
||||
===============================================================================
|
||||
Creating overlapping MTRRs:
|
||||
|
||||
%echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
|
||||
%echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
|
||||
|
||||
And the results: cat /proc/mtrr
|
||||
reg00: base=0x00000000 ( 0MB), size= 64MB: write-back, count=1
|
||||
reg01: base=0xfb000000 (4016MB), size= 16MB: write-combining, count=1
|
||||
reg02: base=0xfb000000 (4016MB), size= 4kB: uncachable, count=1
|
||||
|
||||
Some cards (especially Voodoo Graphics boards) need this 4 kB area
|
||||
excluded from the beginning of the region because it is used for
|
||||
registers.
|
||||
|
||||
NOTE: You can only create type=uncachable region, if the first
|
||||
region that you created is type=write-combining.
|
||||
===============================================================================
|
||||
Removing MTRRs from the C-shell:
|
||||
% echo "disable=2" >! /proc/mtrr
|
||||
or using bash:
|
||||
% echo "disable=2" >| /proc/mtrr
|
||||
===============================================================================
|
||||
Reading MTRRs from a C program using ioctl()'s:
|
||||
|
||||
/* mtrr-show.c
|
||||
|
||||
Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This program will use an ioctl() on /proc/mtrr to show the current MTRR
|
||||
settings. This is an alternative to reading /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main ()
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_gentry gentry;
|
||||
|
||||
if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (1);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (2);
|
||||
}
|
||||
for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
|
||||
++gentry.regnum)
|
||||
{
|
||||
if (gentry.size < 1)
|
||||
{
|
||||
fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
|
||||
continue;
|
||||
}
|
||||
fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
|
||||
gentry.regnum, gentry.base, gentry.size,
|
||||
mtrr_strings[gentry.type]);
|
||||
}
|
||||
if (errno == EINVAL) exit (0);
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (3);
|
||||
} /* End Function main */
|
||||
===============================================================================
|
||||
Creating MTRRs from a C programme using ioctl()'s:
|
||||
|
||||
/* mtrr-add.c
|
||||
|
||||
Source file for mtrr-add (example programme to add an MTRRs using ioctl())
|
||||
|
||||
Copyright (C) 1997-1998 Richard Gooch
|
||||
|
||||
This program is free software; you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
the Free Software Foundation; either version 2 of the License, or
|
||||
(at your option) any later version.
|
||||
|
||||
This program is distributed in the hope that it will be useful,
|
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
||||
GNU General Public License for more details.
|
||||
|
||||
You should have received a copy of the GNU General Public License
|
||||
along with this program; if not, write to the Free Software
|
||||
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
||||
|
||||
Richard Gooch may be reached by email at rgooch@atnf.csiro.au
|
||||
The postal address is:
|
||||
Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
|
||||
*/
|
||||
|
||||
/*
|
||||
This programme will use an ioctl() on /proc/mtrr to add an entry. The first
|
||||
available mtrr is used. This is an alternative to writing /proc/mtrr.
|
||||
|
||||
|
||||
Written by Richard Gooch 17-DEC-1997
|
||||
|
||||
Last updated by Richard Gooch 2-MAY-1998
|
||||
|
||||
|
||||
*/
|
||||
#include <stdio.h>
|
||||
#include <string.h>
|
||||
#include <stdlib.h>
|
||||
#include <unistd.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/stat.h>
|
||||
#include <fcntl.h>
|
||||
#include <sys/ioctl.h>
|
||||
#include <errno.h>
|
||||
#include <asm/mtrr.h>
|
||||
|
||||
#define TRUE 1
|
||||
#define FALSE 0
|
||||
#define ERRSTRING strerror (errno)
|
||||
|
||||
static char *mtrr_strings[MTRR_NUM_TYPES] =
|
||||
{
|
||||
"uncachable", /* 0 */
|
||||
"write-combining", /* 1 */
|
||||
"?", /* 2 */
|
||||
"?", /* 3 */
|
||||
"write-through", /* 4 */
|
||||
"write-protect", /* 5 */
|
||||
"write-back", /* 6 */
|
||||
};
|
||||
|
||||
int main (int argc, char **argv)
|
||||
{
|
||||
int fd;
|
||||
struct mtrr_sentry sentry;
|
||||
|
||||
if (argc != 4)
|
||||
{
|
||||
fprintf (stderr, "Usage:\tmtrr-add base size type\n");
|
||||
exit (1);
|
||||
}
|
||||
sentry.base = strtoul (argv[1], NULL, 0);
|
||||
sentry.size = strtoul (argv[2], NULL, 0);
|
||||
for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
|
||||
{
|
||||
if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
|
||||
}
|
||||
if (sentry.type >= MTRR_NUM_TYPES)
|
||||
{
|
||||
fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
|
||||
exit (2);
|
||||
}
|
||||
if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
|
||||
{
|
||||
if (errno == ENOENT)
|
||||
{
|
||||
fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
|
||||
stderr);
|
||||
exit (3);
|
||||
}
|
||||
fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
|
||||
exit (4);
|
||||
}
|
||||
if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
|
||||
{
|
||||
fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
|
||||
exit (5);
|
||||
}
|
||||
fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
|
||||
sleep (5);
|
||||
close (fd);
|
||||
fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
|
||||
stderr);
|
||||
} /* End Function main */
|
||||
===============================================================================
|
160
Documentation/x86/pat.txt
Normal file
160
Documentation/x86/pat.txt
Normal file
|
@ -0,0 +1,160 @@
|
|||
|
||||
PAT (Page Attribute Table)
|
||||
|
||||
x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
|
||||
page level granularity. PAT is complementary to the MTRR settings which allows
|
||||
for setting of memory types over physical address ranges. However, PAT is
|
||||
more flexible than MTRR due to its capability to set attributes at page level
|
||||
and also due to the fact that there are no hardware limitations on number of
|
||||
such attribute settings allowed. Added flexibility comes with guidelines for
|
||||
not having memory type aliasing for the same physical memory with multiple
|
||||
virtual addresses.
|
||||
|
||||
PAT allows for different types of memory attributes. The most commonly used
|
||||
ones that will be supported at this time are Write-back, Uncached,
|
||||
Write-combined and Uncached Minus.
|
||||
|
||||
|
||||
PAT APIs
|
||||
--------
|
||||
|
||||
There are many different APIs in the kernel that allows setting of memory
|
||||
attributes at the page level. In order to avoid aliasing, these interfaces
|
||||
should be used thoughtfully. Below is a table of interfaces available,
|
||||
their intended usage and their memory attribute relationships. Internally,
|
||||
these APIs use a reserve_memtype()/free_memtype() interface on the physical
|
||||
address range to avoid any aliasing.
|
||||
|
||||
|
||||
-------------------------------------------------------------------
|
||||
API | RAM | ACPI,... | Reserved/Holes |
|
||||
-----------------------|----------|------------|------------------|
|
||||
| | | |
|
||||
ioremap | -- | UC- | UC- |
|
||||
| | | |
|
||||
ioremap_cache | -- | WB | WB |
|
||||
| | | |
|
||||
ioremap_nocache | -- | UC- | UC- |
|
||||
| | | |
|
||||
ioremap_wc | -- | -- | WC |
|
||||
| | | |
|
||||
set_memory_uc | UC- | -- | -- |
|
||||
set_memory_wb | | | |
|
||||
| | | |
|
||||
set_memory_wc | WC | -- | -- |
|
||||
set_memory_wb | | | |
|
||||
| | | |
|
||||
pci sysfs resource | -- | -- | UC- |
|
||||
| | | |
|
||||
pci sysfs resource_wc | -- | -- | WC |
|
||||
is IORESOURCE_PREFETCH| | | |
|
||||
| | | |
|
||||
pci proc | -- | -- | UC- |
|
||||
!PCIIOC_WRITE_COMBINE | | | |
|
||||
| | | |
|
||||
pci proc | -- | -- | WC |
|
||||
PCIIOC_WRITE_COMBINE | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
read-write | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | UC- | UC- |
|
||||
mmap SYNC flag | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB/WC/UC- | WB/WC/UC- |
|
||||
mmap !SYNC flag | |(from exist-| (from exist- |
|
||||
and | | ing alias)| ing alias) |
|
||||
any alias to this area| | | |
|
||||
| | | |
|
||||
/dev/mem | -- | WB | WB |
|
||||
mmap !SYNC flag | | | |
|
||||
no alias to this area | | | |
|
||||
and | | | |
|
||||
MTRR says WB | | | |
|
||||
| | | |
|
||||
/dev/mem | -- | -- | UC- |
|
||||
mmap !SYNC flag | | | |
|
||||
no alias to this area | | | |
|
||||
and | | | |
|
||||
MTRR says !WB | | | |
|
||||
| | | |
|
||||
-------------------------------------------------------------------
|
||||
|
||||
Advanced APIs for drivers
|
||||
-------------------------
|
||||
A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
|
||||
vm_insert_pfn
|
||||
|
||||
Drivers wanting to export some pages to userspace do it by using mmap
|
||||
interface and a combination of
|
||||
1) pgprot_noncached()
|
||||
2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
|
||||
|
||||
With PAT support, a new API pgprot_writecombine is being added. So, drivers can
|
||||
continue to use the above sequence, with either pgprot_noncached() or
|
||||
pgprot_writecombine() in step 1, followed by step 2.
|
||||
|
||||
In addition, step 2 internally tracks the region as UC or WC in memtype
|
||||
list in order to ensure no conflicting mapping.
|
||||
|
||||
Note that this set of APIs only works with IO (non RAM) regions. If driver
|
||||
wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
|
||||
as step 0 above and also track the usage of those pages and use set_memory_wb()
|
||||
before the page is freed to free pool.
|
||||
|
||||
|
||||
|
||||
Notes:
|
||||
|
||||
-- in the above table mean "Not suggested usage for the API". Some of the --'s
|
||||
are strictly enforced by the kernel. Some others are not really enforced
|
||||
today, but may be enforced in future.
|
||||
|
||||
For ioremap and pci access through /sys or /proc - The actual type returned
|
||||
can be more restrictive, in case of any existing aliasing for that address.
|
||||
For example: If there is an existing uncached mapping, a new ioremap_wc can
|
||||
return uncached mapping in place of write-combine requested.
|
||||
|
||||
set_memory_[uc|wc] and set_memory_wb should be used in pairs, where driver will
|
||||
first make a region uc or wc and switch it back to wb after use.
|
||||
|
||||
Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
|
||||
interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
|
||||
|
||||
Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
|
||||
types.
|
||||
|
||||
Drivers should use set_memory_[uc|wc] to set access type for RAM ranges.
|
||||
|
||||
|
||||
PAT debugging
|
||||
-------------
|
||||
|
||||
With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by
|
||||
|
||||
# mount -t debugfs debugfs /sys/kernel/debug
|
||||
# cat /sys/kernel/debug/x86/pat_memtype_list
|
||||
PAT memtype list:
|
||||
uncached-minus @ 0x7fadf000-0x7fae0000
|
||||
uncached-minus @ 0x7fb19000-0x7fb1a000
|
||||
uncached-minus @ 0x7fb1a000-0x7fb1b000
|
||||
uncached-minus @ 0x7fb1b000-0x7fb1c000
|
||||
uncached-minus @ 0x7fb1c000-0x7fb1d000
|
||||
uncached-minus @ 0x7fb1d000-0x7fb1e000
|
||||
uncached-minus @ 0x7fb1e000-0x7fb25000
|
||||
uncached-minus @ 0x7fb25000-0x7fb26000
|
||||
uncached-minus @ 0x7fb26000-0x7fb27000
|
||||
uncached-minus @ 0x7fb27000-0x7fb28000
|
||||
uncached-minus @ 0x7fb28000-0x7fb2e000
|
||||
uncached-minus @ 0x7fb2e000-0x7fb2f000
|
||||
uncached-minus @ 0x7fb2f000-0x7fb30000
|
||||
uncached-minus @ 0x7fb31000-0x7fb32000
|
||||
uncached-minus @ 0x80000000-0x90000000
|
||||
|
||||
This list shows physical address ranges and various PAT settings used to
|
||||
access those physical address ranges.
|
||||
|
||||
Another, more verbose way of getting PAT related debug messages is with
|
||||
"debugpat" boot parameter. With this parameter, various debug messages are
|
||||
printed to dmesg log.
|
||||
|
75
Documentation/x86/tlb.txt
Normal file
75
Documentation/x86/tlb.txt
Normal file
|
@ -0,0 +1,75 @@
|
|||
When the kernel unmaps or modified the attributes of a range of
|
||||
memory, it has two choices:
|
||||
1. Flush the entire TLB with a two-instruction sequence. This is
|
||||
a quick operation, but it causes collateral damage: TLB entries
|
||||
from areas other than the one we are trying to flush will be
|
||||
destroyed and must be refilled later, at some cost.
|
||||
2. Use the invlpg instruction to invalidate a single page at a
|
||||
time. This could potentialy cost many more instructions, but
|
||||
it is a much more precise operation, causing no collateral
|
||||
damage to other TLB entries.
|
||||
|
||||
Which method to do depends on a few things:
|
||||
1. The size of the flush being performed. A flush of the entire
|
||||
address space is obviously better performed by flushing the
|
||||
entire TLB than doing 2^48/PAGE_SIZE individual flushes.
|
||||
2. The contents of the TLB. If the TLB is empty, then there will
|
||||
be no collateral damage caused by doing the global flush, and
|
||||
all of the individual flush will have ended up being wasted
|
||||
work.
|
||||
3. The size of the TLB. The larger the TLB, the more collateral
|
||||
damage we do with a full flush. So, the larger the TLB, the
|
||||
more attrative an individual flush looks. Data and
|
||||
instructions have separate TLBs, as do different page sizes.
|
||||
4. The microarchitecture. The TLB has become a multi-level
|
||||
cache on modern CPUs, and the global flushes have become more
|
||||
expensive relative to single-page flushes.
|
||||
|
||||
There is obviously no way the kernel can know all these things,
|
||||
especially the contents of the TLB during a given flush. The
|
||||
sizes of the flush will vary greatly depending on the workload as
|
||||
well. There is essentially no "right" point to choose.
|
||||
|
||||
You may be doing too many individual invalidations if you see the
|
||||
invlpg instruction (or instructions _near_ it) show up high in
|
||||
profiles. If you believe that individual invalidations being
|
||||
called too often, you can lower the tunable:
|
||||
|
||||
/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
|
||||
|
||||
This will cause us to do the global flush for more cases.
|
||||
Lowering it to 0 will disable the use of the individual flushes.
|
||||
Setting it to 1 is a very conservative setting and it should
|
||||
never need to be 0 under normal circumstances.
|
||||
|
||||
Despite the fact that a single individual flush on x86 is
|
||||
guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
|
||||
flushes. THP is treated exactly the same as normal memory.
|
||||
|
||||
You might see invlpg inside of flush_tlb_mm_range() show up in
|
||||
profiles, or you can use the trace_tlb_flush() tracepoints. to
|
||||
determine how long the flush operations are taking.
|
||||
|
||||
Essentially, you are balancing the cycles you spend doing invlpg
|
||||
with the cycles that you spend refilling the TLB later.
|
||||
|
||||
You can measure how expensive TLB refills are by using
|
||||
performance counters and 'perf stat', like this:
|
||||
|
||||
perf stat -e
|
||||
cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
|
||||
cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
|
||||
cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
|
||||
cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
|
||||
cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
|
||||
cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
|
||||
|
||||
That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
|
||||
may have differently-named counters, but they should at least
|
||||
be there in some form. You can use pmu-tools 'ocperf list'
|
||||
(https://github.com/andikleen/pmu-tools) to find the right
|
||||
counters for a given CPU.
|
||||
|
||||
1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
|
||||
says: "One execution of INVLPG is sufficient even for a page
|
||||
with size greater than 4 KBytes."
|
44
Documentation/x86/usb-legacy-support.txt
Normal file
44
Documentation/x86/usb-legacy-support.txt
Normal file
|
@ -0,0 +1,44 @@
|
|||
USB Legacy support
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Vojtech Pavlik <vojtech@suse.cz>, January 2004
|
||||
|
||||
|
||||
Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
|
||||
feature that allows one to use the USB mouse and keyboard as if they were
|
||||
their classic PS/2 counterparts. This means one can use an USB keyboard to
|
||||
type in LILO for example.
|
||||
|
||||
It has several drawbacks, though:
|
||||
|
||||
1) On some machines, the emulated PS/2 mouse takes over even when no USB
|
||||
mouse is present and a real PS/2 mouse is present. In that case the extra
|
||||
features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
|
||||
not be available.
|
||||
|
||||
2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
|
||||
system crashes, because the SMM BIOS is not expecting to be in PAE mode.
|
||||
The Intel E7505 is a typical machine where this happens.
|
||||
|
||||
3) If AMD64 64-bit mode is enabled, again system crashes often happen,
|
||||
because the SMM BIOS isn't expecting the CPU to be in 64-bit mode. The
|
||||
BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
|
||||
yet.
|
||||
|
||||
Solutions:
|
||||
|
||||
Problem 1) can be solved by loading the USB drivers prior to loading the
|
||||
PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
|
||||
the kernel unconditionally, this means the USB drivers need to be
|
||||
compiled-in, too.
|
||||
|
||||
Problem 2) can currently only be solved by either disabling HIGHMEM64G
|
||||
in the kernel config or USB Legacy support in the BIOS. A BIOS update
|
||||
could help, but so far no such update exists.
|
||||
|
||||
Problem 3) is usually fixed by a BIOS update. Check the board
|
||||
manufacturers web site. If an update is not available, disable USB
|
||||
Legacy support in the BIOS. If this alone doesn't help, try also adding
|
||||
idle=poll on the kernel command line. The BIOS may be entering the SMM
|
||||
on the HLT instruction as well.
|
||||
|
16
Documentation/x86/x86_64/00-INDEX
Normal file
16
Documentation/x86/x86_64/00-INDEX
Normal file
|
@ -0,0 +1,16 @@
|
|||
00-INDEX
|
||||
- This file
|
||||
boot-options.txt
|
||||
- AMD64-specific boot options.
|
||||
cpu-hotplug-spec
|
||||
- Firmware support for CPU hotplug under Linux/x86-64
|
||||
fake-numa-for-cpusets
|
||||
- Using numa=fake and CPUSets for Resource Management
|
||||
kernel-stacks
|
||||
- Context-specific per-processor interrupt stacks.
|
||||
machinecheck
|
||||
- Configurable sysfs parameters for the x86-64 machine check code.
|
||||
mm.txt
|
||||
- Memory layout of x86-64 (4 level page tables, 46 bits physical).
|
||||
uefi.txt
|
||||
- Booting Linux via Unified Extensible Firmware Interface.
|
284
Documentation/x86/x86_64/boot-options.txt
Normal file
284
Documentation/x86/x86_64/boot-options.txt
Normal file
|
@ -0,0 +1,284 @@
|
|||
AMD64 specific boot options
|
||||
|
||||
There are many others (usually documented in driver documentation), but
|
||||
only the AMD64 specific ones are listed here.
|
||||
|
||||
Machine check
|
||||
|
||||
Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
|
||||
|
||||
mce=off
|
||||
Disable machine check
|
||||
mce=no_cmci
|
||||
Disable CMCI(Corrected Machine Check Interrupt) that
|
||||
Intel processor supports. Usually this disablement is
|
||||
not recommended, but it might be handy if your hardware
|
||||
is misbehaving.
|
||||
Note that you'll get more problems without CMCI than with
|
||||
due to the shared banks, i.e. you might get duplicated
|
||||
error logs.
|
||||
mce=dont_log_ce
|
||||
Don't make logs for corrected errors. All events reported
|
||||
as corrected are silently cleared by OS.
|
||||
This option will be useful if you have no interest in any
|
||||
of corrected errors.
|
||||
mce=ignore_ce
|
||||
Disable features for corrected errors, e.g. polling timer
|
||||
and CMCI. All events reported as corrected are not cleared
|
||||
by OS and remained in its error banks.
|
||||
Usually this disablement is not recommended, however if
|
||||
there is an agent checking/clearing corrected errors
|
||||
(e.g. BIOS or hardware monitoring applications), conflicting
|
||||
with OS's error handling, and you cannot deactivate the agent,
|
||||
then this option will be a help.
|
||||
mce=bootlog
|
||||
Enable logging of machine checks left over from booting.
|
||||
Disabled by default on AMD because some BIOS leave bogus ones.
|
||||
If your BIOS doesn't do that it's a good idea to enable though
|
||||
to make sure you log even machine check events that result
|
||||
in a reboot. On Intel systems it is enabled by default.
|
||||
mce=nobootlog
|
||||
Disable boot machine check logging.
|
||||
mce=tolerancelevel[,monarchtimeout] (number,number)
|
||||
tolerance levels:
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
Default is 1
|
||||
Can be also set using sysfs which is preferable.
|
||||
monarchtimeout:
|
||||
Sets the time in us to wait for other CPUs on machine checks. 0
|
||||
to disable.
|
||||
mce=bios_cmci_threshold
|
||||
Don't overwrite the bios-set CMCI threshold. This boot option
|
||||
prevents Linux from overwriting the CMCI threshold set by the
|
||||
bios. Without this option, Linux always sets the CMCI
|
||||
threshold to 1. Enabling this may make memory predictive failure
|
||||
analysis less effective if the bios sets thresholds for memory
|
||||
errors since we will not see details for all errors.
|
||||
|
||||
nomce (for compatibility with i386): same as mce=off
|
||||
|
||||
Everything else is in sysfs now.
|
||||
|
||||
APICs
|
||||
|
||||
apic Use IO-APIC. Default
|
||||
|
||||
noapic Don't use the IO-APIC.
|
||||
|
||||
disableapic Don't use the local APIC
|
||||
|
||||
nolapic Don't use the local APIC (alias for i386 compatibility)
|
||||
|
||||
pirq=... See Documentation/x86/i386/IO-APIC.txt
|
||||
|
||||
noapictimer Don't set up the APIC timer
|
||||
|
||||
no_timer_check Don't check the IO-APIC timer. This can work around
|
||||
problems with incorrect timer initialization on some boards.
|
||||
apicpmtimer
|
||||
Do APIC timer calibration using the pmtimer. Implies
|
||||
apicmaintimer. Useful when your PIT timer is totally
|
||||
broken.
|
||||
|
||||
Timing
|
||||
|
||||
notsc
|
||||
Don't use the CPU time stamp counter to read the wall time.
|
||||
This can be used to work around timing problems on multiprocessor systems
|
||||
with not properly synchronized CPUs.
|
||||
|
||||
nohpet
|
||||
Don't use the HPET timer.
|
||||
|
||||
Idle loop
|
||||
|
||||
idle=poll
|
||||
Don't do power saving in the idle loop using HLT, but poll for rescheduling
|
||||
event. This will make the CPUs eat a lot more power, but may be useful
|
||||
to get slightly better performance in multiprocessor benchmarks. It also
|
||||
makes some profiling using performance counters more accurate.
|
||||
Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
|
||||
CPUs) this option has no performance advantage over the normal idle loop.
|
||||
It may also interact badly with hyperthreading.
|
||||
|
||||
Rebooting
|
||||
|
||||
reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
|
||||
bios Use the CPU reboot vector for warm reset
|
||||
warm Don't set the cold reboot flag
|
||||
cold Set the cold reboot flag
|
||||
triple Force a triple fault (init)
|
||||
kbd Use the keyboard controller. cold reset (default)
|
||||
acpi Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
|
||||
ACPI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
efi Use efi reset_system runtime service. If EFI is not configured or the
|
||||
EFI reset does not work, the reboot path attempts the reset using
|
||||
the keyboard controller.
|
||||
|
||||
Using warm reset will be much faster especially on big memory
|
||||
systems because the BIOS will not go through the memory check.
|
||||
Disadvantage is that not all hardware will be completely reinitialized
|
||||
on reboot so there may be boot problems on some systems.
|
||||
|
||||
reboot=force
|
||||
|
||||
Don't stop other CPUs on reboot. This can make reboot more reliable
|
||||
in some cases.
|
||||
|
||||
Non Executable Mappings
|
||||
|
||||
noexec=on|off
|
||||
|
||||
on Enable(default)
|
||||
off Disable
|
||||
|
||||
NUMA
|
||||
|
||||
numa=off Only set up a single NUMA node spanning all memory.
|
||||
|
||||
numa=noacpi Don't parse the SRAT table for NUMA setup
|
||||
|
||||
numa=fake=<size>[MG]
|
||||
If given as a memory unit, fills all system RAM with nodes of
|
||||
size interleaved over physical nodes.
|
||||
|
||||
numa=fake=<N>
|
||||
If given as an integer, fills all system RAM with N fake nodes
|
||||
interleaved over physical nodes.
|
||||
|
||||
ACPI
|
||||
|
||||
acpi=off Don't enable ACPI
|
||||
acpi=ht Use ACPI boot table parsing, but don't enable ACPI
|
||||
interpreter
|
||||
acpi=force Force ACPI on (currently not needed)
|
||||
|
||||
acpi=strict Disable out of spec ACPI workarounds.
|
||||
|
||||
acpi_sci={edge,level,high,low} Set up ACPI SCI interrupt.
|
||||
|
||||
acpi=noirq Don't route interrupts
|
||||
|
||||
acpi=nocmcff Disable firmware first mode for corrected errors. This
|
||||
disables parsing the HEST CMC error source to check if
|
||||
firmware has set the FF flag. This may result in
|
||||
duplicate corrected error reports.
|
||||
|
||||
PCI
|
||||
|
||||
pci=off Don't use PCI
|
||||
pci=conf1 Use conf1 access.
|
||||
pci=conf2 Use conf2 access.
|
||||
pci=rom Assign ROMs.
|
||||
pci=assign-busses Assign busses
|
||||
pci=irqmask=MASK Set PCI interrupt mask to MASK
|
||||
pci=lastbus=NUMBER Scan up to NUMBER busses, no matter what the mptable says.
|
||||
pci=noacpi Don't use ACPI to set up PCI interrupt routing.
|
||||
|
||||
IOMMU (input/output memory management unit)
|
||||
|
||||
Currently four x86-64 PCI-DMA mapping implementations exist:
|
||||
|
||||
1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
|
||||
(e.g. because you have < 3 GB memory).
|
||||
Kernel boot message: "PCI-DMA: Disabling IOMMU"
|
||||
|
||||
2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
|
||||
Kernel boot message: "PCI-DMA: using GART IOMMU"
|
||||
|
||||
3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
|
||||
e.g. if there is no hardware IOMMU in the system and it is need because
|
||||
you have >3GB memory or told the kernel to us it (iommu=soft))
|
||||
Kernel boot message: "PCI-DMA: Using software bounce buffering
|
||||
for IO (SWIOTLB)"
|
||||
|
||||
4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
|
||||
pSeries and xSeries servers. This hardware IOMMU supports DMA address
|
||||
mapping with memory protection, etc.
|
||||
Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
|
||||
|
||||
iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
|
||||
[,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
|
||||
[,noaperture][,calgary]
|
||||
|
||||
General iommu options:
|
||||
off Don't initialize and use any kind of IOMMU.
|
||||
noforce Don't force hardware IOMMU usage when it is not needed.
|
||||
(default).
|
||||
force Force the use of the hardware IOMMU even when it is
|
||||
not actually needed (e.g. because < 3 GB memory).
|
||||
soft Use software bounce buffering (SWIOTLB) (default for
|
||||
Intel machines). This can be used to prevent the usage
|
||||
of an available hardware IOMMU.
|
||||
|
||||
iommu options only relevant to the AMD GART hardware IOMMU:
|
||||
<size> Set the size of the remapping area in bytes.
|
||||
allowed Overwrite iommu off workarounds for specific chipsets.
|
||||
fullflush Flush IOMMU on each allocation (default).
|
||||
nofullflush Don't use IOMMU fullflush.
|
||||
leak Turn on simple iommu leak tracing (only when
|
||||
CONFIG_IOMMU_LEAK is on). Default number of leak pages
|
||||
is 20.
|
||||
memaper[=<order>] Allocate an own aperture over RAM with size 32MB<<order.
|
||||
(default: order=1, i.e. 64MB)
|
||||
merge Do scatter-gather (SG) merging. Implies "force"
|
||||
(experimental).
|
||||
nomerge Don't do scatter-gather (SG) merging.
|
||||
noaperture Ask the IOMMU not to touch the aperture for AGP.
|
||||
forcesac Force single-address cycle (SAC) mode for masks <40bits
|
||||
(experimental).
|
||||
noagp Don't initialize the AGP driver and use full aperture.
|
||||
allowdac Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
|
||||
DAC is used with 32-bit PCI to push a 64-bit address in
|
||||
two cycles. When off all DMA over >4GB is forced through
|
||||
an IOMMU or software bounce buffering.
|
||||
nodac Forbid DAC mode, i.e. DMA >4GB.
|
||||
panic Always panic when IOMMU overflows.
|
||||
calgary Use the Calgary IOMMU if it is available
|
||||
|
||||
iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
|
||||
implementation:
|
||||
swiotlb=<pages>[,force]
|
||||
<pages> Prereserve that many 128K pages for the software IO
|
||||
bounce buffering.
|
||||
force Force all IO through the software TLB.
|
||||
|
||||
Settings for the IBM Calgary hardware IOMMU currently found in IBM
|
||||
pSeries and xSeries machines:
|
||||
|
||||
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
|
||||
calgary=[translate_empty_slots]
|
||||
calgary=[disable=<PCI bus number>]
|
||||
panic Always panic when IOMMU overflows
|
||||
|
||||
64k,...,8M - Set the size of each PCI slot's translation table
|
||||
when using the Calgary IOMMU. This is the size of the translation
|
||||
table itself in main memory. The smallest table, 64k, covers an IO
|
||||
space of 32MB; the largest, 8MB table, can cover an IO space of
|
||||
4GB. Normally the kernel will make the right choice by itself.
|
||||
|
||||
translate_empty_slots - Enable translation even on slots that have
|
||||
no devices attached to them, in case a device will be hotplugged
|
||||
in the future.
|
||||
|
||||
disable=<PCI bus number> - Disable translation on a given PHB. For
|
||||
example, the built-in graphics adapter resides on the first bridge
|
||||
(PCI bus number 0); if translation (isolation) is enabled on this
|
||||
bridge, X servers that access the hardware directly from user
|
||||
space might stop working. Use this option if you have devices that
|
||||
are accessed from userspace directly on some PCI host bridge.
|
||||
|
||||
Debugging
|
||||
|
||||
kstack=N Print N words from the kernel stack in oops dumps.
|
||||
|
||||
Miscellaneous
|
||||
|
||||
nogbpages
|
||||
Do not use GB pages for kernel direct mappings.
|
||||
gbpages
|
||||
Use GB pages for kernel direct mappings.
|
21
Documentation/x86/x86_64/cpu-hotplug-spec
Normal file
21
Documentation/x86/x86_64/cpu-hotplug-spec
Normal file
|
@ -0,0 +1,21 @@
|
|||
Firmware support for CPU hotplug under Linux/x86-64
|
||||
---------------------------------------------------
|
||||
|
||||
Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
|
||||
know in advance of boot time the maximum number of CPUs that could be plugged
|
||||
into the system. ACPI 3.0 currently has no official way to supply
|
||||
this information from the firmware to the operating system.
|
||||
|
||||
In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
|
||||
ACPI 3.0 specification). ACPI already has the concept of disabled LAPIC
|
||||
objects by setting the Enabled bit in the LAPIC object to zero.
|
||||
|
||||
For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
|
||||
CPU is already available in the MADT. If the CPU is not available yet
|
||||
it should have its LAPIC Enabled bit set to 0. Linux will use the number
|
||||
of disabled LAPICs to compute the maximum number of future CPUs.
|
||||
|
||||
In the worst case the user can overwrite this choice using a command line
|
||||
option (additional_cpus=...), but it is recommended to supply the correct
|
||||
number (or a reasonable approximation of it, with erring towards more not less)
|
||||
in the MADT to avoid manual configuration.
|
67
Documentation/x86/x86_64/fake-numa-for-cpusets
Normal file
67
Documentation/x86/x86_64/fake-numa-for-cpusets
Normal file
|
@ -0,0 +1,67 @@
|
|||
Using numa=fake and CPUSets for Resource Management
|
||||
Written by David Rientjes <rientjes@cs.washington.edu>
|
||||
|
||||
This document describes how the numa=fake x86_64 command-line option can be used
|
||||
in conjunction with cpusets for coarse memory management. Using this feature,
|
||||
you can create fake NUMA nodes that represent contiguous chunks of memory and
|
||||
assign them to cpusets and their attached tasks. This is a way of limiting the
|
||||
amount of system memory that are available to a certain class of tasks.
|
||||
|
||||
For more information on the features of cpusets, see
|
||||
Documentation/cgroups/cpusets.txt.
|
||||
There are a number of different configurations you can use for your needs. For
|
||||
more information on the numa=fake command line option and its various ways of
|
||||
configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
|
||||
|
||||
For the purposes of this introduction, we'll assume a very primitive NUMA
|
||||
emulation setup of "numa=fake=4*512,". This will split our system memory into
|
||||
four equal chunks of 512M each that we can now use to assign to cpusets. As
|
||||
you become more familiar with using this combination for resource control,
|
||||
you'll determine a better setup to minimize the number of nodes you have to deal
|
||||
with.
|
||||
|
||||
A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
|
||||
|
||||
Faking node 0 at 0000000000000000-0000000020000000 (512MB)
|
||||
Faking node 1 at 0000000020000000-0000000040000000 (512MB)
|
||||
Faking node 2 at 0000000040000000-0000000060000000 (512MB)
|
||||
Faking node 3 at 0000000060000000-0000000080000000 (512MB)
|
||||
...
|
||||
On node 0 totalpages: 130975
|
||||
On node 1 totalpages: 131072
|
||||
On node 2 totalpages: 131072
|
||||
On node 3 totalpages: 131072
|
||||
|
||||
Now following the instructions for mounting the cpusets filesystem from
|
||||
Documentation/cgroups/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
|
||||
address spaces) to individual cpusets:
|
||||
|
||||
[root@xroads /]# mkdir exampleset
|
||||
[root@xroads /]# mount -t cpuset none exampleset
|
||||
[root@xroads /]# mkdir exampleset/ddset
|
||||
[root@xroads /]# cd exampleset/ddset
|
||||
[root@xroads /exampleset/ddset]# echo 0-1 > cpus
|
||||
[root@xroads /exampleset/ddset]# echo 0-1 > mems
|
||||
|
||||
Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
|
||||
memory allocations (1G).
|
||||
|
||||
You can now assign tasks to these cpusets to limit the memory resources
|
||||
available to them according to the fake nodes assigned as mems:
|
||||
|
||||
[root@xroads /exampleset/ddset]# echo $$ > tasks
|
||||
[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
|
||||
[1] 13425
|
||||
|
||||
Notice the difference between the system memory usage as reported by
|
||||
/proc/meminfo between the restricted cpuset case above and the unrestricted
|
||||
case (i.e. running the same 'dd' command without assigning it to a fake NUMA
|
||||
cpuset):
|
||||
Unrestricted Restricted
|
||||
MemTotal: 3091900 kB 3091900 kB
|
||||
MemFree: 42113 kB 1513236 kB
|
||||
|
||||
This allows for coarse memory management for the tasks you assign to particular
|
||||
cpusets. Since cpusets can form a hierarchy, you can create some pretty
|
||||
interesting combinations of use-cases for various classes of tasks for your
|
||||
memory management needs.
|
99
Documentation/x86/x86_64/kernel-stacks
Normal file
99
Documentation/x86/x86_64/kernel-stacks
Normal file
|
@ -0,0 +1,99 @@
|
|||
Most of the text from Keith Owens, hacked by AK
|
||||
|
||||
x86_64 page size (PAGE_SIZE) is 4K.
|
||||
|
||||
Like all other architectures, x86_64 has a kernel stack for every
|
||||
active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
|
||||
These stacks contain useful data as long as a thread is alive or a
|
||||
zombie. While the thread is in user space the kernel stack is empty
|
||||
except for the thread_info structure at the bottom.
|
||||
|
||||
In addition to the per thread stacks, there are specialized stacks
|
||||
associated with each CPU. These stacks are only used while the kernel
|
||||
is in control on that CPU; when a CPU returns to user space the
|
||||
specialized stacks contain no useful data. The main CPU stacks are:
|
||||
|
||||
* Interrupt stack. IRQSTACKSIZE
|
||||
|
||||
Used for external hardware interrupts. If this is the first external
|
||||
hardware interrupt (i.e. not a nested hardware interrupt) then the
|
||||
kernel switches from the current task to the interrupt stack. Like
|
||||
the split thread and interrupt stacks on i386, this gives more room
|
||||
for kernel interrupt processing without having to increase the size
|
||||
of every per thread stack.
|
||||
|
||||
The interrupt stack is also used when processing a softirq.
|
||||
|
||||
Switching to the kernel interrupt stack is done by software based on a
|
||||
per CPU interrupt nest counter. This is needed because x86-64 "IST"
|
||||
hardware stacks cannot nest without races.
|
||||
|
||||
x86_64 also has a feature which is not available on i386, the ability
|
||||
to automatically switch to a new stack for designated events such as
|
||||
double fault or NMI, which makes it easier to handle these unusual
|
||||
events on x86_64. This feature is called the Interrupt Stack Table
|
||||
(IST). There can be up to 7 IST entries per CPU. The IST code is an
|
||||
index into the Task State Segment (TSS). The IST entries in the TSS
|
||||
point to dedicated stacks; each stack can be a different size.
|
||||
|
||||
An IST is selected by a non-zero value in the IST field of an
|
||||
interrupt-gate descriptor. When an interrupt occurs and the hardware
|
||||
loads such a descriptor, the hardware automatically sets the new stack
|
||||
pointer based on the IST value, then invokes the interrupt handler. If
|
||||
software wants to allow nested IST interrupts then the handler must
|
||||
adjust the IST values on entry to and exit from the interrupt handler.
|
||||
(This is occasionally done, e.g. for debug exceptions.)
|
||||
|
||||
Events with different IST codes (i.e. with different stacks) can be
|
||||
nested. For example, a debug interrupt can safely be interrupted by an
|
||||
NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
|
||||
pointers on entry to and exit from all IST events, in theory allowing
|
||||
IST events with the same code to be nested. However in most cases, the
|
||||
stack size allocated to an IST assumes no nesting for the same code.
|
||||
If that assumption is ever broken then the stacks will become corrupt.
|
||||
|
||||
The currently assigned IST stacks are :-
|
||||
|
||||
* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 12 - Stack Fault Exception (#SS).
|
||||
|
||||
This allows the CPU to recover from invalid stack segments. Rarely
|
||||
happens.
|
||||
|
||||
* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 8 - Double Fault Exception (#DF).
|
||||
|
||||
Invoked when handling one exception causes another exception. Happens
|
||||
when the kernel is very confused (e.g. kernel stack pointer corrupt).
|
||||
Using a separate stack allows the kernel to recover from it well enough
|
||||
in many cases to still output an oops.
|
||||
|
||||
* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for non-maskable interrupts (NMI).
|
||||
|
||||
NMI can be delivered at any time, including when the kernel is in the
|
||||
middle of switching stacks. Using IST for NMI events avoids making
|
||||
assumptions about the previous state of the kernel stack.
|
||||
|
||||
* DEBUG_STACK. DEBUG_STKSZ
|
||||
|
||||
Used for hardware debug interrupts (interrupt 1) and for software
|
||||
debug interrupts (INT3).
|
||||
|
||||
When debugging a kernel, debug interrupts (both hardware and
|
||||
software) can occur at any time. Using IST for these interrupts
|
||||
avoids making assumptions about the previous state of the kernel
|
||||
stack.
|
||||
|
||||
* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE).
|
||||
|
||||
Used for interrupt 18 - Machine Check Exception (#MC).
|
||||
|
||||
MCE can be delivered at any time, including when the kernel is in the
|
||||
middle of switching stacks. Using IST for MCE events avoids making
|
||||
assumptions about the previous state of the kernel stack.
|
||||
|
||||
For more details see the Intel IA32 or AMD AMD64 architecture manuals.
|
83
Documentation/x86/x86_64/machinecheck
Normal file
83
Documentation/x86/x86_64/machinecheck
Normal file
|
@ -0,0 +1,83 @@
|
|||
|
||||
Configurable sysfs parameters for the x86-64 machine check code.
|
||||
|
||||
Machine checks report internal hardware error conditions detected
|
||||
by the CPU. Uncorrected errors typically cause a machine check
|
||||
(often with panic), corrected ones cause a machine check log entry.
|
||||
|
||||
Machine checks are organized in banks (normally associated with
|
||||
a hardware subsystem) and subevents in a bank. The exact meaning
|
||||
of the banks and subevent is CPU specific.
|
||||
|
||||
mcelog knows how to decode them.
|
||||
|
||||
When you see the "Machine check errors logged" message in the system
|
||||
log then mcelog should run to collect and decode machine check entries
|
||||
from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
|
||||
|
||||
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
|
||||
(N = CPU number)
|
||||
|
||||
The directory contains some configurable entries:
|
||||
|
||||
Entries:
|
||||
|
||||
bankNctl
|
||||
(N bank number)
|
||||
64bit Hex bitmask enabling/disabling specific subevents for bank N
|
||||
When a bit in the bitmask is zero then the respective
|
||||
subevent will not be reported.
|
||||
By default all events are enabled.
|
||||
Note that BIOS maintain another mask to disable specific events
|
||||
per bank. This is not visible here
|
||||
|
||||
The following entries appear for each CPU, but they are truly shared
|
||||
between all CPUs.
|
||||
|
||||
check_interval
|
||||
How often to poll for corrected machine check errors, in seconds
|
||||
(Note output is hexademical). Default 5 minutes. When the poller
|
||||
finds MCEs it triggers an exponential speedup (poll more often) on
|
||||
the polling interval. When the poller stops finding MCEs, it
|
||||
triggers an exponential backoff (poll less often) on the polling
|
||||
interval. The check_interval variable is both the initial and
|
||||
maximum polling interval. 0 means no polling for corrected machine
|
||||
check errors (but some corrected errors might be still reported
|
||||
in other ways)
|
||||
|
||||
tolerant
|
||||
Tolerance level. When a machine check exception occurs for a non
|
||||
corrected machine check the kernel can take different actions.
|
||||
Since machine check exceptions can happen any time it is sometimes
|
||||
risky for the kernel to kill a process because it defies
|
||||
normal kernel locking rules. The tolerance level configures
|
||||
how hard the kernel tries to recover even at some risk of
|
||||
deadlock. Higher tolerant values trade potentially better uptime
|
||||
with the risk of a crash or even corruption (for tolerant >= 3).
|
||||
|
||||
0: always panic on uncorrected errors, log corrected errors
|
||||
1: panic or SIGBUS on uncorrected errors, log corrected errors
|
||||
2: SIGBUS or log uncorrected errors, log corrected errors
|
||||
3: never panic or SIGBUS, log all errors (for testing only)
|
||||
|
||||
Default: 1
|
||||
|
||||
Note this only makes a difference if the CPU allows recovery
|
||||
from a machine check exception. Current x86 CPUs generally do not.
|
||||
|
||||
trigger
|
||||
Program to run when a machine check event is detected.
|
||||
This is an alternative to running mcelog regularly from cron
|
||||
and allows to detect events faster.
|
||||
monarch_timeout
|
||||
How long to wait for the other CPUs to machine check too on a
|
||||
exception. 0 to disable waiting for other CPUs.
|
||||
Unit: us
|
||||
|
||||
TBD document entries for AMD threshold interrupt configuration
|
||||
|
||||
For more details about the x86 machine check architecture
|
||||
see the Intel and AMD architecture manuals from their developer websites.
|
||||
|
||||
For more details about the architecture see
|
||||
see http://one.firstfloor.org/~andi/mce.pdf
|
40
Documentation/x86/x86_64/mm.txt
Normal file
40
Documentation/x86/x86_64/mm.txt
Normal file
|
@ -0,0 +1,40 @@
|
|||
|
||||
<previous description obsolete, deleted>
|
||||
|
||||
Virtual memory map with 4 level page tables:
|
||||
|
||||
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
|
||||
hole caused by [48:63] sign extension
|
||||
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
|
||||
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
|
||||
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
|
||||
ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
|
||||
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
|
||||
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
|
||||
... unused hole ...
|
||||
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
|
||||
... unused hole ...
|
||||
ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0
|
||||
ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
|
||||
ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
|
||||
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
|
||||
|
||||
The direct mapping covers all memory in the system up to the highest
|
||||
memory address (this means in some cases it can also include PCI memory
|
||||
holes).
|
||||
|
||||
vmalloc space is lazily synchronized into the different PML4 pages of
|
||||
the processes using the page fault handler, with init_level4_pgt as
|
||||
reference.
|
||||
|
||||
Current X86-64 implementations only support 40 bits of address space,
|
||||
but we support up to 46 bits. This expands into MBZ space in the page tables.
|
||||
|
||||
->trampoline_pgd:
|
||||
|
||||
We map EFI runtime services in the aforementioned PGD in the virtual
|
||||
range of 64Gb (arbitrarily set, can be raised if needed)
|
||||
|
||||
0xffffffef00000000 - 0xffffffff00000000
|
||||
|
||||
-Andi Kleen, Jul 2004
|
42
Documentation/x86/x86_64/uefi.txt
Normal file
42
Documentation/x86/x86_64/uefi.txt
Normal file
|
@ -0,0 +1,42 @@
|
|||
General note on [U]EFI x86_64 support
|
||||
-------------------------------------
|
||||
|
||||
The nomenclature EFI and UEFI are used interchangeably in this document.
|
||||
|
||||
Although the tools below are _not_ needed for building the kernel,
|
||||
the needed bootloader support and associated tools for x86_64 platforms
|
||||
with EFI firmware and specifications are listed below.
|
||||
|
||||
1. UEFI specification: http://www.uefi.org
|
||||
|
||||
2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
|
||||
support. Elilo with x86_64 support can be used.
|
||||
|
||||
3. x86_64 platform with EFI/UEFI firmware.
|
||||
|
||||
Mechanics:
|
||||
---------
|
||||
- Build the kernel with the following configuration.
|
||||
CONFIG_FB_EFI=y
|
||||
CONFIG_FRAMEBUFFER_CONSOLE=y
|
||||
If EFI runtime services are expected, the following configuration should
|
||||
be selected.
|
||||
CONFIG_EFI=y
|
||||
CONFIG_EFI_VARS=y or m # optional
|
||||
- Create a VFAT partition on the disk
|
||||
- Copy the following to the VFAT partition:
|
||||
elilo bootloader with x86_64 support, elilo configuration file,
|
||||
kernel image built in first step and corresponding
|
||||
initrd. Instructions on building elilo and its dependencies
|
||||
can be found in the elilo sourceforge project.
|
||||
- Boot to EFI shell and invoke elilo choosing the kernel image built
|
||||
in first step.
|
||||
- If some or all EFI runtime services don't work, you can try following
|
||||
kernel command line parameters to turn off some or all EFI runtime
|
||||
services.
|
||||
noefi turn off all EFI runtime services
|
||||
reboot_type=k turn off EFI reboot runtime service
|
||||
- If the EFI memory map has additional entries not in the E820 map,
|
||||
you can include those entries in the kernels memory map of available
|
||||
physical RAM by using the following kernel command line parameter.
|
||||
add_efi_memmap include EFI memory map of available physical RAM
|
37
Documentation/x86/zero-page.txt
Normal file
37
Documentation/x86/zero-page.txt
Normal file
|
@ -0,0 +1,37 @@
|
|||
The additional fields in struct boot_params as a part of 32-bit boot
|
||||
protocol of kernel. These should be filled by bootloader or 16-bit
|
||||
real-mode setup code of the kernel. References/settings to it mainly
|
||||
are in:
|
||||
|
||||
arch/x86/include/asm/bootparam.h
|
||||
|
||||
|
||||
Offset Proto Name Meaning
|
||||
/Size
|
||||
|
||||
000/040 ALL screen_info Text mode or frame buffer information
|
||||
(struct screen_info)
|
||||
040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info)
|
||||
058/008 ALL tboot_addr Physical address of tboot shared page
|
||||
060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information
|
||||
(struct ist_info)
|
||||
080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!!
|
||||
090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!!
|
||||
0A0/010 ALL sys_desc_table System description table (struct sys_desc_table)
|
||||
0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends
|
||||
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
|
||||
0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits
|
||||
0C8/004 ALL ext_cmd_line_ptr cmd_line_ptr high 32bits
|
||||
140/080 ALL edid_info Video mode setup (struct edid_info)
|
||||
1C0/020 ALL efi_info EFI 32 information (struct efi_info)
|
||||
1E0/004 ALL alk_mem_k Alternative mem check, in KB
|
||||
1E4/004 ALL scratch Scratch field for the kernel setup code
|
||||
1E8/001 ALL e820_entries Number of entries in e820_map (below)
|
||||
1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below)
|
||||
1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer
|
||||
(below)
|
||||
1EF/001 ALL sentinel Used to detect broken bootloaders
|
||||
290/040 ALL edd_mbr_sig_buffer EDD MBR signatures
|
||||
2D0/A00 ALL e820_map E820 memory map table
|
||||
(array of struct e820entry)
|
||||
D00/1EC ALL eddbuf EDD data (array of struct edd_info)
|
Loading…
Add table
Add a link
Reference in a new issue