mirror of
https://github.com/AetherDroid/android_kernel_samsung_on5xelte.git
synced 2025-10-29 15:28:50 +01:00
Fixed MTP to work with TWRP
This commit is contained in:
commit
f6dfaef42e
50820 changed files with 20846062 additions and 0 deletions
14
Documentation/PCI/00-INDEX
Normal file
14
Documentation/PCI/00-INDEX
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
00-INDEX
|
||||
- this file
|
||||
MSI-HOWTO.txt
|
||||
- the Message Signaled Interrupts (MSI) Driver Guide HOWTO and FAQ.
|
||||
PCIEBUS-HOWTO.txt
|
||||
- a guide describing the PCI Express Port Bus driver
|
||||
pci-error-recovery.txt
|
||||
- info on PCI error recovery
|
||||
pci-iov-howto.txt
|
||||
- the PCI Express I/O Virtualization HOWTO
|
||||
pci.txt
|
||||
- info on the PCI subsystem for device driver authors
|
||||
pcieaer-howto.txt
|
||||
- the PCI Express Advanced Error Reporting Driver Guide HOWTO
|
||||
596
Documentation/PCI/MSI-HOWTO.txt
Normal file
596
Documentation/PCI/MSI-HOWTO.txt
Normal file
|
|
@ -0,0 +1,596 @@
|
|||
The MSI Driver Guide HOWTO
|
||||
Tom L Nguyen tom.l.nguyen@intel.com
|
||||
10/03/2003
|
||||
Revised Feb 12, 2004 by Martine Silbermann
|
||||
email: Martine.Silbermann@hp.com
|
||||
Revised Jun 25, 2004 by Tom L Nguyen
|
||||
Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
|
||||
Copyright 2003, 2008 Intel Corporation
|
||||
|
||||
1. About this guide
|
||||
|
||||
This guide describes the basics of Message Signaled Interrupts (MSIs),
|
||||
the advantages of using MSI over traditional interrupt mechanisms, how
|
||||
to change your driver to use MSI or MSI-X and some basic diagnostics to
|
||||
try if a device doesn't support MSIs.
|
||||
|
||||
|
||||
2. What are MSIs?
|
||||
|
||||
A Message Signaled Interrupt is a write from the device to a special
|
||||
address which causes an interrupt to be received by the CPU.
|
||||
|
||||
The MSI capability was first specified in PCI 2.2 and was later enhanced
|
||||
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
|
||||
capability was also introduced with PCI 3.0. It supports more interrupts
|
||||
per device than MSI and allows interrupts to be independently configured.
|
||||
|
||||
Devices may support both MSI and MSI-X, but only one can be enabled at
|
||||
a time.
|
||||
|
||||
|
||||
3. Why use MSIs?
|
||||
|
||||
There are three reasons why using MSIs can give an advantage over
|
||||
traditional pin-based interrupts.
|
||||
|
||||
Pin-based PCI interrupts are often shared amongst several devices.
|
||||
To support this, the kernel must call each interrupt handler associated
|
||||
with an interrupt, which leads to reduced performance for the system as
|
||||
a whole. MSIs are never shared, so this problem cannot arise.
|
||||
|
||||
When a device writes data to memory, then raises a pin-based interrupt,
|
||||
it is possible that the interrupt may arrive before all the data has
|
||||
arrived in memory (this becomes more likely with devices behind PCI-PCI
|
||||
bridges). In order to ensure that all the data has arrived in memory,
|
||||
the interrupt handler must read a register on the device which raised
|
||||
the interrupt. PCI transaction ordering rules require that all the data
|
||||
arrive in memory before the value may be returned from the register.
|
||||
Using MSIs avoids this problem as the interrupt-generating write cannot
|
||||
pass the data writes, so by the time the interrupt is raised, the driver
|
||||
knows that all the data has arrived in memory.
|
||||
|
||||
PCI devices can only support a single pin-based interrupt per function.
|
||||
Often drivers have to query the device to find out what event has
|
||||
occurred, slowing down interrupt handling for the common case. With
|
||||
MSIs, a device can support more interrupts, allowing each interrupt
|
||||
to be specialised to a different purpose. One possible design gives
|
||||
infrequent conditions (such as errors) their own interrupt which allows
|
||||
the driver to handle the normal interrupt handling path more efficiently.
|
||||
Other possible designs include giving one interrupt to each packet queue
|
||||
in a network card or each port in a storage controller.
|
||||
|
||||
|
||||
4. How to use MSIs
|
||||
|
||||
PCI devices are initialised to use pin-based interrupts. The device
|
||||
driver has to set up the device to use MSI or MSI-X. Not all machines
|
||||
support MSIs correctly, and for those machines, the APIs described below
|
||||
will simply fail and the device will continue to use pin-based interrupts.
|
||||
|
||||
4.1 Include kernel support for MSIs
|
||||
|
||||
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
|
||||
option enabled. This option is only available on some architectures,
|
||||
and it may depend on some other options also being set. For example,
|
||||
on x86, you must also enable X86_UP_APIC or SMP in order to see the
|
||||
CONFIG_PCI_MSI option.
|
||||
|
||||
4.2 Using MSI
|
||||
|
||||
Most of the hard work is done for the driver in the PCI layer. It simply
|
||||
has to request that the PCI layer set up the MSI capability for this
|
||||
device.
|
||||
|
||||
4.2.1 pci_enable_msi
|
||||
|
||||
int pci_enable_msi(struct pci_dev *dev)
|
||||
|
||||
A successful call allocates ONE interrupt to the device, regardless
|
||||
of how many MSIs the device supports. The device is switched from
|
||||
pin-based interrupt mode to MSI mode. The dev->irq number is changed
|
||||
to a new number which represents the message signaled interrupt;
|
||||
consequently, this function should be called before the driver calls
|
||||
request_irq(), because an MSI is delivered via a vector that is
|
||||
different from the vector of a pin-based interrupt.
|
||||
|
||||
4.2.2 pci_enable_msi_range
|
||||
|
||||
int pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec)
|
||||
|
||||
This function allows a device driver to request any number of MSI
|
||||
interrupts within specified range from 'minvec' to 'maxvec'.
|
||||
|
||||
If this function returns a positive number it indicates the number of
|
||||
MSI interrupts that have been successfully allocated. In this case
|
||||
the device is switched from pin-based interrupt mode to MSI mode and
|
||||
updates dev->irq to be the lowest of the new interrupts assigned to it.
|
||||
The other interrupts assigned to the device are in the range dev->irq
|
||||
to dev->irq + returned value - 1. Device driver can use the returned
|
||||
number of successfully allocated MSI interrupts to further allocate
|
||||
and initialize device resources.
|
||||
|
||||
If this function returns a negative number, it indicates an error and
|
||||
the driver should not attempt to request any more MSI interrupts for
|
||||
this device.
|
||||
|
||||
This function should be called before the driver calls request_irq(),
|
||||
because MSI interrupts are delivered via vectors that are different
|
||||
from the vector of a pin-based interrupt.
|
||||
|
||||
It is ideal if drivers can cope with a variable number of MSI interrupts;
|
||||
there are many reasons why the platform may not be able to provide the
|
||||
exact number that a driver asks for.
|
||||
|
||||
There could be devices that can not operate with just any number of MSI
|
||||
interrupts within a range. See chapter 4.3.1.3 to get the idea how to
|
||||
handle such devices for MSI-X - the same logic applies to MSI.
|
||||
|
||||
4.2.1.1 Maximum possible number of MSI interrupts
|
||||
|
||||
The typical usage of MSI interrupts is to allocate as many vectors as
|
||||
possible, likely up to the limit returned by pci_msi_vec_count() function:
|
||||
|
||||
static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
|
||||
{
|
||||
return pci_enable_msi_range(pdev, 1, nvec);
|
||||
}
|
||||
|
||||
Note the value of 'minvec' parameter is 1. As 'minvec' is inclusive,
|
||||
the value of 0 would be meaningless and could result in error.
|
||||
|
||||
Some devices have a minimal limit on number of MSI interrupts.
|
||||
In this case the function could look like this:
|
||||
|
||||
static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
|
||||
{
|
||||
return pci_enable_msi_range(pdev, FOO_DRIVER_MINIMUM_NVEC, nvec);
|
||||
}
|
||||
|
||||
4.2.1.2 Exact number of MSI interrupts
|
||||
|
||||
If a driver is unable or unwilling to deal with a variable number of MSI
|
||||
interrupts it could request a particular number of interrupts by passing
|
||||
that number to pci_enable_msi_range() function as both 'minvec' and 'maxvec'
|
||||
parameters:
|
||||
|
||||
static int foo_driver_enable_msi(struct pci_dev *pdev, int nvec)
|
||||
{
|
||||
return pci_enable_msi_range(pdev, nvec, nvec);
|
||||
}
|
||||
|
||||
Note, unlike pci_enable_msi_exact() function, which could be also used to
|
||||
enable a particular number of MSI-X interrupts, pci_enable_msi_range()
|
||||
returns either a negative errno or 'nvec' (not negative errno or 0 - as
|
||||
pci_enable_msi_exact() does).
|
||||
|
||||
4.2.1.3 Single MSI mode
|
||||
|
||||
The most notorious example of the request type described above is
|
||||
enabling the single MSI mode for a device. It could be done by passing
|
||||
two 1s as 'minvec' and 'maxvec':
|
||||
|
||||
static int foo_driver_enable_single_msi(struct pci_dev *pdev)
|
||||
{
|
||||
return pci_enable_msi_range(pdev, 1, 1);
|
||||
}
|
||||
|
||||
Note, unlike pci_enable_msi() function, which could be also used to
|
||||
enable the single MSI mode, pci_enable_msi_range() returns either a
|
||||
negative errno or 1 (not negative errno or 0 - as pci_enable_msi()
|
||||
does).
|
||||
|
||||
4.2.3 pci_enable_msi_exact
|
||||
|
||||
int pci_enable_msi_exact(struct pci_dev *dev, int nvec)
|
||||
|
||||
This variation on pci_enable_msi_range() call allows a device driver to
|
||||
request exactly 'nvec' MSIs.
|
||||
|
||||
If this function returns a negative number, it indicates an error and
|
||||
the driver should not attempt to request any more MSI interrupts for
|
||||
this device.
|
||||
|
||||
By contrast with pci_enable_msi_range() function, pci_enable_msi_exact()
|
||||
returns zero in case of success, which indicates MSI interrupts have been
|
||||
successfully allocated.
|
||||
|
||||
4.2.4 pci_disable_msi
|
||||
|
||||
void pci_disable_msi(struct pci_dev *dev)
|
||||
|
||||
This function should be used to undo the effect of pci_enable_msi_range().
|
||||
Calling it restores dev->irq to the pin-based interrupt number and frees
|
||||
the previously allocated MSIs. The interrupts may subsequently be assigned
|
||||
to another device, so drivers should not cache the value of dev->irq.
|
||||
|
||||
Before calling this function, a device driver must always call free_irq()
|
||||
on any interrupt for which it previously called request_irq().
|
||||
Failure to do so results in a BUG_ON(), leaving the device with
|
||||
MSI enabled and thus leaking its vector.
|
||||
|
||||
4.2.4 pci_msi_vec_count
|
||||
|
||||
int pci_msi_vec_count(struct pci_dev *dev)
|
||||
|
||||
This function could be used to retrieve the number of MSI vectors the
|
||||
device requested (via the Multiple Message Capable register). The MSI
|
||||
specification only allows the returned value to be a power of two,
|
||||
up to a maximum of 2^5 (32).
|
||||
|
||||
If this function returns a negative number, it indicates the device is
|
||||
not capable of sending MSIs.
|
||||
|
||||
If this function returns a positive number, it indicates the maximum
|
||||
number of MSI interrupt vectors that could be allocated.
|
||||
|
||||
4.3 Using MSI-X
|
||||
|
||||
The MSI-X capability is much more flexible than the MSI capability.
|
||||
It supports up to 2048 interrupts, each of which can be controlled
|
||||
independently. To support this flexibility, drivers must use an array of
|
||||
`struct msix_entry':
|
||||
|
||||
struct msix_entry {
|
||||
u16 vector; /* kernel uses to write alloc vector */
|
||||
u16 entry; /* driver uses to specify entry */
|
||||
};
|
||||
|
||||
This allows for the device to use these interrupts in a sparse fashion;
|
||||
for example, it could use interrupts 3 and 1027 and yet allocate only a
|
||||
two-element array. The driver is expected to fill in the 'entry' value
|
||||
in each element of the array to indicate for which entries the kernel
|
||||
should assign interrupts; it is invalid to fill in two entries with the
|
||||
same number.
|
||||
|
||||
4.3.1 pci_enable_msix_range
|
||||
|
||||
int pci_enable_msix_range(struct pci_dev *dev, struct msix_entry *entries,
|
||||
int minvec, int maxvec)
|
||||
|
||||
Calling this function asks the PCI subsystem to allocate any number of
|
||||
MSI-X interrupts within specified range from 'minvec' to 'maxvec'.
|
||||
The 'entries' argument is a pointer to an array of msix_entry structs
|
||||
which should be at least 'maxvec' entries in size.
|
||||
|
||||
On success, the device is switched into MSI-X mode and the function
|
||||
returns the number of MSI-X interrupts that have been successfully
|
||||
allocated. In this case the 'vector' member in entries numbered from
|
||||
0 to the returned value - 1 is populated with the interrupt number;
|
||||
the driver should then call request_irq() for each 'vector' that it
|
||||
decides to use. The device driver is responsible for keeping track of the
|
||||
interrupts assigned to the MSI-X vectors so it can free them again later.
|
||||
Device driver can use the returned number of successfully allocated MSI-X
|
||||
interrupts to further allocate and initialize device resources.
|
||||
|
||||
If this function returns a negative number, it indicates an error and
|
||||
the driver should not attempt to allocate any more MSI-X interrupts for
|
||||
this device.
|
||||
|
||||
This function, in contrast with pci_enable_msi_range(), does not adjust
|
||||
dev->irq. The device will not generate interrupts for this interrupt
|
||||
number once MSI-X is enabled.
|
||||
|
||||
Device drivers should normally call this function once per device
|
||||
during the initialization phase.
|
||||
|
||||
It is ideal if drivers can cope with a variable number of MSI-X interrupts;
|
||||
there are many reasons why the platform may not be able to provide the
|
||||
exact number that a driver asks for.
|
||||
|
||||
There could be devices that can not operate with just any number of MSI-X
|
||||
interrupts within a range. E.g., an network adapter might need let's say
|
||||
four vectors per each queue it provides. Therefore, a number of MSI-X
|
||||
interrupts allocated should be a multiple of four. In this case interface
|
||||
pci_enable_msix_range() can not be used alone to request MSI-X interrupts
|
||||
(since it can allocate any number within the range, without any notion of
|
||||
the multiple of four) and the device driver should master a custom logic
|
||||
to request the required number of MSI-X interrupts.
|
||||
|
||||
4.3.1.1 Maximum possible number of MSI-X interrupts
|
||||
|
||||
The typical usage of MSI-X interrupts is to allocate as many vectors as
|
||||
possible, likely up to the limit returned by pci_msix_vec_count() function:
|
||||
|
||||
static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
|
||||
{
|
||||
return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
|
||||
1, nvec);
|
||||
}
|
||||
|
||||
Note the value of 'minvec' parameter is 1. As 'minvec' is inclusive,
|
||||
the value of 0 would be meaningless and could result in error.
|
||||
|
||||
Some devices have a minimal limit on number of MSI-X interrupts.
|
||||
In this case the function could look like this:
|
||||
|
||||
static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
|
||||
{
|
||||
return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
|
||||
FOO_DRIVER_MINIMUM_NVEC, nvec);
|
||||
}
|
||||
|
||||
4.3.1.2 Exact number of MSI-X interrupts
|
||||
|
||||
If a driver is unable or unwilling to deal with a variable number of MSI-X
|
||||
interrupts it could request a particular number of interrupts by passing
|
||||
that number to pci_enable_msix_range() function as both 'minvec' and 'maxvec'
|
||||
parameters:
|
||||
|
||||
static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec)
|
||||
{
|
||||
return pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
|
||||
nvec, nvec);
|
||||
}
|
||||
|
||||
Note, unlike pci_enable_msix_exact() function, which could be also used to
|
||||
enable a particular number of MSI-X interrupts, pci_enable_msix_range()
|
||||
returns either a negative errno or 'nvec' (not negative errno or 0 - as
|
||||
pci_enable_msix_exact() does).
|
||||
|
||||
4.3.1.3 Specific requirements to the number of MSI-X interrupts
|
||||
|
||||
As noted above, there could be devices that can not operate with just any
|
||||
number of MSI-X interrupts within a range. E.g., let's assume a device that
|
||||
is only capable sending the number of MSI-X interrupts which is a power of
|
||||
two. A routine that enables MSI-X mode for such device might look like this:
|
||||
|
||||
/*
|
||||
* Assume 'minvec' and 'maxvec' are non-zero
|
||||
*/
|
||||
static int foo_driver_enable_msix(struct foo_adapter *adapter,
|
||||
int minvec, int maxvec)
|
||||
{
|
||||
int rc;
|
||||
|
||||
minvec = roundup_pow_of_two(minvec);
|
||||
maxvec = rounddown_pow_of_two(maxvec);
|
||||
|
||||
if (minvec > maxvec)
|
||||
return -ERANGE;
|
||||
|
||||
retry:
|
||||
rc = pci_enable_msix_range(adapter->pdev, adapter->msix_entries,
|
||||
maxvec, maxvec);
|
||||
/*
|
||||
* -ENOSPC is the only error code allowed to be analized
|
||||
*/
|
||||
if (rc == -ENOSPC) {
|
||||
if (maxvec == 1)
|
||||
return -ENOSPC;
|
||||
|
||||
maxvec /= 2;
|
||||
|
||||
if (minvec > maxvec)
|
||||
return -ENOSPC;
|
||||
|
||||
goto retry;
|
||||
}
|
||||
|
||||
return rc;
|
||||
}
|
||||
|
||||
Note how pci_enable_msix_range() return value is analized for a fallback -
|
||||
any error code other than -ENOSPC indicates a fatal error and should not
|
||||
be retried.
|
||||
|
||||
4.3.2 pci_enable_msix_exact
|
||||
|
||||
int pci_enable_msix_exact(struct pci_dev *dev,
|
||||
struct msix_entry *entries, int nvec)
|
||||
|
||||
This variation on pci_enable_msix_range() call allows a device driver to
|
||||
request exactly 'nvec' MSI-Xs.
|
||||
|
||||
If this function returns a negative number, it indicates an error and
|
||||
the driver should not attempt to allocate any more MSI-X interrupts for
|
||||
this device.
|
||||
|
||||
By contrast with pci_enable_msix_range() function, pci_enable_msix_exact()
|
||||
returns zero in case of success, which indicates MSI-X interrupts have been
|
||||
successfully allocated.
|
||||
|
||||
Another version of a routine that enables MSI-X mode for a device with
|
||||
specific requirements described in chapter 4.3.1.3 might look like this:
|
||||
|
||||
/*
|
||||
* Assume 'minvec' and 'maxvec' are non-zero
|
||||
*/
|
||||
static int foo_driver_enable_msix(struct foo_adapter *adapter,
|
||||
int minvec, int maxvec)
|
||||
{
|
||||
int rc;
|
||||
|
||||
minvec = roundup_pow_of_two(minvec);
|
||||
maxvec = rounddown_pow_of_two(maxvec);
|
||||
|
||||
if (minvec > maxvec)
|
||||
return -ERANGE;
|
||||
|
||||
retry:
|
||||
rc = pci_enable_msix_exact(adapter->pdev,
|
||||
adapter->msix_entries, maxvec);
|
||||
|
||||
/*
|
||||
* -ENOSPC is the only error code allowed to be analyzed
|
||||
*/
|
||||
if (rc == -ENOSPC) {
|
||||
if (maxvec == 1)
|
||||
return -ENOSPC;
|
||||
|
||||
maxvec /= 2;
|
||||
|
||||
if (minvec > maxvec)
|
||||
return -ENOSPC;
|
||||
|
||||
goto retry;
|
||||
} else if (rc < 0) {
|
||||
return rc;
|
||||
}
|
||||
|
||||
return maxvec;
|
||||
}
|
||||
|
||||
4.3.3 pci_disable_msix
|
||||
|
||||
void pci_disable_msix(struct pci_dev *dev)
|
||||
|
||||
This function should be used to undo the effect of pci_enable_msix_range().
|
||||
It frees the previously allocated MSI-X interrupts. The interrupts may
|
||||
subsequently be assigned to another device, so drivers should not cache
|
||||
the value of the 'vector' elements over a call to pci_disable_msix().
|
||||
|
||||
Before calling this function, a device driver must always call free_irq()
|
||||
on any interrupt for which it previously called request_irq().
|
||||
Failure to do so results in a BUG_ON(), leaving the device with
|
||||
MSI-X enabled and thus leaking its vector.
|
||||
|
||||
4.3.3 The MSI-X Table
|
||||
|
||||
The MSI-X capability specifies a BAR and offset within that BAR for the
|
||||
MSI-X Table. This address is mapped by the PCI subsystem, and should not
|
||||
be accessed directly by the device driver. If the driver wishes to
|
||||
mask or unmask an interrupt, it should call disable_irq() / enable_irq().
|
||||
|
||||
4.3.4 pci_msix_vec_count
|
||||
|
||||
int pci_msix_vec_count(struct pci_dev *dev)
|
||||
|
||||
This function could be used to retrieve number of entries in the device
|
||||
MSI-X table.
|
||||
|
||||
If this function returns a negative number, it indicates the device is
|
||||
not capable of sending MSI-Xs.
|
||||
|
||||
If this function returns a positive number, it indicates the maximum
|
||||
number of MSI-X interrupt vectors that could be allocated.
|
||||
|
||||
4.4 Handling devices implementing both MSI and MSI-X capabilities
|
||||
|
||||
If a device implements both MSI and MSI-X capabilities, it can
|
||||
run in either MSI mode or MSI-X mode, but not both simultaneously.
|
||||
This is a requirement of the PCI spec, and it is enforced by the
|
||||
PCI layer. Calling pci_enable_msi_range() when MSI-X is already
|
||||
enabled or pci_enable_msix_range() when MSI is already enabled
|
||||
results in an error. If a device driver wishes to switch between MSI
|
||||
and MSI-X at runtime, it must first quiesce the device, then switch
|
||||
it back to pin-interrupt mode, before calling pci_enable_msi_range()
|
||||
or pci_enable_msix_range() and resuming operation. This is not expected
|
||||
to be a common operation but may be useful for debugging or testing
|
||||
during development.
|
||||
|
||||
4.5 Considerations when using MSIs
|
||||
|
||||
4.5.1 Choosing between MSI-X and MSI
|
||||
|
||||
If your device supports both MSI-X and MSI capabilities, you should use
|
||||
the MSI-X facilities in preference to the MSI facilities. As mentioned
|
||||
above, MSI-X supports any number of interrupts between 1 and 2048.
|
||||
In constrast, MSI is restricted to a maximum of 32 interrupts (and
|
||||
must be a power of two). In addition, the MSI interrupt vectors must
|
||||
be allocated consecutively, so the system might not be able to allocate
|
||||
as many vectors for MSI as it could for MSI-X. On some platforms, MSI
|
||||
interrupts must all be targeted at the same set of CPUs whereas MSI-X
|
||||
interrupts can all be targeted at different CPUs.
|
||||
|
||||
4.5.2 Spinlocks
|
||||
|
||||
Most device drivers have a per-device spinlock which is taken in the
|
||||
interrupt handler. With pin-based interrupts or a single MSI, it is not
|
||||
necessary to disable interrupts (Linux guarantees the same interrupt will
|
||||
not be re-entered). If a device uses multiple interrupts, the driver
|
||||
must disable interrupts while the lock is held. If the device sends
|
||||
a different interrupt, the driver will deadlock trying to recursively
|
||||
acquire the spinlock.
|
||||
|
||||
There are two solutions. The first is to take the lock with
|
||||
spin_lock_irqsave() or spin_lock_irq() (see
|
||||
Documentation/DocBook/kernel-locking). The second is to specify
|
||||
IRQF_DISABLED to request_irq() so that the kernel runs the entire
|
||||
interrupt routine with interrupts disabled.
|
||||
|
||||
If your MSI interrupt routine does not hold the lock for the whole time
|
||||
it is running, the first solution may be best. The second solution is
|
||||
normally preferred as it avoids making two transitions from interrupt
|
||||
disabled to enabled and back again.
|
||||
|
||||
4.6 How to tell whether MSI/MSI-X is enabled on a device
|
||||
|
||||
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
|
||||
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
|
||||
has an 'Enable' flag which is followed with either "+" (enabled)
|
||||
or "-" (disabled).
|
||||
|
||||
|
||||
5. MSI quirks
|
||||
|
||||
Several PCI chipsets or devices are known not to support MSIs.
|
||||
The PCI stack provides three ways to disable MSIs:
|
||||
|
||||
1. globally
|
||||
2. on all devices behind a specific bridge
|
||||
3. on a single device
|
||||
|
||||
5.1. Disabling MSIs globally
|
||||
|
||||
Some host chipsets simply don't support MSIs properly. If we're
|
||||
lucky, the manufacturer knows this and has indicated it in the ACPI
|
||||
FADT table. In this case, Linux automatically disables MSIs.
|
||||
Some boards don't include this information in the table and so we have
|
||||
to detect them ourselves. The complete list of these is found near the
|
||||
quirk_disable_all_msi() function in drivers/pci/quirks.c.
|
||||
|
||||
If you have a board which has problems with MSIs, you can pass pci=nomsi
|
||||
on the kernel command line to disable MSIs on all devices. It would be
|
||||
in your best interests to report the problem to linux-pci@vger.kernel.org
|
||||
including a full 'lspci -v' so we can add the quirks to the kernel.
|
||||
|
||||
5.2. Disabling MSIs below a bridge
|
||||
|
||||
Some PCI bridges are not able to route MSIs between busses properly.
|
||||
In this case, MSIs must be disabled on all devices behind the bridge.
|
||||
|
||||
Some bridges allow you to enable MSIs by changing some bits in their
|
||||
PCI configuration space (especially the Hypertransport chipsets such
|
||||
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
|
||||
Linux mostly knows about them and automatically enables MSIs if it can.
|
||||
If you have a bridge unknown to Linux, you can enable
|
||||
MSIs in configuration space using whatever method you know works, then
|
||||
enable MSIs on that bridge by doing:
|
||||
|
||||
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
|
||||
|
||||
where $bridge is the PCI address of the bridge you've enabled (eg
|
||||
0000:00:0e.0).
|
||||
|
||||
To disable MSIs, echo 0 instead of 1. Changing this value should be
|
||||
done with caution as it could break interrupt handling for all devices
|
||||
below this bridge.
|
||||
|
||||
Again, please notify linux-pci@vger.kernel.org of any bridges that need
|
||||
special handling.
|
||||
|
||||
5.3. Disabling MSIs on a single device
|
||||
|
||||
Some devices are known to have faulty MSI implementations. Usually this
|
||||
is handled in the individual device driver, but occasionally it's necessary
|
||||
to handle this with a quirk. Some drivers have an option to disable use
|
||||
of MSI. While this is a convenient workaround for the driver author,
|
||||
it is not good practice, and should not be emulated.
|
||||
|
||||
5.4. Finding why MSIs are disabled on a device
|
||||
|
||||
From the above three sections, you can see that there are many reasons
|
||||
why MSIs may not be enabled for a given device. Your first step should
|
||||
be to examine your dmesg carefully to determine whether MSIs are enabled
|
||||
for your machine. You should also check your .config to be sure you
|
||||
have enabled CONFIG_PCI_MSI.
|
||||
|
||||
Then, 'lspci -t' gives the list of bridges above a device. Reading
|
||||
/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
|
||||
or disabled (0). If 0 is found in any of the msi_bus files belonging
|
||||
to bridges between the PCI root and the device, MSIs are disabled.
|
||||
|
||||
It is also worth checking the device driver to see whether it supports MSIs.
|
||||
For example, it may contain calls to pci_enable_msi_range() or
|
||||
pci_enable_msix_range().
|
||||
217
Documentation/PCI/PCIEBUS-HOWTO.txt
Normal file
217
Documentation/PCI/PCIEBUS-HOWTO.txt
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
The PCI Express Port Bus Driver Guide HOWTO
|
||||
Tom L Nguyen tom.l.nguyen@intel.com
|
||||
11/03/2004
|
||||
|
||||
1. About this guide
|
||||
|
||||
This guide describes the basics of the PCI Express Port Bus driver
|
||||
and provides information on how to enable the service drivers to
|
||||
register/unregister with the PCI Express Port Bus Driver.
|
||||
|
||||
2. Copyright 2004 Intel Corporation
|
||||
|
||||
3. What is the PCI Express Port Bus Driver
|
||||
|
||||
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
||||
are two types of PCI Express Port: the Root Port and the Switch
|
||||
Port. The Root Port originates a PCI Express link from a PCI Express
|
||||
Root Complex and the Switch Port connects PCI Express links to
|
||||
internal logical PCI buses. The Switch Port, which has its secondary
|
||||
bus representing the switch's internal routing logic, is called the
|
||||
switch's Upstream Port. The switch's Downstream Port is bridging from
|
||||
switch's internal routing bus to a bus representing the downstream
|
||||
PCI Express link from the PCI Express Switch.
|
||||
|
||||
A PCI Express Port can provide up to four distinct functions,
|
||||
referred to in this document as services, depending on its port type.
|
||||
PCI Express Port's services include native hotplug support (HP),
|
||||
power management event support (PME), advanced error reporting
|
||||
support (AER), and virtual channel support (VC). These services may
|
||||
be handled by a single complex driver or be individually distributed
|
||||
and handled by corresponding service drivers.
|
||||
|
||||
4. Why use the PCI Express Port Bus Driver?
|
||||
|
||||
In existing Linux kernels, the Linux Device Driver Model allows a
|
||||
physical device to be handled by only a single driver. The PCI
|
||||
Express Port is a PCI-PCI Bridge device with multiple distinct
|
||||
services. To maintain a clean and simple solution each service
|
||||
may have its own software service driver. In this case several
|
||||
service drivers will compete for a single PCI-PCI Bridge device.
|
||||
For example, if the PCI Express Root Port native hotplug service
|
||||
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
||||
kernel therefore does not load other service drivers for that Root
|
||||
Port. In other words, it is impossible to have multiple service
|
||||
drivers load and run on a PCI-PCI Bridge device simultaneously
|
||||
using the current driver model.
|
||||
|
||||
To enable multiple service drivers running simultaneously requires
|
||||
having a PCI Express Port Bus driver, which manages all populated
|
||||
PCI Express Ports and distributes all provided service requests
|
||||
to the corresponding service drivers as required. Some key
|
||||
advantages of using the PCI Express Port Bus driver are listed below:
|
||||
|
||||
- Allow multiple service drivers to run simultaneously on
|
||||
a PCI-PCI Bridge Port device.
|
||||
|
||||
- Allow service drivers implemented in an independent
|
||||
staged approach.
|
||||
|
||||
- Allow one service driver to run on multiple PCI-PCI Bridge
|
||||
Port devices.
|
||||
|
||||
- Manage and distribute resources of a PCI-PCI Bridge Port
|
||||
device to requested service drivers.
|
||||
|
||||
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
||||
|
||||
5.1 Including the PCI Express Port Bus Driver Support into the Kernel
|
||||
|
||||
Including the PCI Express Port Bus driver depends on whether the PCI
|
||||
Express support is included in the kernel config. The kernel will
|
||||
automatically include the PCI Express Port Bus driver as a kernel
|
||||
driver when the PCI Express support is enabled in the kernel.
|
||||
|
||||
5.2 Enabling Service Driver Support
|
||||
|
||||
PCI device drivers are implemented based on Linux Device Driver Model.
|
||||
All service drivers are PCI device drivers. As discussed above, it is
|
||||
impossible to load any service driver once the kernel has loaded the
|
||||
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
||||
Model requires some minimal changes on existing service drivers that
|
||||
imposes no impact on the functionality of existing service drivers.
|
||||
|
||||
A service driver is required to use the two APIs shown below to
|
||||
register its service with the PCI Express Port Bus driver (see
|
||||
section 5.2.1 & 5.2.2). It is important that a service driver
|
||||
initializes the pcie_port_service_driver data structure, included in
|
||||
header file /include/linux/pcieport_if.h, before calling these APIs.
|
||||
Failure to do so will result an identity mismatch, which prevents
|
||||
the PCI Express Port Bus driver from loading a service driver.
|
||||
|
||||
5.2.1 pcie_port_service_register
|
||||
|
||||
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
||||
|
||||
This API replaces the Linux Driver Model's pci_register_driver API. A
|
||||
service driver should always calls pcie_port_service_register at
|
||||
module init. Note that after service driver being loaded, calls
|
||||
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
||||
necessary since these calls are executed by the PCI Port Bus driver.
|
||||
|
||||
5.2.2 pcie_port_service_unregister
|
||||
|
||||
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
||||
|
||||
pcie_port_service_unregister replaces the Linux Driver Model's
|
||||
pci_unregister_driver. It's always called by service driver when a
|
||||
module exits.
|
||||
|
||||
5.2.3 Sample Code
|
||||
|
||||
Below is sample service driver code to initialize the port service
|
||||
driver data structure.
|
||||
|
||||
static struct pcie_port_service_id service_id[] = { {
|
||||
.vendor = PCI_ANY_ID,
|
||||
.device = PCI_ANY_ID,
|
||||
.port_type = PCIE_RC_PORT,
|
||||
.service_type = PCIE_PORT_SERVICE_AER,
|
||||
}, { /* end: all zeroes */ }
|
||||
};
|
||||
|
||||
static struct pcie_port_service_driver root_aerdrv = {
|
||||
.name = (char *)device_name,
|
||||
.id_table = &service_id[0],
|
||||
|
||||
.probe = aerdrv_load,
|
||||
.remove = aerdrv_unload,
|
||||
|
||||
.suspend = aerdrv_suspend,
|
||||
.resume = aerdrv_resume,
|
||||
};
|
||||
|
||||
Below is a sample code for registering/unregistering a service
|
||||
driver.
|
||||
|
||||
static int __init aerdrv_service_init(void)
|
||||
{
|
||||
int retval = 0;
|
||||
|
||||
retval = pcie_port_service_register(&root_aerdrv);
|
||||
if (!retval) {
|
||||
/*
|
||||
* FIX ME
|
||||
*/
|
||||
}
|
||||
return retval;
|
||||
}
|
||||
|
||||
static void __exit aerdrv_service_exit(void)
|
||||
{
|
||||
pcie_port_service_unregister(&root_aerdrv);
|
||||
}
|
||||
|
||||
module_init(aerdrv_service_init);
|
||||
module_exit(aerdrv_service_exit);
|
||||
|
||||
6. Possible Resource Conflicts
|
||||
|
||||
Since all service drivers of a PCI-PCI Bridge Port device are
|
||||
allowed to run simultaneously, below lists a few of possible resource
|
||||
conflicts with proposed solutions.
|
||||
|
||||
6.1 MSI Vector Resource
|
||||
|
||||
The MSI capability structure enables a device software driver to call
|
||||
pci_enable_msi to request MSI based interrupts. Once MSI interrupts
|
||||
are enabled on a device, it stays in this mode until a device driver
|
||||
calls pci_disable_msi to disable MSI interrupts and revert back to
|
||||
INTx emulation mode. Since service drivers of the same PCI-PCI Bridge
|
||||
port share the same physical device, if an individual service driver
|
||||
calls pci_enable_msi/pci_disable_msi it may result unpredictable
|
||||
behavior. For example, two service drivers run simultaneously on the
|
||||
same physical Root Port. Both service drivers call pci_enable_msi to
|
||||
request MSI based interrupts. A service driver may not know whether
|
||||
any other service drivers have run on this Root Port. If either one
|
||||
of them calls pci_disable_msi, it puts the other service driver
|
||||
in a wrong interrupt mode.
|
||||
|
||||
To avoid this situation all service drivers are not permitted to
|
||||
switch interrupt mode on its device. The PCI Express Port Bus driver
|
||||
is responsible for determining the interrupt mode and this should be
|
||||
transparent to service drivers. Service drivers need to know only
|
||||
the vector IRQ assigned to the field irq of struct pcie_device, which
|
||||
is passed in when the PCI Express Port Bus driver probes each service
|
||||
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
||||
call request_irq/free_irq. In addition, the interrupt mode is stored
|
||||
in the field interrupt_mode of struct pcie_device.
|
||||
|
||||
6.2 MSI-X Vector Resources
|
||||
|
||||
Similar to the MSI a device driver for an MSI-X capable device can
|
||||
call pci_enable_msix to request MSI-X interrupts. All service drivers
|
||||
are not permitted to switch interrupt mode on its device. The PCI
|
||||
Express Port Bus driver is responsible for determining the interrupt
|
||||
mode and this should be transparent to service drivers. Any attempt
|
||||
by service driver to call pci_enable_msix/pci_disable_msix may
|
||||
result unpredictable behavior. Service drivers should use
|
||||
(struct pcie_device*)dev->irq and call request_irq/free_irq.
|
||||
|
||||
6.3 PCI Memory/IO Mapped Regions
|
||||
|
||||
Service drivers for PCI Express Power Management (PME), Advanced
|
||||
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
||||
PCI configuration space on the PCI Express port. In all cases the
|
||||
registers accessed are independent of each other. This patch assumes
|
||||
that all service drivers will be well behaved and not overwrite
|
||||
other service driver's configuration settings.
|
||||
|
||||
6.4 PCI Config Registers
|
||||
|
||||
Each service driver runs its PCI config operations on its own
|
||||
capability structure except the PCI Express capability structure, in
|
||||
which Root Control register and Device Control register are shared
|
||||
between PME and AER. This patch assumes that all service drivers
|
||||
will be well behaved and not overwrite other service driver's
|
||||
configuration settings.
|
||||
431
Documentation/PCI/pci-error-recovery.txt
Normal file
431
Documentation/PCI/pci-error-recovery.txt
Normal file
|
|
@ -0,0 +1,431 @@
|
|||
|
||||
PCI Error Recovery
|
||||
------------------
|
||||
February 2, 2006
|
||||
|
||||
Current document maintainer:
|
||||
Linas Vepstas <linasvepstas@gmail.com>
|
||||
updated by Richard Lary <rlary@us.ibm.com>
|
||||
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
|
||||
|
||||
|
||||
Many PCI bus controllers are able to detect a variety of hardware
|
||||
PCI errors on the bus, such as parity errors on the data and address
|
||||
busses, as well as SERR and PERR errors. Some of the more advanced
|
||||
chipsets are able to deal with these errors; these include PCI-E chipsets,
|
||||
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
|
||||
pSeries boxes. A typical action taken is to disconnect the affected device,
|
||||
halting all I/O to it. The goal of a disconnection is to avoid system
|
||||
corruption; for example, to halt system memory corruption due to DMA's
|
||||
to "wild" addresses. Typically, a reconnection mechanism is also
|
||||
offered, so that the affected PCI device(s) are reset and put back
|
||||
into working condition. The reset phase requires coordination
|
||||
between the affected device drivers and the PCI controller chip.
|
||||
This document describes a generic API for notifying device drivers
|
||||
of a bus disconnection, and then performing error recovery.
|
||||
This API is currently implemented in the 2.6.16 and later kernels.
|
||||
|
||||
Reporting and recovery is performed in several steps. First, when
|
||||
a PCI hardware error has resulted in a bus disconnect, that event
|
||||
is reported as soon as possible to all affected device drivers,
|
||||
including multiple instances of a device driver on multi-function
|
||||
cards. This allows device drivers to avoid deadlocking in spinloops,
|
||||
waiting for some i/o-space register to change, when it never will.
|
||||
It also gives the drivers a chance to defer incoming I/O as
|
||||
needed.
|
||||
|
||||
Next, recovery is performed in several stages. Most of the complexity
|
||||
is forced by the need to handle multi-function devices, that is,
|
||||
devices that have multiple device drivers associated with them.
|
||||
In the first stage, each driver is allowed to indicate what type
|
||||
of reset it desires, the choices being a simple re-enabling of I/O
|
||||
or requesting a slot reset.
|
||||
|
||||
If any driver requests a slot reset, that is what will be done.
|
||||
|
||||
After a reset and/or a re-enabling of I/O, all drivers are
|
||||
again notified, so that they may then perform any device setup/config
|
||||
that may be required. After these have all completed, a final
|
||||
"resume normal operations" event is sent out.
|
||||
|
||||
The biggest reason for choosing a kernel-based implementation rather
|
||||
than a user-space implementation was the need to deal with bus
|
||||
disconnects of PCI devices attached to storage media, and, in particular,
|
||||
disconnects from devices holding the root file system. If the root
|
||||
file system is disconnected, a user-space mechanism would have to go
|
||||
through a large number of contortions to complete recovery. Almost all
|
||||
of the current Linux file systems are not tolerant of disconnection
|
||||
from/reconnection to their underlying block device. By contrast,
|
||||
bus errors are easy to manage in the device driver. Indeed, most
|
||||
device drivers already handle very similar recovery procedures;
|
||||
for example, the SCSI-generic layer already provides significant
|
||||
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
|
||||
|
||||
|
||||
Detailed Design
|
||||
---------------
|
||||
Design and implementation details below, based on a chain of
|
||||
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
|
||||
|
||||
The error recovery API support is exposed to the driver in the form of
|
||||
a structure of function pointers pointed to by a new field in struct
|
||||
pci_driver. A driver that fails to provide the structure is "non-aware",
|
||||
and the actual recovery steps taken are platform dependent. The
|
||||
arch/powerpc implementation will simulate a PCI hotplug remove/add.
|
||||
|
||||
This structure has the form:
|
||||
struct pci_error_handlers
|
||||
{
|
||||
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
|
||||
int (*mmio_enabled)(struct pci_dev *dev);
|
||||
int (*link_reset)(struct pci_dev *dev);
|
||||
int (*slot_reset)(struct pci_dev *dev);
|
||||
void (*resume)(struct pci_dev *dev);
|
||||
};
|
||||
|
||||
The possible channel states are:
|
||||
enum pci_channel_state {
|
||||
pci_channel_io_normal, /* I/O channel is in normal state */
|
||||
pci_channel_io_frozen, /* I/O to channel is blocked */
|
||||
pci_channel_io_perm_failure, /* PCI card is dead */
|
||||
};
|
||||
|
||||
Possible return values are:
|
||||
enum pci_ers_result {
|
||||
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
|
||||
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
|
||||
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
|
||||
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
|
||||
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
|
||||
};
|
||||
|
||||
A driver does not have to implement all of these callbacks; however,
|
||||
if it implements any, it must implement error_detected(). If a callback
|
||||
is not implemented, the corresponding feature is considered unsupported.
|
||||
For example, if mmio_enabled() and resume() aren't there, then it
|
||||
is assumed that the driver is not doing any direct recovery and requires
|
||||
a slot reset. If link_reset() is not implemented, the card is assumed to
|
||||
not care about link resets. Typically a driver will want to know about
|
||||
a slot_reset().
|
||||
|
||||
The actual steps taken by a platform to recover from a PCI error
|
||||
event will be platform-dependent, but will follow the general
|
||||
sequence described below.
|
||||
|
||||
STEP 0: Error Event
|
||||
-------------------
|
||||
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
||||
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
||||
all writes are ignored.
|
||||
|
||||
|
||||
STEP 1: Notification
|
||||
--------------------
|
||||
Platform calls the error_detected() callback on every instance of
|
||||
every driver affected by the error.
|
||||
|
||||
At this point, the device might not be accessible anymore, depending on
|
||||
the platform (the slot will be isolated on powerpc). The driver may
|
||||
already have "noticed" the error because of a failing I/O, but this
|
||||
is the proper "synchronization point", that is, it gives the driver
|
||||
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
|
||||
to complete; it can take semaphores, schedule, etc... everything but
|
||||
touch the device. Within this function and after it returns, the driver
|
||||
shouldn't do any new IOs. Called in task context. This is sort of a
|
||||
"quiesce" point. See note about interrupts at the end of this doc.
|
||||
|
||||
All drivers participating in this system must implement this call.
|
||||
The driver must return one of the following result codes:
|
||||
- PCI_ERS_RESULT_CAN_RECOVER:
|
||||
Driver returns this if it thinks it might be able to recover
|
||||
the HW by just banging IOs or if it wants to be given
|
||||
a chance to extract some diagnostic information (see
|
||||
mmio_enable, below).
|
||||
- PCI_ERS_RESULT_NEED_RESET:
|
||||
Driver returns this if it can't recover without a
|
||||
slot reset.
|
||||
- PCI_ERS_RESULT_DISCONNECT:
|
||||
Driver returns this if it doesn't want to recover at all.
|
||||
|
||||
The next step taken will depend on the result codes returned by the
|
||||
drivers.
|
||||
|
||||
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
|
||||
then the platform should re-enable IOs on the slot (or do nothing in
|
||||
particular, if the platform doesn't isolate slots), and recovery
|
||||
proceeds to STEP 2 (MMIO Enable).
|
||||
|
||||
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
|
||||
then recovery proceeds to STEP 4 (Slot Reset).
|
||||
|
||||
If the platform is unable to recover the slot, the next step
|
||||
is STEP 6 (Permanent Failure).
|
||||
|
||||
>>> The current powerpc implementation assumes that a device driver will
|
||||
>>> *not* schedule or semaphore in this routine; the current powerpc
|
||||
>>> implementation uses one kernel thread to notify all devices;
|
||||
>>> thus, if one device sleeps/schedules, all devices are affected.
|
||||
>>> Doing better requires complex multi-threaded logic in the error
|
||||
>>> recovery implementation (e.g. waiting for all notification threads
|
||||
>>> to "join" before proceeding with recovery.) This seems excessively
|
||||
>>> complex and not worth implementing.
|
||||
|
||||
>>> The current powerpc implementation doesn't much care if the device
|
||||
>>> attempts I/O at this point, or not. I/O's will fail, returning
|
||||
>>> a value of 0xff on read, and writes will be dropped. If more than
|
||||
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
|
||||
>>> assumes that the device driver has gone into an infinite loop
|
||||
>>> and prints an error to syslog. A reboot is then required to
|
||||
>>> get the device working again.
|
||||
|
||||
STEP 2: MMIO Enabled
|
||||
-------------------
|
||||
The platform re-enables MMIO to the device (but typically not the
|
||||
DMA), and then calls the mmio_enabled() callback on all affected
|
||||
device drivers.
|
||||
|
||||
This is the "early recovery" call. IOs are allowed again, but DMA is
|
||||
not, with some restrictions. This is NOT a callback for the driver to
|
||||
start operations again, only to peek/poke at the device, extract diagnostic
|
||||
information, if any, and eventually do things like trigger a device local
|
||||
reset or some such, but not restart operations. This callback is made if
|
||||
all drivers on a segment agree that they can try to recover and if no automatic
|
||||
link reset was performed by the HW. If the platform can't just re-enable IOs
|
||||
without a slot reset or a link reset, it will not call this callback, and
|
||||
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
|
||||
|
||||
>>> The following is proposed; no platform implements this yet:
|
||||
>>> Proposal: All I/O's should be done _synchronously_ from within
|
||||
>>> this callback, errors triggered by them will be returned via
|
||||
>>> the normal pci_check_whatever() API, no new error_detected()
|
||||
>>> callback will be issued due to an error happening here. However,
|
||||
>>> such an error might cause IOs to be re-blocked for the whole
|
||||
>>> segment, and thus invalidate the recovery that other devices
|
||||
>>> on the same segment might have done, forcing the whole segment
|
||||
>>> into one of the next states, that is, link reset or slot reset.
|
||||
|
||||
The driver should return one of the following result codes:
|
||||
- PCI_ERS_RESULT_RECOVERED
|
||||
Driver returns this if it thinks the device is fully
|
||||
functional and thinks it is ready to start
|
||||
normal driver operations again. There is no
|
||||
guarantee that the driver will actually be
|
||||
allowed to proceed, as another driver on the
|
||||
same segment might have failed and thus triggered a
|
||||
slot reset on platforms that support it.
|
||||
|
||||
- PCI_ERS_RESULT_NEED_RESET
|
||||
Driver returns this if it thinks the device is not
|
||||
recoverable in its current state and it needs a slot
|
||||
reset to proceed.
|
||||
|
||||
- PCI_ERS_RESULT_DISCONNECT
|
||||
Same as above. Total failure, no recovery even after
|
||||
reset driver dead. (To be defined more precisely)
|
||||
|
||||
The next step taken depends on the results returned by the drivers.
|
||||
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
|
||||
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
||||
|
||||
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
||||
proceeds to STEP 4 (Slot Reset)
|
||||
|
||||
STEP 3: Link Reset
|
||||
------------------
|
||||
The platform resets the link, and then calls the link_reset() callback
|
||||
on all affected device drivers. This is a PCI-Express specific state
|
||||
and is done whenever a non-fatal error has been detected that can be
|
||||
"solved" by resetting the link. This call informs the driver of the
|
||||
reset and the driver should check to see if the device appears to be
|
||||
in working condition.
|
||||
|
||||
The driver is not supposed to restart normal driver I/O operations
|
||||
at this point. It should limit itself to "probing" the device to
|
||||
check its recoverability status. If all is right, then the platform
|
||||
will call resume() once all drivers have ack'd link_reset().
|
||||
|
||||
Result codes:
|
||||
(identical to STEP 3 (MMIO Enabled)
|
||||
|
||||
The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
|
||||
(Resume Operations).
|
||||
|
||||
>>> The current powerpc implementation does not implement this callback.
|
||||
|
||||
STEP 4: Slot Reset
|
||||
------------------
|
||||
|
||||
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
||||
the platform will peform a slot reset on the requesting PCI device(s).
|
||||
The actual steps taken by a platform to perform a slot reset
|
||||
will be platform-dependent. Upon completion of slot reset, the
|
||||
platform will call the device slot_reset() callback.
|
||||
|
||||
Powerpc platforms implement two levels of slot reset:
|
||||
soft reset(default) and fundamental(optional) reset.
|
||||
|
||||
Powerpc soft reset consists of asserting the adapter #RST line and then
|
||||
restoring the PCI BAR's and PCI configuration header to a state
|
||||
that is equivalent to what it would be after a fresh system
|
||||
power-on followed by power-on BIOS/system firmware initialization.
|
||||
Soft reset is also known as hot-reset.
|
||||
|
||||
Powerpc fundamental reset is supported by PCI Express cards only
|
||||
and results in device's state machines, hardware logic, port states and
|
||||
configuration registers to initialize to their default conditions.
|
||||
|
||||
For most PCI devices, a soft reset will be sufficient for recovery.
|
||||
Optional fundamental reset is provided to support a limited number
|
||||
of PCI Express PCI devices for which a soft reset is not sufficient
|
||||
for recovery.
|
||||
|
||||
If the platform supports PCI hotplug, then the reset might be
|
||||
performed by toggling the slot electrical power off/on.
|
||||
|
||||
It is important for the platform to restore the PCI config space
|
||||
to the "fresh poweron" state, rather than the "last state". After
|
||||
a slot reset, the device driver will almost always use its standard
|
||||
device initialization routines, and an unusual config space setup
|
||||
may result in hung devices, kernel panics, or silent data corruption.
|
||||
|
||||
This call gives drivers the chance to re-initialize the hardware
|
||||
(re-download firmware, etc.). At this point, the driver may assume
|
||||
that the card is in a fresh state and is fully functional. The slot
|
||||
is unfrozen and the driver has full access to PCI config space,
|
||||
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
|
||||
will also be available.
|
||||
|
||||
Drivers should not restart normal I/O processing operations
|
||||
at this point. If all device drivers report success on this
|
||||
callback, the platform will call resume() to complete the sequence,
|
||||
and let the driver restart normal I/O processing.
|
||||
|
||||
A driver can still return a critical failure for this function if
|
||||
it can't get the device operational after reset. If the platform
|
||||
previously tried a soft reset, it might now try a hard reset (power
|
||||
cycle) and then call slot_reset() again. It the device still can't
|
||||
be recovered, there is nothing more that can be done; the platform
|
||||
will typically report a "permanent failure" in such a case. The
|
||||
device will be considered "dead" in this case.
|
||||
|
||||
Drivers for multi-function cards will need to coordinate among
|
||||
themselves as to which driver instance will perform any "one-shot"
|
||||
or global device initialization. For example, the Symbios sym53cxx2
|
||||
driver performs device init only from PCI function 0:
|
||||
|
||||
+ if (PCI_FUNC(pdev->devfn) == 0)
|
||||
+ sym_reset_scsi_bus(np, 0);
|
||||
|
||||
Result codes:
|
||||
- PCI_ERS_RESULT_DISCONNECT
|
||||
Same as above.
|
||||
|
||||
Drivers for PCI Express cards that require a fundamental reset must
|
||||
set the needs_freset bit in the pci_dev structure in their probe function.
|
||||
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
|
||||
PCI card types:
|
||||
|
||||
+ /* Set EEH reset type to fundamental if required by hba */
|
||||
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
|
||||
+ pdev->needs_freset = 1;
|
||||
+
|
||||
|
||||
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
|
||||
Failure).
|
||||
|
||||
>>> The current powerpc implementation does not try a power-cycle
|
||||
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
|
||||
>>> However, it probably should.
|
||||
|
||||
|
||||
STEP 5: Resume Operations
|
||||
-------------------------
|
||||
The platform will call the resume() callback on all affected device
|
||||
drivers if all drivers on the segment have returned
|
||||
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
|
||||
The goal of this callback is to tell the driver to restart activity,
|
||||
that everything is back and running. This callback does not return
|
||||
a result code.
|
||||
|
||||
At this point, if a new error happens, the platform will restart
|
||||
a new error recovery sequence.
|
||||
|
||||
STEP 6: Permanent Failure
|
||||
-------------------------
|
||||
A "permanent failure" has occurred, and the platform cannot recover
|
||||
the device. The platform will call error_detected() with a
|
||||
pci_channel_state value of pci_channel_io_perm_failure.
|
||||
|
||||
The device driver should, at this point, assume the worst. It should
|
||||
cancel all pending I/O, refuse all new I/O, returning -EIO to
|
||||
higher layers. The device driver should then clean up all of its
|
||||
memory and remove itself from kernel operations, much as it would
|
||||
during system shutdown.
|
||||
|
||||
The platform will typically notify the system operator of the
|
||||
permanent failure in some way. If the device is hotplug-capable,
|
||||
the operator will probably want to remove and replace the device.
|
||||
Note, however, not all failures are truly "permanent". Some are
|
||||
caused by over-heating, some by a poorly seated card. Many
|
||||
PCI error events are caused by software bugs, e.g. DMA's to
|
||||
wild addresses or bogus split transactions due to programming
|
||||
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
||||
for additional detail on real-life experience of the causes of
|
||||
software errors.
|
||||
|
||||
|
||||
Conclusion; General Remarks
|
||||
---------------------------
|
||||
The way the callbacks are called is platform policy. A platform with
|
||||
no slot reset capability may want to just "ignore" drivers that can't
|
||||
recover (disconnect them) and try to let other cards on the same segment
|
||||
recover. Keep in mind that in most real life cases, though, there will
|
||||
be only one driver per segment.
|
||||
|
||||
Now, a note about interrupts. If you get an interrupt and your
|
||||
device is dead or has been isolated, there is a problem :)
|
||||
The current policy is to turn this into a platform policy.
|
||||
That is, the recovery API only requires that:
|
||||
|
||||
- There is no guarantee that interrupt delivery can proceed from any
|
||||
device on the segment starting from the error detection and until the
|
||||
slot_reset callback is called, at which point interrupts are expected
|
||||
to be fully operational.
|
||||
|
||||
- There is no guarantee that interrupt delivery is stopped, that is,
|
||||
a driver that gets an interrupt after detecting an error, or that detects
|
||||
an error within the interrupt handler such that it prevents proper
|
||||
ack'ing of the interrupt (and thus removal of the source) should just
|
||||
return IRQ_NOTHANDLED. It's up to the platform to deal with that
|
||||
condition, typically by masking the IRQ source during the duration of
|
||||
the error handling. It is expected that the platform "knows" which
|
||||
interrupts are routed to error-management capable slots and can deal
|
||||
with temporarily disabling that IRQ number during error processing (this
|
||||
isn't terribly complex). That means some IRQ latency for other devices
|
||||
sharing the interrupt, but there is simply no other way. High end
|
||||
platforms aren't supposed to share interrupts between many devices
|
||||
anyway :)
|
||||
|
||||
>>> Implementation details for the powerpc platform are discussed in
|
||||
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
|
||||
|
||||
>>> As of this writing, there is a growing list of device drivers with
|
||||
>>> patches implementing error recovery. Not all of these patches are in
|
||||
>>> mainline yet. These may be used as "examples":
|
||||
>>>
|
||||
>>> drivers/scsi/ipr
|
||||
>>> drivers/scsi/sym53c8xx_2
|
||||
>>> drivers/scsi/qla2xxx
|
||||
>>> drivers/scsi/lpfc
|
||||
>>> drivers/next/bnx2.c
|
||||
>>> drivers/next/e100.c
|
||||
>>> drivers/net/e1000
|
||||
>>> drivers/net/e1000e
|
||||
>>> drivers/net/ixgb
|
||||
>>> drivers/net/ixgbe
|
||||
>>> drivers/net/cxgb3
|
||||
>>> drivers/net/s2io.c
|
||||
>>> drivers/net/qlge
|
||||
|
||||
The End
|
||||
-------
|
||||
135
Documentation/PCI/pci-iov-howto.txt
Normal file
135
Documentation/PCI/pci-iov-howto.txt
Normal file
|
|
@ -0,0 +1,135 @@
|
|||
PCI Express I/O Virtualization Howto
|
||||
Copyright (C) 2009 Intel Corporation
|
||||
Yu Zhao <yu.zhao@intel.com>
|
||||
|
||||
Update: November 2012
|
||||
-- sysfs-based SRIOV enable-/disable-ment
|
||||
Donald Dutile <ddutile@redhat.com>
|
||||
|
||||
1. Overview
|
||||
|
||||
1.1 What is SR-IOV
|
||||
|
||||
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
|
||||
capability which makes one physical device appear as multiple virtual
|
||||
devices. The physical device is referred to as Physical Function (PF)
|
||||
while the virtual devices are referred to as Virtual Functions (VF).
|
||||
Allocation of the VF can be dynamically controlled by the PF via
|
||||
registers encapsulated in the capability. By default, this feature is
|
||||
not enabled and the PF behaves as traditional PCIe device. Once it's
|
||||
turned on, each VF's PCI configuration space can be accessed by its own
|
||||
Bus, Device and Function Number (Routing ID). And each VF also has PCI
|
||||
Memory Space, which is used to map its register set. VF device driver
|
||||
operates on the register set so it can be functional and appear as a
|
||||
real existing PCI device.
|
||||
|
||||
2. User Guide
|
||||
|
||||
2.1 How can I enable SR-IOV capability
|
||||
|
||||
Multiple methods are available for SR-IOV enablement.
|
||||
In the first method, the device driver (PF driver) will control the
|
||||
enabling and disabling of the capability via API provided by SR-IOV core.
|
||||
If the hardware has SR-IOV capability, loading its PF driver would
|
||||
enable it and all VFs associated with the PF. Some PF drivers require
|
||||
a module parameter to be set to determine the number of VFs to enable.
|
||||
In the second method, a write to the sysfs file sriov_numvfs will
|
||||
enable and disable the VFs associated with a PCIe PF. This method
|
||||
enables per-PF, VF enable/disable values versus the first method,
|
||||
which applies to all PFs of the same device. Additionally, the
|
||||
PCI SRIOV core support ensures that enable/disable operations are
|
||||
valid to reduce duplication in multiple drivers for the same
|
||||
checks, e.g., check numvfs == 0 if enabling VFs, ensure
|
||||
numvfs <= totalvfs.
|
||||
The second method is the recommended method for new/future VF devices.
|
||||
|
||||
2.2 How can I use the Virtual Functions
|
||||
|
||||
The VF is treated as hot-plugged PCI devices in the kernel, so they
|
||||
should be able to work in the same way as real PCI devices. The VF
|
||||
requires device driver that is same as a normal PCI device's.
|
||||
|
||||
3. Developer Guide
|
||||
|
||||
3.1 SR-IOV API
|
||||
|
||||
To enable SR-IOV capability:
|
||||
(a) For the first method, in the driver:
|
||||
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
|
||||
'nr_virtfn' is number of VFs to be enabled.
|
||||
(b) For the second method, from sysfs:
|
||||
echo 'nr_virtfn' > \
|
||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||
|
||||
To disable SR-IOV capability:
|
||||
(a) For the first method, in the driver:
|
||||
void pci_disable_sriov(struct pci_dev *dev);
|
||||
(b) For the second method, from sysfs:
|
||||
echo 0 > \
|
||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||
|
||||
3.2 Usage example
|
||||
|
||||
Following piece of code illustrates the usage of the SR-IOV API.
|
||||
|
||||
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
|
||||
{
|
||||
pci_enable_sriov(dev, NR_VIRTFN);
|
||||
|
||||
...
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void dev_remove(struct pci_dev *dev)
|
||||
{
|
||||
pci_disable_sriov(dev);
|
||||
|
||||
...
|
||||
}
|
||||
|
||||
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
|
||||
{
|
||||
...
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int dev_resume(struct pci_dev *dev)
|
||||
{
|
||||
...
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
static void dev_shutdown(struct pci_dev *dev)
|
||||
{
|
||||
...
|
||||
}
|
||||
|
||||
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
|
||||
{
|
||||
if (numvfs > 0) {
|
||||
...
|
||||
pci_enable_sriov(dev, numvfs);
|
||||
...
|
||||
return numvfs;
|
||||
}
|
||||
if (numvfs == 0) {
|
||||
....
|
||||
pci_disable_sriov(dev);
|
||||
...
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
static struct pci_driver dev_driver = {
|
||||
.name = "SR-IOV Physical Function driver",
|
||||
.id_table = dev_id_table,
|
||||
.probe = dev_probe,
|
||||
.remove = dev_remove,
|
||||
.suspend = dev_suspend,
|
||||
.resume = dev_resume,
|
||||
.shutdown = dev_shutdown,
|
||||
.sriov_configure = dev_sriov_configure,
|
||||
};
|
||||
635
Documentation/PCI/pci.txt
Normal file
635
Documentation/PCI/pci.txt
Normal file
|
|
@ -0,0 +1,635 @@
|
|||
|
||||
How To Write Linux PCI Drivers
|
||||
|
||||
by Martin Mares <mj@ucw.cz> on 07-Feb-2000
|
||||
updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
|
||||
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
The world of PCI is vast and full of (mostly unpleasant) surprises.
|
||||
Since each CPU architecture implements different chip-sets and PCI devices
|
||||
have different requirements (erm, "features"), the result is the PCI support
|
||||
in the Linux kernel is not as trivial as one would wish. This short paper
|
||||
tries to introduce all potential driver authors to Linux APIs for
|
||||
PCI device drivers.
|
||||
|
||||
A more complete resource is the third edition of "Linux Device Drivers"
|
||||
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
|
||||
LDD3 is available for free (under Creative Commons License) from:
|
||||
|
||||
http://lwn.net/Kernel/LDD3/
|
||||
|
||||
However, keep in mind that all documents are subject to "bit rot".
|
||||
Refer to the source code if things are not working as described here.
|
||||
|
||||
Please send questions/comments/patches about Linux PCI API to the
|
||||
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
|
||||
|
||||
|
||||
|
||||
0. Structure of PCI drivers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
PCI drivers "discover" PCI devices in a system via pci_register_driver().
|
||||
Actually, it's the other way around. When the PCI generic code discovers
|
||||
a new device, the driver with a matching "description" will be notified.
|
||||
Details on this below.
|
||||
|
||||
pci_register_driver() leaves most of the probing for devices to
|
||||
the PCI layer and supports online insertion/removal of devices [thus
|
||||
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
|
||||
pci_register_driver() call requires passing in a table of function
|
||||
pointers and thus dictates the high level structure of a driver.
|
||||
|
||||
Once the driver knows about a PCI device and takes ownership, the
|
||||
driver generally needs to perform the following initialization:
|
||||
|
||||
Enable the device
|
||||
Request MMIO/IOP resources
|
||||
Set the DMA mask size (for both coherent and streaming DMA)
|
||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
||||
Access device configuration space (if needed)
|
||||
Register IRQ handler (request_irq())
|
||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||
Enable DMA/processing engines
|
||||
|
||||
When done using the device, and perhaps the module needs to be unloaded,
|
||||
the driver needs to take the follow steps:
|
||||
Disable the device from generating IRQs
|
||||
Release the IRQ (free_irq())
|
||||
Stop all DMA activity
|
||||
Release DMA buffers (both streaming and coherent)
|
||||
Unregister from other subsystems (e.g. scsi or netdev)
|
||||
Release MMIO/IOP resources
|
||||
Disable the device
|
||||
|
||||
Most of these topics are covered in the following sections.
|
||||
For the rest look at LDD3 or <linux/pci.h> .
|
||||
|
||||
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
|
||||
the PCI functions described below are defined as inline functions either
|
||||
completely empty or just returning an appropriate error codes to avoid
|
||||
lots of ifdefs in the drivers.
|
||||
|
||||
|
||||
|
||||
1. pci_register_driver() call
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PCI device drivers call pci_register_driver() during their
|
||||
initialization with a pointer to a structure describing the driver
|
||||
(struct pci_driver):
|
||||
|
||||
field name Description
|
||||
---------- ------------------------------------------------------
|
||||
id_table Pointer to table of device ID's the driver is
|
||||
interested in. Most drivers should export this
|
||||
table using MODULE_DEVICE_TABLE(pci,...).
|
||||
|
||||
probe This probing function gets called (during execution
|
||||
of pci_register_driver() for already existing
|
||||
devices or later if a new device gets inserted) for
|
||||
all PCI devices which match the ID table and are not
|
||||
"owned" by the other drivers yet. This function gets
|
||||
passed a "struct pci_dev *" for each device whose
|
||||
entry in the ID table matches the device. The probe
|
||||
function returns zero when the driver chooses to
|
||||
take "ownership" of the device or an error code
|
||||
(negative number) otherwise.
|
||||
The probe function always gets called from process
|
||||
context, so it can sleep.
|
||||
|
||||
remove The remove() function gets called whenever a device
|
||||
being handled by this driver is removed (either during
|
||||
deregistration of the driver or when it's manually
|
||||
pulled out of a hot-pluggable slot).
|
||||
The remove function always gets called from process
|
||||
context, so it can sleep.
|
||||
|
||||
suspend Put device into low power state.
|
||||
suspend_late Put device into low power state.
|
||||
|
||||
resume_early Wake device from low power state.
|
||||
resume Wake device from low power state.
|
||||
|
||||
(Please see Documentation/power/pci.txt for descriptions
|
||||
of PCI Power Management and the related functions.)
|
||||
|
||||
shutdown Hook into reboot_notifier_list (kernel/sys.c).
|
||||
Intended to stop any idling DMA operations.
|
||||
Useful for enabling wake-on-lan (NIC) or changing
|
||||
the power state of a device before reboot.
|
||||
e.g. drivers/net/e100.c.
|
||||
|
||||
err_handler See Documentation/PCI/pci-error-recovery.txt
|
||||
|
||||
|
||||
The ID table is an array of struct pci_device_id entries ending with an
|
||||
all-zero entry. Definitions with static const are generally preferred.
|
||||
Use of the deprecated macro DEFINE_PCI_DEVICE_TABLE should be avoided.
|
||||
|
||||
Each entry consists of:
|
||||
|
||||
vendor,device Vendor and device ID to match (or PCI_ANY_ID)
|
||||
|
||||
subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID)
|
||||
subdevice,
|
||||
|
||||
class Device class, subclass, and "interface" to match.
|
||||
See Appendix D of the PCI Local Bus Spec or
|
||||
include/linux/pci_ids.h for a full list of classes.
|
||||
Most drivers do not need to specify class/class_mask
|
||||
as vendor/device is normally sufficient.
|
||||
|
||||
class_mask limit which sub-fields of the class field are compared.
|
||||
See drivers/scsi/sym53c8xx_2/ for example of usage.
|
||||
|
||||
driver_data Data private to the driver.
|
||||
Most drivers don't need to use driver_data field.
|
||||
Best practice is to use driver_data as an index
|
||||
into a static list of equivalent device types,
|
||||
instead of using it as a pointer.
|
||||
|
||||
|
||||
Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
|
||||
a pci_device_id table.
|
||||
|
||||
New PCI IDs may be added to a device driver pci_ids table at runtime
|
||||
as shown below:
|
||||
|
||||
echo "vendor device subvendor subdevice class class_mask driver_data" > \
|
||||
/sys/bus/pci/drivers/{driver}/new_id
|
||||
|
||||
All fields are passed in as hexadecimal values (no leading 0x).
|
||||
The vendor and device fields are mandatory, the others are optional. Users
|
||||
need pass only as many optional fields as necessary:
|
||||
o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
||||
o class and classmask fields default to 0
|
||||
o driver_data defaults to 0UL.
|
||||
|
||||
Note that driver_data must match the value used by any of the pci_device_id
|
||||
entries defined in the driver. This makes the driver_data field mandatory
|
||||
if all the pci_device_id entries have a non-zero driver_data value.
|
||||
|
||||
Once added, the driver probe routine will be invoked for any unclaimed
|
||||
PCI devices listed in its (newly updated) pci_ids list.
|
||||
|
||||
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
|
||||
automatically calls the remove hook for all devices handled by the driver.
|
||||
|
||||
|
||||
1.1 "Attributes" for driver functions/data
|
||||
|
||||
Please mark the initialization and cleanup functions where appropriate
|
||||
(the corresponding macros are defined in <linux/init.h>):
|
||||
|
||||
__init Initialization code. Thrown away after the driver
|
||||
initializes.
|
||||
__exit Exit code. Ignored for non-modular drivers.
|
||||
|
||||
Tips on when/where to use the above attributes:
|
||||
o The module_init()/module_exit() functions (and all
|
||||
initialization functions called _only_ from these)
|
||||
should be marked __init/__exit.
|
||||
|
||||
o Do not mark the struct pci_driver.
|
||||
|
||||
o Do NOT mark a function if you are not sure which mark to use.
|
||||
Better to not mark the function than mark the function wrong.
|
||||
|
||||
|
||||
|
||||
2. How to find PCI devices manually
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PCI drivers should have a really good reason for not using the
|
||||
pci_register_driver() interface to search for PCI devices.
|
||||
The main reason PCI devices are controlled by multiple drivers
|
||||
is because one PCI device implements several different HW services.
|
||||
E.g. combined serial/parallel port/floppy controller.
|
||||
|
||||
A manual search may be performed using the following constructs:
|
||||
|
||||
Searching by vendor and device ID:
|
||||
|
||||
struct pci_dev *dev = NULL;
|
||||
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
|
||||
configure_device(dev);
|
||||
|
||||
Searching by class ID (iterate in a similar way):
|
||||
|
||||
pci_get_class(CLASS_ID, dev)
|
||||
|
||||
Searching by both vendor/device and subsystem vendor/device ID:
|
||||
|
||||
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
|
||||
|
||||
You can use the constant PCI_ANY_ID as a wildcard replacement for
|
||||
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
|
||||
specific vendor, for example.
|
||||
|
||||
These functions are hotplug-safe. They increment the reference count on
|
||||
the pci_dev that they return. You must eventually (possibly at module unload)
|
||||
decrement the reference count on these devices by calling pci_dev_put().
|
||||
|
||||
|
||||
|
||||
3. Device Initialization Steps
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As noted in the introduction, most PCI drivers need the following steps
|
||||
for device initialization:
|
||||
|
||||
Enable the device
|
||||
Request MMIO/IOP resources
|
||||
Set the DMA mask size (for both coherent and streaming DMA)
|
||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
||||
Access device configuration space (if needed)
|
||||
Register IRQ handler (request_irq())
|
||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||
Enable DMA/processing engines.
|
||||
|
||||
The driver can access PCI config space registers at any time.
|
||||
(Well, almost. When running BIST, config space can go away...but
|
||||
that will just result in a PCI Bus Master Abort and config reads
|
||||
will return garbage).
|
||||
|
||||
|
||||
3.1 Enable the PCI device
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Before touching any device registers, the driver needs to enable
|
||||
the PCI device by calling pci_enable_device(). This will:
|
||||
o wake up the device if it was in suspended state,
|
||||
o allocate I/O and memory regions of the device (if BIOS did not),
|
||||
o allocate an IRQ (if BIOS did not).
|
||||
|
||||
NOTE: pci_enable_device() can fail! Check the return value.
|
||||
|
||||
[ OS BUG: we don't check resource allocations before enabling those
|
||||
resources. The sequence would make more sense if we called
|
||||
pci_request_resources() before calling pci_enable_device().
|
||||
Currently, the device drivers can't detect the bug when when two
|
||||
devices have been allocated the same range. This is not a common
|
||||
problem and unlikely to get fixed soon.
|
||||
|
||||
This has been discussed before but not changed as of 2.6.19:
|
||||
http://lkml.org/lkml/2006/3/2/194
|
||||
]
|
||||
|
||||
pci_set_master() will enable DMA by setting the bus master bit
|
||||
in the PCI_COMMAND register. It also fixes the latency timer value if
|
||||
it's set to something bogus by the BIOS. pci_clear_master() will
|
||||
disable DMA by clearing the bus master bit.
|
||||
|
||||
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
|
||||
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
|
||||
and also ensures that the cache line size register is set correctly.
|
||||
Check the return value of pci_set_mwi() as not all architectures
|
||||
or chip-sets may support Memory-Write-Invalidate. Alternatively,
|
||||
if Mem-Wr-Inval would be nice to have but is not required, call
|
||||
pci_try_set_mwi() to have the system do its best effort at enabling
|
||||
Mem-Wr-Inval.
|
||||
|
||||
|
||||
3.2 Request MMIO/IOP resources
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Memory (MMIO), and I/O port addresses should NOT be read directly
|
||||
from the PCI device config space. Use the values in the pci_dev structure
|
||||
as the PCI "bus address" might have been remapped to a "host physical"
|
||||
address by the arch/chip-set specific kernel support.
|
||||
|
||||
See Documentation/io-mapping.txt for how to access device registers
|
||||
or device memory.
|
||||
|
||||
The device driver needs to call pci_request_region() to verify
|
||||
no other device is already using the same address resource.
|
||||
Conversely, drivers should call pci_release_region() AFTER
|
||||
calling pci_disable_device().
|
||||
The idea is to prevent two devices colliding on the same address range.
|
||||
|
||||
[ See OS BUG comment above. Currently (2.6.19), The driver can only
|
||||
determine MMIO and IO Port resource availability _after_ calling
|
||||
pci_enable_device(). ]
|
||||
|
||||
Generic flavors of pci_request_region() are request_mem_region()
|
||||
(for MMIO ranges) and request_region() (for IO Port ranges).
|
||||
Use these for address resources that are not described by "normal" PCI
|
||||
BARs.
|
||||
|
||||
Also see pci_request_selected_regions() below.
|
||||
|
||||
|
||||
3.3 Set the DMA mask size
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
[ If anything below doesn't make sense, please refer to
|
||||
Documentation/DMA-API.txt. This section is just a reminder that
|
||||
drivers need to indicate DMA capabilities of the device and is not
|
||||
an authoritative source for DMA interfaces. ]
|
||||
|
||||
While all drivers should explicitly indicate the DMA capability
|
||||
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
|
||||
32-bit bus master capability for streaming data need the driver
|
||||
to "register" this capability by calling pci_set_dma_mask() with
|
||||
appropriate parameters. In general this allows more efficient DMA
|
||||
on systems where System RAM exists above 4G _physical_ address.
|
||||
|
||||
Drivers for all PCI-X and PCIe compliant devices must call
|
||||
pci_set_dma_mask() as they are 64-bit DMA devices.
|
||||
|
||||
Similarly, drivers must also "register" this capability if the device
|
||||
can directly address "consistent memory" in System RAM above 4G physical
|
||||
address by calling pci_set_consistent_dma_mask().
|
||||
Again, this includes drivers for all PCI-X and PCIe compliant devices.
|
||||
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
|
||||
64-bit DMA capable for payload ("streaming") data but not control
|
||||
("consistent") data.
|
||||
|
||||
|
||||
3.4 Setup shared control data
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
|
||||
memory. See Documentation/DMA-API.txt for a full description of
|
||||
the DMA APIs. This section is just a reminder that it needs to be done
|
||||
before enabling DMA on the device.
|
||||
|
||||
|
||||
3.5 Initialize device registers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Some drivers will need specific "capability" fields programmed
|
||||
or other "vendor specific" register initialized or reset.
|
||||
E.g. clearing pending interrupts.
|
||||
|
||||
|
||||
3.6 Register IRQ handler
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
While calling request_irq() is the last step described here,
|
||||
this is often just another intermediate step to initialize a device.
|
||||
This step can often be deferred until the device is opened for use.
|
||||
|
||||
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
|
||||
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
|
||||
can be shared).
|
||||
|
||||
request_irq() will associate an interrupt handler and device handle
|
||||
with an interrupt number. Historically interrupt numbers represent
|
||||
IRQ lines which run from the PCI device to the Interrupt controller.
|
||||
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
|
||||
|
||||
request_irq() also enables the interrupt. Make sure the device is
|
||||
quiesced and does not have any interrupts pending before registering
|
||||
the interrupt handler.
|
||||
|
||||
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
|
||||
which deliver interrupts to the CPU via a DMA write to a Local APIC.
|
||||
The fundamental difference between MSI and MSI-X is how multiple
|
||||
"vectors" get allocated. MSI requires contiguous blocks of vectors
|
||||
while MSI-X can allocate several individual ones.
|
||||
|
||||
MSI capability can be enabled by calling pci_enable_msi() or
|
||||
pci_enable_msix() before calling request_irq(). This causes
|
||||
the PCI support to program CPU vector data into the PCI device
|
||||
capability registers.
|
||||
|
||||
If your PCI device supports both, try to enable MSI-X first.
|
||||
Only one can be enabled at a time. Many architectures, chip-sets,
|
||||
or BIOSes do NOT support MSI or MSI-X and the call to pci_enable_msi/msix
|
||||
will fail. This is important to note since many drivers have
|
||||
two (or more) interrupt handlers: one for MSI/MSI-X and another for IRQs.
|
||||
They choose which handler to register with request_irq() based on the
|
||||
return value from pci_enable_msi/msix().
|
||||
|
||||
There are (at least) two really good reasons for using MSI:
|
||||
1) MSI is an exclusive interrupt vector by definition.
|
||||
This means the interrupt handler doesn't have to verify
|
||||
its device caused the interrupt.
|
||||
|
||||
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
|
||||
to be visible to the host CPU(s) when the MSI is delivered. This
|
||||
is important for both data coherency and avoiding stale control data.
|
||||
This guarantee allows the driver to omit MMIO reads to flush
|
||||
the DMA stream.
|
||||
|
||||
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
|
||||
of MSI/MSI-X usage.
|
||||
|
||||
|
||||
|
||||
4. PCI device shutdown
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When a PCI device driver is being unloaded, most of the following
|
||||
steps need to be performed:
|
||||
|
||||
Disable the device from generating IRQs
|
||||
Release the IRQ (free_irq())
|
||||
Stop all DMA activity
|
||||
Release DMA buffers (both streaming and consistent)
|
||||
Unregister from other subsystems (e.g. scsi or netdev)
|
||||
Disable device from responding to MMIO/IO Port addresses
|
||||
Release MMIO/IO Port resource(s)
|
||||
|
||||
|
||||
4.1 Stop IRQs on the device
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
How to do this is chip/device specific. If it's not done, it opens
|
||||
the possibility of a "screaming interrupt" if (and only if)
|
||||
the IRQ is shared with another device.
|
||||
|
||||
When the shared IRQ handler is "unhooked", the remaining devices
|
||||
using the same IRQ line will still need the IRQ enabled. Thus if the
|
||||
"unhooked" device asserts IRQ line, the system will respond assuming
|
||||
it was one of the remaining devices asserted the IRQ line. Since none
|
||||
of the other devices will handle the IRQ, the system will "hang" until
|
||||
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
|
||||
iterations later). Once the shared IRQ is masked, the remaining devices
|
||||
will stop functioning properly. Not a nice situation.
|
||||
|
||||
This is another reason to use MSI or MSI-X if it's available.
|
||||
MSI and MSI-X are defined to be exclusive interrupts and thus
|
||||
are not susceptible to the "screaming interrupt" problem.
|
||||
|
||||
|
||||
4.2 Release the IRQ
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
Once the device is quiesced (no more IRQs), one can call free_irq().
|
||||
This function will return control once any pending IRQs are handled,
|
||||
"unhook" the drivers IRQ handler from that IRQ, and finally release
|
||||
the IRQ if no one else is using it.
|
||||
|
||||
|
||||
4.3 Stop all DMA activity
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
It's extremely important to stop all DMA operations BEFORE attempting
|
||||
to deallocate DMA control data. Failure to do so can result in memory
|
||||
corruption, hangs, and on some chip-sets a hard crash.
|
||||
|
||||
Stopping DMA after stopping the IRQs can avoid races where the
|
||||
IRQ handler might restart DMA engines.
|
||||
|
||||
While this step sounds obvious and trivial, several "mature" drivers
|
||||
didn't get this step right in the past.
|
||||
|
||||
|
||||
4.4 Release DMA buffers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Once DMA is stopped, clean up streaming DMA first.
|
||||
I.e. unmap data buffers and return buffers to "upstream"
|
||||
owners if there is one.
|
||||
|
||||
Then clean up "consistent" buffers which contain the control data.
|
||||
|
||||
See Documentation/DMA-API.txt for details on unmapping interfaces.
|
||||
|
||||
|
||||
4.5 Unregister from other subsystems
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Most low level PCI device drivers support some other subsystem
|
||||
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
|
||||
driver isn't losing resources from that other subsystem.
|
||||
If this happens, typically the symptom is an Oops (panic) when
|
||||
the subsystem attempts to call into a driver that has been unloaded.
|
||||
|
||||
|
||||
4.6 Disable Device from responding to MMIO/IO Port addresses
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
|
||||
This is the symmetric opposite of pci_enable_device().
|
||||
Do not access device registers after calling pci_disable_device().
|
||||
|
||||
|
||||
4.7 Release MMIO/IO Port Resource(s)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Call pci_release_region() to mark the MMIO or IO Port range as available.
|
||||
Failure to do so usually results in the inability to reload the driver.
|
||||
|
||||
|
||||
|
||||
5. How to access PCI config space
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can use pci_(read|write)_config_(byte|word|dword) to access the config
|
||||
space of a device represented by struct pci_dev *. All these functions return 0
|
||||
when successful or an error code (PCIBIOS_...) which can be translated to a text
|
||||
string by pcibios_strerror. Most drivers expect that accesses to valid PCI
|
||||
devices don't fail.
|
||||
|
||||
If you don't have a struct pci_dev available, you can call
|
||||
pci_bus_(read|write)_config_(byte|word|dword) to access a given device
|
||||
and function on that bus.
|
||||
|
||||
If you access fields in the standard portion of the config header, please
|
||||
use symbolic names of locations and bits declared in <linux/pci.h>.
|
||||
|
||||
If you need to access Extended PCI Capability registers, just call
|
||||
pci_find_capability() for the particular capability and it will find the
|
||||
corresponding register block for you.
|
||||
|
||||
|
||||
|
||||
6. Other interesting functions
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
|
||||
bus and slot and number. If the device is
|
||||
found, its reference count is increased.
|
||||
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
|
||||
pci_find_capability() Find specified capability in device's capability
|
||||
list.
|
||||
pci_resource_start() Returns bus start address for a given PCI region
|
||||
pci_resource_end() Returns bus end address for a given PCI region
|
||||
pci_resource_len() Returns the byte length of a PCI region
|
||||
pci_set_drvdata() Set private driver data pointer for a pci_dev
|
||||
pci_get_drvdata() Return private driver data pointer for a pci_dev
|
||||
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
|
||||
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
|
||||
|
||||
|
||||
|
||||
7. Miscellaneous hints
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When displaying PCI device names to the user (for example when a driver wants
|
||||
to tell the user what card has it found), please use pci_name(pci_dev).
|
||||
|
||||
Always refer to the PCI devices by a pointer to the pci_dev structure.
|
||||
All PCI layer functions use this identification and it's the only
|
||||
reasonable one. Don't use bus/slot/function numbers except for very
|
||||
special purposes -- on systems with multiple primary buses their semantics
|
||||
can be pretty complex.
|
||||
|
||||
Don't try to turn on Fast Back to Back writes in your driver. All devices
|
||||
on the bus need to be capable of doing it, so this is something which needs
|
||||
to be handled by platform and generic code, not individual drivers.
|
||||
|
||||
|
||||
|
||||
8. Vendor and device identifications
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
One is not required to add new device ids to include/linux/pci_ids.h.
|
||||
Please add PCI_VENDOR_ID_xxx for vendors and a hex constant for device ids.
|
||||
|
||||
PCI_VENDOR_ID_xxx constants are re-used. The device ids are arbitrary
|
||||
hex numbers (vendor controlled) and normally used only in a single
|
||||
location, the pci_device_id table.
|
||||
|
||||
Please DO submit new vendor/device ids to pciids.sourceforge.net project.
|
||||
|
||||
|
||||
|
||||
9. Obsolete functions
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are several functions which you might come across when trying to
|
||||
port an old driver to the new PCI interface. They are no longer present
|
||||
in the kernel as they aren't compatible with hotplug or PCI domains or
|
||||
having sane locking.
|
||||
|
||||
pci_find_device() Superseded by pci_get_device()
|
||||
pci_find_subsys() Superseded by pci_get_subsys()
|
||||
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||
|
||||
|
||||
The alternative is the traditional PCI device driver that walks PCI
|
||||
device lists. This is still possible but discouraged.
|
||||
|
||||
|
||||
|
||||
10. MMIO Space and "Write Posting"
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Converting a driver from using I/O Port space to using MMIO space
|
||||
often requires some additional changes. Specifically, "write posting"
|
||||
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
|
||||
already do this. I/O Port space guarantees write transactions reach the PCI
|
||||
device before the CPU can continue. Writes to MMIO space allow the CPU
|
||||
to continue before the transaction reaches the PCI device. HW weenies
|
||||
call this "Write Posting" because the write completion is "posted" to
|
||||
the CPU before the transaction has reached its destination.
|
||||
|
||||
Thus, timing sensitive code should add readl() where the CPU is
|
||||
expected to wait before doing other work. The classic "bit banging"
|
||||
sequence works fine for I/O Port space:
|
||||
|
||||
for (i = 8; --i; val >>= 1) {
|
||||
outb(val & 1, ioport_reg); /* write bit */
|
||||
udelay(10);
|
||||
}
|
||||
|
||||
The same sequence for MMIO space should be:
|
||||
|
||||
for (i = 8; --i; val >>= 1) {
|
||||
writeb(val & 1, mmio_reg); /* write bit */
|
||||
readb(safe_mmio_reg); /* flush posted write */
|
||||
udelay(10);
|
||||
}
|
||||
|
||||
It is important that "safe_mmio_reg" not have any side effects that
|
||||
interferes with the correct operation of the device.
|
||||
|
||||
Another case to watch out for is when resetting a PCI device. Use PCI
|
||||
Configuration space reads to flush the writel(). This will gracefully
|
||||
handle the PCI master abort on all platforms if the PCI device is
|
||||
expected to not respond to a readl(). Most x86 platforms will allow
|
||||
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
|
||||
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
|
||||
|
||||
270
Documentation/PCI/pcieaer-howto.txt
Normal file
270
Documentation/PCI/pcieaer-howto.txt
Normal file
|
|
@ -0,0 +1,270 @@
|
|||
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
||||
T. Long Nguyen <tom.l.nguyen@intel.com>
|
||||
Yanmin Zhang <yanmin.zhang@intel.com>
|
||||
07/29/2006
|
||||
|
||||
|
||||
1. Overview
|
||||
|
||||
1.1 About this guide
|
||||
|
||||
This guide describes the basics of the PCI Express Advanced Error
|
||||
Reporting (AER) driver and provides information on how to use it, as
|
||||
well as how to enable the drivers of endpoint devices to conform with
|
||||
PCI Express AER driver.
|
||||
|
||||
1.2 Copyright (C) Intel Corporation 2006.
|
||||
|
||||
1.3 What is the PCI Express AER Driver?
|
||||
|
||||
PCI Express error signaling can occur on the PCI Express link itself
|
||||
or on behalf of transactions initiated on the link. PCI Express
|
||||
defines two error reporting paradigms: the baseline capability and
|
||||
the Advanced Error Reporting capability. The baseline capability is
|
||||
required of all PCI Express components providing a minimum defined
|
||||
set of error reporting requirements. Advanced Error Reporting
|
||||
capability is implemented with a PCI Express advanced error reporting
|
||||
extended capability structure providing more robust error reporting.
|
||||
|
||||
The PCI Express AER driver provides the infrastructure to support PCI
|
||||
Express Advanced Error Reporting capability. The PCI Express AER
|
||||
driver provides three basic functions:
|
||||
|
||||
- Gathers the comprehensive error information if errors occurred.
|
||||
- Reports error to the users.
|
||||
- Performs error recovery actions.
|
||||
|
||||
AER driver only attaches root ports which support PCI-Express AER
|
||||
capability.
|
||||
|
||||
|
||||
2. User Guide
|
||||
|
||||
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
|
||||
|
||||
The PCI Express AER Root driver is a Root Port service driver attached
|
||||
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
||||
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
||||
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
||||
CONFIG_PCIEAER = y.
|
||||
|
||||
2.2 Load PCI Express AER Root Driver
|
||||
There is a case where a system has AER support in BIOS. Enabling the AER
|
||||
Root driver and having AER support in BIOS may result unpredictable
|
||||
behavior. To avoid this conflict, a successful load of the AER Root driver
|
||||
requires ACPI _OSC support in the BIOS to allow the AER Root driver to
|
||||
request for native control of AER. See the PCI FW 3.0 Specification for
|
||||
details regarding OSC usage. Currently, lots of firmwares don't provide
|
||||
_OSC support while they use PCI Express. To support such firmwares,
|
||||
forceload, a parameter of type bool, could enable AER to continue to
|
||||
be initiated although firmwares have no _OSC support. To enable the
|
||||
walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line
|
||||
when booting kernel. Note that forceload=n by default.
|
||||
|
||||
nosourceid, another parameter of type bool, can be used when broken
|
||||
hardware (mostly chipsets) has root ports that cannot obtain the reporting
|
||||
source ID. nosourceid=n by default.
|
||||
|
||||
2.3 AER error output
|
||||
When a PCI-E AER error is captured, an error message will be outputed to
|
||||
console. If it's a correctable error, it is outputed as a warning.
|
||||
Otherwise, it is printed as an error. So users could choose different
|
||||
log level to filter out correctable error messages.
|
||||
|
||||
Below shows an example:
|
||||
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
||||
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
||||
0000:50:00.0: [20] Unsupported Request (First)
|
||||
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
||||
|
||||
In the example, 'Requester ID' means the ID of the device who sends
|
||||
the error message to root port. Pls. refer to pci express specs for
|
||||
other fields.
|
||||
|
||||
|
||||
3. Developer Guide
|
||||
|
||||
To enable AER aware support requires a software driver to configure
|
||||
the AER capability structure within its device and to provide callbacks.
|
||||
|
||||
To support AER better, developers need understand how AER does work
|
||||
firstly.
|
||||
|
||||
PCI Express errors are classified into two types: correctable errors
|
||||
and uncorrectable errors. This classification is based on the impacts
|
||||
of those errors, which may result in degraded performance or function
|
||||
failure.
|
||||
|
||||
Correctable errors pose no impacts on the functionality of the
|
||||
interface. The PCI Express protocol can recover without any software
|
||||
intervention or any loss of data. These errors are detected and
|
||||
corrected by hardware. Unlike correctable errors, uncorrectable
|
||||
errors impact functionality of the interface. Uncorrectable errors
|
||||
can cause a particular transaction or a particular PCI Express link
|
||||
to be unreliable. Depending on those error conditions, uncorrectable
|
||||
errors are further classified into non-fatal errors and fatal errors.
|
||||
Non-fatal errors cause the particular transaction to be unreliable,
|
||||
but the PCI Express link itself is fully functional. Fatal errors, on
|
||||
the other hand, cause the link to be unreliable.
|
||||
|
||||
When AER is enabled, a PCI Express device will automatically send an
|
||||
error message to the PCIe root port above it when the device captures
|
||||
an error. The Root Port, upon receiving an error reporting message,
|
||||
internally processes and logs the error message in its PCI Express
|
||||
capability structure. Error information being logged includes storing
|
||||
the error reporting agent's requestor ID into the Error Source
|
||||
Identification Registers and setting the error bits of the Root Error
|
||||
Status Register accordingly. If AER error reporting is enabled in Root
|
||||
Error Command Register, the Root Port generates an interrupt if an
|
||||
error is detected.
|
||||
|
||||
Note that the errors as described above are related to the PCI Express
|
||||
hierarchy and links. These errors do not include any device specific
|
||||
errors because device specific errors will still get sent directly to
|
||||
the device driver.
|
||||
|
||||
3.1 Configure the AER capability structure
|
||||
|
||||
AER aware drivers of PCI Express component need change the device
|
||||
control registers to enable AER. They also could change AER registers,
|
||||
including mask and severity registers. Helper function
|
||||
pci_enable_pcie_error_reporting could be used to enable AER. See
|
||||
section 3.3.
|
||||
|
||||
3.2. Provide callbacks
|
||||
|
||||
3.2.1 callback reset_link to reset pci express link
|
||||
|
||||
This callback is used to reset the pci express physical link when a
|
||||
fatal error happens. The root port aer service driver provides a
|
||||
default reset_link function, but different upstream ports might
|
||||
have different specifications to reset pci express link, so all
|
||||
upstream ports should provide their own reset_link functions.
|
||||
|
||||
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
||||
added.
|
||||
|
||||
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
||||
|
||||
Section 3.2.2.2 provides more detailed info on when to call
|
||||
reset_link.
|
||||
|
||||
3.2.2 PCI error-recovery callbacks
|
||||
|
||||
The PCI Express AER Root driver uses error callbacks to coordinate
|
||||
with downstream device drivers associated with a hierarchy in question
|
||||
when performing error recovery actions.
|
||||
|
||||
Data struct pci_driver has a pointer, err_handler, to point to
|
||||
pci_error_handlers who consists of a couple of callback function
|
||||
pointers. AER driver follows the rules defined in
|
||||
pci-error-recovery.txt except pci express specific parts (e.g.
|
||||
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
||||
definitions of the callbacks.
|
||||
|
||||
Below sections specify when to call the error callback functions.
|
||||
|
||||
3.2.2.1 Correctable errors
|
||||
|
||||
Correctable errors pose no impacts on the functionality of
|
||||
the interface. The PCI Express protocol can recover without any
|
||||
software intervention or any loss of data. These errors do not
|
||||
require any recovery actions. The AER driver clears the device's
|
||||
correctable error status register accordingly and logs these errors.
|
||||
|
||||
3.2.2.2 Non-correctable (non-fatal and fatal) errors
|
||||
|
||||
If an error message indicates a non-fatal error, performing link reset
|
||||
at upstream is not required. The AER driver calls error_detected(dev,
|
||||
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
||||
question. for example,
|
||||
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
|
||||
If Upstream port A captures an AER error, the hierarchy consists of
|
||||
Downstream port B and EndPoint.
|
||||
|
||||
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
||||
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
||||
whether it can recover or the AER driver calls mmio_enabled as next.
|
||||
|
||||
If an error message indicates a fatal error, kernel will broadcast
|
||||
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
||||
a hierarchy in question. Then, performing link reset at upstream is
|
||||
necessary. As different kinds of devices might use different approaches
|
||||
to reset link, AER port service driver is required to provide the
|
||||
function to reset link. Firstly, kernel looks for if the upstream
|
||||
component has an aer driver. If it has, kernel uses the reset_link
|
||||
callback of the aer driver. If the upstream component has no aer driver
|
||||
and the port is downstream port, we will perform a hot reset as the
|
||||
default by setting the Secondary Bus Reset bit of the Bridge Control
|
||||
register associated with the downstream port. As for upstream ports,
|
||||
they should provide their own aer service drivers with reset_link
|
||||
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
||||
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
||||
to mmio_enabled.
|
||||
|
||||
3.3 helper functions
|
||||
|
||||
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
||||
pci_enable_pcie_error_reporting enables the device to send error
|
||||
messages to root port when an error is detected. Note that devices
|
||||
don't enable the error reporting by default, so device drivers need
|
||||
call this function to enable it.
|
||||
|
||||
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
||||
pci_disable_pcie_error_reporting disables the device to send error
|
||||
messages to root port when an error is detected.
|
||||
|
||||
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
|
||||
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
||||
error status register.
|
||||
|
||||
3.4 Frequent Asked Questions
|
||||
|
||||
Q: What happens if a PCI Express device driver does not provide an
|
||||
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
||||
|
||||
A: The devices attached with the driver won't be recovered. If the
|
||||
error is fatal, kernel will print out warning messages. Please refer
|
||||
to section 3 for more information.
|
||||
|
||||
Q: What happens if an upstream port service driver does not provide
|
||||
callback reset_link?
|
||||
|
||||
A: Fatal error recovery will fail if the errors are reported by the
|
||||
upstream ports who are attached by the service driver.
|
||||
|
||||
Q: How does this infrastructure deal with driver that is not PCI
|
||||
Express aware?
|
||||
|
||||
A: This infrastructure calls the error callback functions of the
|
||||
driver when an error happens. But if the driver is not aware of
|
||||
PCI Express, the device might not report its own errors to root
|
||||
port.
|
||||
|
||||
Q: What modifications will that driver need to make it compatible
|
||||
with the PCI Express AER Root driver?
|
||||
|
||||
A: It could call the helper functions to enable AER in devices and
|
||||
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
||||
|
||||
|
||||
4. Software error injection
|
||||
|
||||
Debugging PCIe AER error recovery code is quite difficult because it
|
||||
is hard to trigger real hardware errors. Software based error
|
||||
injection can be used to fake various kinds of PCIe errors.
|
||||
|
||||
First you should enable PCIe AER software error injection in kernel
|
||||
configuration, that is, following item should be in your .config.
|
||||
|
||||
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
||||
|
||||
After reboot with new kernel or insert the module, a device file named
|
||||
/dev/aer_inject should be created.
|
||||
|
||||
Then, you need a user space tool named aer-inject, which can be gotten
|
||||
from:
|
||||
http://www.kernel.org/pub/linux/utils/pci/aer-inject/
|
||||
|
||||
More information about aer-inject can be found in the document comes
|
||||
with its source code.
|
||||
Loading…
Add table
Add a link
Reference in a new issue