cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

eeh-pci-error-recovery.rst (15264B)


      1==========================
      2PCI Bus EEH Error Recovery
      3==========================
      4
      5Linas Vepstas <linas@austin.ibm.com>
      6
      712 January 2005
      8
      9
     10Overview:
     11---------
     12The IBM POWER-based pSeries and iSeries computers include PCI bus
     13controller chips that have extended capabilities for detecting and
     14reporting a large variety of PCI bus error conditions.  These features
     15go under the name of "EEH", for "Enhanced Error Handling".  The EEH
     16hardware features allow PCI bus errors to be cleared and a PCI
     17card to be "rebooted", without also having to reboot the operating
     18system.
     19
     20This is in contrast to traditional PCI error handling, where the
     21PCI chip is wired directly to the CPU, and an error would cause
     22a CPU machine-check/check-stop condition, halting the CPU entirely.
     23Another "traditional" technique is to ignore such errors, which
     24can lead to data corruption, both of user data or of kernel data,
     25hung/unresponsive adapters, or system crashes/lockups.  Thus,
     26the idea behind EEH is that the operating system can become more
     27reliable and robust by protecting it from PCI errors, and giving
     28the OS the ability to "reboot"/recover individual PCI devices.
     29
     30Future systems from other vendors, based on the PCI-E specification,
     31may contain similar features.
     32
     33
     34Causes of EEH Errors
     35--------------------
     36EEH was originally designed to guard against hardware failure, such
     37as PCI cards dying from heat, humidity, dust, vibration and bad
     38electrical connections. The vast majority of EEH errors seen in
     39"real life" are due to either poorly seated PCI cards, or,
     40unfortunately quite commonly, due to device driver bugs, device firmware
     41bugs, and sometimes PCI card hardware bugs.
     42
     43The most common software bug, is one that causes the device to
     44attempt to DMA to a location in system memory that has not been
     45reserved for DMA access for that card.  This is a powerful feature,
     46as it prevents what; otherwise, would have been silent memory
     47corruption caused by the bad DMA.  A number of device driver
     48bugs have been found and fixed in this way over the past few
     49years.  Other possible causes of EEH errors include data or
     50address line parity errors (for example, due to poor electrical
     51connectivity due to a poorly seated card), and PCI-X split-completion
     52errors (due to software, device firmware, or device PCI hardware bugs).
     53The vast majority of "true hardware failures" can be cured by
     54physically removing and re-seating the PCI card.
     55
     56
     57Detection and Recovery
     58----------------------
     59In the following discussion, a generic overview of how to detect
     60and recover from EEH errors will be presented. This is followed
     61by an overview of how the current implementation in the Linux
     62kernel does it.  The actual implementation is subject to change,
     63and some of the finer points are still being debated.  These
     64may in turn be swayed if or when other architectures implement
     65similar functionality.
     66
     67When a PCI Host Bridge (PHB, the bus controller connecting the
     68PCI bus to the system CPU electronics complex) detects a PCI error
     69condition, it will "isolate" the affected PCI card.  Isolation
     70will block all writes (either to the card from the system, or
     71from the card to the system), and it will cause all reads to
     72return all-ff's (0xff, 0xffff, 0xffffffff for 8/16/32-bit reads).
     73This value was chosen because it is the same value you would
     74get if the device was physically unplugged from the slot.
     75This includes access to PCI memory, I/O space, and PCI config
     76space.  Interrupts; however, will continue to be delivered.
     77
     78Detection and recovery are performed with the aid of ppc64
     79firmware.  The programming interfaces in the Linux kernel
     80into the firmware are referred to as RTAS (Run-Time Abstraction
     81Services).  The Linux kernel does not (should not) access
     82the EEH function in the PCI chipsets directly, primarily because
     83there are a number of different chipsets out there, each with
     84different interfaces and quirks. The firmware provides a
     85uniform abstraction layer that will work with all pSeries
     86and iSeries hardware (and be forwards-compatible).
     87
     88If the OS or device driver suspects that a PCI slot has been
     89EEH-isolated, there is a firmware call it can make to determine if
     90this is the case. If so, then the device driver should put itself
     91into a consistent state (given that it won't be able to complete any
     92pending work) and start recovery of the card.  Recovery normally
     93would consist of resetting the PCI device (holding the PCI #RST
     94line high for two seconds), followed by setting up the device
     95config space (the base address registers (BAR's), latency timer,
     96cache line size, interrupt line, and so on).  This is followed by a
     97reinitialization of the device driver.  In a worst-case scenario,
     98the power to the card can be toggled, at least on hot-plug-capable
     99slots.  In principle, layers far above the device driver probably
    100do not need to know that the PCI card has been "rebooted" in this
    101way; ideally, there should be at most a pause in Ethernet/disk/USB
    102I/O while the card is being reset.
    103
    104If the card cannot be recovered after three or four resets, the
    105kernel/device driver should assume the worst-case scenario, that the
    106card has died completely, and report this error to the sysadmin.
    107In addition, error messages are reported through RTAS and also through
    108syslogd (/var/log/messages) to alert the sysadmin of PCI resets.
    109The correct way to deal with failed adapters is to use the standard
    110PCI hotplug tools to remove and replace the dead card.
    111
    112
    113Current PPC64 Linux EEH Implementation
    114--------------------------------------
    115At this time, a generic EEH recovery mechanism has been implemented,
    116so that individual device drivers do not need to be modified to support
    117EEH recovery.  This generic mechanism piggy-backs on the PCI hotplug
    118infrastructure,  and percolates events up through the userspace/udev
    119infrastructure.  Following is a detailed description of how this is
    120accomplished.
    121
    122EEH must be enabled in the PHB's very early during the boot process,
    123and if a PCI slot is hot-plugged. The former is performed by
    124eeh_init() in arch/powerpc/platforms/pseries/eeh.c, and the later by
    125drivers/pci/hotplug/pSeries_pci.c calling in to the eeh.c code.
    126EEH must be enabled before a PCI scan of the device can proceed.
    127Current Power5 hardware will not work unless EEH is enabled;
    128although older Power4 can run with it disabled.  Effectively,
    129EEH can no longer be turned off.  PCI devices *must* be
    130registered with the EEH code; the EEH code needs to know about
    131the I/O address ranges of the PCI device in order to detect an
    132error.  Given an arbitrary address, the routine
    133pci_get_device_by_addr() will find the pci device associated
    134with that address (if any).
    135
    136The default arch/powerpc/include/asm/io.h macros readb(), inb(), insb(),
    137etc. include a check to see if the i/o read returned all-0xff's.
    138If so, these make a call to eeh_dn_check_failure(), which in turn
    139asks the firmware if the all-ff's value is the sign of a true EEH
    140error.  If it is not, processing continues as normal.  The grand
    141total number of these false alarms or "false positives" can be
    142seen in /proc/ppc64/eeh (subject to change).  Normally, almost
    143all of these occur during boot, when the PCI bus is scanned, where
    144a large number of 0xff reads are part of the bus scan procedure.
    145
    146If a frozen slot is detected, code in
    147arch/powerpc/platforms/pseries/eeh.c will print a stack trace to
    148syslog (/var/log/messages).  This stack trace has proven to be very
    149useful to device-driver authors for finding out at what point the EEH
    150error was detected, as the error itself usually occurs slightly
    151beforehand.
    152
    153Next, it uses the Linux kernel notifier chain/work queue mechanism to
    154allow any interested parties to find out about the failure.  Device
    155drivers, or other parts of the kernel, can use
    156`eeh_register_notifier(struct notifier_block *)` to find out about EEH
    157events.  The event will include a pointer to the pci device, the
    158device node and some state info.  Receivers of the event can "do as
    159they wish"; the default handler will be described further in this
    160section.
    161
    162To assist in the recovery of the device, eeh.c exports the
    163following functions:
    164
    165rtas_set_slot_reset()
    166   assert the  PCI #RST line for 1/8th of a second
    167rtas_configure_bridge()
    168   ask firmware to configure any PCI bridges
    169   located topologically under the pci slot.
    170eeh_save_bars() and eeh_restore_bars():
    171   save and restore the PCI
    172   config-space info for a device and any devices under it.
    173
    174
    175A handler for the EEH notifier_block events is implemented in
    176drivers/pci/hotplug/pSeries_pci.c, called handle_eeh_events().
    177It saves the device BAR's and then calls rpaphp_unconfig_pci_adapter().
    178This last call causes the device driver for the card to be stopped,
    179which causes uevents to go out to user space. This triggers
    180user-space scripts that might issue commands such as "ifdown eth0"
    181for ethernet cards, and so on.  This handler then sleeps for 5 seconds,
    182hoping to give the user-space scripts enough time to complete.
    183It then resets the PCI card, reconfigures the device BAR's, and
    184any bridges underneath. It then calls rpaphp_enable_pci_slot(),
    185which restarts the device driver and triggers more user-space
    186events (for example, calling "ifup eth0" for ethernet cards).
    187
    188
    189Device Shutdown and User-Space Events
    190-------------------------------------
    191This section documents what happens when a pci slot is unconfigured,
    192focusing on how the device driver gets shut down, and on how the
    193events get delivered to user-space scripts.
    194
    195Following is an example sequence of events that cause a device driver
    196close function to be called during the first phase of an EEH reset.
    197The following sequence is an example of the pcnet32 device driver::
    198
    199    rpa_php_unconfig_pci_adapter (struct slot *)  // in rpaphp_pci.c
    200    {
    201      calls
    202      pci_remove_bus_device (struct pci_dev *) // in /drivers/pci/remove.c
    203      {
    204        calls
    205        pci_destroy_dev (struct pci_dev *)
    206        {
    207          calls
    208          device_unregister (&dev->dev) // in /drivers/base/core.c
    209          {
    210            calls
    211            device_del (struct device *)
    212            {
    213              calls
    214              bus_remove_device() // in /drivers/base/bus.c
    215              {
    216                calls
    217                device_release_driver()
    218                {
    219                  calls
    220                  struct device_driver->remove() which is just
    221                  pci_device_remove()  // in /drivers/pci/pci_driver.c
    222                  {
    223                    calls
    224                    struct pci_driver->remove() which is just
    225                    pcnet32_remove_one() // in /drivers/net/pcnet32.c
    226                    {
    227                      calls
    228                      unregister_netdev() // in /net/core/dev.c
    229                      {
    230                        calls
    231                        dev_close()  // in /net/core/dev.c
    232                        {
    233                           calls dev->stop();
    234                           which is just pcnet32_close() // in pcnet32.c
    235                           {
    236                             which does what you wanted
    237                             to stop the device
    238                           }
    239                        }
    240                     }
    241                   which
    242                   frees pcnet32 device driver memory
    243                }
    244     }}}}}}
    245
    246
    247in drivers/pci/pci_driver.c,
    248struct device_driver->remove() is just pci_device_remove()
    249which calls struct pci_driver->remove() which is pcnet32_remove_one()
    250which calls unregister_netdev()  (in net/core/dev.c)
    251which calls dev_close()  (in net/core/dev.c)
    252which calls dev->stop() which is pcnet32_close()
    253which then does the appropriate shutdown.
    254
    255---
    256
    257Following is the analogous stack trace for events sent to user-space
    258when the pci device is unconfigured::
    259
    260  rpa_php_unconfig_pci_adapter() {             // in rpaphp_pci.c
    261    calls
    262    pci_remove_bus_device (struct pci_dev *) { // in /drivers/pci/remove.c
    263      calls
    264      pci_destroy_dev (struct pci_dev *) {
    265        calls
    266        device_unregister (&dev->dev) {        // in /drivers/base/core.c
    267          calls
    268          device_del(struct device * dev) {    // in /drivers/base/core.c
    269            calls
    270            kobject_del() {                    //in /libs/kobject.c
    271              calls
    272              kobject_uevent() {               // in /libs/kobject.c
    273                calls
    274                kset_uevent() {                // in /lib/kobject.c
    275                  calls
    276                  kset->uevent_ops->uevent()   // which is really just
    277                  a call to
    278                  dev_uevent() {               // in /drivers/base/core.c
    279                    calls
    280                    dev->bus->uevent() which is really just a call to
    281                    pci_uevent () {            // in drivers/pci/hotplug.c
    282                      which prints device name, etc....
    283                   }
    284                 }
    285                 then kobject_uevent() sends a netlink uevent to userspace
    286                 --> userspace uevent
    287                 (during early boot, nobody listens to netlink events and
    288                 kobject_uevent() executes uevent_helper[], which runs the
    289                 event process /sbin/hotplug)
    290             }
    291           }
    292           kobject_del() then calls sysfs_remove_dir(), which would
    293           trigger any user-space daemon that was watching /sysfs,
    294           and notice the delete event.
    295
    296
    297Pro's and Con's of the Current Design
    298-------------------------------------
    299There are several issues with the current EEH software recovery design,
    300which may be addressed in future revisions.  But first, note that the
    301big plus of the current design is that no changes need to be made to
    302individual device drivers, so that the current design throws a wide net.
    303The biggest negative of the design is that it potentially disturbs
    304network daemons and file systems that didn't need to be disturbed.
    305
    306-  A minor complaint is that resetting the network card causes
    307   user-space back-to-back ifdown/ifup burps that potentially disturb
    308   network daemons, that didn't need to even know that the pci
    309   card was being rebooted.
    310
    311-  A more serious concern is that the same reset, for SCSI devices,
    312   causes havoc to mounted file systems.  Scripts cannot post-facto
    313   unmount a file system without flushing pending buffers, but this
    314   is impossible, because I/O has already been stopped.  Thus,
    315   ideally, the reset should happen at or below the block layer,
    316   so that the file systems are not disturbed.
    317
    318   Reiserfs does not tolerate errors returned from the block device.
    319   Ext3fs seems to be tolerant, retrying reads/writes until it does
    320   succeed. Both have been only lightly tested in this scenario.
    321
    322   The SCSI-generic subsystem already has built-in code for performing
    323   SCSI device resets, SCSI bus resets, and SCSI host-bus-adapter
    324   (HBA) resets.  These are cascaded into a chain of attempted
    325   resets if a SCSI command fails. These are completely hidden
    326   from the block layer.  It would be very natural to add an EEH
    327   reset into this chain of events.
    328
    329-  If a SCSI error occurs for the root device, all is lost unless
    330   the sysadmin had the foresight to run /bin, /sbin, /etc, /var
    331   and so on, out of ramdisk/tmpfs.
    332
    333
    334Conclusions
    335-----------
    336There's forward progress ...