cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

devlink-health.rst (5162B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3==============
      4Devlink Health
      5==============
      6
      7Background
      8==========
      9
     10The ``devlink`` health mechanism is targeted for Real Time Alerting, in
     11order to know when something bad happened to a PCI device.
     12
     13  * Provide alert debug information.
     14  * Self healing.
     15  * If problem needs vendor support, provide a way to gather all needed
     16    debugging information.
     17
     18Overview
     19========
     20
     21The main idea is to unify and centralize driver health reports in the
     22generic ``devlink`` instance and allow the user to set different
     23attributes of the health reporting and recovery procedures.
     24
     25The ``devlink`` health reporter:
     26Device driver creates a "health reporter" per each error/health type.
     27Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
     28or unknown (driver specific).
     29For each registered health reporter a driver can issue error/health reports
     30asynchronously. All health reports handling is done by ``devlink``.
     31Device driver can provide specific callbacks for each "health reporter", e.g.:
     32
     33  * Recovery procedures
     34  * Diagnostics procedures
     35  * Object dump procedures
     36  * OOB initial parameters
     37
     38Different parts of the driver can register different types of health reporters
     39with different handlers.
     40
     41Actions
     42=======
     43
     44Once an error is reported, devlink health will perform the following actions:
     45
     46  * A log is being send to the kernel trace events buffer
     47  * Health status and statistics are being updated for the reporter instance
     48  * Object dump is being taken and saved at the reporter instance (as long as
     49    there is no other dump which is already stored)
     50  * Auto recovery attempt is being done. Depends on:
     51
     52    - Auto-recovery configuration
     53    - Grace period vs. time passed since last recover
     54
     55User Interface
     56==============
     57
     58User can access/change each reporter's parameters and driver specific callbacks
     59via ``devlink``, e.g per error type (per health reporter):
     60
     61  * Configure reporter's generic parameters (like: disable/enable auto recovery)
     62  * Invoke recovery procedure
     63  * Run diagnostics
     64  * Object dump
     65
     66.. list-table:: List of devlink health interfaces
     67   :widths: 10 90
     68
     69   * - Name
     70     - Description
     71   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
     72     - Retrieves status and configuration info per DEV and reporter.
     73   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
     74     - Allows reporter-related configuration setting.
     75   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
     76     - Triggers reporter's recovery procedure.
     77   * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
     78     - Triggers a fake health event on the reporter. The effects of the test
     79       event in terms of recovery flow should follow closely that of a real
     80       event.
     81   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
     82     - Retrieves current device state related to the reporter.
     83   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
     84     - Retrieves the last stored dump. Devlink health
     85       saves a single dump. If an dump is not already stored by devlink
     86       for this reporter, devlink generates a new dump.
     87       Dump output is defined by the reporter.
     88   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
     89     - Clears the last saved dump file for the specified reporter.
     90
     91The following diagram provides a general overview of ``devlink-health``::
     92
     93                                                   netlink
     94                                          +--------------------------+
     95                                          |                          |
     96                                          |            +             |
     97                                          |            |             |
     98                                          +--------------------------+
     99                                                       |request for ops
    100                                                       |(diagnose,
    101      driver                               devlink     |recover,
    102                                                       |dump)
    103    +--------+                            +--------------------------+
    104    |        |                            |    reporter|             |
    105    |        |                            |  +---------v----------+  |
    106    |        |   ops execution            |  |                    |  |
    107    |     <----------------------------------+                    |  |
    108    |        |                            |  |                    |  |
    109    |        |                            |  + ^------------------+  |
    110    |        |                            |    | request for ops     |
    111    |        |                            |    | (recover, dump)     |
    112    |        |                            |    |                     |
    113    |        |                            |  +-+------------------+  |
    114    |        |     health report          |  | health handler     |  |
    115    |        +------------------------------->                    |  |
    116    |        |                            |  +--------------------+  |
    117    |        |     health reporter create |                          |
    118    |        +---------------------------->                          |
    119    +--------+                            +--------------------------+