pci_iov_resource_on_powernv.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
pci_iov_resource_on_powernv.rst (14926B)
      1===================================================
      2PCI Express I/O Virtualization Resource on Powerenv
      3===================================================
      4
      5Wei Yang <weiyang@linux.vnet.ibm.com>
      6
      7Benjamin Herrenschmidt <benh@au1.ibm.com>
      8
      9Bjorn Helgaas <bhelgaas@google.com>
     10
     1126 Aug 2014
     12
     13This document describes the requirement from hardware for PCI MMIO resource
     14sizing and assignment on PowerKVM and how generic PCI code handles this
     15requirement. The first two sections describe the concepts of Partitionable
     16Endpoints and the implementation on P8 (IODA2). The next two sections talks
     17about considerations on enabling SRIOV on IODA2.
     18
     191. Introduction to Partitionable Endpoints
     20==========================================
     21
     22A Partitionable Endpoint (PE) is a way to group the various resources
     23associated with a device or a set of devices to provide isolation between
     24partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
     25to freeze a device that is causing errors in order to limit the possibility
     26of propagation of bad data.
     27
     28There is thus, in HW, a table of PE states that contains a pair of "frozen"
     29state bits (one for MMIO and one for DMA, they get set together but can be
     30cleared independently) for each PE.
     31
     32When a PE is frozen, all stores in any direction are dropped and all loads
     33return all 1's value. MSIs are also blocked. There's a bit more state that
     34captures things like the details of the error that caused the freeze etc., but
     35that's not critical.
     36
     37The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
     38are matched to their corresponding PEs.
     39
     40The following section provides a rough description of what we have on P8
     41(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
     42is a completely separate HW entity that replicates the entire logic, so has
     43its own set of PEs, etc.
     44
     452. Implementation of Partitionable Endpoints on P8 (IODA2)
     46==========================================================
     47
     48P8 supports up to 256 Partitionable Endpoints per PHB.
     49
     50  * Inbound
     51
     52    For DMA, MSIs and inbound PCIe error messages, we have a table (in
     53    memory but accessed in HW by the chip) that provides a direct
     54    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
     55    We call this the RTT.
     56
     57    - For DMA we then provide an entire address space for each PE that can
     58      contain two "windows", depending on the value of PCI address bit 59.
     59      Each window can be configured to be remapped via a "TCE table" (IOMMU
     60      translation table), which has various configurable characteristics
     61      not described here.
     62
     63    - For MSIs, we have two windows in the address space (one at the top of
     64      the 32-bit space and one much higher) which, via a combination of the
     65      address and MSI value, will result in one of the 2048 interrupts per
     66      bridge being triggered.  There's a PE# in the interrupt controller
     67      descriptor table as well which is compared with the PE# obtained from
     68      the RTT to "authorize" the device to emit that specific interrupt.
     69
     70    - Error messages just use the RTT.
     71
     72  * Outbound.  That's where the tricky part is.
     73
     74    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
     75    from the CPU address space to the PCI address space.  There is one M32
     76    window and sixteen M64 windows.  They have different characteristics.
     77    First what they have in common: they forward a configurable portion of
     78    the CPU address space to the PCIe bus and must be naturally aligned
     79    power of two in size.  The rest is different:
     80
     81    - The M32 window:
     82
     83      * Is limited to 4GB in size.
     84
     85      * Drops the top bits of the address (above the size) and replaces
     86	them with a configurable value.  This is typically used to generate
     87	32-bit PCIe accesses.  We configure that window at boot from FW and
     88	don't touch it from Linux; it's usually set to forward a 2GB
     89	portion of address space from the CPU to PCIe
     90	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
     91	reserved for MSIs but this is not a problem at this point; we just
     92	need to ensure Linux doesn't assign anything there, the M32 logic
     93	ignores that however and will forward in that space if we try).
     94
     95      * It is divided into 256 segments of equal size.  A table in the chip
     96	maps each segment to a PE#.  That allows portions of the MMIO space
     97	to be assigned to PEs on a segment granularity.  For a 2GB window,
     98	the segment granularity is 2GB/256 = 8MB.
     99
    100    Now, this is the "main" window we use in Linux today (excluding
    101    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
    102    onto a segment alignment/granularity so that the space behind a bridge
    103    can be assigned to a PE.
    104
    105    Ideally we would like to be able to have individual functions in PEs
    106    but that would mean using a completely different address allocation
    107    scheme where individual function BARs can be "grouped" to fit in one or
    108    more segments.
    109
    110    - The M64 windows:
    111
    112      * Must be at least 256MB in size.
    113
    114      * Do not translate addresses (the address on PCIe is the same as the
    115	address on the PowerBus).  There is a way to also set the top 14
    116	bits which are not conveyed by PowerBus but we don't use this.
    117
    118      * Can be configured to be segmented.  When not segmented, we can
    119	specify the PE# for the entire window.  When segmented, a window
    120	has 256 segments; however, there is no table for mapping a segment
    121	to a PE#.  The segment number *is* the PE#.
    122
    123      * Support overlaps.  If an address is covered by multiple windows,
    124	there's a defined ordering for which window applies.
    125
    126    We have code (fairly new compared to the M32 stuff) that exploits that
    127    for large BARs in 64-bit space:
    128
    129    We configure an M64 window to cover the entire region of address space
    130    that has been assigned by FW for the PHB (about 64GB, ignore the space
    131    for the M32, it comes out of a different "reserve").  We configure it
    132    as segmented.
    133
    134    Then we do the same thing as with M32, using the bridge alignment
    135    trick, to match to those giant segments.
    136
    137    Since we cannot remap, we have two additional constraints:
    138
    139    - We do the PE# allocation *after* the 64-bit space has been assigned
    140      because the addresses we use directly determine the PE#.  We then
    141      update the M32 PE# for the devices that use both 32-bit and 64-bit
    142      spaces or assign the remaining PE# to 32-bit only devices.
    143
    144    - We cannot "group" segments in HW, so if a device ends up using more
    145      than one segment, we end up with more than one PE#.  There is a HW
    146      mechanism to make the freeze state cascade to "companion" PEs but
    147      that only works for PCIe error messages (typically used so that if
    148      you freeze a switch, it freezes all its children).  So we do it in
    149      SW.  We lose a bit of effectiveness of EEH in that case, but that's
    150      the best we found.  So when any of the PEs freezes, we freeze the
    151      other ones for that "domain".  We thus introduce the concept of
    152      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
    153      PEs" that are used for the remaining M64 segments.
    154
    155    We would like to investigate using additional M64 windows in "single
    156    PE" mode to overlay over specific BARs to work around some of that, for
    157    example for devices with very large BARs, e.g., GPUs.  It would make
    158    sense, but we haven't done it yet.
    159
    1603. Considerations for SR-IOV on PowerKVM
    161========================================
    162
    163  * SR-IOV Background
    164
    165    The PCIe SR-IOV feature allows a single Physical Function (PF) to
    166    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
    167    Capability control the number of VFs and whether they are enabled.
    168
    169    When VFs are enabled, they appear in Configuration Space like normal
    170    PCI devices, but the BARs in VF config space headers are unusual.  For
    171    a non-VF device, software uses BARs in the config space header to
    172    discover the BAR sizes and assign addresses for them.  For VF devices,
    173    software uses VF BAR registers in the *PF* SR-IOV Capability to
    174    discover sizes and assign addresses.  The BARs in the VF's config space
    175    header are read-only zeros.
    176
    177    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
    178    base address for all the corresponding VF(n) BARs.  For example, if the
    179    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
    180    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
    181    This region is divided into eight contiguous 1MB regions, each of which
    182    is a BAR0 for one of the VFs.  Note that even though the VF BAR
    183    describes an 8MB region, the alignment requirement is for a single VF,
    184    i.e., 1MB in this example.
    185
    186  There are several strategies for isolating VFs in PEs:
    187
    188  - M32 window: There's one M32 window, and it is split into 256
    189    equally-sized segments.  The finest granularity possible is a 256MB
    190    window with 1MB segments.  VF BARs that are 1MB or larger could be
    191    mapped to separate PEs in this window.  Each segment can be
    192    individually mapped to a PE via the lookup table, so this is quite
    193    flexible, but it works best when all the VF BARs are the same size.  If
    194    they are different sizes, the entire window has to be small enough that
    195    the segment size matches the smallest VF BAR, which means larger VF
    196    BARs span several segments.
    197
    198  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
    199    to a single PE, so it could only isolate one VF.
    200
    201  - Single segmented M64 windows: A segmented M64 window could be used just
    202    like the M32 window, but the segments can't be individually mapped to
    203    PEs (the segment number is the PE#), so there isn't as much
    204    flexibility.  A VF with multiple BARs would have to be in a "domain" of
    205    multiple PEs, which is not as well isolated as a single PE.
    206
    207  - Multiple segmented M64 windows: As usual, each window is split into 256
    208    equally-sized segments, and the segment number is the PE#.  But if we
    209    use several M64 windows, they can be set to different base addresses
    210    and different segment sizes.  If we have VFs that each have a 1MB BAR
    211    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
    212    another M64 window to assign 32MB segments.
    213
    214  Finally, the plan to use M64 windows for SR-IOV, which will be described
    215  more in the next two sections.  For a given VF BAR, we need to
    216  effectively reserve the entire 256 segments (256 * VF BAR size) and
    217  position the VF BAR to start at the beginning of a free range of
    218  segments/PEs inside that M64 window.
    219
    220  The goal is of course to be able to give a separate PE for each VF.
    221
    222  The IODA2 platform has 16 M64 windows, which are used to map MMIO
    223  range to PE#.  Each M64 window defines one MMIO range and this range is
    224  divided into 256 segments, with each segment corresponding to one PE.
    225
    226  We decide to leverage this M64 window to map VFs to individual PEs, since
    227  SR-IOV VF BARs are all the same size.
    228
    229  But doing so introduces another problem: total_VFs is usually smaller
    230  than the number of M64 window segments, so if we map one VF BAR directly
    231  to one M64 window, some part of the M64 window will map to another
    232  device's MMIO range.
    233
    234  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
    235  total_VFs is less than 256, we have the situation in Figure 1.0, where
    236  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
    237  other devices::
    238
    239     0      1                     total_VFs - 1
    240     +------+------+-     -+------+------+
    241     |      |      |  ...  |      |      |
    242     +------+------+-     -+------+------+
    243
    244                           VF(n) BAR space
    245
    246     0      1                     total_VFs - 1                255
    247     +------+------+-     -+------+------+-      -+------+------+
    248     |      |      |  ...  |      |      |   ...  |      |      |
    249     +------+------+-     -+------+------+-      -+------+------+
    250
    251                           M64 window
    252
    253		Figure 1.0 Direct map VF(n) BAR space
    254
    255  Our current solution is to allocate 256 segments even if the VF(n) BAR
    256  space doesn't need that much, as shown in Figure 1.1::
    257
    258     0      1                     total_VFs - 1                255
    259     +------+------+-     -+------+------+-      -+------+------+
    260     |      |      |  ...  |      |      |   ...  |      |      |
    261     +------+------+-     -+------+------+-      -+------+------+
    262
    263                           VF(n) BAR space + extra
    264
    265     0      1                     total_VFs - 1                255
    266     +------+------+-     -+------+------+-      -+------+------+
    267     |      |      |  ...  |      |      |   ...  |      |      |
    268     +------+------+-     -+------+------+-      -+------+------+
    269
    270			   M64 window
    271
    272		Figure 1.1 Map VF(n) BAR space + extra
    273
    274  Allocating the extra space ensures that the entire M64 window will be
    275  assigned to this one SR-IOV device and none of the space will be
    276  available for other devices.  Note that this only expands the space
    277  reserved in software; there are still only total_VFs VFs, and they only
    278  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
    279  responds to segments [total_VFs, 255].
    280
    2814. Implications for the Generic PCI Code
    282========================================
    283
    284The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
    285aligned to the size of an individual VF BAR.
    286
    287In IODA2, the MMIO address determines the PE#.  If the address is in an M32
    288window, we can set the PE# by updating the table that translates segments
    289to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
    290set the PE# for the window.  But if it's in a segmented M64 window, the
    291segment number is the PE#.
    292
    293Therefore, the only way to control the PE# for a VF is to change the base
    294of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
    295amount of space required for the VF(n) BAR space, the VF BAR value is fixed
    296and cannot be changed.
    297
    298On the other hand, if the PCI core allocates additional space, the VF BAR
    299value can be changed as long as the entire VF(n) BAR space remains inside
    300the space allocated by the core.
    301
    302Ideally the segment size will be the same as an individual VF BAR size.
    303Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
    304are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
    305allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
    306
    307If the segment size is smaller than the VF BAR size, it will take several
    308segments to cover a VF BAR, and a VF will be in several PEs.  This is
    309possible, but the isolation isn't as good, and it reduces the number of PE#
    310choices because instead of consuming only numVFs segments, the VF(n) BAR
    311space will consume (numVFs * n) segments.  That means there aren't as many
    312available segments for adjusting base of the VF(n) BAR space.