cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

edac.rst (6221B)


      1Error Detection And Correction (EDAC) Devices
      2=============================================
      3
      4Main Concepts used at the EDAC subsystem
      5----------------------------------------
      6
      7There are several things to be aware of that aren't at all obvious, like
      8*sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
      9etc...
     10
     11These are some of the many terms that are thrown about that don't always
     12mean what people think they mean (Inconceivable!).  In the interest of
     13creating a common ground for discussion, terms and their definitions
     14will be established.
     15
     16* Memory devices
     17
     18The individual DRAM chips on a memory stick.  These devices commonly
     19output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
     20provides the number of bits that the memory controller expects:
     21typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
     22
     23* Memory Stick
     24
     25A printed circuit board that aggregates multiple memory devices in
     26parallel.  In general, this is the Field Replaceable Unit (FRU) which
     27gets replaced, in the case of excessive errors. Most often it is also
     28called DIMM (Dual Inline Memory Module).
     29
     30* Memory Socket
     31
     32A physical connector on the motherboard that accepts a single memory
     33stick. Also called as "slot" on several datasheets.
     34
     35* Channel
     36
     37A memory controller channel, responsible to communicate with a group of
     38DIMMs. Each channel has its own independent control (command) and data
     39bus, and can be used independently or grouped with other channels.
     40
     41* Branch
     42
     43It is typically the highest hierarchy on a Fully-Buffered DIMM memory
     44controller. Typically, it contains two channels. Two channels at the
     45same branch can be used in single mode or in lockstep mode. When
     46lockstep is enabled, the cacheline is doubled, but it generally brings
     47some performance penalty. Also, it is generally not possible to point to
     48just one memory stick when an error occurs, as the error correction code
     49is calculated using two DIMMs instead of one. Due to that, it is capable
     50of correcting more errors than on single mode.
     51
     52* Single-channel
     53
     54The data accessed by the memory controller is contained into one dimm
     55only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
     56one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
     57memories. FB-DIMM and RAMBUS use a different concept for channel, so
     58this concept doesn't apply there.
     59
     60* Double-channel
     61
     62The data size accessed by the memory controller is interlaced into two
     63dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
     64bits with ECC), the data flows to the CPU using a 128 bits parallel
     65access.
     66
     67* Chip-select row
     68
     69This is the name of the DRAM signal used to select the DRAM ranks to be
     70accessed. Common chip-select rows for single channel are 64 bits, for
     71dual channel 128 bits. It may not be visible by the memory controller,
     72as some DIMM types have a memory buffer that can hide direct access to
     73it from the Memory Controller.
     74
     75* Single-Ranked stick
     76
     77A Single-ranked stick has 1 chip-select row of memory. Motherboards
     78commonly drive two chip-select pins to a memory stick. A single-ranked
     79stick, will occupy only one of those rows. The other will be unused.
     80
     81.. _doubleranked:
     82
     83* Double-Ranked stick
     84
     85A double-ranked stick has two chip-select rows which access different
     86sets of memory devices.  The two rows cannot be accessed concurrently.
     87
     88* Double-sided stick
     89
     90**DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
     91
     92A double-sided stick has two chip-select rows which access different sets
     93of memory devices. The two rows cannot be accessed concurrently.
     94"Double-sided" is irrespective of the memory devices being mounted on
     95both sides of the memory stick.
     96
     97* Socket set
     98
     99All of the memory sticks that are required for a single memory access or
    100all of the memory sticks spanned by a chip-select row.  A single socket
    101set has two chip-select rows and if double-sided sticks are used these
    102will occupy those chip-select rows.
    103
    104* Bank
    105
    106This term is avoided because it is unclear when needing to distinguish
    107between chip-select rows and socket sets.
    108
    109
    110Memory Controllers
    111------------------
    112
    113Most of the EDAC core is focused on doing Memory Controller error detection.
    114The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
    115to describe the memory controllers, with is an opaque struct for the EDAC
    116drivers. Only the EDAC core is allowed to touch it.
    117
    118.. kernel-doc:: include/linux/edac.h
    119
    120.. kernel-doc:: drivers/edac/edac_mc.h
    121
    122PCI Controllers
    123---------------
    124
    125The EDAC subsystem provides a mechanism to handle PCI controllers by calling
    126the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
    127:c:type:`edac_pci_ctl_info` to describe the PCI controllers.
    128
    129.. kernel-doc:: drivers/edac/edac_pci.h
    130
    131EDAC Blocks
    132-----------
    133
    134The EDAC subsystem also provides a generic mechanism to report errors on
    135other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
    136
    137The structures :c:type:`edac_dev_sysfs_block_attribute`,
    138:c:type:`edac_device_block`, :c:type:`edac_device_instance` and
    139:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
    140representation at sysfs.
    141
    142This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
    143PCI, like:
    144
    145- CPU caches (L1 and L2)
    146- DMA engines
    147- Core CPU switches
    148- Fabric switch units
    149- PCIe interface controllers
    150- other EDAC/ECC type devices that can be monitored for
    151  errors, etc.
    152
    153It allows for a 2 level set of hierarchy.
    154
    155For example, a cache could be composed of L1, L2 and L3 levels of cache.
    156Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
    157caches. On such case, those can be represented via the following sysfs
    158nodes::
    159
    160	/sys/devices/system/edac/..
    161
    162	pci/		<existing pci directory (if available)>
    163	mc/		<existing memory device directory>
    164	cpu/cpu0/..	<L1 and L2 block directory>
    165		/L1-cache/ce_count
    166			 /ue_count
    167		/L2-cache/ce_count
    168			 /ue_count
    169	cpu/cpu1/..	<L1 and L2 block directory>
    170		/L1-cache/ce_count
    171			 /ue_count
    172		/L2-cache/ce_count
    173			 /ue_count
    174	...
    175
    176	the L1 and L2 directories would be "edac_device_block's"
    177
    178.. kernel-doc:: drivers/edac/edac_device.h