cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

btt.rst (12124B)


      1=============================
      2BTT - Block Translation Table
      3=============================
      4
      5
      61. Introduction
      7===============
      8
      9Persistent memory based storage is able to perform IO at byte (or more
     10accurately, cache line) granularity. However, we often want to expose such
     11storage as traditional block devices. The block drivers for persistent memory
     12will do exactly this. However, they do not provide any atomicity guarantees.
     13Traditional SSDs typically provide protection against torn sectors in hardware,
     14using stored energy in capacitors to complete in-flight block writes, or perhaps
     15in firmware. We don't have this luxury with persistent memory - if a write is in
     16progress, and we experience a power failure, the block will contain a mix of old
     17and new data. Applications may not be prepared to handle such a scenario.
     18
     19The Block Translation Table (BTT) provides atomic sector update semantics for
     20persistent memory devices, so that applications that rely on sector writes not
     21being torn can continue to do so. The BTT manifests itself as a stacked block
     22device, and reserves a portion of the underlying storage for its metadata. At
     23the heart of it, is an indirection table that re-maps all the blocks on the
     24volume. It can be thought of as an extremely simple file system that only
     25provides atomic sector updates.
     26
     27
     282. Static Layout
     29================
     30
     31The underlying storage on which a BTT can be laid out is not limited in any way.
     32The BTT, however, splits the available space into chunks of up to 512 GiB,
     33called "Arenas".
     34
     35Each arena follows the same layout for its metadata, and all references in an
     36arena are internal to it (with the exception of one field that points to the
     37next arena). The following depicts the "On-disk" metadata layout::
     38
     39
     40    Backing Store     +------->  Arena
     41  +---------------+   |   +------------------+
     42  |               |   |   | Arena info block |
     43  |    Arena 0    +---+   |       4K         |
     44  |     512G      |       +------------------+
     45  |               |       |                  |
     46  +---------------+       |                  |
     47  |               |       |                  |
     48  |    Arena 1    |       |   Data Blocks    |
     49  |     512G      |       |                  |
     50  |               |       |                  |
     51  +---------------+       |                  |
     52  |       .       |       |                  |
     53  |       .       |       |                  |
     54  |       .       |       |                  |
     55  |               |       |                  |
     56  |               |       |                  |
     57  +---------------+       +------------------+
     58                          |                  |
     59                          |     BTT Map      |
     60                          |                  |
     61                          |                  |
     62                          +------------------+
     63                          |                  |
     64                          |     BTT Flog     |
     65                          |                  |
     66                          +------------------+
     67                          | Info block copy  |
     68                          |       4K         |
     69                          +------------------+
     70
     71
     723. Theory of Operation
     73======================
     74
     75
     76a. The BTT Map
     77--------------
     78
     79The map is a simple lookup/indirection table that maps an LBA to an internal
     80block. Each map entry is 32 bits. The two most significant bits are special
     81flags, and the remaining form the internal block number.
     82
     83======== =============================================================
     84Bit      Description
     85======== =============================================================
     8631 - 30	 Error and Zero flags - Used in the following way::
     87
     88	   == ==  ====================================================
     89	   31 30  Description
     90	   == ==  ====================================================
     91	   0  0	  Initial state. Reads return zeroes; Premap = Postmap
     92	   0  1	  Zero state: Reads return zeroes
     93	   1  0	  Error state: Reads fail; Writes clear 'E' bit
     94	   1  1	  Normal Block – has valid postmap
     95	   == ==  ====================================================
     96
     9729 - 0	 Mappings to internal 'postmap' blocks
     98======== =============================================================
     99
    100
    101Some of the terminology that will be subsequently used:
    102
    103============	================================================================
    104External LBA	LBA as made visible to upper layers.
    105ABA		Arena Block Address - Block offset/number within an arena
    106Premap ABA	The block offset into an arena, which was decided upon by range
    107		checking the External LBA
    108Postmap ABA	The block number in the "Data Blocks" area obtained after
    109		indirection from the map
    110nfree		The number of free blocks that are maintained at any given time.
    111		This is the number of concurrent writes that can happen to the
    112		arena.
    113============	================================================================
    114
    115
    116For example, after adding a BTT, we surface a disk of 1024G. We get a read for
    117the external LBA at 768G. This falls into the second arena, and of the 512G
    118worth of blocks that this arena contributes, this block is at 256G. Thus, the
    119premap ABA is 256G. We now refer to the map, and find out the mapping for block
    120'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
    121
    122
    123b. The BTT Flog
    124---------------
    125
    126The BTT provides sector atomicity by making every write an "allocating write",
    127i.e. Every write goes to a "free" block. A running list of free blocks is
    128maintained in the form of the BTT flog. 'Flog' is a combination of the words
    129"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
    130
    131========  =====================================================================
    132lba       The premap ABA that is being written to
    133old_map   The old postmap ABA - after 'this' write completes, this will be a
    134	  free block.
    135new_map   The new postmap ABA. The map will up updated to reflect this
    136	  lba->postmap_aba mapping, but we log it here in case we have to
    137	  recover.
    138seq	  Sequence number to mark which of the 2 sections of this flog entry is
    139	  valid/newest. It cycles between 01->10->11->01 (binary) under normal
    140	  operation, with 00 indicating an uninitialized state.
    141lba'	  alternate lba entry
    142old_map'  alternate old postmap entry
    143new_map'  alternate new postmap entry
    144seq'	  alternate sequence number.
    145========  =====================================================================
    146
    147Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
    148padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
    149done such that for any entry being written, it:
    150a. overwrites the 'old' section in the entry based on sequence numbers
    151b. writes the 'new' section such that the sequence number is written last.
    152
    153
    154c. The concept of lanes
    155-----------------------
    156
    157While 'nfree' describes the number of concurrent IOs an arena can process
    158concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
    159process::
    160
    161	nlanes = min(nfree, num_cpus)
    162
    163A lane number is obtained at the start of any IO, and is used for indexing into
    164all the on-disk and in-memory data structures for the duration of the IO. If
    165there are more CPUs than the max number of available lanes, than lanes are
    166protected by spinlocks.
    167
    168
    169d. In-memory data structure: Read Tracking Table (RTT)
    170------------------------------------------------------
    171
    172Consider a case where we have two threads, one doing reads and the other,
    173writes. We can hit a condition where the writer thread grabs a free block to do
    174a new IO, but the (slow) reader thread is still reading from it. In other words,
    175the reader consulted a map entry, and started reading the corresponding block. A
    176writer started writing to the same external LBA, and finished the write updating
    177the map for that external LBA to point to its new postmap ABA. At this point the
    178internal, postmap block that the reader is (still) reading has been inserted
    179into the list of free blocks. If another write comes in for the same LBA, it can
    180grab this free block, and start writing to it, causing the reader to read
    181incorrect data. To prevent this, we introduce the RTT.
    182
    183The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
    184into rtt[lane_number], the postmap ABA it is reading, and clears it after the
    185read is complete. Every writer thread, after grabbing a free block, checks the
    186RTT for its presence. If the postmap free block is in the RTT, it waits till the
    187reader clears the RTT entry, and only then starts writing to it.
    188
    189
    190e. In-memory data structure: map locks
    191--------------------------------------
    192
    193Consider a case where two writer threads are writing to the same LBA. There can
    194be a race in the following sequence of steps::
    195
    196	free[lane] = map[premap_aba]
    197	map[premap_aba] = postmap_aba
    198
    199Both threads can update their respective free[lane] with the same old, freed
    200postmap_aba. This has made the layout inconsistent by losing a free entry, and
    201at the same time, duplicating another free entry for two lanes.
    202
    203To solve this, we could have a single map lock (per arena) that has to be taken
    204before performing the above sequence, but we feel that could be too contentious.
    205Instead we use an array of (nfree) map_locks that is indexed by
    206(premap_aba modulo nfree).
    207
    208
    209f. Reconstruction from the Flog
    210-------------------------------
    211
    212On startup, we analyze the BTT flog to create our list of free blocks. We walk
    213through all the entries, and for each lane, of the set of two possible
    214'sections', we always look at the most recent one only (based on the sequence
    215number). The reconstruction rules/steps are simple:
    216
    217- Read map[log_entry.lba].
    218- If log_entry.new matches the map entry, then log_entry.old is free.
    219- If log_entry.new does not match the map entry, then log_entry.new is free.
    220  (This case can only be caused by power-fails/unsafe shutdowns)
    221
    222
    223g. Summarizing - Read and Write flows
    224-------------------------------------
    225
    226Read:
    227
    2281.  Convert external LBA to arena number + pre-map ABA
    2292.  Get a lane (and take lane_lock)
    2303.  Read map to get the entry for this pre-map ABA
    2314.  Enter post-map ABA into RTT[lane]
    2325.  If TRIM flag set in map, return zeroes, and end IO (go to step 8)
    2336.  If ERROR flag set in map, end IO with EIO (go to step 8)
    2347.  Read data from this block
    2358.  Remove post-map ABA entry from RTT[lane]
    2369.  Release lane (and lane_lock)
    237
    238Write:
    239
    2401.  Convert external LBA to Arena number + pre-map ABA
    2412.  Get a lane (and take lane_lock)
    2423.  Use lane to index into in-memory free list and obtain a new block, next flog
    243    index, next sequence number
    2444.  Scan the RTT to check if free block is present, and spin/wait if it is.
    2455.  Write data to this free block
    2466.  Read map to get the existing post-map ABA entry for this pre-map ABA
    2477.  Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
    2488.  Write new post-map ABA into map.
    2499.  Write old post-map entry into the free list
    25010. Calculate next sequence number and write into the free list entry
    25111. Release lane (and lane_lock)
    252
    253
    2544. Error Handling
    255=================
    256
    257An arena would be in an error state if any of the metadata is corrupted
    258irrecoverably, either due to a bug or a media error. The following conditions
    259indicate an error:
    260
    261- Info block checksum does not match (and recovering from the copy also fails)
    262- All internal available blocks are not uniquely and entirely addressed by the
    263  sum of mapped blocks and free blocks (from the BTT flog).
    264- Rebuilding free list from the flog reveals missing/duplicate/impossible
    265  entries
    266- A map entry is out of bounds
    267
    268If any of these error conditions are encountered, the arena is put into a read
    269only state using a flag in the info block.
    270
    271
    2725. Usage
    273========
    274
    275The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
    276(pmem, or blk mode). The easiest way to set up such a namespace is using the
    277'ndctl' utility [1]:
    278
    279For example, the ndctl command line to setup a btt with a 4k sector size is::
    280
    281    ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
    282
    283See ndctl create-namespace --help for more options.
    284
    285[1]: https://github.com/pmem/ndctl