cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

xfs-self-describing-metadata.rst (17018B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3============================
      4XFS Self Describing Metadata
      5============================
      6
      7Introduction
      8============
      9
     10The largest scalability problem facing XFS is not one of algorithmic
     11scalability, but of verification of the filesystem structure. Scalabilty of the
     12structures and indexes on disk and the algorithms for iterating them are
     13adequate for supporting PB scale filesystems with billions of inodes, however it
     14is this very scalability that causes the verification problem.
     15
     16Almost all metadata on XFS is dynamically allocated. The only fixed location
     17metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
     18other metadata structures need to be discovered by walking the filesystem
     19structure in different ways. While this is already done by userspace tools for
     20validating and repairing the structure, there are limits to what they can
     21verify, and this in turn limits the supportable size of an XFS filesystem.
     22
     23For example, it is entirely possible to manually use xfs_db and a bit of
     24scripting to analyse the structure of a 100TB filesystem when trying to
     25determine the root cause of a corruption problem, but it is still mainly a
     26manual task of verifying that things like single bit errors or misplaced writes
     27weren't the ultimate cause of a corruption event. It may take a few hours to a
     28few days to perform such forensic analysis, so for at this scale root cause
     29analysis is entirely possible.
     30
     31However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
     32to analyse and so that analysis blows out towards weeks/months of forensic work.
     33Most of the analysis work is slow and tedious, so as the amount of analysis goes
     34up, the more likely that the cause will be lost in the noise.  Hence the primary
     35concern for supporting PB scale filesystems is minimising the time and effort
     36required for basic forensic analysis of the filesystem structure.
     37
     38
     39Self Describing Metadata
     40========================
     41
     42One of the problems with the current metadata format is that apart from the
     43magic number in the metadata block, we have no other way of identifying what it
     44is supposed to be. We can't even identify if it is the right place. Put simply,
     45you can't look at a single metadata block in isolation and say "yes, it is
     46supposed to be there and the contents are valid".
     47
     48Hence most of the time spent on forensic analysis is spent doing basic
     49verification of metadata values, looking for values that are in range (and hence
     50not detected by automated verification checks) but are not correct. Finding and
     51understanding how things like cross linked block lists (e.g. sibling
     52pointers in a btree end up with loops in them) are the key to understanding what
     53went wrong, but it is impossible to tell what order the blocks were linked into
     54each other or written to disk after the fact.
     55
     56Hence we need to record more information into the metadata to allow us to
     57quickly determine if the metadata is intact and can be ignored for the purpose
     58of analysis. We can't protect against every possible type of error, but we can
     59ensure that common types of errors are easily detectable.  Hence the concept of
     60self describing metadata.
     61
     62The first, fundamental requirement of self describing metadata is that the
     63metadata object contains some form of unique identifier in a well known
     64location. This allows us to identify the expected contents of the block and
     65hence parse and verify the metadata object. IF we can't independently identify
     66the type of metadata in the object, then the metadata doesn't describe itself
     67very well at all!
     68
     69Luckily, almost all XFS metadata has magic numbers embedded already - only the
     70AGFL, remote symlinks and remote attribute blocks do not contain identifying
     71magic numbers. Hence we can change the on-disk format of all these objects to
     72add more identifying information and detect this simply by changing the magic
     73numbers in the metadata objects. That is, if it has the current magic number,
     74the metadata isn't self identifying. If it contains a new magic number, it is
     75self identifying and we can do much more expansive automated verification of the
     76metadata object at runtime, during forensic analysis or repair.
     77
     78As a primary concern, self describing metadata needs some form of overall
     79integrity checking. We cannot trust the metadata if we cannot verify that it has
     80not been changed as a result of external influences. Hence we need some form of
     81integrity check, and this is done by adding CRC32c validation to the metadata
     82block. If we can verify the block contains the metadata it was intended to
     83contain, a large amount of the manual verification work can be skipped.
     84
     85CRC32c was selected as metadata cannot be more than 64k in length in XFS and
     86hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
     87metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
     88fast. So while CRC32c is not the strongest of possible integrity checks that
     89could be used, it is more than sufficient for our needs and has relatively
     90little overhead. Adding support for larger integrity fields and/or algorithms
     91does really provide any extra value over CRC32c, but it does add a lot of
     92complexity and so there is no provision for changing the integrity checking
     93mechanism.
     94
     95Self describing metadata needs to contain enough information so that the
     96metadata block can be verified as being in the correct place without needing to
     97look at any other metadata. This means it needs to contain location information.
     98Just adding a block number to the metadata is not sufficient to protect against
     99mis-directed writes - a write might be misdirected to the wrong LUN and so be
    100written to the "correct block" of the wrong filesystem. Hence location
    101information must contain a filesystem identifier as well as a block number.
    102
    103Another key information point in forensic analysis is knowing who the metadata
    104block belongs to. We already know the type, the location, that it is valid
    105and/or corrupted, and how long ago that it was last modified. Knowing the owner
    106of the block is important as it allows us to find other related metadata to
    107determine the scope of the corruption. For example, if we have a extent btree
    108object, we don't know what inode it belongs to and hence have to walk the entire
    109filesystem to find the owner of the block. Worse, the corruption could mean that
    110no owner can be found (i.e. it's an orphan block), and so without an owner field
    111in the metadata we have no idea of the scope of the corruption. If we have an
    112owner field in the metadata object, we can immediately do top down validation to
    113determine the scope of the problem.
    114
    115Different types of metadata have different owner identifiers. For example,
    116directory, attribute and extent tree blocks are all owned by an inode, while
    117freespace btree blocks are owned by an allocation group. Hence the size and
    118contents of the owner field are determined by the type of metadata object we are
    119looking at.  The owner information can also identify misplaced writes (e.g.
    120freespace btree block written to the wrong AG).
    121
    122Self describing metadata also needs to contain some indication of when it was
    123written to the filesystem. One of the key information points when doing forensic
    124analysis is how recently the block was modified. Correlation of set of corrupted
    125metadata blocks based on modification times is important as it can indicate
    126whether the corruptions are related, whether there's been multiple corruption
    127events that lead to the eventual failure, and even whether there are corruptions
    128present that the run-time verification is not detecting.
    129
    130For example, we can determine whether a metadata object is supposed to be free
    131space or still allocated if it is still referenced by its owner by looking at
    132when the free space btree block that contains the block was last written
    133compared to when the metadata object itself was last written.  If the free space
    134block is more recent than the object and the object's owner, then there is a
    135very good chance that the block should have been removed from the owner.
    136
    137To provide this "written timestamp", each metadata block gets the Log Sequence
    138Number (LSN) of the most recent transaction it was modified on written into it.
    139This number will always increase over the life of the filesystem, and the only
    140thing that resets it is running xfs_repair on the filesystem. Further, by use of
    141the LSN we can tell if the corrupted metadata all belonged to the same log
    142checkpoint and hence have some idea of how much modification occurred between
    143the first and last instance of corrupt metadata on disk and, further, how much
    144modification occurred between the corruption being written and when it was
    145detected.
    146
    147Runtime Validation
    148==================
    149
    150Validation of self-describing metadata takes place at runtime in two places:
    151
    152	- immediately after a successful read from disk
    153	- immediately prior to write IO submission
    154
    155The verification is completely stateless - it is done independently of the
    156modification process, and seeks only to check that the metadata is what it says
    157it is and that the metadata fields are within bounds and internally consistent.
    158As such, we cannot catch all types of corruption that can occur within a block
    159as there may be certain limitations that operational state enforces of the
    160metadata, or there may be corruption of interblock relationships (e.g. corrupted
    161sibling pointer lists). Hence we still need stateful checking in the main code
    162body, but in general most of the per-field validation is handled by the
    163verifiers.
    164
    165For read verification, the caller needs to specify the expected type of metadata
    166that it should see, and the IO completion process verifies that the metadata
    167object matches what was expected. If the verification process fails, then it
    168marks the object being read as EFSCORRUPTED. The caller needs to catch this
    169error (same as for IO errors), and if it needs to take special action due to a
    170verification error it can do so by catching the EFSCORRUPTED error value. If we
    171need more discrimination of error type at higher levels, we can define new
    172error numbers for different errors as necessary.
    173
    174The first step in read verification is checking the magic number and determining
    175whether CRC validating is necessary. If it is, the CRC32c is calculated and
    176compared against the value stored in the object itself. Once this is validated,
    177further checks are made against the location information, followed by extensive
    178object specific metadata validation. If any of these checks fail, then the
    179buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
    180
    181Write verification is the opposite of the read verification - first the object
    182is extensively verified and if it is OK we then update the LSN from the last
    183modification made to the object, After this, we calculate the CRC and insert it
    184into the object. Once this is done the write IO is allowed to continue. If any
    185error occurs during this process, the buffer is again marked with a EFSCORRUPTED
    186error for the higher layers to catch.
    187
    188Structures
    189==========
    190
    191A typical on-disk structure needs to contain the following information::
    192
    193    struct xfs_ondisk_hdr {
    194	    __be32  magic;		/* magic number */
    195	    __be32  crc;		/* CRC, not logged */
    196	    uuid_t  uuid;		/* filesystem identifier */
    197	    __be64  owner;		/* parent object */
    198	    __be64  blkno;		/* location on disk */
    199	    __be64  lsn;		/* last modification in log, not logged */
    200    };
    201
    202Depending on the metadata, this information may be part of a header structure
    203separate to the metadata contents, or may be distributed through an existing
    204structure. The latter occurs with metadata that already contains some of this
    205information, such as the superblock and AG headers.
    206
    207Other metadata may have different formats for the information, but the same
    208level of information is generally provided. For example:
    209
    210	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
    211	  number for location. The two of these combined provide the same
    212	  information as @owner and @blkno in eh above structure, but using 8
    213	  bytes less space on disk.
    214
    215	- directory/attribute node blocks have a 16 bit magic number, and the
    216	  header that contains the magic number has other information in it as
    217	  well. hence the additional metadata headers change the overall format
    218	  of the metadata.
    219
    220A typical buffer read verifier is structured as follows::
    221
    222    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
    223
    224    static void
    225    xfs_foo_read_verify(
    226	    struct xfs_buf	*bp)
    227    {
    228	struct xfs_mount *mp = bp->b_mount;
    229
    230	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
    231		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
    232					    XFS_FOO_CRC_OFF)) ||
    233		!xfs_foo_verify(bp)) {
    234		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
    235		    xfs_buf_ioerror(bp, EFSCORRUPTED);
    236	    }
    237    }
    238
    239The code ensures that the CRC is only checked if the filesystem has CRCs enabled
    240by checking the superblock of the feature bit, and then if the CRC verifies OK
    241(or is not needed) it verifies the actual contents of the block.
    242
    243The verifier function will take a couple of different forms, depending on
    244whether the magic number can be used to determine the format of the block. In
    245the case it can't, the code is structured as follows::
    246
    247    static bool
    248    xfs_foo_verify(
    249	    struct xfs_buf		*bp)
    250    {
    251	    struct xfs_mount	*mp = bp->b_mount;
    252	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
    253
    254	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
    255		    return false;
    256
    257	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
    258		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
    259			    return false;
    260		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
    261			    return false;
    262		    if (hdr->owner == 0)
    263			    return false;
    264	    }
    265
    266	    /* object specific verification checks here */
    267
    268	    return true;
    269    }
    270
    271If there are different magic numbers for the different formats, the verifier
    272will look like::
    273
    274    static bool
    275    xfs_foo_verify(
    276	    struct xfs_buf		*bp)
    277    {
    278	    struct xfs_mount	*mp = bp->b_mount;
    279	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
    280
    281	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
    282		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
    283			    return false;
    284		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
    285			    return false;
    286		    if (hdr->owner == 0)
    287			    return false;
    288	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
    289		    return false;
    290
    291	    /* object specific verification checks here */
    292
    293	    return true;
    294    }
    295
    296Write verifiers are very similar to the read verifiers, they just do things in
    297the opposite order to the read verifiers. A typical write verifier::
    298
    299    static void
    300    xfs_foo_write_verify(
    301	    struct xfs_buf	*bp)
    302    {
    303	    struct xfs_mount	*mp = bp->b_mount;
    304	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
    305
    306	    if (!xfs_foo_verify(bp)) {
    307		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
    308		    xfs_buf_ioerror(bp, EFSCORRUPTED);
    309		    return;
    310	    }
    311
    312	    if (!xfs_sb_version_hascrc(&mp->m_sb))
    313		    return;
    314
    315
    316	    if (bip) {
    317		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
    318		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
    319	    }
    320	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
    321    }
    322
    323This will verify the internal structure of the metadata before we go any
    324further, detecting corruptions that have occurred as the metadata has been
    325modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
    326update the LSN field (when it was last modified) and calculate the CRC on the
    327metadata. Once this is done, we can issue the IO.
    328
    329Inodes and Dquots
    330=================
    331
    332Inodes and dquots are special snowflakes. They have per-object CRC and
    333self-identifiers, but they are packed so that there are multiple objects per
    334buffer. Hence we do not use per-buffer verifiers to do the work of per-object
    335verification and CRC calculations. The per-buffer verifiers simply perform basic
    336identification of the buffer - that they contain inodes or dquots, and that
    337there are magic numbers in all the expected spots. All further CRC and
    338verification checks are done when each inode is read from or written back to the
    339buffer.
    340
    341The structure of the verifiers and the identifiers checks is very similar to the
    342buffer code described above. The only difference is where they are called. For
    343example, inode read verification is done in xfs_inode_from_disk() when the inode
    344is first read out of the buffer and the struct xfs_inode is instantiated. The
    345inode is already extensively verified during writeback in xfs_iflush_int, so the
    346only addition here is to add the LSN and CRC to the inode as it is copied back
    347into the buffer.
    348
    349XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
    350the unlinked list modifications check or update CRCs, neither during unlink nor
    351log recovery. So, it's gone unnoticed until now. This won't matter immediately -
    352repair will probably complain about it - but it needs to be fixed.