cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

journal.rst (23741B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3Journal (jbd2)
      4--------------
      5
      6Introduced in ext3, the ext4 filesystem employs a journal to protect the
      7filesystem against metadata inconsistencies in the case of a system crash. Up
      8to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
      9size limits) can be reserved inside the filesystem as a place to land
     10“important” data writes on-disk as quickly as possible. Once the important
     11data transaction is fully written to the disk and flushed from the disk write
     12cache, a record of the data being committed is also written to the journal. At
     13some later point in time, the journal code writes the transactions to their
     14final locations on disk (this could involve a lot of seeking or a lot of small
     15read-write-erases) before erasing the commit record. Should the system
     16crash during the second slow write, the journal can be replayed all the
     17way to the latest commit record, guaranteeing the atomicity of whatever
     18gets written through the journal to the disk. The effect of this is to
     19guarantee that the filesystem does not become stuck midway through a
     20metadata update.
     21
     22For performance reasons, ext4 by default only writes filesystem metadata
     23through the journal. This means that file data blocks are /not/
     24guaranteed to be in any consistent state after a crash. If this default
     25guarantee level (``data=ordered``) is not satisfactory, there is a mount
     26option to control journal behavior. If ``data=journal``, all data and
     27metadata are written to disk through the journal. This is slower but
     28safest. If ``data=writeback``, dirty data blocks are not flushed to the
     29disk before the metadata are written to disk through the journal.
     30
     31In case of ``data=ordered`` mode, Ext4 also supports fast commits which
     32help reduce commit latency significantly. The default ``data=ordered``
     33mode works by logging metadata blocks to the journal. In fast commit
     34mode, Ext4 only stores the minimal delta needed to recreate the
     35affected metadata in fast commit space that is shared with JBD2.
     36Once the fast commit area fills in or if fast commit is not possible
     37or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
     38A full commit invalidates all the fast commits that happened before
     39it and thus it makes the fast commit area empty for further fast
     40commits. This feature needs to be enabled at mkfs time.
     41
     42The journal inode is typically inode 8. The first 68 bytes of the
     43journal inode are replicated in the ext4 superblock. The journal itself
     44is normal (but hidden) file within the filesystem. The file usually
     45consumes an entire block group, though mke2fs tries to put it in the
     46middle of the disk.
     47
     48All fields in jbd2 are written to disk in big-endian order. This is the
     49opposite of ext4.
     50
     51NOTE: Both ext4 and ocfs2 use jbd2.
     52
     53The maximum size of a journal embedded in an ext4 filesystem is 2^32
     54blocks. jbd2 itself does not seem to care.
     55
     56Layout
     57~~~~~~
     58
     59Generally speaking, the journal has this format:
     60
     61.. list-table::
     62   :widths: 16 48 16
     63   :header-rows: 1
     64
     65   * - Superblock
     66     - descriptor_block (data_blocks or revocation_block) [more data or
     67       revocations] commmit_block
     68     - [more transactions...]
     69   * - 
     70     - One transaction
     71     -
     72
     73Notice that a transaction begins with either a descriptor and some data,
     74or a block revocation list. A finished transaction always ends with a
     75commit. If there is no commit record (or the checksums don't match), the
     76transaction will be discarded during replay.
     77
     78External Journal
     79~~~~~~~~~~~~~~~~
     80
     81Optionally, an ext4 filesystem can be created with an external journal
     82device (as opposed to an internal journal, which uses a reserved inode).
     83In this case, on the filesystem device, ``s_journal_inum`` should be
     84zero and ``s_journal_uuid`` should be set. On the journal device there
     85will be an ext4 super block in the usual place, with a matching UUID.
     86The journal superblock will be in the next full block after the
     87superblock.
     88
     89.. list-table::
     90   :widths: 12 12 12 32 12
     91   :header-rows: 1
     92
     93   * - 1024 bytes of padding
     94     - ext4 Superblock
     95     - Journal Superblock
     96     - descriptor_block (data_blocks or revocation_block) [more data or
     97       revocations] commmit_block
     98     - [more transactions...]
     99   * - 
    100     -
    101     -
    102     - One transaction
    103     -
    104
    105Block Header
    106~~~~~~~~~~~~
    107
    108Every block in the journal starts with a common 12-byte header
    109``struct journal_header_s``:
    110
    111.. list-table::
    112   :widths: 8 8 24 40
    113   :header-rows: 1
    114
    115   * - Offset
    116     - Type
    117     - Name
    118     - Description
    119   * - 0x0
    120     - __be32
    121     - h_magic
    122     - jbd2 magic number, 0xC03B3998.
    123   * - 0x4
    124     - __be32
    125     - h_blocktype
    126     - Description of what this block contains. See the jbd2_blocktype_ table
    127       below.
    128   * - 0x8
    129     - __be32
    130     - h_sequence
    131     - The transaction ID that goes with this block.
    132
    133.. _jbd2_blocktype:
    134
    135The journal block type can be any one of:
    136
    137.. list-table::
    138   :widths: 16 64
    139   :header-rows: 1
    140
    141   * - Value
    142     - Description
    143   * - 1
    144     - Descriptor. This block precedes a series of data blocks that were
    145       written through the journal during a transaction.
    146   * - 2
    147     - Block commit record. This block signifies the completion of a
    148       transaction.
    149   * - 3
    150     - Journal superblock, v1.
    151   * - 4
    152     - Journal superblock, v2.
    153   * - 5
    154     - Block revocation records. This speeds up recovery by enabling the
    155       journal to skip writing blocks that were subsequently rewritten.
    156
    157Super Block
    158~~~~~~~~~~~
    159
    160The super block for the journal is much simpler as compared to ext4's.
    161The key data kept within are size of the journal, and where to find the
    162start of the log of transactions.
    163
    164The journal superblock is recorded as ``struct journal_superblock_s``,
    165which is 1024 bytes long:
    166
    167.. list-table::
    168   :widths: 8 8 24 40
    169   :header-rows: 1
    170
    171   * - Offset
    172     - Type
    173     - Name
    174     - Description
    175   * -
    176     -
    177     -
    178     - Static information describing the journal.
    179   * - 0x0
    180     - journal_header_t (12 bytes)
    181     - s_header
    182     - Common header identifying this as a superblock.
    183   * - 0xC
    184     - __be32
    185     - s_blocksize
    186     - Journal device block size.
    187   * - 0x10
    188     - __be32
    189     - s_maxlen
    190     - Total number of blocks in this journal.
    191   * - 0x14
    192     - __be32
    193     - s_first
    194     - First block of log information.
    195   * -
    196     -
    197     -
    198     - Dynamic information describing the current state of the log.
    199   * - 0x18
    200     - __be32
    201     - s_sequence
    202     - First commit ID expected in log.
    203   * - 0x1C
    204     - __be32
    205     - s_start
    206     - Block number of the start of log. Contrary to the comments, this field
    207       being zero does not imply that the journal is clean!
    208   * - 0x20
    209     - __be32
    210     - s_errno
    211     - Error value, as set by jbd2_journal_abort().
    212   * -
    213     -
    214     -
    215     - The remaining fields are only valid in a v2 superblock.
    216   * - 0x24
    217     - __be32
    218     - s_feature_compat;
    219     - Compatible feature set. See the table jbd2_compat_ below.
    220   * - 0x28
    221     - __be32
    222     - s_feature_incompat
    223     - Incompatible feature set. See the table jbd2_incompat_ below.
    224   * - 0x2C
    225     - __be32
    226     - s_feature_ro_compat
    227     - Read-only compatible feature set. There aren't any of these currently.
    228   * - 0x30
    229     - __u8
    230     - s_uuid[16]
    231     - 128-bit uuid for journal. This is compared against the copy in the ext4
    232       super block at mount time.
    233   * - 0x40
    234     - __be32
    235     - s_nr_users
    236     - Number of file systems sharing this journal.
    237   * - 0x44
    238     - __be32
    239     - s_dynsuper
    240     - Location of dynamic super block copy. (Not used?)
    241   * - 0x48
    242     - __be32
    243     - s_max_transaction
    244     - Limit of journal blocks per transaction. (Not used?)
    245   * - 0x4C
    246     - __be32
    247     - s_max_trans_data
    248     - Limit of data blocks per transaction. (Not used?)
    249   * - 0x50
    250     - __u8
    251     - s_checksum_type
    252     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
    253       more info.
    254   * - 0x51
    255     - __u8[3]
    256     - s_padding2
    257     -
    258   * - 0x54
    259     - __be32
    260     - s_num_fc_blocks
    261     - Number of fast commit blocks in the journal.
    262   * - 0x58
    263     - __u32
    264     - s_padding[42]
    265     -
    266   * - 0xFC
    267     - __be32
    268     - s_checksum
    269     - Checksum of the entire superblock, with this field set to zero.
    270   * - 0x100
    271     - __u8
    272     - s_users[16*48]
    273     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
    274       shared external journals, but I imagine Lustre (or ocfs2?), which use
    275       the jbd2 code, might.
    276
    277.. _jbd2_compat:
    278
    279The journal compat features are any combination of the following:
    280
    281.. list-table::
    282   :widths: 16 64
    283   :header-rows: 1
    284
    285   * - Value
    286     - Description
    287   * - 0x1
    288     - Journal maintains checksums on the data blocks.
    289       (JBD2_FEATURE_COMPAT_CHECKSUM)
    290
    291.. _jbd2_incompat:
    292
    293The journal incompat features are any combination of the following:
    294
    295.. list-table::
    296   :widths: 16 64
    297   :header-rows: 1
    298
    299   * - Value
    300     - Description
    301   * - 0x1
    302     - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
    303   * - 0x2
    304     - Journal can deal with 64-bit block numbers.
    305       (JBD2_FEATURE_INCOMPAT_64BIT)
    306   * - 0x4
    307     - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
    308   * - 0x8
    309     - This journal uses v2 of the checksum on-disk format. Each journal
    310       metadata block gets its own checksum, and the block tags in the
    311       descriptor table contain checksums for each of the data blocks in the
    312       journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
    313   * - 0x10
    314     - This journal uses v3 of the checksum on-disk format. This is the same as
    315       v2, but the journal block tag size is fixed regardless of the size of
    316       block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
    317   * - 0x20
    318     - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
    319
    320.. _jbd2_checksum_type:
    321
    322Journal checksum type codes are one of the following.  crc32 or crc32c are the
    323most likely choices.
    324
    325.. list-table::
    326   :widths: 16 64
    327   :header-rows: 1
    328
    329   * - Value
    330     - Description
    331   * - 1
    332     - CRC32
    333   * - 2
    334     - MD5
    335   * - 3
    336     - SHA1
    337   * - 4
    338     - CRC32C
    339
    340Descriptor Block
    341~~~~~~~~~~~~~~~~
    342
    343The descriptor block contains an array of journal block tags that
    344describe the final locations of the data blocks that follow in the
    345journal. Descriptor blocks are open-coded instead of being completely
    346described by a data structure, but here is the block structure anyway.
    347Descriptor blocks consume at least 36 bytes, but use a full block:
    348
    349.. list-table::
    350   :widths: 8 8 24 40
    351   :header-rows: 1
    352
    353   * - Offset
    354     - Type
    355     - Name
    356     - Descriptor
    357   * - 0x0
    358     - journal_header_t
    359     - (open coded)
    360     - Common block header.
    361   * - 0xC
    362     - struct journal_block_tag_s
    363     - open coded array[]
    364     - Enough tags either to fill up the block or to describe all the data
    365       blocks that follow this descriptor block.
    366
    367Journal block tags have any of the following formats, depending on which
    368journal feature and block tag flags are set.
    369
    370If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
    371defined as ``struct journal_block_tag3_s``, which looks like the
    372following. The size is 16 or 32 bytes.
    373
    374.. list-table::
    375   :widths: 8 8 24 40
    376   :header-rows: 1
    377
    378   * - Offset
    379     - Type
    380     - Name
    381     - Descriptor
    382   * - 0x0
    383     - __be32
    384     - t_blocknr
    385     - Lower 32-bits of the location of where the corresponding data block
    386       should end up on disk.
    387   * - 0x4
    388     - __be32
    389     - t_flags
    390     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
    391       more info.
    392   * - 0x8
    393     - __be32
    394     - t_blocknr_high
    395     - Upper 32-bits of the location of where the corresponding data block
    396       should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
    397       not enabled.
    398   * - 0xC
    399     - __be32
    400     - t_checksum
    401     - Checksum of the journal UUID, the sequence number, and the data block.
    402   * -
    403     -
    404     -
    405     - This field appears to be open coded. It always comes at the end of the
    406       tag, after t_checksum. This field is not present if the "same UUID" flag
    407       is set.
    408   * - 0x8 or 0xC
    409     - char
    410     - uuid[16]
    411     - A UUID to go with this tag. This field appears to be copied from the
    412       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
    413       field.
    414
    415.. _jbd2_tag_flags:
    416
    417The journal tag flags are any combination of the following:
    418
    419.. list-table::
    420   :widths: 16 64
    421   :header-rows: 1
    422
    423   * - Value
    424     - Description
    425   * - 0x1
    426     - On-disk block is escaped. The first four bytes of the data block just
    427       happened to match the jbd2 magic number.
    428   * - 0x2
    429     - This block has the same UUID as previous, therefore the UUID field is
    430       omitted.
    431   * - 0x4
    432     - The data block was deleted by the transaction. (Not used?)
    433   * - 0x8
    434     - This is the last tag in this descriptor block.
    435
    436If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
    437is defined as ``struct journal_block_tag_s``, which looks like the
    438following. The size is 8, 12, 24, or 28 bytes:
    439
    440.. list-table::
    441   :widths: 8 8 24 40
    442   :header-rows: 1
    443
    444   * - Offset
    445     - Type
    446     - Name
    447     - Descriptor
    448   * - 0x0
    449     - __be32
    450     - t_blocknr
    451     - Lower 32-bits of the location of where the corresponding data block
    452       should end up on disk.
    453   * - 0x4
    454     - __be16
    455     - t_checksum
    456     - Checksum of the journal UUID, the sequence number, and the data block.
    457       Note that only the lower 16 bits are stored.
    458   * - 0x6
    459     - __be16
    460     - t_flags
    461     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
    462       more info.
    463   * -
    464     -
    465     -
    466     - This next field is only present if the super block indicates support for
    467       64-bit block numbers.
    468   * - 0x8
    469     - __be32
    470     - t_blocknr_high
    471     - Upper 32-bits of the location of where the corresponding data block
    472       should end up on disk.
    473   * -
    474     -
    475     -
    476     - This field appears to be open coded. It always comes at the end of the
    477       tag, after t_flags or t_blocknr_high. This field is not present if the
    478       "same UUID" flag is set.
    479   * - 0x8 or 0xC
    480     - char
    481     - uuid[16]
    482     - A UUID to go with this tag. This field appears to be copied from the
    483       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
    484       field.
    485
    486If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
    487JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
    488``struct jbd2_journal_block_tail``, which looks like this:
    489
    490.. list-table::
    491   :widths: 8 8 24 40
    492   :header-rows: 1
    493
    494   * - Offset
    495     - Type
    496     - Name
    497     - Descriptor
    498   * - 0x0
    499     - __be32
    500     - t_checksum
    501     - Checksum of the journal UUID + the descriptor block, with this field set
    502       to zero.
    503
    504Data Block
    505~~~~~~~~~~
    506
    507In general, the data blocks being written to disk through the journal
    508are written verbatim into the journal file after the descriptor block.
    509However, if the first four bytes of the block match the jbd2 magic
    510number then those four bytes are replaced with zeroes and the “escaped”
    511flag is set in the descriptor block tag.
    512
    513Revocation Block
    514~~~~~~~~~~~~~~~~
    515
    516A revocation block is used to prevent replay of a block in an earlier
    517transaction. This is used to mark blocks that were journalled at one
    518time but are no longer journalled. Typically this happens if a metadata
    519block is freed and re-allocated as a file data block; in this case, a
    520journal replay after the file block was written to disk will cause
    521corruption.
    522
    523**NOTE**: This mechanism is NOT used to express “this journal block is
    524superseded by this other journal block”, as the author (djwong)
    525mistakenly thought. Any block being added to a transaction will cause
    526the removal of all existing revocation records for that block.
    527
    528Revocation blocks are described in
    529``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
    530length, but use a full block:
    531
    532.. list-table::
    533   :widths: 8 8 24 40
    534   :header-rows: 1
    535
    536   * - Offset
    537     - Type
    538     - Name
    539     - Description
    540   * - 0x0
    541     - journal_header_t
    542     - r_header
    543     - Common block header.
    544   * - 0xC
    545     - __be32
    546     - r_count
    547     - Number of bytes used in this block.
    548   * - 0x10
    549     - __be32 or __be64
    550     - blocks[0]
    551     - Blocks to revoke.
    552
    553After r_count is a linear array of block numbers that are effectively
    554revoked by this transaction. The size of each block number is 8 bytes if
    555the superblock advertises 64-bit block number support, or 4 bytes
    556otherwise.
    557
    558If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
    559JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
    560block is a ``struct jbd2_journal_revoke_tail``, which has this format:
    561
    562.. list-table::
    563   :widths: 8 8 24 40
    564   :header-rows: 1
    565
    566   * - Offset
    567     - Type
    568     - Name
    569     - Description
    570   * - 0x0
    571     - __be32
    572     - r_checksum
    573     - Checksum of the journal UUID + revocation block
    574
    575Commit Block
    576~~~~~~~~~~~~
    577
    578The commit block is a sentry that indicates that a transaction has been
    579completely written to the journal. Once this commit block reaches the
    580journal, the data stored with this transaction can be written to their
    581final locations on disk.
    582
    583The commit block is described by ``struct commit_header``, which is 32
    584bytes long (but uses a full block):
    585
    586.. list-table::
    587   :widths: 8 8 24 40
    588   :header-rows: 1
    589
    590   * - Offset
    591     - Type
    592     - Name
    593     - Descriptor
    594   * - 0x0
    595     - journal_header_s
    596     - (open coded)
    597     - Common block header.
    598   * - 0xC
    599     - unsigned char
    600     - h_chksum_type
    601     - The type of checksum to use to verify the integrity of the data blocks
    602       in the transaction. See jbd2_checksum_type_ for more info.
    603   * - 0xD
    604     - unsigned char
    605     - h_chksum_size
    606     - The number of bytes used by the checksum. Most likely 4.
    607   * - 0xE
    608     - unsigned char
    609     - h_padding[2]
    610     -
    611   * - 0x10
    612     - __be32
    613     - h_chksum[JBD2_CHECKSUM_BYTES]
    614     - 32 bytes of space to store checksums. If
    615       JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
    616       are set, the first ``__be32`` is the checksum of the journal UUID and
    617       the entire commit block, with this field zeroed. If
    618       JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
    619       crc32 of all the blocks already written to the transaction.
    620   * - 0x30
    621     - __be64
    622     - h_commit_sec
    623     - The time that the transaction was committed, in seconds since the epoch.
    624   * - 0x38
    625     - __be32
    626     - h_commit_nsec
    627     - Nanoseconds component of the above timestamp.
    628
    629Fast commits
    630~~~~~~~~~~~~
    631
    632Fast commit area is organized as a log of tag length values. Each TLV has
    633a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
    634of the entire field. It is followed by variable length tag specific value.
    635Here is the list of supported tags and their meanings:
    636
    637.. list-table::
    638   :widths: 8 20 20 32
    639   :header-rows: 1
    640
    641   * - Tag
    642     - Meaning
    643     - Value struct
    644     - Description
    645   * - EXT4_FC_TAG_HEAD
    646     - Fast commit area header
    647     - ``struct ext4_fc_head``
    648     - Stores the TID of the transaction after which these fast commits should
    649       be applied.
    650   * - EXT4_FC_TAG_ADD_RANGE
    651     - Add extent to inode
    652     - ``struct ext4_fc_add_range``
    653     - Stores the inode number and extent to be added in this inode
    654   * - EXT4_FC_TAG_DEL_RANGE
    655     - Remove logical offsets to inode
    656     - ``struct ext4_fc_del_range``
    657     - Stores the inode number and the logical offset range that needs to be
    658       removed
    659   * - EXT4_FC_TAG_CREAT
    660     - Create directory entry for a newly created file
    661     - ``struct ext4_fc_dentry_info``
    662     - Stores the parent inode number, inode number and directory entry of the
    663       newly created file
    664   * - EXT4_FC_TAG_LINK
    665     - Link a directory entry to an inode
    666     - ``struct ext4_fc_dentry_info``
    667     - Stores the parent inode number, inode number and directory entry
    668   * - EXT4_FC_TAG_UNLINK
    669     - Unlink a directory entry of an inode
    670     - ``struct ext4_fc_dentry_info``
    671     - Stores the parent inode number, inode number and directory entry
    672
    673   * - EXT4_FC_TAG_PAD
    674     - Padding (unused area)
    675     - None
    676     - Unused bytes in the fast commit area.
    677
    678   * - EXT4_FC_TAG_TAIL
    679     - Mark the end of a fast commit
    680     - ``struct ext4_fc_tail``
    681     - Stores the TID of the commit, CRC of the fast commit of which this tag
    682       represents the end of
    683
    684Fast Commit Replay Idempotence
    685~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    686
    687Fast commits tags are idempotent in nature provided the recovery code follows
    688certain rules. The guiding principle that the commit path follows while
    689committing is that it stores the result of a particular operation instead of
    690storing the procedure.
    691
    692Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
    693was associated with inode 10. During fast commit, instead of storing this
    694operation as a procedure "rename a to b", we store the resulting file system
    695state as a "series" of outcomes:
    696
    697- Link dirent b to inode 10
    698- Unlink dirent a
    699- Inode 10 with valid refcount
    700
    701Now when recovery code runs, it needs "enforce" this state on the file
    702system. This is what guarantees idempotence of fast commit replay.
    703
    704Let's take an example of a procedure that is not idempotent and see how fast
    705commits make it idempotent. Consider following sequence of operations:
    706
    7071) rm A
    7082) mv B A
    7093) read A
    710
    711If we store this sequence of operations as is then the replay is not idempotent.
    712Let's say while in replay, we crash after (2). During the second replay,
    713file A (which was actually created as a result of "mv B A" operation) would get
    714deleted. Thus, file named A would be absent when we try to read A. So, this
    715sequence of operations is not idempotent. However, as mentioned above, instead
    716of storing the procedure fast commits store the outcome of each procedure. Thus
    717the fast commit log for above procedure would be as follows:
    718
    719(Let's assume dirent A was linked to inode 10 and dirent B was linked to
    720inode 11 before the replay)
    721
    7221) Unlink A
    7232) Link A to inode 11
    7243) Unlink B
    7254) Inode 11
    726
    727If we crash after (3) we will have file A linked to inode 11. During the second
    728replay, we will remove file A (inode 11). But we will create it back and make
    729it point to inode 11. We won't find B, so we'll just skip that step. At this
    730point, the refcount for inode 11 is not reliable, but that gets fixed by the
    731replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
    732into a series of idempotent outcomes, fast commits ensured idempotence during
    733the replay.
    734
    735Journal Checkpoint
    736~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    737
    738Checkpointing the journal ensures all transactions and their associated buffers
    739are submitted to the disk. In-progress transactions are waited upon and included
    740in the checkpoint. Checkpointing is used internally during critical updates to
    741the filesystem including journal recovery, filesystem resizing, and freeing of
    742the journal_t structure.
    743
    744A journal checkpoint can be triggered from userspace via the ioctl
    745EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
    746Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
    747can be used to verify input to the ioctl. It returns error if there is any
    748invalid input, otherwise it returns success without performing
    749any checkpointing. This can be used to check whether the ioctl exists on a
    750system and to verify there are no issues with arguments or flags. The
    751other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
    752EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
    753discarded or zero-filled, respectively, after the journal checkpoint is
    754complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
    755cannot both be set. The ioctl may be useful when snapshotting a system or for
    756complying with content deletion SLOs.