cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

inodes.rst (17485B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3Index Nodes
      4-----------
      5
      6In a regular UNIX filesystem, the inode stores all the metadata
      7pertaining to the file (time stamps, block maps, extended attributes,
      8etc), not the directory entry. To find the information associated with a
      9file, one must traverse the directory files to find the directory entry
     10associated with a file, then load the inode to find the metadata for
     11that file. ext4 appears to cheat (for performance reasons) a little bit
     12by storing a copy of the file type (normally stored in the inode) in the
     13directory entry. (Compare all this to FAT, which stores all the file
     14information directly in the directory entry, but does not support hard
     15links and is in general more seek-happy than ext4 due to its simpler
     16block allocator and extensive use of linked lists.)
     17
     18The inode table is a linear array of ``struct ext4_inode``. The table is
     19sized to have enough blocks to store at least
     20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the
     21block group containing an inode can be calculated as
     22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the
     23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There
     24is no inode 0.
     25
     26The inode checksum is calculated against the FS UUID, the inode number,
     27and the inode structure itself.
     28
     29The inode table entry is laid out in ``struct ext4_inode``.
     30
     31.. list-table::
     32   :widths: 8 8 24 40
     33   :header-rows: 1
     34   :class: longtable
     35
     36   * - Offset
     37     - Size
     38     - Name
     39     - Description
     40   * - 0x0
     41     - __le16
     42     - i_mode
     43     - File mode. See the table i_mode_ below.
     44   * - 0x2
     45     - __le16
     46     - i_uid
     47     - Lower 16-bits of Owner UID.
     48   * - 0x4
     49     - __le32
     50     - i_size_lo
     51     - Lower 32-bits of size in bytes.
     52   * - 0x8
     53     - __le32
     54     - i_atime
     55     - Last access time, in seconds since the epoch. However, if the EA_INODE
     56       inode flag is set, this inode stores an extended attribute value and
     57       this field contains the checksum of the value.
     58   * - 0xC
     59     - __le32
     60     - i_ctime
     61     - Last inode change time, in seconds since the epoch. However, if the
     62       EA_INODE inode flag is set, this inode stores an extended attribute
     63       value and this field contains the lower 32 bits of the attribute value's
     64       reference count.
     65   * - 0x10
     66     - __le32
     67     - i_mtime
     68     - Last data modification time, in seconds since the epoch. However, if the
     69       EA_INODE inode flag is set, this inode stores an extended attribute
     70       value and this field contains the number of the inode that owns the
     71       extended attribute.
     72   * - 0x14
     73     - __le32
     74     - i_dtime
     75     - Deletion Time, in seconds since the epoch.
     76   * - 0x18
     77     - __le16
     78     - i_gid
     79     - Lower 16-bits of GID.
     80   * - 0x1A
     81     - __le16
     82     - i_links_count
     83     - Hard link count. Normally, ext4 does not permit an inode to have more
     84       than 65,000 hard links. This applies to files as well as directories,
     85       which means that there cannot be more than 64,998 subdirectories in a
     86       directory (each subdirectory's '..' entry counts as a hard link, as does
     87       the '.' entry in the directory itself). With the DIR_NLINK feature
     88       enabled, ext4 supports more than 64,998 subdirectories by setting this
     89       field to 1 to indicate that the number of hard links is not known.
     90   * - 0x1C
     91     - __le32
     92     - i_blocks_lo
     93     - Lower 32-bits of “block” count. If the huge_file feature flag is not
     94       set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
     95       on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
     96       ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
     97       << 32)`` 512-byte blocks on disk. If huge_file is set and
     98       EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
     99       consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
    100       disk.
    101   * - 0x20
    102     - __le32
    103     - i_flags
    104     - Inode flags. See the table i_flags_ below.
    105   * - 0x24
    106     - 4 bytes
    107     - i_osd1
    108     - See the table i_osd1_ for more details.
    109   * - 0x28
    110     - 60 bytes
    111     - i_block[EXT4_N_BLOCKS=15]
    112     - Block map or extent tree. See the section “The Contents of inode.i_block”.
    113   * - 0x64
    114     - __le32
    115     - i_generation
    116     - File version (for NFS).
    117   * - 0x68
    118     - __le32
    119     - i_file_acl_lo
    120     - Lower 32-bits of extended attribute block. ACLs are of course one of
    121       many possible extended attributes; I think the name of this field is a
    122       result of the first use of extended attributes being for ACLs.
    123   * - 0x6C
    124     - __le32
    125     - i_size_high / i_dir_acl
    126     - Upper 32-bits of file/directory size. In ext2/3 this field was named
    127       i_dir_acl, though it was usually set to zero and never used.
    128   * - 0x70
    129     - __le32
    130     - i_obso_faddr
    131     - (Obsolete) fragment address.
    132   * - 0x74
    133     - 12 bytes
    134     - i_osd2
    135     - See the table i_osd2_ for more details.
    136   * - 0x80
    137     - __le16
    138     - i_extra_isize
    139     - Size of this inode - 128. Alternately, the size of the extended inode
    140       fields beyond the original ext2 inode, including this field.
    141   * - 0x82
    142     - __le16
    143     - i_checksum_hi
    144     - Upper 16-bits of the inode checksum.
    145   * - 0x84
    146     - __le32
    147     - i_ctime_extra
    148     - Extra change time bits. This provides sub-second precision. See Inode
    149       Timestamps section.
    150   * - 0x88
    151     - __le32
    152     - i_mtime_extra
    153     - Extra modification time bits. This provides sub-second precision.
    154   * - 0x8C
    155     - __le32
    156     - i_atime_extra
    157     - Extra access time bits. This provides sub-second precision.
    158   * - 0x90
    159     - __le32
    160     - i_crtime
    161     - File creation time, in seconds since the epoch.
    162   * - 0x94
    163     - __le32
    164     - i_crtime_extra
    165     - Extra file creation time bits. This provides sub-second precision.
    166   * - 0x98
    167     - __le32
    168     - i_version_hi
    169     - Upper 32-bits for version number.
    170   * - 0x9C
    171     - __le32
    172     - i_projid
    173     - Project ID.
    174
    175.. _i_mode:
    176
    177The ``i_mode`` value is a combination of the following flags:
    178
    179.. list-table::
    180   :widths: 16 64
    181   :header-rows: 1
    182
    183   * - Value
    184     - Description
    185   * - 0x1
    186     - S_IXOTH (Others may execute)
    187   * - 0x2
    188     - S_IWOTH (Others may write)
    189   * - 0x4
    190     - S_IROTH (Others may read)
    191   * - 0x8
    192     - S_IXGRP (Group members may execute)
    193   * - 0x10
    194     - S_IWGRP (Group members may write)
    195   * - 0x20
    196     - S_IRGRP (Group members may read)
    197   * - 0x40
    198     - S_IXUSR (Owner may execute)
    199   * - 0x80
    200     - S_IWUSR (Owner may write)
    201   * - 0x100
    202     - S_IRUSR (Owner may read)
    203   * - 0x200
    204     - S_ISVTX (Sticky bit)
    205   * - 0x400
    206     - S_ISGID (Set GID)
    207   * - 0x800
    208     - S_ISUID (Set UID)
    209   * -
    210     - These are mutually-exclusive file types:
    211   * - 0x1000
    212     - S_IFIFO (FIFO)
    213   * - 0x2000
    214     - S_IFCHR (Character device)
    215   * - 0x4000
    216     - S_IFDIR (Directory)
    217   * - 0x6000
    218     - S_IFBLK (Block device)
    219   * - 0x8000
    220     - S_IFREG (Regular file)
    221   * - 0xA000
    222     - S_IFLNK (Symbolic link)
    223   * - 0xC000
    224     - S_IFSOCK (Socket)
    225
    226.. _i_flags:
    227
    228The ``i_flags`` field is a combination of these values:
    229
    230.. list-table::
    231   :widths: 16 64
    232   :header-rows: 1
    233
    234   * - Value
    235     - Description
    236   * - 0x1
    237     - This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
    238   * - 0x2
    239     - This file should be preserved, should undeletion be desired
    240       (EXT4_UNRM_FL). (not implemented)
    241   * - 0x4
    242     - File is compressed (EXT4_COMPR_FL). (not really implemented)
    243   * - 0x8
    244     - All writes to the file must be synchronous (EXT4_SYNC_FL).
    245   * - 0x10
    246     - File is immutable (EXT4_IMMUTABLE_FL).
    247   * - 0x20
    248     - File can only be appended (EXT4_APPEND_FL).
    249   * - 0x40
    250     - The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
    251   * - 0x80
    252     - Do not update access time (EXT4_NOATIME_FL).
    253   * - 0x100
    254     - Dirty compressed file (EXT4_DIRTY_FL). (not used)
    255   * - 0x200
    256     - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
    257   * - 0x400
    258     - Do not compress file (EXT4_NOCOMPR_FL). (not used)
    259   * - 0x800
    260     - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
    261       EXT4_ECOMPR_FL (compression error), which was never used.
    262   * - 0x1000
    263     - Directory has hashed indexes (EXT4_INDEX_FL).
    264   * - 0x2000
    265     - AFS magic directory (EXT4_IMAGIC_FL).
    266   * - 0x4000
    267     - File data must always be written through the journal
    268       (EXT4_JOURNAL_DATA_FL).
    269   * - 0x8000
    270     - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
    271   * - 0x10000
    272     - All directory entry data should be written synchronously (see
    273       ``dirsync``) (EXT4_DIRSYNC_FL).
    274   * - 0x20000
    275     - Top of directory hierarchy (EXT4_TOPDIR_FL).
    276   * - 0x40000
    277     - This is a huge file (EXT4_HUGE_FILE_FL).
    278   * - 0x80000
    279     - Inode uses extents (EXT4_EXTENTS_FL).
    280   * - 0x100000
    281     - Verity protected file (EXT4_VERITY_FL).
    282   * - 0x200000
    283     - Inode stores a large extended attribute value in its data blocks
    284       (EXT4_EA_INODE_FL).
    285   * - 0x400000
    286     - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
    287       (deprecated)
    288   * - 0x01000000
    289     - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
    290   * - 0x04000000
    291     - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in
    292       mainline)
    293   * - 0x08000000
    294     - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
    295       mainline)
    296   * - 0x10000000
    297     - Inode has inline data (EXT4_INLINE_DATA_FL).
    298   * - 0x20000000
    299     - Create children with the same project ID (EXT4_PROJINHERIT_FL).
    300   * - 0x80000000
    301     - Reserved for ext4 library (EXT4_RESERVED_FL).
    302   * -
    303     - Aggregate flags:
    304   * - 0x705BDFFF
    305     - User-visible flags.
    306   * - 0x604BC0FF
    307     - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
    308       EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
    309       EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
    310       these flags in a special manner and they are masked out of the set of
    311       flags that are saved directly to i_flags.
    312
    313.. _i_osd1:
    314
    315The ``osd1`` field has multiple meanings depending on the creator:
    316
    317Linux:
    318
    319.. list-table::
    320   :widths: 8 8 24 40
    321   :header-rows: 1
    322
    323   * - Offset
    324     - Size
    325     - Name
    326     - Description
    327   * - 0x0
    328     - __le32
    329     - l_i_version
    330     - Inode version. However, if the EA_INODE inode flag is set, this inode
    331       stores an extended attribute value and this field contains the upper 32
    332       bits of the attribute value's reference count.
    333
    334Hurd:
    335
    336.. list-table::
    337   :widths: 8 8 24 40
    338   :header-rows: 1
    339
    340   * - Offset
    341     - Size
    342     - Name
    343     - Description
    344   * - 0x0
    345     - __le32
    346     - h_i_translator
    347     - ??
    348
    349Masix:
    350
    351.. list-table::
    352   :widths: 8 8 24 40
    353   :header-rows: 1
    354
    355   * - Offset
    356     - Size
    357     - Name
    358     - Description
    359   * - 0x0
    360     - __le32
    361     - m_i_reserved
    362     - ??
    363
    364.. _i_osd2:
    365
    366The ``osd2`` field has multiple meanings depending on the filesystem creator:
    367
    368Linux:
    369
    370.. list-table::
    371   :widths: 8 8 24 40
    372   :header-rows: 1
    373
    374   * - Offset
    375     - Size
    376     - Name
    377     - Description
    378   * - 0x0
    379     - __le16
    380     - l_i_blocks_high
    381     - Upper 16-bits of the block count. Please see the note attached to
    382       i_blocks_lo.
    383   * - 0x2
    384     - __le16
    385     - l_i_file_acl_high
    386     - Upper 16-bits of the extended attribute block (historically, the file
    387       ACL location). See the Extended Attributes section below.
    388   * - 0x4
    389     - __le16
    390     - l_i_uid_high
    391     - Upper 16-bits of the Owner UID.
    392   * - 0x6
    393     - __le16
    394     - l_i_gid_high
    395     - Upper 16-bits of the GID.
    396   * - 0x8
    397     - __le16
    398     - l_i_checksum_lo
    399     - Lower 16-bits of the inode checksum.
    400   * - 0xA
    401     - __le16
    402     - l_i_reserved
    403     - Unused.
    404
    405Hurd:
    406
    407.. list-table::
    408   :widths: 8 8 24 40
    409   :header-rows: 1
    410
    411   * - Offset
    412     - Size
    413     - Name
    414     - Description
    415   * - 0x0
    416     - __le16
    417     - h_i_reserved1
    418     - ??
    419   * - 0x2
    420     - __u16
    421     - h_i_mode_high
    422     - Upper 16-bits of the file mode.
    423   * - 0x4
    424     - __le16
    425     - h_i_uid_high
    426     - Upper 16-bits of the Owner UID.
    427   * - 0x6
    428     - __le16
    429     - h_i_gid_high
    430     - Upper 16-bits of the GID.
    431   * - 0x8
    432     - __u32
    433     - h_i_author
    434     - Author code?
    435
    436Masix:
    437
    438.. list-table::
    439   :widths: 8 8 24 40
    440   :header-rows: 1
    441
    442   * - Offset
    443     - Size
    444     - Name
    445     - Description
    446   * - 0x0
    447     - __le16
    448     - h_i_reserved1
    449     - ??
    450   * - 0x2
    451     - __u16
    452     - m_i_file_acl_high
    453     - Upper 16-bits of the extended attribute block (historically, the file
    454       ACL location).
    455   * - 0x4
    456     - __u32
    457     - m_i_reserved2[2]
    458     - ??
    459
    460Inode Size
    461~~~~~~~~~~
    462
    463In ext2 and ext3, the inode structure size was fixed at 128 bytes
    464(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of
    465128 bytes. Starting with ext4, it is possible to allocate a larger
    466on-disk inode at format time for all inodes in the filesystem to provide
    467space beyond the end of the original ext2 inode. The on-disk inode
    468record size is recorded in the superblock as ``s_inode_size``. The
    469number of bytes actually used by struct ext4_inode beyond the original
    470128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
    471inode, which allows struct ext4_inode to grow for a new kernel without
    472having to upgrade all of the on-disk inodes. Access to fields beyond
    473EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
    474``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
    475of August 2019) the inode structure is 160 bytes
    476(``i_extra_isize = 32``). The extra space between the end of the inode
    477structure and the end of the inode record can be used to store extended
    478attributes. Each inode record can be as large as the filesystem block
    479size, though this is not terribly efficient.
    480
    481Finding an Inode
    482~~~~~~~~~~~~~~~~
    483
    484Each block group contains ``sb->s_inodes_per_group`` inodes. Because
    485inode 0 is defined not to exist, this formula can be used to find the
    486block group that an inode lives in:
    487``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode
    488can be found within the block group's inode table at
    489``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte
    490address within the inode table, use
    491``offset = index * sb->s_inode_size``.
    492
    493Inode Timestamps
    494~~~~~~~~~~~~~~~~
    495
    496Four timestamps are recorded in the lower 128 bytes of the inode
    497structure -- inode change time (ctime), access time (atime), data
    498modification time (mtime), and deletion time (dtime). The four fields
    499are 32-bit signed integers that represent seconds since the Unix epoch
    500(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
    501January 2038. If the filesystem does not have orphan_file feature, inodes
    502that are not linked from any directory but are still open (orphan inodes) have
    503the dtime field overloaded for use with the orphan list. The superblock field
    504``s_last_orphan`` points to the first inode in the orphan list; dtime is then
    505the number of the next orphaned inode, or zero if there are no more orphans.
    506
    507If the inode structure size ``sb->s_inode_size`` is larger than 128
    508bytes and the ``i_inode_extra`` field is large enough to encompass the
    509respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime
    510inode fields are widened to 64 bits. Within this “extra” 32-bit field,
    511the lower two bits are used to extend the 32-bit seconds field to be 34
    512bit wide; the upper 30 bits are used to provide nanosecond timestamp
    513accuracy. Therefore, timestamps should not overflow until May 2446.
    514dtime was not widened. There is also a fifth timestamp to record inode
    515creation time (crtime); this field is 64-bits wide and decoded in the
    516same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
    517through the regular stat() interface, though debugfs will report them.
    518
    519We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
    520In other words:
    521
    522.. list-table::
    523   :widths: 20 20 20 20 20
    524   :header-rows: 1
    525
    526   * - Extra epoch bits
    527     - MSB of 32-bit time
    528     - Adjustment for signed 32-bit to 64-bit tv_sec
    529     - Decoded 64-bit tv_sec
    530     - valid time range
    531   * - 0 0
    532     - 1
    533     - 0
    534     - ``-0x80000000 - -0x00000001``
    535     - 1901-12-13 to 1969-12-31
    536   * - 0 0
    537     - 0
    538     - 0
    539     - ``0x000000000 - 0x07fffffff``
    540     - 1970-01-01 to 2038-01-19
    541   * - 0 1
    542     - 1
    543     - 0x100000000
    544     - ``0x080000000 - 0x0ffffffff``
    545     - 2038-01-19 to 2106-02-07
    546   * - 0 1
    547     - 0
    548     - 0x100000000
    549     - ``0x100000000 - 0x17fffffff``
    550     - 2106-02-07 to 2174-02-25
    551   * - 1 0
    552     - 1
    553     - 0x200000000
    554     - ``0x180000000 - 0x1ffffffff``
    555     - 2174-02-25 to 2242-03-16
    556   * - 1 0
    557     - 0
    558     - 0x200000000
    559     - ``0x200000000 - 0x27fffffff``
    560     - 2242-03-16 to 2310-04-04
    561   * - 1 1
    562     - 1
    563     - 0x300000000
    564     - ``0x280000000 - 0x2ffffffff``
    565     - 2310-04-04 to 2378-04-22
    566   * - 1 1
    567     - 0
    568     - 0x300000000
    569     - ``0x300000000 - 0x37fffffff``
    570     - 2378-04-22 to 2446-05-10
    571
    572This is a somewhat odd encoding since there are effectively seven times
    573as many positive values as negative values. There have also been
    574long-standing bugs decoding and encoding dates beyond 2038, which don't
    575seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels
    576incorrectly use the extra epoch bits 1,1 for dates between 1901 and
    5771970. At some point the kernel will be fixed and e2fsck will fix this
    578situation, assuming that it is run before 2310.