cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

ioctl.rst (10606B)


      1======================
      2ioctl based interfaces
      3======================
      4
      5ioctl() is the most common way for applications to interface
      6with device drivers. It is flexible and easily extended by adding new
      7commands and can be passed through character devices, block devices as
      8well as sockets and other special file descriptors.
      9
     10However, it is also very easy to get ioctl command definitions wrong,
     11and hard to fix them later without breaking existing applications,
     12so this documentation tries to help developers get it right.
     13
     14Command number definitions
     15==========================
     16
     17The command number, or request number, is the second argument passed to
     18the ioctl system call. While this can be any 32-bit number that uniquely
     19identifies an action for a particular driver, there are a number of
     20conventions around defining them.
     21
     22``include/uapi/asm-generic/ioctl.h`` provides four macros for defining
     23ioctl commands that follow modern conventions: ``_IO``, ``_IOR``,
     24``_IOW``, and ``_IOWR``. These should be used for all new commands,
     25with the correct parameters:
     26
     27_IO/_IOR/_IOW/_IOWR
     28   The macro name specifies how the argument will be used.  It may be a
     29   pointer to data to be passed into the kernel (_IOW), out of the kernel
     30   (_IOR), or both (_IOWR).  _IO can indicate either commands with no
     31   argument or those passing an integer value instead of a pointer.
     32   It is recommended to only use _IO for commands without arguments,
     33   and use pointers for passing data.
     34
     35type
     36   An 8-bit number, often a character literal, specific to a subsystem
     37   or driver, and listed in Documentation/userspace-api/ioctl/ioctl-number.rst
     38
     39nr
     40  An 8-bit number identifying the specific command, unique for a give
     41  value of 'type'
     42
     43data_type
     44  The name of the data type pointed to by the argument, the command number
     45  encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer,
     46  leading to a limit of 8191 bytes for the maximum size of the argument.
     47  Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that
     48  will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t).
     49  _IO does not have a data_type parameter.
     50
     51
     52Interface versions
     53==================
     54
     55Some subsystems use version numbers in data structures to overload
     56commands with different interpretations of the argument.
     57
     58This is generally a bad idea, since changes to existing commands tend
     59to break existing applications.
     60
     61A better approach is to add a new ioctl command with a new number. The
     62old command still needs to be implemented in the kernel for compatibility,
     63but this can be a wrapper around the new implementation.
     64
     65Return code
     66===========
     67
     68ioctl commands can return negative error codes as documented in errno(3);
     69these get turned into errno values in user space. On success, the return
     70code should be zero. It is also possible but not recommended to return
     71a positive 'long' value.
     72
     73When the ioctl callback is called with an unknown command number, the
     74handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in
     75-ENOTTY being returned from the system call. Some subsystems return
     76-ENOSYS or -EINVAL here for historic reasons, but this is wrong.
     77
     78Prior to Linux 5.5, compat_ioctl handlers were required to return
     79-ENOIOCTLCMD in order to use the fallback conversion into native
     80commands. As all subsystems are now responsible for handling compat
     81mode themselves, this is no longer needed, but it may be important to
     82consider when backporting bug fixes to older kernels.
     83
     84Timestamps
     85==========
     86
     87Traditionally, timestamps and timeout values are passed as ``struct
     88timespec`` or ``struct timeval``, but these are problematic because of
     89incompatible definitions of these structures in user space after the
     90move to 64-bit time_t.
     91
     92The ``struct __kernel_timespec`` type can be used instead to be embedded
     93in other data structures when separate second/nanosecond values are
     94desired, or passed to user space directly. This is still not ideal though,
     95as the structure matches neither the kernel's timespec64 nor the user
     96space timespec exactly. The get_timespec64() and put_timespec64() helper
     97functions can be used to ensure that the layout remains compatible with
     98user space and the padding is treated correctly.
     99
    100As it is cheap to convert seconds to nanoseconds, but the opposite
    101requires an expensive 64-bit division, a simple __u64 nanosecond value
    102can be simpler and more efficient.
    103
    104Timeout values and timestamps should ideally use CLOCK_MONOTONIC time,
    105as returned by ktime_get_ns() or ktime_get_ts64().  Unlike
    106CLOCK_REALTIME, this makes the timestamps immune from jumping backwards
    107or forwards due to leap second adjustments and clock_settime() calls.
    108
    109ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that
    110need to be persistent across a reboot or between multiple machines.
    111
    11232-bit compat mode
    113==================
    114
    115In order to support 32-bit user space running on a 64-bit machine, each
    116subsystem or driver that implements an ioctl callback handler must also
    117implement the corresponding compat_ioctl handler.
    118
    119As long as all the rules for data structures are followed, this is as
    120easy as setting the .compat_ioctl pointer to a helper function such as
    121compat_ptr_ioctl() or blkdev_compat_ptr_ioctl().
    122
    123compat_ptr()
    124------------
    125
    126On the s390 architecture, 31-bit user space has ambiguous representations
    127for data pointers, with the upper bit being ignored. When running such
    128a process in compat mode, the compat_ptr() helper must be used to
    129clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit
    130pointer.  On other architectures, this macro only performs a cast to a
    131``void __user *`` pointer.
    132
    133In an compat_ioctl() callback, the last argument is an unsigned long,
    134which can be interpreted as either a pointer or a scalar depending on
    135the command. If it is a scalar, then compat_ptr() must not be used, to
    136ensure that the 64-bit kernel behaves the same way as a 32-bit kernel
    137for arguments with the upper bit set.
    138
    139The compat_ptr_ioctl() helper can be used in place of a custom
    140compat_ioctl file operation for drivers that only take arguments that
    141are pointers to compatible data structures.
    142
    143Structure layout
    144----------------
    145
    146Compatible data structures have the same layout on all architectures,
    147avoiding all problematic members:
    148
    149* ``long`` and ``unsigned long`` are the size of a register, so
    150  they can be either 32-bit or 64-bit wide and cannot be used in portable
    151  data structures. Fixed-length replacements are ``__s32``, ``__u32``,
    152  ``__s64`` and ``__u64``.
    153
    154* Pointers have the same problem, in addition to requiring the
    155  use of compat_ptr(). The best workaround is to use ``__u64``
    156  in place of pointers, which requires a cast to ``uintptr_t`` in user
    157  space, and the use of u64_to_user_ptr() in the kernel to convert
    158  it back into a user pointer.
    159
    160* On the x86-32 (i386) architecture, the alignment of 64-bit variables
    161  is only 32-bit, but they are naturally aligned on most other
    162  architectures including x86-64. This means a structure like::
    163
    164    struct foo {
    165        __u32 a;
    166        __u64 b;
    167        __u32 c;
    168    };
    169
    170  has four bytes of padding between a and b on x86-64, plus another four
    171  bytes of padding at the end, but no padding on i386, and it needs a
    172  compat_ioctl conversion handler to translate between the two formats.
    173
    174  To avoid this problem, all structures should have their members
    175  naturally aligned, or explicit reserved fields added in place of the
    176  implicit padding. The ``pahole`` tool can be used for checking the
    177  alignment.
    178
    179* On ARM OABI user space, structures are padded to multiples of 32-bit,
    180  making some structs incompatible with modern EABI kernels if they
    181  do not end on a 32-bit boundary.
    182
    183* On the m68k architecture, struct members are not guaranteed to have an
    184  alignment greater than 16-bit, which is a problem when relying on
    185  implicit padding.
    186
    187* Bitfields and enums generally work as one would expect them to,
    188  but some properties of them are implementation-defined, so it is better
    189  to avoid them completely in ioctl interfaces.
    190
    191* ``char`` members can be either signed or unsigned, depending on
    192  the architecture, so the __u8 and __s8 types should be used for 8-bit
    193  integer values, though char arrays are clearer for fixed-length strings.
    194
    195Information leaks
    196=================
    197
    198Uninitialized data must not be copied back to user space, as this can
    199cause an information leak, which can be used to defeat kernel address
    200space layout randomization (KASLR), helping in an attack.
    201
    202For this reason (and for compat support) it is best to avoid any
    203implicit padding in data structures.  Where there is implicit padding
    204in an existing structure, kernel drivers must be careful to fully
    205initialize an instance of the structure before copying it to user
    206space.  This is usually done by calling memset() before assigning to
    207individual members.
    208
    209Subsystem abstractions
    210======================
    211
    212While some device drivers implement their own ioctl function, most
    213subsystems implement the same command for multiple drivers.  Ideally the
    214subsystem has an .ioctl() handler that copies the arguments from and
    215to user space, passing them into subsystem specific callback functions
    216through normal kernel pointers.
    217
    218This helps in various ways:
    219
    220* Applications written for one driver are more likely to work for
    221  another one in the same subsystem if there are no subtle differences
    222  in the user space ABI.
    223
    224* The complexity of user space access and data structure layout is done
    225  in one place, reducing the potential for implementation bugs.
    226
    227* It is more likely to be reviewed by experienced developers
    228  that can spot problems in the interface when the ioctl is shared
    229  between multiple drivers than when it is only used in a single driver.
    230
    231Alternatives to ioctl
    232=====================
    233
    234There are many cases in which ioctl is not the best solution for a
    235problem. Alternatives include:
    236
    237* System calls are a better choice for a system-wide feature that
    238  is not tied to a physical device or constrained by the file system
    239  permissions of a character device node
    240
    241* netlink is the preferred way of configuring any network related
    242  objects through sockets.
    243
    244* debugfs is used for ad-hoc interfaces for debugging functionality
    245  that does not need to be exposed as a stable interface to applications.
    246
    247* sysfs is a good way to expose the state of an in-kernel object
    248  that is not tied to a file descriptor.
    249
    250* configfs can be used for more complex configuration than sysfs
    251
    252* A custom file system can provide extra flexibility with a simple
    253  user interface but adds a lot of complexity to the implementation.