cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

tcmu-design.rst (14378B)


      1====================
      2TCM Userspace Design
      3====================
      4
      5
      6.. Contents:
      7
      8   1) Design
      9     a) Background
     10     b) Benefits
     11     c) Design constraints
     12     d) Implementation overview
     13        i. Mailbox
     14        ii. Command ring
     15        iii. Data Area
     16     e) Device discovery
     17     f) Device events
     18     g) Other contingencies
     19   2) Writing a user pass-through handler
     20     a) Discovering and configuring TCMU uio devices
     21     b) Waiting for events on the device(s)
     22     c) Managing the command ring
     23   3) A final note
     24
     25
     26Design
     27======
     28
     29TCM is another name for LIO, an in-kernel iSCSI target (server).
     30Existing TCM targets run in the kernel.  TCMU (TCM in Userspace)
     31allows userspace programs to be written which act as iSCSI targets.
     32This document describes the design.
     33
     34The existing kernel provides modules for different SCSI transport
     35protocols.  TCM also modularizes the data storage.  There are existing
     36modules for file, block device, RAM or using another SCSI device as
     37storage.  These are called "backstores" or "storage engines".  These
     38built-in modules are implemented entirely as kernel code.
     39
     40Background
     41----------
     42
     43In addition to modularizing the transport protocol used for carrying
     44SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
     45the actual data storage as well. These are referred to as "backstores"
     46or "storage engines". The target comes with backstores that allow a
     47file, a block device, RAM, or another SCSI device to be used for the
     48local storage needed for the exported SCSI LUN. Like the rest of LIO,
     49these are implemented entirely as kernel code.
     50
     51These backstores cover the most common use cases, but not all. One new
     52use case that other non-kernel target solutions, such as tgt, are able
     53to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
     54target then serves as a translator, allowing initiators to store data
     55in these non-traditional networked storage systems, while still only
     56using standard protocols themselves.
     57
     58If the target is a userspace process, supporting these is easy. tgt,
     59for example, needs only a small adapter module for each, because the
     60modules just use the available userspace libraries for RBD and GLFS.
     61
     62Adding support for these backstores in LIO is considerably more
     63difficult, because LIO is entirely kernel code. Instead of undertaking
     64the significant work to port the GLFS or RBD APIs and protocols to the
     65kernel, another approach is to create a userspace pass-through
     66backstore for LIO, "TCMU".
     67
     68
     69Benefits
     70--------
     71
     72In addition to allowing relatively easy support for RBD and GLFS, TCMU
     73will also allow easier development of new backstores. TCMU combines
     74with the LIO loopback fabric to become something similar to FUSE
     75(Filesystem in Userspace), but at the SCSI layer instead of the
     76filesystem layer. A SUSE, if you will.
     77
     78The disadvantage is there are more distinct components to configure, and
     79potentially to malfunction. This is unavoidable, but hopefully not
     80fatal if we're careful to keep things as simple as possible.
     81
     82Design constraints
     83------------------
     84
     85- Good performance: high throughput, low latency
     86- Cleanly handle if userspace:
     87
     88   1) never attaches
     89   2) hangs
     90   3) dies
     91   4) misbehaves
     92
     93- Allow future flexibility in user & kernel implementations
     94- Be reasonably memory-efficient
     95- Simple to configure & run
     96- Simple to write a userspace backend
     97
     98
     99Implementation overview
    100-----------------------
    101
    102The core of the TCMU interface is a memory region that is shared
    103between kernel and userspace. Within this region is: a control area
    104(mailbox); a lockless producer/consumer circular buffer for commands
    105to be passed up, and status returned; and an in/out data buffer area.
    106
    107TCMU uses the pre-existing UIO subsystem. UIO allows device driver
    108development in userspace, and this is conceptually very close to the
    109TCMU use case, except instead of a physical device, TCMU implements a
    110memory-mapped layout designed for SCSI commands. Using UIO also
    111benefits TCMU by handling device introspection (e.g. a way for
    112userspace to determine how large the shared region is) and signaling
    113mechanisms in both directions.
    114
    115There are no embedded pointers in the memory region. Everything is
    116expressed as an offset from the region's starting address. This allows
    117the ring to still work if the user process dies and is restarted with
    118the region mapped at a different virtual address.
    119
    120See target_core_user.h for the struct definitions.
    121
    122The Mailbox
    123-----------
    124
    125The mailbox is always at the start of the shared memory region, and
    126contains a version, details about the starting offset and size of the
    127command ring, and head and tail pointers to be used by the kernel and
    128userspace (respectively) to put commands on the ring, and indicate
    129when the commands are completed.
    130
    131version - 1 (userspace should abort if otherwise)
    132
    133flags:
    134    - TCMU_MAILBOX_FLAG_CAP_OOOC:
    135	indicates out-of-order completion is supported.
    136	See "The Command Ring" for details.
    137
    138cmdr_off
    139	The offset of the start of the command ring from the start
    140	of the memory region, to account for the mailbox size.
    141cmdr_size
    142	The size of the command ring. This does *not* need to be a
    143	power of two.
    144cmd_head
    145	Modified by the kernel to indicate when a command has been
    146	placed on the ring.
    147cmd_tail
    148	Modified by userspace to indicate when it has completed
    149	processing of a command.
    150
    151The Command Ring
    152----------------
    153
    154Commands are placed on the ring by the kernel incrementing
    155mailbox.cmd_head by the size of the command, modulo cmdr_size, and
    156then signaling userspace via uio_event_notify(). Once the command is
    157completed, userspace updates mailbox.cmd_tail in the same way and
    158signals the kernel via a 4-byte write(). When cmd_head equals
    159cmd_tail, the ring is empty -- no commands are currently waiting to be
    160processed by userspace.
    161
    162TCMU commands are 8-byte aligned. They start with a common header
    163containing "len_op", a 32-bit value that stores the length, as well as
    164the opcode in the lowest unused bits. It also contains cmd_id and
    165flags fields for setting by the kernel (kflags) and userspace
    166(uflags).
    167
    168Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD.
    169
    170When the opcode is CMD, the entry in the command ring is a struct
    171tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via
    172tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the
    173overall shared memory region, not the entry. The data in/out buffers
    174are accessible via tht req.iov[] array. iov_cnt contains the number of
    175entries in iov[] needed to describe either the Data-In or Data-Out
    176buffers. For bidirectional commands, iov_cnt specifies how many iovec
    177entries cover the Data-Out area, and iov_bidi_cnt specifies how many
    178iovec entries immediately after that in iov[] cover the Data-In
    179area. Just like other fields, iov.iov_base is an offset from the start
    180of the region.
    181
    182When completing a command, userspace sets rsp.scsi_status, and
    183rsp.sense_buffer if necessary. Userspace then increments
    184mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
    185kernel via the UIO method, a 4-byte write to the file descriptor.
    186
    187If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is
    188capable of handling out-of-order completions. In this case, userspace can
    189handle command in different order other than original. Since kernel would
    190still process the commands in the same order it appeared in the command
    191ring, userspace need to update the cmd->id when completing the
    192command(a.k.a steal the original command's entry).
    193
    194When the opcode is PAD, userspace only updates cmd_tail as above --
    195it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry
    196is contiguous within the command ring.)
    197
    198More opcodes may be added in the future. If userspace encounters an
    199opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in
    200hdr.uflags, update cmd_tail, and proceed with processing additional
    201commands, if any.
    202
    203The Data Area
    204-------------
    205
    206This is shared-memory space after the command ring. The organization
    207of this area is not defined in the TCMU interface, and userspace
    208should access only the parts referenced by pending iovs.
    209
    210
    211Device Discovery
    212----------------
    213
    214Other devices may be using UIO besides TCMU. Unrelated user processes
    215may also be handling different sets of TCMU devices. TCMU userspace
    216processes must find their devices by scanning sysfs
    217class/uio/uio*/name. For TCMU devices, these names will be of the
    218format::
    219
    220	tcm-user/<hba_num>/<device_name>/<subtype>/<path>
    221
    222where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
    223and <device_name> allow userspace to find the device's path in the
    224kernel target's configfs tree. Assuming the usual mount point, it is
    225found at::
    226
    227	/sys/kernel/config/target/core/user_<hba_num>/<device_name>
    228
    229This location contains attributes such as "hw_block_size", that
    230userspace needs to know for correct operation.
    231
    232<subtype> will be a userspace-process-unique string to identify the
    233TCMU device as expecting to be backed by a certain handler, and <path>
    234will be an additional handler-specific string for the user process to
    235configure the device, if needed. The name cannot contain ':', due to
    236LIO limitations.
    237
    238For all devices so discovered, the user handler opens /dev/uioX and
    239calls mmap()::
    240
    241	mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
    242
    243where size must be equal to the value read from
    244/sys/class/uio/uioX/maps/map0/size.
    245
    246
    247Device Events
    248-------------
    249
    250If a new device is added or removed, a notification will be broadcast
    251over netlink, using a generic netlink family name of "TCM-USER" and a
    252multicast group named "config". This will include the UIO name as
    253described in the previous section, as well as the UIO minor
    254number. This should allow userspace to identify both the UIO device and
    255the LIO device, so that after determining the device is supported
    256(based on subtype) it can take the appropriate action.
    257
    258
    259Other contingencies
    260-------------------
    261
    262Userspace handler process never attaches:
    263
    264- TCMU will post commands, and then abort them after a timeout period
    265  (30 seconds.)
    266
    267Userspace handler process is killed:
    268
    269- It is still possible to restart and re-connect to TCMU
    270  devices. Command ring is preserved. However, after the timeout period,
    271  the kernel will abort pending tasks.
    272
    273Userspace handler process hangs:
    274
    275- The kernel will abort pending tasks after a timeout period.
    276
    277Userspace handler process is malicious:
    278
    279- The process can trivially break the handling of devices it controls,
    280  but should not be able to access kernel memory outside its shared
    281  memory areas.
    282
    283
    284Writing a user pass-through handler (with example code)
    285=======================================================
    286
    287A user process handing a TCMU device must support the following:
    288
    289a) Discovering and configuring TCMU uio devices
    290b) Waiting for events on the device(s)
    291c) Managing the command ring: Parsing operations and commands,
    292   performing work as needed, setting response fields (scsi_status and
    293   possibly sense_buffer), updating cmd_tail, and notifying the kernel
    294   that work has been finished
    295
    296First, consider instead writing a plugin for tcmu-runner. tcmu-runner
    297implements all of this, and provides a higher-level API for plugin
    298authors.
    299
    300TCMU is designed so that multiple unrelated processes can manage TCMU
    301devices separately. All handlers should make sure to only open their
    302devices, based opon a known subtype string.
    303
    304a) Discovering and configuring TCMU UIO devices::
    305
    306      /* error checking omitted for brevity */
    307
    308      int fd, dev_fd;
    309      char buf[256];
    310      unsigned long long map_len;
    311      void *map;
    312
    313      fd = open("/sys/class/uio/uio0/name", O_RDONLY);
    314      ret = read(fd, buf, sizeof(buf));
    315      close(fd);
    316      buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
    317
    318      /* we only want uio devices whose name is a format we expect */
    319      if (strncmp(buf, "tcm-user", 8))
    320	exit(-1);
    321
    322      /* Further checking for subtype also needed here */
    323
    324      fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
    325      ret = read(fd, buf, sizeof(buf));
    326      close(fd);
    327      str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
    328
    329      map_len = strtoull(buf, NULL, 0);
    330
    331      dev_fd = open("/dev/uio0", O_RDWR);
    332      map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
    333
    334
    335      b) Waiting for events on the device(s)
    336
    337      while (1) {
    338        char buf[4];
    339
    340        int ret = read(dev_fd, buf, 4); /* will block */
    341
    342        handle_device_events(dev_fd, map);
    343      }
    344
    345
    346c) Managing the command ring::
    347
    348      #include <linux/target_core_user.h>
    349
    350      int handle_device_events(int fd, void *map)
    351      {
    352        struct tcmu_mailbox *mb = map;
    353        struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
    354        int did_some_work = 0;
    355
    356        /* Process events from cmd ring until we catch up with cmd_head */
    357        while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
    358
    359          if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) {
    360            uint8_t *cdb = (void *)mb + ent->req.cdb_off;
    361            bool success = true;
    362
    363            /* Handle command here. */
    364            printf("SCSI opcode: 0x%x\n", cdb[0]);
    365
    366            /* Set response fields */
    367            if (success)
    368              ent->rsp.scsi_status = SCSI_NO_SENSE;
    369            else {
    370              /* Also fill in rsp->sense_buffer here */
    371              ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
    372            }
    373          }
    374          else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) {
    375            /* Tell the kernel we didn't handle unknown opcodes */
    376            ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP;
    377          }
    378          else {
    379            /* Do nothing for PAD entries except update cmd_tail */
    380          }
    381
    382          /* update cmd_tail */
    383          mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
    384          ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
    385          did_some_work = 1;
    386        }
    387
    388        /* Notify the kernel that work has been finished */
    389        if (did_some_work) {
    390          uint32_t buf = 0;
    391
    392          write(fd, &buf, 4);
    393        }
    394
    395        return 0;
    396      }
    397
    398
    399A final note
    400============
    401
    402Please be careful to return codes as defined by the SCSI
    403specifications. These are different than some values defined in the
    404scsi/scsi.h include file. For example, CHECK CONDITION's status code
    405is 2, not 1.