tcmu-design.rst (14378B)
1==================== 2TCM Userspace Design 3==================== 4 5 6.. Contents: 7 8 1) Design 9 a) Background 10 b) Benefits 11 c) Design constraints 12 d) Implementation overview 13 i. Mailbox 14 ii. Command ring 15 iii. Data Area 16 e) Device discovery 17 f) Device events 18 g) Other contingencies 19 2) Writing a user pass-through handler 20 a) Discovering and configuring TCMU uio devices 21 b) Waiting for events on the device(s) 22 c) Managing the command ring 23 3) A final note 24 25 26Design 27====== 28 29TCM is another name for LIO, an in-kernel iSCSI target (server). 30Existing TCM targets run in the kernel. TCMU (TCM in Userspace) 31allows userspace programs to be written which act as iSCSI targets. 32This document describes the design. 33 34The existing kernel provides modules for different SCSI transport 35protocols. TCM also modularizes the data storage. There are existing 36modules for file, block device, RAM or using another SCSI device as 37storage. These are called "backstores" or "storage engines". These 38built-in modules are implemented entirely as kernel code. 39 40Background 41---------- 42 43In addition to modularizing the transport protocol used for carrying 44SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes 45the actual data storage as well. These are referred to as "backstores" 46or "storage engines". The target comes with backstores that allow a 47file, a block device, RAM, or another SCSI device to be used for the 48local storage needed for the exported SCSI LUN. Like the rest of LIO, 49these are implemented entirely as kernel code. 50 51These backstores cover the most common use cases, but not all. One new 52use case that other non-kernel target solutions, such as tgt, are able 53to support is using Gluster's GLFS or Ceph's RBD as a backstore. The 54target then serves as a translator, allowing initiators to store data 55in these non-traditional networked storage systems, while still only 56using standard protocols themselves. 57 58If the target is a userspace process, supporting these is easy. tgt, 59for example, needs only a small adapter module for each, because the 60modules just use the available userspace libraries for RBD and GLFS. 61 62Adding support for these backstores in LIO is considerably more 63difficult, because LIO is entirely kernel code. Instead of undertaking 64the significant work to port the GLFS or RBD APIs and protocols to the 65kernel, another approach is to create a userspace pass-through 66backstore for LIO, "TCMU". 67 68 69Benefits 70-------- 71 72In addition to allowing relatively easy support for RBD and GLFS, TCMU 73will also allow easier development of new backstores. TCMU combines 74with the LIO loopback fabric to become something similar to FUSE 75(Filesystem in Userspace), but at the SCSI layer instead of the 76filesystem layer. A SUSE, if you will. 77 78The disadvantage is there are more distinct components to configure, and 79potentially to malfunction. This is unavoidable, but hopefully not 80fatal if we're careful to keep things as simple as possible. 81 82Design constraints 83------------------ 84 85- Good performance: high throughput, low latency 86- Cleanly handle if userspace: 87 88 1) never attaches 89 2) hangs 90 3) dies 91 4) misbehaves 92 93- Allow future flexibility in user & kernel implementations 94- Be reasonably memory-efficient 95- Simple to configure & run 96- Simple to write a userspace backend 97 98 99Implementation overview 100----------------------- 101 102The core of the TCMU interface is a memory region that is shared 103between kernel and userspace. Within this region is: a control area 104(mailbox); a lockless producer/consumer circular buffer for commands 105to be passed up, and status returned; and an in/out data buffer area. 106 107TCMU uses the pre-existing UIO subsystem. UIO allows device driver 108development in userspace, and this is conceptually very close to the 109TCMU use case, except instead of a physical device, TCMU implements a 110memory-mapped layout designed for SCSI commands. Using UIO also 111benefits TCMU by handling device introspection (e.g. a way for 112userspace to determine how large the shared region is) and signaling 113mechanisms in both directions. 114 115There are no embedded pointers in the memory region. Everything is 116expressed as an offset from the region's starting address. This allows 117the ring to still work if the user process dies and is restarted with 118the region mapped at a different virtual address. 119 120See target_core_user.h for the struct definitions. 121 122The Mailbox 123----------- 124 125The mailbox is always at the start of the shared memory region, and 126contains a version, details about the starting offset and size of the 127command ring, and head and tail pointers to be used by the kernel and 128userspace (respectively) to put commands on the ring, and indicate 129when the commands are completed. 130 131version - 1 (userspace should abort if otherwise) 132 133flags: 134 - TCMU_MAILBOX_FLAG_CAP_OOOC: 135 indicates out-of-order completion is supported. 136 See "The Command Ring" for details. 137 138cmdr_off 139 The offset of the start of the command ring from the start 140 of the memory region, to account for the mailbox size. 141cmdr_size 142 The size of the command ring. This does *not* need to be a 143 power of two. 144cmd_head 145 Modified by the kernel to indicate when a command has been 146 placed on the ring. 147cmd_tail 148 Modified by userspace to indicate when it has completed 149 processing of a command. 150 151The Command Ring 152---------------- 153 154Commands are placed on the ring by the kernel incrementing 155mailbox.cmd_head by the size of the command, modulo cmdr_size, and 156then signaling userspace via uio_event_notify(). Once the command is 157completed, userspace updates mailbox.cmd_tail in the same way and 158signals the kernel via a 4-byte write(). When cmd_head equals 159cmd_tail, the ring is empty -- no commands are currently waiting to be 160processed by userspace. 161 162TCMU commands are 8-byte aligned. They start with a common header 163containing "len_op", a 32-bit value that stores the length, as well as 164the opcode in the lowest unused bits. It also contains cmd_id and 165flags fields for setting by the kernel (kflags) and userspace 166(uflags). 167 168Currently only two opcodes are defined, TCMU_OP_CMD and TCMU_OP_PAD. 169 170When the opcode is CMD, the entry in the command ring is a struct 171tcmu_cmd_entry. Userspace finds the SCSI CDB (Command Data Block) via 172tcmu_cmd_entry.req.cdb_off. This is an offset from the start of the 173overall shared memory region, not the entry. The data in/out buffers 174are accessible via tht req.iov[] array. iov_cnt contains the number of 175entries in iov[] needed to describe either the Data-In or Data-Out 176buffers. For bidirectional commands, iov_cnt specifies how many iovec 177entries cover the Data-Out area, and iov_bidi_cnt specifies how many 178iovec entries immediately after that in iov[] cover the Data-In 179area. Just like other fields, iov.iov_base is an offset from the start 180of the region. 181 182When completing a command, userspace sets rsp.scsi_status, and 183rsp.sense_buffer if necessary. Userspace then increments 184mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the 185kernel via the UIO method, a 4-byte write to the file descriptor. 186 187If TCMU_MAILBOX_FLAG_CAP_OOOC is set for mailbox->flags, kernel is 188capable of handling out-of-order completions. In this case, userspace can 189handle command in different order other than original. Since kernel would 190still process the commands in the same order it appeared in the command 191ring, userspace need to update the cmd->id when completing the 192command(a.k.a steal the original command's entry). 193 194When the opcode is PAD, userspace only updates cmd_tail as above -- 195it's a no-op. (The kernel inserts PAD entries to ensure each CMD entry 196is contiguous within the command ring.) 197 198More opcodes may be added in the future. If userspace encounters an 199opcode it does not handle, it must set UNKNOWN_OP bit (bit 0) in 200hdr.uflags, update cmd_tail, and proceed with processing additional 201commands, if any. 202 203The Data Area 204------------- 205 206This is shared-memory space after the command ring. The organization 207of this area is not defined in the TCMU interface, and userspace 208should access only the parts referenced by pending iovs. 209 210 211Device Discovery 212---------------- 213 214Other devices may be using UIO besides TCMU. Unrelated user processes 215may also be handling different sets of TCMU devices. TCMU userspace 216processes must find their devices by scanning sysfs 217class/uio/uio*/name. For TCMU devices, these names will be of the 218format:: 219 220 tcm-user/<hba_num>/<device_name>/<subtype>/<path> 221 222where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num> 223and <device_name> allow userspace to find the device's path in the 224kernel target's configfs tree. Assuming the usual mount point, it is 225found at:: 226 227 /sys/kernel/config/target/core/user_<hba_num>/<device_name> 228 229This location contains attributes such as "hw_block_size", that 230userspace needs to know for correct operation. 231 232<subtype> will be a userspace-process-unique string to identify the 233TCMU device as expecting to be backed by a certain handler, and <path> 234will be an additional handler-specific string for the user process to 235configure the device, if needed. The name cannot contain ':', due to 236LIO limitations. 237 238For all devices so discovered, the user handler opens /dev/uioX and 239calls mmap():: 240 241 mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0) 242 243where size must be equal to the value read from 244/sys/class/uio/uioX/maps/map0/size. 245 246 247Device Events 248------------- 249 250If a new device is added or removed, a notification will be broadcast 251over netlink, using a generic netlink family name of "TCM-USER" and a 252multicast group named "config". This will include the UIO name as 253described in the previous section, as well as the UIO minor 254number. This should allow userspace to identify both the UIO device and 255the LIO device, so that after determining the device is supported 256(based on subtype) it can take the appropriate action. 257 258 259Other contingencies 260------------------- 261 262Userspace handler process never attaches: 263 264- TCMU will post commands, and then abort them after a timeout period 265 (30 seconds.) 266 267Userspace handler process is killed: 268 269- It is still possible to restart and re-connect to TCMU 270 devices. Command ring is preserved. However, after the timeout period, 271 the kernel will abort pending tasks. 272 273Userspace handler process hangs: 274 275- The kernel will abort pending tasks after a timeout period. 276 277Userspace handler process is malicious: 278 279- The process can trivially break the handling of devices it controls, 280 but should not be able to access kernel memory outside its shared 281 memory areas. 282 283 284Writing a user pass-through handler (with example code) 285======================================================= 286 287A user process handing a TCMU device must support the following: 288 289a) Discovering and configuring TCMU uio devices 290b) Waiting for events on the device(s) 291c) Managing the command ring: Parsing operations and commands, 292 performing work as needed, setting response fields (scsi_status and 293 possibly sense_buffer), updating cmd_tail, and notifying the kernel 294 that work has been finished 295 296First, consider instead writing a plugin for tcmu-runner. tcmu-runner 297implements all of this, and provides a higher-level API for plugin 298authors. 299 300TCMU is designed so that multiple unrelated processes can manage TCMU 301devices separately. All handlers should make sure to only open their 302devices, based opon a known subtype string. 303 304a) Discovering and configuring TCMU UIO devices:: 305 306 /* error checking omitted for brevity */ 307 308 int fd, dev_fd; 309 char buf[256]; 310 unsigned long long map_len; 311 void *map; 312 313 fd = open("/sys/class/uio/uio0/name", O_RDONLY); 314 ret = read(fd, buf, sizeof(buf)); 315 close(fd); 316 buf[ret-1] = '\0'; /* null-terminate and chop off the \n */ 317 318 /* we only want uio devices whose name is a format we expect */ 319 if (strncmp(buf, "tcm-user", 8)) 320 exit(-1); 321 322 /* Further checking for subtype also needed here */ 323 324 fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY); 325 ret = read(fd, buf, sizeof(buf)); 326 close(fd); 327 str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */ 328 329 map_len = strtoull(buf, NULL, 0); 330 331 dev_fd = open("/dev/uio0", O_RDWR); 332 map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0); 333 334 335 b) Waiting for events on the device(s) 336 337 while (1) { 338 char buf[4]; 339 340 int ret = read(dev_fd, buf, 4); /* will block */ 341 342 handle_device_events(dev_fd, map); 343 } 344 345 346c) Managing the command ring:: 347 348 #include <linux/target_core_user.h> 349 350 int handle_device_events(int fd, void *map) 351 { 352 struct tcmu_mailbox *mb = map; 353 struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail; 354 int did_some_work = 0; 355 356 /* Process events from cmd ring until we catch up with cmd_head */ 357 while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) { 358 359 if (tcmu_hdr_get_op(ent->hdr.len_op) == TCMU_OP_CMD) { 360 uint8_t *cdb = (void *)mb + ent->req.cdb_off; 361 bool success = true; 362 363 /* Handle command here. */ 364 printf("SCSI opcode: 0x%x\n", cdb[0]); 365 366 /* Set response fields */ 367 if (success) 368 ent->rsp.scsi_status = SCSI_NO_SENSE; 369 else { 370 /* Also fill in rsp->sense_buffer here */ 371 ent->rsp.scsi_status = SCSI_CHECK_CONDITION; 372 } 373 } 374 else if (tcmu_hdr_get_op(ent->hdr.len_op) != TCMU_OP_PAD) { 375 /* Tell the kernel we didn't handle unknown opcodes */ 376 ent->hdr.uflags |= TCMU_UFLAG_UNKNOWN_OP; 377 } 378 else { 379 /* Do nothing for PAD entries except update cmd_tail */ 380 } 381 382 /* update cmd_tail */ 383 mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size; 384 ent = (void *) mb + mb->cmdr_off + mb->cmd_tail; 385 did_some_work = 1; 386 } 387 388 /* Notify the kernel that work has been finished */ 389 if (did_some_work) { 390 uint32_t buf = 0; 391 392 write(fd, &buf, 4); 393 } 394 395 return 0; 396 } 397 398 399A final note 400============ 401 402Please be careful to return codes as defined by the SCSI 403specifications. These are different than some values defined in the 404scsi/scsi.h include file. For example, CHECK CONDITION's status code 405is 2, not 1.