cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

rds.rst (17019B)


      1.. SPDX-License-Identifier: GPL-2.0
      2
      3===
      4RDS
      5===
      6
      7Overview
      8========
      9
     10This readme tries to provide some background on the hows and whys of RDS,
     11and will hopefully help you find your way around the code.
     12
     13In addition, please see this email about RDS origins:
     14http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
     15
     16RDS Architecture
     17================
     18
     19RDS provides reliable, ordered datagram delivery by using a single
     20reliable connection between any two nodes in the cluster. This allows
     21applications to use a single socket to talk to any other process in the
     22cluster - so in a cluster with N processes you need N sockets, in contrast
     23to N*N if you use a connection-oriented socket transport like TCP.
     24
     25RDS is not Infiniband-specific; it was designed to support different
     26transports.  The current implementation used to support RDS over TCP as well
     27as IB.
     28
     29The high-level semantics of RDS from the application's point of view are
     30
     31 *	Addressing
     32
     33	RDS uses IPv4 addresses and 16bit port numbers to identify
     34	the end point of a connection. All socket operations that involve
     35	passing addresses between kernel and user space generally
     36	use a struct sockaddr_in.
     37
     38	The fact that IPv4 addresses are used does not mean the underlying
     39	transport has to be IP-based. In fact, RDS over IB uses a
     40	reliable IB connection; the IP address is used exclusively to
     41	locate the remote node's GID (by ARPing for the given IP).
     42
     43	The port space is entirely independent of UDP, TCP or any other
     44	protocol.
     45
     46 *	Socket interface
     47
     48	RDS sockets work *mostly* as you would expect from a BSD
     49	socket. The next section will cover the details. At any rate,
     50	all I/O is performed through the standard BSD socket API.
     51	Some additions like zerocopy support are implemented through
     52	control messages, while other extensions use the getsockopt/
     53	setsockopt calls.
     54
     55	Sockets must be bound before you can send or receive data.
     56	This is needed because binding also selects a transport and
     57	attaches it to the socket. Once bound, the transport assignment
     58	does not change. RDS will tolerate IPs moving around (eg in
     59	a active-active HA scenario), but only as long as the address
     60	doesn't move to a different transport.
     61
     62 *	sysctls
     63
     64	RDS supports a number of sysctls in /proc/sys/net/rds
     65
     66
     67Socket Interface
     68================
     69
     70  AF_RDS, PF_RDS, SOL_RDS
     71	AF_RDS and PF_RDS are the domain type to be used with socket(2)
     72	to create RDS sockets. SOL_RDS is the socket-level to be used
     73	with setsockopt(2) and getsockopt(2) for RDS specific socket
     74	options.
     75
     76  fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
     77	This creates a new, unbound RDS socket.
     78
     79  setsockopt(SOL_SOCKET): send and receive buffer size
     80	RDS honors the send and receive buffer size socket options.
     81	You are not allowed to queue more than SO_SNDSIZE bytes to
     82	a socket. A message is queued when sendmsg is called, and
     83	it leaves the queue when the remote system acknowledges
     84	its arrival.
     85
     86	The SO_RCVSIZE option controls the maximum receive queue length.
     87	This is a soft limit rather than a hard limit - RDS will
     88	continue to accept and queue incoming messages, even if that
     89	takes the queue length over the limit. However, it will also
     90	mark the port as "congested" and send a congestion update to
     91	the source node. The source node is supposed to throttle any
     92	processes sending to this congested port.
     93
     94  bind(fd, &sockaddr_in, ...)
     95	This binds the socket to a local IP address and port, and a
     96	transport, if one has not already been selected via the
     97	SO_RDS_TRANSPORT socket option
     98
     99  sendmsg(fd, ...)
    100	Sends a message to the indicated recipient. The kernel will
    101	transparently establish the underlying reliable connection
    102	if it isn't up yet.
    103
    104	An attempt to send a message that exceeds SO_SNDSIZE will
    105	return with -EMSGSIZE
    106
    107	An attempt to send a message that would take the total number
    108	of queued bytes over the SO_SNDSIZE threshold will return
    109	EAGAIN.
    110
    111	An attempt to send a message to a destination that is marked
    112	as "congested" will return ENOBUFS.
    113
    114  recvmsg(fd, ...)
    115	Receives a message that was queued to this socket. The sockets
    116	recv queue accounting is adjusted, and if the queue length
    117	drops below SO_SNDSIZE, the port is marked uncongested, and
    118	a congestion update is sent to all peers.
    119
    120	Applications can ask the RDS kernel module to receive
    121	notifications via control messages (for instance, there is a
    122	notification when a congestion update arrived, or when a RDMA
    123	operation completes). These notifications are received through
    124	the msg.msg_control buffer of struct msghdr. The format of the
    125	messages is described in manpages.
    126
    127  poll(fd)
    128	RDS supports the poll interface to allow the application
    129	to implement async I/O.
    130
    131	POLLIN handling is pretty straightforward. When there's an
    132	incoming message queued to the socket, or a pending notification,
    133	we signal POLLIN.
    134
    135	POLLOUT is a little harder. Since you can essentially send
    136	to any destination, RDS will always signal POLLOUT as long as
    137	there's room on the send queue (ie the number of bytes queued
    138	is less than the sendbuf size).
    139
    140	However, the kernel will refuse to accept messages to
    141	a destination marked congested - in this case you will loop
    142	forever if you rely on poll to tell you what to do.
    143	This isn't a trivial problem, but applications can deal with
    144	this - by using congestion notifications, and by checking for
    145	ENOBUFS errors returned by sendmsg.
    146
    147  setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
    148	This allows the application to discard all messages queued to a
    149	specific destination on this particular socket.
    150
    151	This allows the application to cancel outstanding messages if
    152	it detects a timeout. For instance, if it tried to send a message,
    153	and the remote host is unreachable, RDS will keep trying forever.
    154	The application may decide it's not worth it, and cancel the
    155	operation. In this case, it would use RDS_CANCEL_SENT_TO to
    156	nuke any pending messages.
    157
    158  ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
    159	Set or read an integer defining  the underlying
    160	encapsulating transport to be used for RDS packets on the
    161	socket. When setting the option, integer argument may be
    162	one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
    163	value, RDS_TRANS_NONE will be returned on an unbound socket.
    164	This socket option may only be set exactly once on the socket,
    165	prior to binding it via the bind(2) system call. Attempts to
    166	set SO_RDS_TRANSPORT on a socket for which the transport has
    167	been previously attached explicitly (by SO_RDS_TRANSPORT) or
    168	implicitly (via bind(2)) will return an error of EOPNOTSUPP.
    169	An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
    170	always return EINVAL.
    171
    172RDMA for RDS
    173============
    174
    175  see rds-rdma(7) manpage (available in rds-tools)
    176
    177
    178Congestion Notifications
    179========================
    180
    181  see rds(7) manpage
    182
    183
    184RDS Protocol
    185============
    186
    187  Message header
    188
    189    The message header is a 'struct rds_header' (see rds.h):
    190
    191    Fields:
    192
    193      h_sequence:
    194	  per-packet sequence number
    195      h_ack:
    196	  piggybacked acknowledgment of last packet received
    197      h_len:
    198	  length of data, not including header
    199      h_sport:
    200	  source port
    201      h_dport:
    202	  destination port
    203      h_flags:
    204	  Can be:
    205
    206	  =============  ==================================
    207	  CONG_BITMAP    this is a congestion update bitmap
    208	  ACK_REQUIRED   receiver must ack this packet
    209	  RETRANSMITTED  packet has previously been sent
    210	  =============  ==================================
    211
    212      h_credit:
    213	  indicate to other end of connection that
    214	  it has more credits available (i.e. there is
    215	  more send room)
    216      h_padding[4]:
    217	  unused, for future use
    218      h_csum:
    219	  header checksum
    220      h_exthdr:
    221	  optional data can be passed here. This is currently used for
    222	  passing RDMA-related information.
    223
    224  ACK and retransmit handling
    225
    226      One might think that with reliable IB connections you wouldn't need
    227      to ack messages that have been received.  The problem is that IB
    228      hardware generates an ack message before it has DMAed the message
    229      into memory.  This creates a potential message loss if the HCA is
    230      disabled for any reason between when it sends the ack and before
    231      the message is DMAed and processed.  This is only a potential issue
    232      if another HCA is available for fail-over.
    233
    234      Sending an ack immediately would allow the sender to free the sent
    235      message from their send queue quickly, but could cause excessive
    236      traffic to be used for acks. RDS piggybacks acks on sent data
    237      packets.  Ack-only packets are reduced by only allowing one to be
    238      in flight at a time, and by the sender only asking for acks when
    239      its send buffers start to fill up. All retransmissions are also
    240      acked.
    241
    242  Flow Control
    243
    244      RDS's IB transport uses a credit-based mechanism to verify that
    245      there is space in the peer's receive buffers for more data. This
    246      eliminates the need for hardware retries on the connection.
    247
    248  Congestion
    249
    250      Messages waiting in the receive queue on the receiving socket
    251      are accounted against the sockets SO_RCVBUF option value.  Only
    252      the payload bytes in the message are accounted for.  If the
    253      number of bytes queued equals or exceeds rcvbuf then the socket
    254      is congested.  All sends attempted to this socket's address
    255      should return block or return -EWOULDBLOCK.
    256
    257      Applications are expected to be reasonably tuned such that this
    258      situation very rarely occurs.  An application encountering this
    259      "back-pressure" is considered a bug.
    260
    261      This is implemented by having each node maintain bitmaps which
    262      indicate which ports on bound addresses are congested.  As the
    263      bitmap changes it is sent through all the connections which
    264      terminate in the local address of the bitmap which changed.
    265
    266      The bitmaps are allocated as connections are brought up.  This
    267      avoids allocation in the interrupt handling path which queues
    268      sages on sockets.  The dense bitmaps let transports send the
    269      entire bitmap on any bitmap change reasonably efficiently.  This
    270      is much easier to implement than some finer-grained
    271      communication of per-port congestion.  The sender does a very
    272      inexpensive bit test to test if the port it's about to send to
    273      is congested or not.
    274
    275
    276RDS Transport Layer
    277===================
    278
    279  As mentioned above, RDS is not IB-specific. Its code is divided
    280  into a general RDS layer and a transport layer.
    281
    282  The general layer handles the socket API, congestion handling,
    283  loopback, stats, usermem pinning, and the connection state machine.
    284
    285  The transport layer handles the details of the transport. The IB
    286  transport, for example, handles all the queue pairs, work requests,
    287  CM event handlers, and other Infiniband details.
    288
    289
    290RDS Kernel Structures
    291=====================
    292
    293  struct rds_message
    294    aka possibly "rds_outgoing", the generic RDS layer copies data to
    295    be sent and sets header fields as needed, based on the socket API.
    296    This is then queued for the individual connection and sent by the
    297    connection's transport.
    298
    299  struct rds_incoming
    300    a generic struct referring to incoming data that can be handed from
    301    the transport to the general code and queued by the general code
    302    while the socket is awoken. It is then passed back to the transport
    303    code to handle the actual copy-to-user.
    304
    305  struct rds_socket
    306    per-socket information
    307
    308  struct rds_connection
    309    per-connection information
    310
    311  struct rds_transport
    312    pointers to transport-specific functions
    313
    314  struct rds_statistics
    315    non-transport-specific statistics
    316
    317  struct rds_cong_map
    318    wraps the raw congestion bitmap, contains rbnode, waitq, etc.
    319
    320Connection management
    321=====================
    322
    323  Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
    324  ERROR states.
    325
    326  The first time an attempt is made by an RDS socket to send data to
    327  a node, a connection is allocated and connected. That connection is
    328  then maintained forever -- if there are transport errors, the
    329  connection will be dropped and re-established.
    330
    331  Dropping a connection while packets are queued will cause queued or
    332  partially-sent datagrams to be retransmitted when the connection is
    333  re-established.
    334
    335
    336The send path
    337=============
    338
    339  rds_sendmsg()
    340    - struct rds_message built from incoming data
    341    - CMSGs parsed (e.g. RDMA ops)
    342    - transport connection alloced and connected if not already
    343    - rds_message placed on send queue
    344    - send worker awoken
    345
    346  rds_send_worker()
    347    - calls rds_send_xmit() until queue is empty
    348
    349  rds_send_xmit()
    350    - transmits congestion map if one is pending
    351    - may set ACK_REQUIRED
    352    - calls transport to send either non-RDMA or RDMA message
    353      (RDMA ops never retransmitted)
    354
    355  rds_ib_xmit()
    356    - allocs work requests from send ring
    357    - adds any new send credits available to peer (h_credits)
    358    - maps the rds_message's sg list
    359    - piggybacks ack
    360    - populates work requests
    361    - post send to connection's queue pair
    362
    363The recv path
    364=============
    365
    366  rds_ib_recv_cq_comp_handler()
    367    - looks at write completions
    368    - unmaps recv buffer from device
    369    - no errors, call rds_ib_process_recv()
    370    - refill recv ring
    371
    372  rds_ib_process_recv()
    373    - validate header checksum
    374    - copy header to rds_ib_incoming struct if start of a new datagram
    375    - add to ibinc's fraglist
    376    - if competed datagram:
    377	 - update cong map if datagram was cong update
    378	 - call rds_recv_incoming() otherwise
    379	 - note if ack is required
    380
    381  rds_recv_incoming()
    382    - drop duplicate packets
    383    - respond to pings
    384    - find the sock associated with this datagram
    385    - add to sock queue
    386    - wake up sock
    387    - do some congestion calculations
    388  rds_recvmsg
    389    - copy data into user iovec
    390    - handle CMSGs
    391    - return to application
    392
    393Multipath RDS (mprds)
    394=====================
    395  Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
    396  (though the concept can be extended to other transports). The classical
    397  implementation of RDS-over-TCP is implemented by demultiplexing multiple
    398  PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
    399  port]) over a single TCP socket between the 2 IP addresses involved. This
    400  has the limitation that it ends up funneling multiple RDS flows over a
    401  single TCP flow, thus it is
    402  (a) upper-bounded to the single-flow bandwidth,
    403  (b) suffers from head-of-line blocking for all the RDS sockets.
    404
    405  Better throughput (for a fixed small packet size, MTU) can be achieved
    406  by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
    407  RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
    408  connection. RDS sockets will be attached to a path based on some hash
    409  (e.g., of local address and RDS port number) and packets for that RDS
    410  socket will be sent over the attached path using TCP to segment/reassemble
    411  RDS datagrams on that path.
    412
    413  Multipathed RDS is implemented by splitting the struct rds_connection into
    414  a common (to all paths) part, and a per-path struct rds_conn_path. All
    415  I/O workqs and reconnect threads are driven from the rds_conn_path.
    416  Transports such as TCP that are multipath capable may then set up a
    417  TCP socket per rds_conn_path, and this is managed by the transport via
    418  the transport privatee cp_transport_data pointer.
    419
    420  Transports announce themselves as multipath capable by setting the
    421  t_mp_capable bit during registration with the rds core module. When the
    422  transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
    423  across multiple paths. The outgoing hash is computed based on the
    424  local address and port that the PF_RDS socket is bound to.
    425
    426  Additionally, even if the transport is MP capable, we may be
    427  peering with some node that does not support mprds, or supports
    428  a different number of paths. As a result, the peering nodes need
    429  to agree on the number of paths to be used for the connection.
    430  This is done by sending out a control packet exchange before the
    431  first data packet. The control packet exchange must have completed
    432  prior to outgoing hash completion in rds_sendmsg() when the transport
    433  is mutlipath capable.
    434
    435  The control packet is an RDS ping packet (i.e., packet to rds dest
    436  port 0) with the ping packet having a rds extension header option  of
    437  type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
    438  number of paths supported by the sender. The "probe" ping packet will
    439  get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
    440  The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
    441  be able to compute the min(sender_paths, rcvr_paths). The pong
    442  sent in response to a probe-ping should contain the rcvr's npaths
    443  when the rcvr is mprds-capable.
    444
    445  If the rcvr is not mprds-capable, the exthdr in the ping will be
    446  ignored.  In this case the pong will not have any exthdrs, so the sender
    447  of the probe-ping can default to single-path mprds.
    448