cachepc-linux

Fork of AMDESE/linux with modifications for CachePC side-channel attack
git clone https://git.sinitax.com/sinitax/cachepc-linux
Log | Files | Refs | README | LICENSE | sfeed.txt

oracle-dax.rst (18696B)


      1=======================================
      2Oracle Data Analytics Accelerator (DAX)
      3=======================================
      4
      5DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
      6(DAX2) processor chips, and has direct access to the CPU's L3 caches
      7as well as physical memory. It can perform several operations on data
      8streams with various input and output formats.  A driver provides a
      9transport mechanism and has limited knowledge of the various opcodes
     10and data formats. A user space library provides high level services
     11and translates these into low level commands which are then passed
     12into the driver and subsequently the Hypervisor and the coprocessor.
     13The library is the recommended way for applications to use the
     14coprocessor, and the driver interface is not intended for general use.
     15This document describes the general flow of the driver, its
     16structures, and its programmatic interface. It also provides example
     17code sufficient to write user or kernel applications that use DAX
     18functionality.
     19
     20The user library is open source and available at:
     21
     22    https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
     23
     24The Hypervisor interface to the coprocessor is described in detail in
     25the accompanying document, dax-hv-api.txt, which is a plain text
     26excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
     27Specification" version 3.0.20+15, dated 2017-09-25.
     28
     29
     30High Level Overview
     31===================
     32
     33A coprocessor request is described by a Command Control Block
     34(CCB). The CCB contains an opcode and various parameters. The opcode
     35specifies what operation is to be done, and the parameters specify
     36options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
     37is passed to the Hypervisor, which handles queueing and scheduling of
     38requests to the available coprocessor execution units. A status code
     39returned indicates if the request was submitted successfully or if
     40there was an error.  One of the addresses given in each CCB is a
     41pointer to a "completion area", which is a 128 byte memory block that
     42is written by the coprocessor to provide execution status. No
     43interrupt is generated upon completion; the completion area must be
     44polled by software to find out when a transaction has finished, but
     45the M7 and later processors provide a mechanism to pause the virtual
     46processor until the completion status has been updated by the
     47coprocessor. This is done using the monitored load and mwait
     48instructions, which are described in more detail later.  The DAX
     49coprocessor was designed so that after a request is submitted, the
     50kernel is no longer involved in the processing of it.  The polling is
     51done at the user level, which results in almost zero latency between
     52completion of a request and resumption of execution of the requesting
     53thread.
     54
     55
     56Addressing Memory
     57=================
     58
     59The kernel does not have access to physical memory in the Sun4v
     60architecture, as there is an additional level of memory virtualization
     61present. This intermediate level is called "real" memory, and the
     62kernel treats this as if it were physical.  The Hypervisor handles the
     63translations between real memory and physical so that each logical
     64domain (LDOM) can have a partition of physical memory that is isolated
     65from that of other LDOMs.  When the kernel sets up a virtual mapping,
     66it specifies a virtual address and the real address to which it should
     67be mapped.
     68
     69The DAX coprocessor can only operate on physical memory, so before a
     70request can be fed to the coprocessor, all the addresses in a CCB must
     71be converted into physical addresses. The kernel cannot do this since
     72it has no visibility into physical addresses. So a CCB may contain
     73either the virtual or real addresses of the buffers or a combination
     74of them. An "address type" field is available for each address that
     75may be given in the CCB. In all cases, the Hypervisor will translate
     76all the addresses to physical before dispatching to hardware. Address
     77translations are performed using the context of the process initiating
     78the request.
     79
     80
     81The Driver API
     82==============
     83
     84An application makes requests to the driver via the write() system
     85call, and gets results (if any) via read(). The completion areas are
     86made accessible via mmap(), and are read-only for the application.
     87
     88The request may either be an immediate command or an array of CCBs to
     89be submitted to the hardware.
     90
     91Each open instance of the device is exclusive to the thread that
     92opened it, and must be used by that thread for all subsequent
     93operations. The driver open function creates a new context for the
     94thread and initializes it for use.  This context contains pointers and
     95values used internally by the driver to keep track of submitted
     96requests. The completion area buffer is also allocated, and this is
     97large enough to contain the completion areas for many concurrent
     98requests.  When the device is closed, any outstanding transactions are
     99flushed and the context is cleaned up.
    100
    101On a DAX1 system (M7), the device will be called "oradax1", while on a
    102DAX2 system (M8) it will be "oradax2". If an application requires one
    103or the other, it should simply attempt to open the appropriate
    104device. Only one of the devices will exist on any given system, so the
    105name can be used to determine what the platform supports.
    106
    107The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
    108all of these, success is indicated by a return value from write()
    109equal to the number of bytes given in the call. Otherwise -1 is
    110returned and errno is set.
    111
    112CCB_DEQUEUE
    113-----------
    114
    115Tells the driver to clean up resources associated with past
    116requests. Since no interrupt is generated upon the completion of a
    117request, the driver must be told when it may reclaim resources.  No
    118further status information is returned, so the user should not
    119subsequently call read().
    120
    121CCB_KILL
    122--------
    123
    124Kills a CCB during execution. The CCB is guaranteed to not continue
    125executing once this call returns successfully. On success, read() must
    126be called to retrieve the result of the action.
    127
    128CCB_INFO
    129--------
    130
    131Retrieves information about a currently executing CCB. Note that some
    132Hypervisors might return 'notfound' when the CCB is in 'inprogress'
    133state. To ensure a CCB in the 'notfound' state will never be executed,
    134CCB_KILL must be invoked on that CCB. Upon success, read() must be
    135called to retrieve the details of the action.
    136
    137Submission of an array of CCBs for execution
    138---------------------------------------------
    139
    140A write() whose length is a multiple of the CCB size is treated as a
    141submit operation. The file offset is treated as the index of the
    142completion area to use, and may be set via lseek() or using the
    143pwrite() system call. If -1 is returned then errno is set to indicate
    144the error. Otherwise, the return value is the length of the array that
    145was actually accepted by the coprocessor. If the accepted length is
    146equal to the requested length, then the submission was completely
    147successful and there is no further status needed; hence, the user
    148should not subsequently call read(). Partial acceptance of the CCB
    149array is indicated by a return value less than the requested length,
    150and read() must be called to retrieve further status information.  The
    151status will reflect the error caused by the first CCB that was not
    152accepted, and status_data will provide additional data in some cases.
    153
    154MMAP
    155----
    156
    157The mmap() function provides access to the completion area allocated
    158in the driver.  Note that the completion area is not writeable by the
    159user process, and the mmap call must not specify PROT_WRITE.
    160
    161
    162Completion of a Request
    163=======================
    164
    165The first byte in each completion area is the command status which is
    166updated by the coprocessor hardware. Software may take advantage of
    167new M7/M8 processor capabilities to efficiently poll this status byte.
    168First, a "monitored load" is achieved via a Load from Alternate Space
    169(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
    170"monitored wait" is achieved via the mwait instruction (a write to
    171%asr28). This instruction is like pause in that it suspends execution
    172of the virtual processor for the given number of nanoseconds, but in
    173addition will terminate early when one of several events occur. If the
    174block of data containing the monitored location is modified, then the
    175mwait terminates. This causes software to resume execution immediately
    176(without a context switch or kernel to user transition) after a
    177transaction completes. Thus the latency between transaction completion
    178and resumption of execution may be just a few nanoseconds.
    179
    180
    181Application Life Cycle of a DAX Submission
    182==========================================
    183
    184 - open dax device
    185 - call mmap() to get the completion area address
    186 - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
    187 - submit CCB via write() or pwrite()
    188 - go into a loop executing monitored load + monitored wait and
    189   terminate when the command status indicates the request is complete
    190   (CCB_KILL or CCB_INFO may be used any time as necessary)
    191 - perform a CCB_DEQUEUE
    192 - call munmap() for completion area
    193 - close the dax device
    194
    195
    196Memory Constraints
    197==================
    198
    199The DAX hardware operates only on physical addresses. Therefore, it is
    200not aware of virtual memory mappings and the discontiguities that may
    201exist in the physical memory that a virtual buffer maps to. There is
    202no I/O TLB or any scatter/gather mechanism. All buffers, whether input
    203or output, must reside in a physically contiguous region of memory.
    204
    205The Hypervisor translates all addresses within a CCB to physical
    206before handing off the CCB to DAX. The Hypervisor determines the
    207virtual page size for each virtual address given, and uses this to
    208program a size limit for each address. This prevents the coprocessor
    209from reading or writing beyond the bound of the virtual page, even
    210though it is accessing physical memory directly. A simpler way of
    211saying this is that a DAX operation will never "cross" a virtual page
    212boundary. If an 8k virtual page is used, then the data is strictly
    213limited to 8k. If a user's buffer is larger than 8k, then a larger
    214page size must be used, or the transaction size will be truncated to
    2158k.
    216
    217Huge pages. A user may allocate huge pages using standard interfaces.
    218Memory buffers residing on huge pages may be used to achieve much
    219larger DAX transaction sizes, but the rules must still be followed,
    220and no transaction will cross a page boundary, even a huge page.  A
    221major caveat is that Linux on Sparc presents 8Mb as one of the huge
    222page sizes. Sparc does not actually provide a 8Mb hardware page size,
    223and this size is synthesized by pasting together two 4Mb pages. The
    224reasons for this are historical, and it creates an issue because only
    225half of this 8Mb page can actually be used for any given buffer in a
    226DAX request, and it must be either the first half or the second half;
    227it cannot be a 4Mb chunk in the middle, since that crosses a
    228(hardware) page boundary. Note that this entire issue may be hidden by
    229higher level libraries.
    230
    231
    232CCB Structure
    233-------------
    234A CCB is an array of 8 64-bit words. Several of these words provide
    235command opcodes, parameters, flags, etc., and the rest are addresses
    236for the completion area, output buffer, and various inputs::
    237
    238   struct ccb {
    239       u64   control;
    240       u64   completion;
    241       u64   input0;
    242       u64   access;
    243       u64   input1;
    244       u64   op_data;
    245       u64   output;
    246       u64   table;
    247   };
    248
    249See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
    250each of these fields, and see dax-hv-api.txt for a complete description
    251of the Hypervisor API available to the guest OS (ie, Linux kernel).
    252
    253The first word (control) is examined by the driver for the following:
    254 - CCB version, which must be consistent with hardware version
    255 - Opcode, which must be one of the documented allowable commands
    256 - Address types, which must be set to "virtual" for all the addresses
    257   given by the user, thereby ensuring that the application can
    258   only access memory that it owns
    259
    260
    261Example Code
    262============
    263
    264The DAX is accessible to both user and kernel code.  The kernel code
    265can make hypercalls directly while the user code must use wrappers
    266provided by the driver. The setup of the CCB is nearly identical for
    267both; the only difference is in preparation of the completion area. An
    268example of user code is given now, with kernel code afterwards.
    269
    270In order to program using the driver API, the file
    271arch/sparc/include/uapi/asm/oradax.h must be included.
    272
    273First, the proper device must be opened. For M7 it will be
    274/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
    275procedure is to attempt to open both, as only one will succeed::
    276
    277	fd = open("/dev/oradax1", O_RDWR);
    278	if (fd < 0)
    279		fd = open("/dev/oradax2", O_RDWR);
    280	if (fd < 0)
    281	       /* No DAX found */
    282
    283Next, the completion area must be mapped::
    284
    285      completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
    286
    287All input and output buffers must be fully contained in one hardware
    288page, since as explained above, the DAX is strictly constrained by
    289virtual page boundaries.  In addition, the output buffer must be
    29064-byte aligned and its size must be a multiple of 64 bytes because
    291the coprocessor writes in units of cache lines.
    292
    293This example demonstrates the DAX Scan command, which takes as input a
    294vector and a match value, and produces a bitmap as the output. For
    295each input element that matches the value, the corresponding bit is
    296set in the output.
    297
    298In this example, the input vector consists of a series of single bits,
    299and the match value is 0. So each 0 bit in the input will produce a 1
    300in the output, and vice versa, which produces an output bitmap which
    301is the input bitmap inverted.
    302
    303For details of all the parameters and bits used in this CCB, please
    304refer to section 36.2.1.3 of the DAX Hypervisor API document, which
    305describes the Scan command in detail::
    306
    307	ccb->control =       /* Table 36.1, CCB Header Format */
    308		  (2L << 48)     /* command = Scan Value */
    309		| (3L << 40)     /* output address type = primary virtual */
    310		| (3L << 34)     /* primary input address type = primary virtual */
    311		             /* Section 36.2.1, Query CCB Command Formats */
    312		| (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
    313		| (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
    314		| (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
    315		| (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */
    316		| (31 << 0);	/* 36.2.1.3 Disable second scan criteria */
    317
    318	ccb->completion = 0;    /* Completion area address, to be filled in by driver */
    319
    320	ccb->input0 = (unsigned long) input; /* primary input address */
    321
    322	ccb->access =       /* Section 36.2.1.2, Data Access Control */
    323		  (2 << 24)    /* Primary input length format = bits */
    324		| (nbits - 1); /* number of bits in primary input stream, minus 1 */
    325
    326	ccb->input1 = 0;       /* secondary input address, unused */
    327
    328	ccb->op_data = 0;      /* scan criteria (value to be matched) */
    329
    330	ccb->output = (unsigned long) output;	/* output address */
    331
    332	ccb->table = 0;	       /* table address, unused */
    333
    334The CCB submission is a write() or pwrite() system call to the
    335driver. If the call fails, then a read() must be used to retrieve the
    336status::
    337
    338	if (pwrite(fd, ccb, 64, 0) != 64) {
    339		struct ccb_exec_result status;
    340		read(fd, &status, sizeof(status));
    341		/* bail out */
    342	}
    343
    344After a successful submission of the CCB, the completion area may be
    345polled to determine when the DAX is finished. Detailed information on
    346the contents of the completion area can be found in section 36.2.2 of
    347the DAX HV API document::
    348
    349	while (1) {
    350		/* Monitored Load */
    351		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
    352				     : "=r" (status)
    353				     : "r"  (completion_area));
    354
    355		if (status)	     /* 0 indicates command in progress */
    356			break;
    357
    358		/* MWAIT */
    359		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
    360	}
    361
    362A completion area status of 1 indicates successful completion of the
    363CCB and validity of the output bitmap, which may be used immediately.
    364All other non-zero values indicate error conditions which are
    365described in section 36.2.2::
    366
    367	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
    368		/* completion_area[0] contains the completion status */
    369		/* completion_area[1] contains an error code, see 36.2.2 */
    370	}
    371
    372After the completion area has been processed, the driver must be
    373notified that it can release any resources associated with the
    374request. This is done via the dequeue operation::
    375
    376	struct dax_command cmd;
    377	cmd.command = CCB_DEQUEUE;
    378	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
    379		/* bail out */
    380	}
    381
    382Finally, normal program cleanup should be done, i.e., unmapping
    383completion area, closing the dax device, freeing memory etc.
    384
    385Kernel example
    386--------------
    387
    388The only difference in using the DAX in kernel code is the treatment
    389of the completion area. Unlike user applications which mmap the
    390completion area allocated by the driver, kernel code must allocate its
    391own memory to use for the completion area, and this address and its
    392type must be given in the CCB::
    393
    394	ccb->control |=      /* Table 36.1, CCB Header Format */
    395	        (3L << 32);     /* completion area address type = primary virtual */
    396
    397	ccb->completion = (unsigned long) completion_area;   /* Completion area address */
    398
    399The dax submit hypercall is made directly. The flags used in the
    400ccb_submit call are documented in the DAX HV API in section 36.3.1/
    401
    402::
    403
    404  #include <asm/hypervisor.h>
    405
    406	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
    407				 HV_CCB_QUERY_CMD |
    408				 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
    409				 HV_CCB_VA_PRIVILEGED,
    410				 0, &bytes_accepted, &status_data);
    411
    412	if (hv_rv != HV_EOK) {
    413		/* hv_rv is an error code, status_data contains */
    414		/* potential additional status, see 36.3.1.1 */
    415	}
    416
    417After the submission, the completion area polling code is identical to
    418that in user land::
    419
    420	while (1) {
    421		/* Monitored Load */
    422		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
    423				     : "=r" (status)
    424				     : "r"  (completion_area));
    425
    426		if (status)	     /* 0 indicates command in progress */
    427			break;
    428
    429		/* MWAIT */
    430		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
    431	}
    432
    433	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
    434		/* completion_area[0] contains the completion status */
    435		/* completion_area[1] contains an error code, see 36.2.2 */
    436	}
    437
    438The output bitmap is ready for consumption immediately after the
    439completion status indicates success.
    440
    441Excer[t from UltraSPARC Virtual Machine Specification
    442=====================================================
    443
    444 .. include:: dax-hv-api.txt
    445    :literal: