classic_vs_extended.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
classic_vs_extended.rst (13851B)
      1
      2===================
      3Classic BPF vs eBPF
      4===================
      5
      6eBPF is designed to be JITed with one to one mapping, which can also open up
      7the possibility for GCC/LLVM compilers to generate optimized eBPF code through
      8an eBPF backend that performs almost as fast as natively compiled code.
      9
     10Some core changes of the eBPF format from classic BPF:
     11
     12- Number of registers increase from 2 to 10:
     13
     14  The old format had two registers A and X, and a hidden frame pointer. The
     15  new layout extends this to be 10 internal registers and a read-only frame
     16  pointer. Since 64-bit CPUs are passing arguments to functions via registers
     17  the number of args from eBPF program to in-kernel function is restricted
     18  to 5 and one register is used to accept return value from an in-kernel
     19  function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
     20  sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
     21  registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
     22
     23  Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
     24  etc, and eBPF calling convention maps directly to ABIs used by the kernel on
     25  64-bit architectures.
     26
     27  On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
     28  and may let more complex programs to be interpreted.
     29
     30  R0 - R5 are scratch registers and eBPF program needs spill/fill them if
     31  necessary across calls. Note that there is only one eBPF program (== one
     32  eBPF main routine) and it cannot call other eBPF functions, it can only
     33  call predefined in-kernel functions, though.
     34
     35- Register width increases from 32-bit to 64-bit:
     36
     37  Still, the semantics of the original 32-bit ALU operations are preserved
     38  via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
     39  subregisters that zero-extend into 64-bit if they are being written to.
     40  That behavior maps directly to x86_64 and arm64 subregister definition, but
     41  makes other JITs more difficult.
     42
     43  32-bit architectures run 64-bit eBPF programs via interpreter.
     44  Their JITs may convert BPF programs that only use 32-bit subregisters into
     45  native instruction set and let the rest being interpreted.
     46
     47  Operation is 64-bit, because on 64-bit architectures, pointers are also
     48  64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
     49  so 32-bit eBPF registers would otherwise require to define register-pair
     50  ABI, thus, there won't be able to use a direct eBPF register to HW register
     51  mapping and JIT would need to do combine/split/move operations for every
     52  register in and out of the function, which is complex, bug prone and slow.
     53  Another reason is the use of atomic 64-bit counters.
     54
     55- Conditional jt/jf targets replaced with jt/fall-through:
     56
     57  While the original design has constructs such as ``if (cond) jump_true;
     58  else jump_false;``, they are being replaced into alternative constructs like
     59  ``if (cond) jump_true; /* else fall-through */``.
     60
     61- Introduces bpf_call insn and register passing convention for zero overhead
     62  calls from/to other kernel functions:
     63
     64  Before an in-kernel function call, the eBPF program needs to
     65  place function arguments into R1 to R5 registers to satisfy calling
     66  convention, then the interpreter will take them from registers and pass
     67  to in-kernel function. If R1 - R5 registers are mapped to CPU registers
     68  that are used for argument passing on given architecture, the JIT compiler
     69  doesn't need to emit extra moves. Function arguments will be in the correct
     70  registers and BPF_CALL instruction will be JITed as single 'call' HW
     71  instruction. This calling convention was picked to cover common call
     72  situations without performance penalty.
     73
     74  After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
     75  a return value of the function. Since R6 - R9 are callee saved, their state
     76  is preserved across the call.
     77
     78  For example, consider three C functions::
     79
     80    u64 f1() { return (*_f2)(1); }
     81    u64 f2(u64 a) { return f3(a + 1, a); }
     82    u64 f3(u64 a, u64 b) { return a - b; }
     83
     84  GCC can compile f1, f3 into x86_64::
     85
     86    f1:
     87	movl $1, %edi
     88	movq _f2(%rip), %rax
     89	jmp  *%rax
     90    f3:
     91	movq %rdi, %rax
     92	subq %rsi, %rax
     93	ret
     94
     95  Function f2 in eBPF may look like::
     96
     97    f2:
     98	bpf_mov R2, R1
     99	bpf_add R1, 1
    100	bpf_call f3
    101	bpf_exit
    102
    103  If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
    104  returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
    105  be used to call into f2.
    106
    107  For practical reasons all eBPF programs have only one argument 'ctx' which is
    108  already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
    109  can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
    110  are currently not supported, but these restrictions can be lifted if necessary
    111  in the future.
    112
    113  On 64-bit architectures all register map to HW registers one to one. For
    114  example, x86_64 JIT compiler can map them as ...
    115
    116  ::
    117
    118    R0 - rax
    119    R1 - rdi
    120    R2 - rsi
    121    R3 - rdx
    122    R4 - rcx
    123    R5 - r8
    124    R6 - rbx
    125    R7 - r13
    126    R8 - r14
    127    R9 - r15
    128    R10 - rbp
    129
    130  ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
    131  and rbx, r12 - r15 are callee saved.
    132
    133  Then the following eBPF pseudo-program::
    134
    135    bpf_mov R6, R1 /* save ctx */
    136    bpf_mov R2, 2
    137    bpf_mov R3, 3
    138    bpf_mov R4, 4
    139    bpf_mov R5, 5
    140    bpf_call foo
    141    bpf_mov R7, R0 /* save foo() return value */
    142    bpf_mov R1, R6 /* restore ctx for next call */
    143    bpf_mov R2, 6
    144    bpf_mov R3, 7
    145    bpf_mov R4, 8
    146    bpf_mov R5, 9
    147    bpf_call bar
    148    bpf_add R0, R7
    149    bpf_exit
    150
    151  After JIT to x86_64 may look like::
    152
    153    push %rbp
    154    mov %rsp,%rbp
    155    sub $0x228,%rsp
    156    mov %rbx,-0x228(%rbp)
    157    mov %r13,-0x220(%rbp)
    158    mov %rdi,%rbx
    159    mov $0x2,%esi
    160    mov $0x3,%edx
    161    mov $0x4,%ecx
    162    mov $0x5,%r8d
    163    callq foo
    164    mov %rax,%r13
    165    mov %rbx,%rdi
    166    mov $0x6,%esi
    167    mov $0x7,%edx
    168    mov $0x8,%ecx
    169    mov $0x9,%r8d
    170    callq bar
    171    add %r13,%rax
    172    mov -0x228(%rbp),%rbx
    173    mov -0x220(%rbp),%r13
    174    leaveq
    175    retq
    176
    177  Which is in this example equivalent in C to::
    178
    179    u64 bpf_filter(u64 ctx)
    180    {
    181	return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
    182    }
    183
    184  In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
    185  arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
    186  registers and place their return value into ``%rax`` which is R0 in eBPF.
    187  Prologue and epilogue are emitted by JIT and are implicit in the
    188  interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
    189  them across the calls as defined by calling convention.
    190
    191  For example the following program is invalid::
    192
    193    bpf_mov R1, 1
    194    bpf_call foo
    195    bpf_mov R0, R1
    196    bpf_exit
    197
    198  After the call the registers R1-R5 contain junk values and cannot be read.
    199  An in-kernel verifier.rst is used to validate eBPF programs.
    200
    201Also in the new design, eBPF is limited to 4096 insns, which means that any
    202program will terminate quickly and will only call a fixed number of kernel
    203functions. Original BPF and eBPF are two operand instructions,
    204which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
    205
    206The input context pointer for invoking the interpreter function is generic,
    207its content is defined by a specific use case. For seccomp register R1 points
    208to seccomp_data, for converted BPF filters R1 points to a skb.
    209
    210A program, that is translated internally consists of the following elements::
    211
    212  op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32
    213
    214So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field
    215has room for new instructions. Some of them may use 16/24/32 byte encoding. New
    216instructions must be multiple of 8 bytes to preserve backward compatibility.
    217
    218eBPF is a general purpose RISC instruction set. Not every register and
    219every instruction are used during translation from original BPF to eBPF.
    220For example, socket filters are not using ``exclusive add`` instruction, but
    221tracing filters may do to maintain counters of events, for example. Register R9
    222is not used by socket filters either, but more complex filters may be running
    223out of registers and would have to resort to spill/fill to stack.
    224
    225eBPF can be used as a generic assembler for last step performance
    226optimizations, socket filters and seccomp are using it as assembler. Tracing
    227filters may use it as assembler to generate code from kernel. In kernel usage
    228may not be bounded by security considerations, since generated eBPF code
    229may be optimizing internal code path and not being exposed to the user space.
    230Safety of eBPF can come from the verifier.rst. In such use cases as
    231described, it may be used as safe instruction set.
    232
    233Just like the original BPF, eBPF runs within a controlled environment,
    234is deterministic and the kernel can easily prove that. The safety of the program
    235can be determined in two steps: first step does depth-first-search to disallow
    236loops and other CFG validation; second step starts from the first insn and
    237descends all possible paths. It simulates execution of every insn and observes
    238the state change of registers and stack.
    239
    240opcode encoding
    241===============
    242
    243eBPF is reusing most of the opcode encoding from classic to simplify conversion
    244of classic BPF to eBPF.
    245
    246For arithmetic and jump instructions the 8-bit 'code' field is divided into three
    247parts::
    248
    249  +----------------+--------+--------------------+
    250  |   4 bits       |  1 bit |   3 bits           |
    251  | operation code | source | instruction class  |
    252  +----------------+--------+--------------------+
    253  (MSB)                                      (LSB)
    254
    255Three LSB bits store instruction class which is one of:
    256
    257  ===================     ===============
    258  Classic BPF classes     eBPF classes
    259  ===================     ===============
    260  BPF_LD    0x00          BPF_LD    0x00
    261  BPF_LDX   0x01          BPF_LDX   0x01
    262  BPF_ST    0x02          BPF_ST    0x02
    263  BPF_STX   0x03          BPF_STX   0x03
    264  BPF_ALU   0x04          BPF_ALU   0x04
    265  BPF_JMP   0x05          BPF_JMP   0x05
    266  BPF_RET   0x06          BPF_JMP32 0x06
    267  BPF_MISC  0x07          BPF_ALU64 0x07
    268  ===================     ===============
    269
    270The 4th bit encodes the source operand ...
    271
    272    ::
    273
    274	BPF_K     0x00
    275	BPF_X     0x08
    276
    277 * in classic BPF, this means::
    278
    279	BPF_SRC(code) == BPF_X - use register X as source operand
    280	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
    281
    282 * in eBPF, this means::
    283
    284	BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
    285	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
    286
    287... and four MSB bits store operation code.
    288
    289If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
    290
    291  BPF_ADD   0x00
    292  BPF_SUB   0x10
    293  BPF_MUL   0x20
    294  BPF_DIV   0x30
    295  BPF_OR    0x40
    296  BPF_AND   0x50
    297  BPF_LSH   0x60
    298  BPF_RSH   0x70
    299  BPF_NEG   0x80
    300  BPF_MOD   0x90
    301  BPF_XOR   0xa0
    302  BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
    303  BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
    304  BPF_END   0xd0  /* eBPF only: endianness conversion */
    305
    306If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
    307
    308  BPF_JA    0x00  /* BPF_JMP only */
    309  BPF_JEQ   0x10
    310  BPF_JGT   0x20
    311  BPF_JGE   0x30
    312  BPF_JSET  0x40
    313  BPF_JNE   0x50  /* eBPF only: jump != */
    314  BPF_JSGT  0x60  /* eBPF only: signed '>' */
    315  BPF_JSGE  0x70  /* eBPF only: signed '>=' */
    316  BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
    317  BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
    318  BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
    319  BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
    320  BPF_JSLT  0xc0  /* eBPF only: signed '<' */
    321  BPF_JSLE  0xd0  /* eBPF only: signed '<=' */
    322
    323So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
    324and eBPF. There are only two registers in classic BPF, so it means A += X.
    325In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
    326BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
    327src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
    328
    329Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
    330eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
    331BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
    332exactly the same operations as BPF_ALU, but with 64-bit wide operands
    333instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
    334dst_reg = dst_reg + src_reg
    335
    336Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
    337operation. Classic BPF_RET | BPF_K means copy imm32 into return register
    338and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
    339in eBPF means function exit only. The eBPF program needs to store return
    340value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
    341BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
    342operands for the comparisons instead.
    343
    344For load and store instructions the 8-bit 'code' field is divided as::
    345
    346  +--------+--------+-------------------+
    347  | 3 bits | 2 bits |   3 bits          |
    348  |  mode  |  size  | instruction class |
    349  +--------+--------+-------------------+
    350  (MSB)                             (LSB)
    351
    352Size modifier is one of ...
    353
    354::
    355
    356  BPF_W   0x00    /* word */
    357  BPF_H   0x08    /* half word */
    358  BPF_B   0x10    /* byte */
    359  BPF_DW  0x18    /* eBPF only, double word */
    360
    361... which encodes size of load/store operation::
    362
    363 B  - 1 byte
    364 H  - 2 byte
    365 W  - 4 byte
    366 DW - 8 byte (eBPF only)
    367
    368Mode modifier is one of::
    369
    370  BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
    371  BPF_ABS     0x20
    372  BPF_IND     0x40
    373  BPF_MEM     0x60
    374  BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
    375  BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
    376  BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */