filter.rst - cachepc-linux - Fork of AMDESE/linux with modifications for CachePC side-channel attack

	cachepc-linux Fork of AMDESE/linux with modifications for CachePC side-channel attack
	git clone https://git.sinitax.com/sinitax/cachepc-linux
	Log \| Files \| Refs \| README \| LICENSE \| sfeed.txt
filter.rst (25808B)
      1.. SPDX-License-Identifier: GPL-2.0
      2
      3.. _networking-filter:
      4
      5=======================================================
      6Linux Socket Filtering aka Berkeley Packet Filter (BPF)
      7=======================================================
      8
      9Notice
     10------
     11
     12This file used to document the eBPF format and mechanisms even when not
     13related to socket filtering.  The ../bpf/index.rst has more details
     14on eBPF.
     15
     16Introduction
     17------------
     18
     19Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
     20Though there are some distinct differences between the BSD and Linux
     21Kernel filtering, but when we speak of BPF or LSF in Linux context, we
     22mean the very same mechanism of filtering in the Linux kernel.
     23
     24BPF allows a user-space program to attach a filter onto any socket and
     25allow or disallow certain types of data to come through the socket. LSF
     26follows exactly the same filter code structure as BSD's BPF, so referring
     27to the BSD bpf.4 manpage is very helpful in creating filters.
     28
     29On Linux, BPF is much simpler than on BSD. One does not have to worry
     30about devices or anything like that. You simply create your filter code,
     31send it to the kernel via the SO_ATTACH_FILTER option and if your filter
     32code passes the kernel check on it, you then immediately begin filtering
     33data on that socket.
     34
     35You can also detach filters from your socket via the SO_DETACH_FILTER
     36option. This will probably not be used much since when you close a socket
     37that has a filter on it the filter is automagically removed. The other
     38less common case may be adding a different filter on the same socket where
     39you had another filter that is still running: the kernel takes care of
     40removing the old one and placing your new one in its place, assuming your
     41filter has passed the checks, otherwise if it fails the old filter will
     42remain on that socket.
     43
     44SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
     45set, a filter cannot be removed or changed. This allows one process to
     46setup a socket, attach a filter, lock it then drop privileges and be
     47assured that the filter will be kept until the socket is closed.
     48
     49The biggest user of this construct might be libpcap. Issuing a high-level
     50filter command like `tcpdump -i em1 port 22` passes through the libpcap
     51internal compiler that generates a structure that can eventually be loaded
     52via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
     53displays what is being placed into this structure.
     54
     55Although we were only speaking about sockets here, BPF in Linux is used
     56in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
     57qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
     58such as team driver, PTP code, etc where BPF is being used.
     59
     60.. [1] Documentation/userspace-api/seccomp_filter.rst
     61
     62Original BPF paper:
     63
     64Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
     65architecture for user-level packet capture. In Proceedings of the
     66USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
     67Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
     68CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
     69
     70Structure
     71---------
     72
     73User space applications include <linux/filter.h> which contains the
     74following relevant structures::
     75
     76	struct sock_filter {	/* Filter block */
     77		__u16	code;   /* Actual filter code */
     78		__u8	jt;	/* Jump true */
     79		__u8	jf;	/* Jump false */
     80		__u32	k;      /* Generic multiuse field */
     81	};
     82
     83Such a structure is assembled as an array of 4-tuples, that contains
     84a code, jt, jf and k value. jt and jf are jump offsets and k a generic
     85value to be used for a provided code::
     86
     87	struct sock_fprog {			/* Required for SO_ATTACH_FILTER. */
     88		unsigned short		   len;	/* Number of filter blocks */
     89		struct sock_filter __user *filter;
     90	};
     91
     92For socket filtering, a pointer to this structure (as shown in
     93follow-up example) is being passed to the kernel through setsockopt(2).
     94
     95Example
     96-------
     97
     98::
     99
    100    #include <sys/socket.h>
    101    #include <sys/types.h>
    102    #include <arpa/inet.h>
    103    #include <linux/if_ether.h>
    104    /* ... */
    105
    106    /* From the example above: tcpdump -i em1 port 22 -dd */
    107    struct sock_filter code[] = {
    108	    { 0x28,  0,  0, 0x0000000c },
    109	    { 0x15,  0,  8, 0x000086dd },
    110	    { 0x30,  0,  0, 0x00000014 },
    111	    { 0x15,  2,  0, 0x00000084 },
    112	    { 0x15,  1,  0, 0x00000006 },
    113	    { 0x15,  0, 17, 0x00000011 },
    114	    { 0x28,  0,  0, 0x00000036 },
    115	    { 0x15, 14,  0, 0x00000016 },
    116	    { 0x28,  0,  0, 0x00000038 },
    117	    { 0x15, 12, 13, 0x00000016 },
    118	    { 0x15,  0, 12, 0x00000800 },
    119	    { 0x30,  0,  0, 0x00000017 },
    120	    { 0x15,  2,  0, 0x00000084 },
    121	    { 0x15,  1,  0, 0x00000006 },
    122	    { 0x15,  0,  8, 0x00000011 },
    123	    { 0x28,  0,  0, 0x00000014 },
    124	    { 0x45,  6,  0, 0x00001fff },
    125	    { 0xb1,  0,  0, 0x0000000e },
    126	    { 0x48,  0,  0, 0x0000000e },
    127	    { 0x15,  2,  0, 0x00000016 },
    128	    { 0x48,  0,  0, 0x00000010 },
    129	    { 0x15,  0,  1, 0x00000016 },
    130	    { 0x06,  0,  0, 0x0000ffff },
    131	    { 0x06,  0,  0, 0x00000000 },
    132    };
    133
    134    struct sock_fprog bpf = {
    135	    .len = ARRAY_SIZE(code),
    136	    .filter = code,
    137    };
    138
    139    sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
    140    if (sock < 0)
    141	    /* ... bail out ... */
    142
    143    ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
    144    if (ret < 0)
    145	    /* ... bail out ... */
    146
    147    /* ... */
    148    close(sock);
    149
    150The above example code attaches a socket filter for a PF_PACKET socket
    151in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
    152be dropped for this socket.
    153
    154The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
    155and SO_LOCK_FILTER for preventing the filter to be detached, takes an
    156integer value with 0 or 1.
    157
    158Note that socket filters are not restricted to PF_PACKET sockets only,
    159but can also be used on other socket families.
    160
    161Summary of system calls:
    162
    163 * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
    164 * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
    165 * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER,   &val, sizeof(val));
    166
    167Normally, most use cases for socket filtering on packet sockets will be
    168covered by libpcap in high-level syntax, so as an application developer
    169you should stick to that. libpcap wraps its own layer around all that.
    170
    171Unless i) using/linking to libpcap is not an option, ii) the required BPF
    172filters use Linux extensions that are not supported by libpcap's compiler,
    173iii) a filter might be more complex and not cleanly implementable with
    174libpcap's compiler, or iv) particular filter codes should be optimized
    175differently than libpcap's internal compiler does; then in such cases
    176writing such a filter "by hand" can be of an alternative. For example,
    177xt_bpf and cls_bpf users might have requirements that could result in
    178more complex filter code, or one that cannot be expressed with libpcap
    179(e.g. different return codes for various code paths). Moreover, BPF JIT
    180implementors may wish to manually write test cases and thus need low-level
    181access to BPF code as well.
    182
    183BPF engine and instruction set
    184------------------------------
    185
    186Under tools/bpf/ there's a small helper tool called bpf_asm which can
    187be used to write low-level filters for example scenarios mentioned in the
    188previous section. Asm-like syntax mentioned here has been implemented in
    189bpf_asm and will be used for further explanations (instead of dealing with
    190less readable opcodes directly, principles are the same). The syntax is
    191closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
    192
    193The BPF architecture consists of the following basic elements:
    194
    195  =======          ====================================================
    196  Element          Description
    197  =======          ====================================================
    198  A                32 bit wide accumulator
    199  X                32 bit wide X register
    200  M[]              16 x 32 bit wide misc registers aka "scratch memory
    201		   store", addressable from 0 to 15
    202  =======          ====================================================
    203
    204A program, that is translated by bpf_asm into "opcodes" is an array that
    205consists of the following elements (as already mentioned)::
    206
    207  op:16, jt:8, jf:8, k:32
    208
    209The element op is a 16 bit wide opcode that has a particular instruction
    210encoded. jt and jf are two 8 bit wide jump targets, one for condition
    211"jump if true", the other one "jump if false". Eventually, element k
    212contains a miscellaneous argument that can be interpreted in different
    213ways depending on the given instruction in op.
    214
    215The instruction set consists of load, store, branch, alu, miscellaneous
    216and return instructions that are also represented in bpf_asm syntax. This
    217table lists all bpf_asm instructions available resp. what their underlying
    218opcodes as defined in linux/filter.h stand for:
    219
    220  ===========      ===================  =====================
    221  Instruction      Addressing mode      Description
    222  ===========      ===================  =====================
    223  ld               1, 2, 3, 4, 12       Load word into A
    224  ldi              4                    Load word into A
    225  ldh              1, 2                 Load half-word into A
    226  ldb              1, 2                 Load byte into A
    227  ldx              3, 4, 5, 12          Load word into X
    228  ldxi             4                    Load word into X
    229  ldxb             5                    Load byte into X
    230
    231  st               3                    Store A into M[]
    232  stx              3                    Store X into M[]
    233
    234  jmp              6                    Jump to label
    235  ja               6                    Jump to label
    236  jeq              7, 8, 9, 10          Jump on A == <x>
    237  jneq             9, 10                Jump on A != <x>
    238  jne              9, 10                Jump on A != <x>
    239  jlt              9, 10                Jump on A <  <x>
    240  jle              9, 10                Jump on A <= <x>
    241  jgt              7, 8, 9, 10          Jump on A >  <x>
    242  jge              7, 8, 9, 10          Jump on A >= <x>
    243  jset             7, 8, 9, 10          Jump on A &  <x>
    244
    245  add              0, 4                 A + <x>
    246  sub              0, 4                 A - <x>
    247  mul              0, 4                 A * <x>
    248  div              0, 4                 A / <x>
    249  mod              0, 4                 A % <x>
    250  neg                                   !A
    251  and              0, 4                 A & <x>
    252  or               0, 4                 A | <x>
    253  xor              0, 4                 A ^ <x>
    254  lsh              0, 4                 A << <x>
    255  rsh              0, 4                 A >> <x>
    256
    257  tax                                   Copy A into X
    258  txa                                   Copy X into A
    259
    260  ret              4, 11                Return
    261  ===========      ===================  =====================
    262
    263The next table shows addressing formats from the 2nd column:
    264
    265  ===============  ===================  ===============================================
    266  Addressing mode  Syntax               Description
    267  ===============  ===================  ===============================================
    268   0               x/%x                 Register X
    269   1               [k]                  BHW at byte offset k in the packet
    270   2               [x + k]              BHW at the offset X + k in the packet
    271   3               M[k]                 Word at offset k in M[]
    272   4               #k                   Literal value stored in k
    273   5               4*([k]&0xf)          Lower nibble * 4 at byte offset k in the packet
    274   6               L                    Jump label L
    275   7               #k,Lt,Lf             Jump to Lt if true, otherwise jump to Lf
    276   8               x/%x,Lt,Lf           Jump to Lt if true, otherwise jump to Lf
    277   9               #k,Lt                Jump to Lt if predicate is true
    278  10               x/%x,Lt              Jump to Lt if predicate is true
    279  11               a/%a                 Accumulator A
    280  12               extension            BPF extension
    281  ===============  ===================  ===============================================
    282
    283The Linux kernel also has a couple of BPF extensions that are used along
    284with the class of load instructions by "overloading" the k argument with
    285a negative offset + a particular extension offset. The result of such BPF
    286extensions are loaded into A.
    287
    288Possible BPF extensions are shown in the following table:
    289
    290  ===================================   =================================================
    291  Extension                             Description
    292  ===================================   =================================================
    293  len                                   skb->len
    294  proto                                 skb->protocol
    295  type                                  skb->pkt_type
    296  poff                                  Payload start offset
    297  ifidx                                 skb->dev->ifindex
    298  nla                                   Netlink attribute of type X with offset A
    299  nlan                                  Nested Netlink attribute of type X with offset A
    300  mark                                  skb->mark
    301  queue                                 skb->queue_mapping
    302  hatype                                skb->dev->type
    303  rxhash                                skb->hash
    304  cpu                                   raw_smp_processor_id()
    305  vlan_tci                              skb_vlan_tag_get(skb)
    306  vlan_avail                            skb_vlan_tag_present(skb)
    307  vlan_tpid                             skb->vlan_proto
    308  rand                                  prandom_u32()
    309  ===================================   =================================================
    310
    311These extensions can also be prefixed with '#'.
    312Examples for low-level BPF:
    313
    314**ARP packets**::
    315
    316  ldh [12]
    317  jne #0x806, drop
    318  ret #-1
    319  drop: ret #0
    320
    321**IPv4 TCP packets**::
    322
    323  ldh [12]
    324  jne #0x800, drop
    325  ldb [23]
    326  jneq #6, drop
    327  ret #-1
    328  drop: ret #0
    329
    330**icmp random packet sampling, 1 in 4**::
    331
    332  ldh [12]
    333  jne #0x800, drop
    334  ldb [23]
    335  jneq #1, drop
    336  # get a random uint32 number
    337  ld rand
    338  mod #4
    339  jneq #1, drop
    340  ret #-1
    341  drop: ret #0
    342
    343**SECCOMP filter example**::
    344
    345  ld [4]                  /* offsetof(struct seccomp_data, arch) */
    346  jne #0xc000003e, bad    /* AUDIT_ARCH_X86_64 */
    347  ld [0]                  /* offsetof(struct seccomp_data, nr) */
    348  jeq #15, good           /* __NR_rt_sigreturn */
    349  jeq #231, good          /* __NR_exit_group */
    350  jeq #60, good           /* __NR_exit */
    351  jeq #0, good            /* __NR_read */
    352  jeq #1, good            /* __NR_write */
    353  jeq #5, good            /* __NR_fstat */
    354  jeq #9, good            /* __NR_mmap */
    355  jeq #14, good           /* __NR_rt_sigprocmask */
    356  jeq #13, good           /* __NR_rt_sigaction */
    357  jeq #35, good           /* __NR_nanosleep */
    358  bad: ret #0             /* SECCOMP_RET_KILL_THREAD */
    359  good: ret #0x7fff0000   /* SECCOMP_RET_ALLOW */
    360
    361Examples for low-level BPF extension:
    362
    363**Packet for interface index 13**::
    364
    365  ld ifidx
    366  jneq #13, drop
    367  ret #-1
    368  drop: ret #0
    369
    370**(Accelerated) VLAN w/ id 10**::
    371
    372  ld vlan_tci
    373  jneq #10, drop
    374  ret #-1
    375  drop: ret #0
    376
    377The above example code can be placed into a file (here called "foo"), and
    378then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
    379and cls_bpf understands and can directly be loaded with. Example with above
    380ARP code::
    381
    382    $ ./bpf_asm foo
    383    4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
    384
    385In copy and paste C-like output::
    386
    387    $ ./bpf_asm -c foo
    388    { 0x28,  0,  0, 0x0000000c },
    389    { 0x15,  0,  1, 0x00000806 },
    390    { 0x06,  0,  0, 0xffffffff },
    391    { 0x06,  0,  0, 0000000000 },
    392
    393In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
    394filters that might not be obvious at first, it's good to test filters before
    395attaching to a live system. For that purpose, there's a small tool called
    396bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
    397for testing BPF filters against given pcap files, single stepping through the
    398BPF code on the pcap's packets and to do BPF machine register dumps.
    399
    400Starting bpf_dbg is trivial and just requires issuing::
    401
    402    # ./bpf_dbg
    403
    404In case input and output do not equal stdin/stdout, bpf_dbg takes an
    405alternative stdin source as a first argument, and an alternative stdout
    406sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
    407
    408Other than that, a particular libreadline configuration can be set via
    409file "~/.bpf_dbg_init" and the command history is stored in the file
    410"~/.bpf_dbg_history".
    411
    412Interaction in bpf_dbg happens through a shell that also has auto-completion
    413support (follow-up example commands starting with '>' denote bpf_dbg shell).
    414The usual workflow would be to ...
    415
    416* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
    417  Loads a BPF filter from standard output of bpf_asm, or transformed via
    418  e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
    419  debugging (next section), this command creates a temporary socket and
    420  loads the BPF code into the kernel. Thus, this will also be useful for
    421  JIT developers.
    422
    423* load pcap foo.pcap
    424
    425  Loads standard tcpdump pcap file.
    426
    427* run [<n>]
    428
    429bpf passes:1 fails:9
    430  Runs through all packets from a pcap to account how many passes and fails
    431  the filter will generate. A limit of packets to traverse can be given.
    432
    433* disassemble::
    434
    435	l0:	ldh [12]
    436	l1:	jeq #0x800, l2, l5
    437	l2:	ldb [23]
    438	l3:	jeq #0x1, l4, l5
    439	l4:	ret #0xffff
    440	l5:	ret #0
    441
    442  Prints out BPF code disassembly.
    443
    444* dump::
    445
    446	/* { op, jt, jf, k }, */
    447	{ 0x28,  0,  0, 0x0000000c },
    448	{ 0x15,  0,  3, 0x00000800 },
    449	{ 0x30,  0,  0, 0x00000017 },
    450	{ 0x15,  0,  1, 0x00000001 },
    451	{ 0x06,  0,  0, 0x0000ffff },
    452	{ 0x06,  0,  0, 0000000000 },
    453
    454  Prints out C-style BPF code dump.
    455
    456* breakpoint 0::
    457
    458	breakpoint at: l0:	ldh [12]
    459
    460* breakpoint 1::
    461
    462	breakpoint at: l1:	jeq #0x800, l2, l5
    463
    464  ...
    465
    466  Sets breakpoints at particular BPF instructions. Issuing a `run` command
    467  will walk through the pcap file continuing from the current packet and
    468  break when a breakpoint is being hit (another `run` will continue from
    469  the currently active breakpoint executing next instructions):
    470
    471  * run::
    472
    473	-- register dump --
    474	pc:       [0]                       <-- program counter
    475	code:     [40] jt[0] jf[0] k[12]    <-- plain BPF code of current instruction
    476	curr:     l0:	ldh [12]              <-- disassembly of current instruction
    477	A:        [00000000][0]             <-- content of A (hex, decimal)
    478	X:        [00000000][0]             <-- content of X (hex, decimal)
    479	M[0,15]:  [00000000][0]             <-- folded content of M (hex, decimal)
    480	-- packet dump --                   <-- Current packet from pcap (hex)
    481	len: 42
    482	    0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
    483	16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
    484	32: 00 00 00 00 00 00 0a 3b 01 01
    485	(breakpoint)
    486	>
    487
    488  * breakpoint::
    489
    490	breakpoints: 0 1
    491
    492    Prints currently set breakpoints.
    493
    494* step [-<n>, +<n>]
    495
    496  Performs single stepping through the BPF program from the current pc
    497  offset. Thus, on each step invocation, above register dump is issued.
    498  This can go forwards and backwards in time, a plain `step` will break
    499  on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
    500
    501* select <n>
    502
    503  Selects a given packet from the pcap file to continue from. Thus, on
    504  the next `run` or `step`, the BPF program is being evaluated against
    505  the user pre-selected packet. Numbering starts just as in Wireshark
    506  with index 1.
    507
    508* quit
    509
    510  Exits bpf_dbg.
    511
    512JIT compiler
    513------------
    514
    515The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
    516PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
    517CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
    518attached filter from user space or for internal kernel users if it has
    519been previously enabled by root::
    520
    521  echo 1 > /proc/sys/net/core/bpf_jit_enable
    522
    523For JIT developers, doing audits etc, each compile run can output the generated
    524opcode image into the kernel log via::
    525
    526  echo 2 > /proc/sys/net/core/bpf_jit_enable
    527
    528Example output from dmesg::
    529
    530    [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
    531    [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
    532    [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
    533    [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
    534    [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
    535    [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
    536
    537When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
    538setting any other value than that will return in failure. This is even the case for
    539setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
    540is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
    541generally recommended approach instead.
    542
    543In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
    544generating disassembly out of the kernel log's hexdump::
    545
    546	# ./bpf_jit_disasm
    547	70 bytes emitted from JIT compiler (pass:3, flen:6)
    548	ffffffffa0069c8f + <x>:
    549	0:	push   %rbp
    550	1:	mov    %rsp,%rbp
    551	4:	sub    $0x60,%rsp
    552	8:	mov    %rbx,-0x8(%rbp)
    553	c:	mov    0x68(%rdi),%r9d
    554	10:	sub    0x6c(%rdi),%r9d
    555	14:	mov    0xd8(%rdi),%r8
    556	1b:	mov    $0xc,%esi
    557	20:	callq  0xffffffffe0ff9442
    558	25:	cmp    $0x800,%eax
    559	2a:	jne    0x0000000000000042
    560	2c:	mov    $0x17,%esi
    561	31:	callq  0xffffffffe0ff945e
    562	36:	cmp    $0x1,%eax
    563	39:	jne    0x0000000000000042
    564	3b:	mov    $0xffff,%eax
    565	40:	jmp    0x0000000000000044
    566	42:	xor    %eax,%eax
    567	44:	leaveq
    568	45:	retq
    569
    570	Issuing option `-o` will "annotate" opcodes to resulting assembler
    571	instructions, which can be very useful for JIT developers:
    572
    573	# ./bpf_jit_disasm -o
    574	70 bytes emitted from JIT compiler (pass:3, flen:6)
    575	ffffffffa0069c8f + <x>:
    576	0:	push   %rbp
    577		55
    578	1:	mov    %rsp,%rbp
    579		48 89 e5
    580	4:	sub    $0x60,%rsp
    581		48 83 ec 60
    582	8:	mov    %rbx,-0x8(%rbp)
    583		48 89 5d f8
    584	c:	mov    0x68(%rdi),%r9d
    585		44 8b 4f 68
    586	10:	sub    0x6c(%rdi),%r9d
    587		44 2b 4f 6c
    588	14:	mov    0xd8(%rdi),%r8
    589		4c 8b 87 d8 00 00 00
    590	1b:	mov    $0xc,%esi
    591		be 0c 00 00 00
    592	20:	callq  0xffffffffe0ff9442
    593		e8 1d 94 ff e0
    594	25:	cmp    $0x800,%eax
    595		3d 00 08 00 00
    596	2a:	jne    0x0000000000000042
    597		75 16
    598	2c:	mov    $0x17,%esi
    599		be 17 00 00 00
    600	31:	callq  0xffffffffe0ff945e
    601		e8 28 94 ff e0
    602	36:	cmp    $0x1,%eax
    603		83 f8 01
    604	39:	jne    0x0000000000000042
    605		75 07
    606	3b:	mov    $0xffff,%eax
    607		b8 ff ff 00 00
    608	40:	jmp    0x0000000000000044
    609		eb 02
    610	42:	xor    %eax,%eax
    611		31 c0
    612	44:	leaveq
    613		c9
    614	45:	retq
    615		c3
    616
    617For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
    618toolchain for developing and testing the kernel's JIT compiler.
    619
    620BPF kernel internals
    621--------------------
    622Internally, for the kernel interpreter, a different instruction set
    623format with similar underlying principles from BPF described in previous
    624paragraphs is being used. However, the instruction set format is modelled
    625closer to the underlying architecture to mimic native instruction sets, so
    626that a better performance can be achieved (more details later). This new
    627ISA is called eBPF.  See the ../bpf/index.rst for details.  (Note: eBPF which
    628originates from [e]xtended BPF is not the same as BPF extensions! While
    629eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
    630of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
    631
    632The new instruction set was originally designed with the possible goal in
    633mind to write programs in "restricted C" and compile into eBPF with a optional
    634GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
    635minimal performance overhead over two steps, that is, C -> eBPF -> native code.
    636
    637Currently, the new format is being used for running user BPF programs, which
    638includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
    639team driver's classifier for its load-balancing mode, netfilter's xt_bpf
    640extension, PTP dissector/classifier, and much more. They are all internally
    641converted by the kernel into the new instruction set representation and run
    642in the eBPF interpreter. For in-kernel handlers, this all works transparently
    643by using bpf_prog_create() for setting up the filter, resp.
    644bpf_prog_destroy() for destroying it. The function
    645bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
    646code to run the filter. 'filter' is a pointer to struct bpf_prog that we
    647got from bpf_prog_create(), and 'ctx' the given context (e.g.
    648skb pointer). All constraints and restrictions from bpf_check_classic() apply
    649before a conversion to the new layout is being done behind the scenes!
    650
    651Currently, the classic BPF format is being used for JITing on most
    65232-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
    653sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
    654instruction set.
    655
    656Testing
    657-------
    658
    659Next to the BPF toolchain, the kernel also ships a test module that contains
    660various test cases for classic and eBPF that can be executed against
    661the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
    662enabled via Kconfig::
    663
    664  CONFIG_TEST_BPF=m
    665
    666After the module has been built and installed, the test suite can be executed
    667via insmod or modprobe against 'test_bpf' module. Results of the test cases
    668including timings in nsec can be found in the kernel log (dmesg).
    669
    670Misc
    671----
    672
    673Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
    674SECCOMP-BPF kernel fuzzing.
    675
    676Written by
    677----------
    678
    679The document was written in the hope that it is found useful and in order
    680to give potential BPF hackers or security auditors a better overview of
    681the underlying architecture.
    682
    683- Jay Schulist <jschlst@samba.org>
    684- Daniel Borkmann <daniel@iogearbox.net>
    685- Alexei Starovoitov <ast@kernel.org>