summaryrefslogtreecommitdiffstats
path: root/kernel/bpf
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | | bpf: Implement bpf_seq_read() for bpf iteratorYonghong Song2020-05-091-0/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bpf iterator uses seq_file to provide a lossless way to transfer data to user space. But we want to call bpf program after all objects have been traversed, and bpf program may write additional data to the seq_file buffer. The current seq_read() does not work for this use case. Besides allowing stop() function to write to the buffer, the bpf_seq_read() also fixed the buffer size to one page. If any single call of show() or stop() will emit data more than one page to cause overflow, -E2BIG error code will be returned to user space. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175904.2475468-1-yhs@fb.com
| * | | | | bpf: Support bpf tracing/iter programs for BPF_LINK_UPDATEYonghong Song2020-05-091-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added BPF_LINK_UPDATE support for tracing/iter programs. This way, a file based bpf iterator, which holds a reference to the link, can have its bpf program updated without creating new files. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175902.2475262-1-yhs@fb.com
| * | | | | bpf: Support bpf tracing/iter programs for BPF_LINK_CREATEYonghong Song2020-05-092-0/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given a bpf program, the step to create an anonymous bpf iterator is: - create a bpf_iter_link, which combines bpf program and the target. In the future, there could be more information recorded in the link. A link_fd will be returned to the user space. - create an anonymous bpf iterator with the given link_fd. The bpf_iter_link can be pinned to bpffs mount file system to create a file based bpf iterator as well. The benefit to use of bpf_iter_link: - using bpf link simplifies design and implementation as bpf link is used for other tracing bpf programs. - for file based bpf iterator, bpf_iter_link provides a standard way to replace underlying bpf programs. - for both anonymous and free based iterators, bpf link query capability can be leveraged. The patch added support of tracing/iter programs for BPF_LINK_CREATE. A new link type BPF_LINK_TYPE_ITER is added to facilitate link querying. Currently, only prog_id is needed, so there is no additional in-kernel show_fdinfo() and fill_link_info() hook is needed for BPF_LINK_TYPE_ITER link. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
| * | | | | bpf: Allow loading of a bpf_iter programYonghong Song2020-05-092-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bpf_iter program is a tracing program with attach type BPF_TRACE_ITER. The load attribute attach_btf_id is used by the verifier against a particular kernel function, which represents a target, e.g., __bpf_iter__bpf_map for target bpf_map which is implemented later. The program return value must be 0 or 1 for now. 0 : successful, except potential seq_file buffer overflow which is handled by seq_file reader. 1 : request to restart the same object In the future, other return values may be used for filtering or teminating the iterator. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
| * | | | | bpf: Implement an interface to register bpf_iter targetsYonghong Song2020-05-092-1/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The target can call bpf_iter_reg_target() to register itself. The needed information: target: target name seq_ops: the seq_file operations for the target init_seq_private target callback to initialize seq_priv during file open fini_seq_private target callback to clean up seq_priv during file release seq_priv_size: the private_data size needed by the seq_file operations The target name represents a target which provides a seq_ops for iterating objects. The target can provide two callback functions, init_seq_private and fini_seq_private, called during file open/release time. For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net name space needs to be setup properly during file open and released properly during file release. Function bpf_iter_unreg_target() is also implemented to unregister a particular target. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
| * | | | | bpf: Fix use-after-free of bpf_link when priming half-failsAndrii Nakryiko2020-05-011-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If bpf_link_prime() succeeds to allocate new anon file, but then fails to allocate ID for it, link priming is considered to be failed and user is supposed ot be able to directly kfree() bpf_link, because it was never exposed to user-space. But at that point file already keeps a pointer to bpf_link and will eventually call bpf_link_release(), so if bpf_link was kfree()'d by caller, that would lead to use-after-free. Fix this by first allocating ID and only then allocating file. Adding ID to link_idr is ok, because link at that point still doesn't have its ID set, so no user-space process can create a new FD for it. Fixes: a3b80e107894 ("bpf: Allocate ID for bpf_link") Reported-by: syzbot+39b64425f91b5aab714d@syzkaller.appspotmail.com Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200501185622.3088964-1-andriin@fb.com
| * | | | | bpf: Sharing bpf runtime stats with BPF_ENABLE_STATSSong Liu2020-05-011-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats. Typical userspace tools use kernel.bpf_stats_enabled as follows: 1. Enable kernel.bpf_stats_enabled; 2. Check program run_time_ns; 3. Sleep for the monitoring period; 4. Check program run_time_ns again, calculate the difference; 5. Disable kernel.bpf_stats_enabled. The problem with this approach is that only one userspace tool can toggle this sysctl. If multiple tools toggle the sysctl at the same time, the measurement may be inaccurate. To fix this problem while keep backward compatibility, introduce a new bpf command BPF_ENABLE_STATS. On success, this command enables stats and returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently, only one type, BPF_STATS_RUN_TIME, is supported. We can extend the command to support other types of stats in the future. With BPF_ENABLE_STATS, user space tool would have the following flow: 1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid; 2. Check program run_time_ns; 3. Sleep for the monitoring period; 4. Check program run_time_ns again, calculate the difference; 5. Close the fd. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
| * | | | | bpf: Fix unused variable warningArnd Bergmann2020-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hiding the only using of bpf_link_type_strs[] in an #ifdef causes an unused-variable warning: kernel/bpf/syscall.c:2280:20: error: 'bpf_link_type_strs' defined but not used [-Werror=unused-variable] 2280 | static const char *bpf_link_type_strs[] = { Move the definition into the same #ifdef. Fixes: f2e10bff16a0 ("bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200429132217.1294289-1-arnd@arndb.de
| * | | | | bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASHJakub Sitnicki2020-04-291-10/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a pointer to a full socket and acquires a reference if necessary. To support it we need to extend the verifier to know that: (1) register storing the lookup result holds a pointer to socket, if lookup was done on SOCKMAP/SOCKHASH, and that (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation, which needs a corresponding reference release with bpf_sk_release. On sock_map side, lookup handlers exposed via bpf_map_ops now bump sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport, the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was updated to release the reference. Sockets fetched from a map can be used in the same way as ones returned by BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they can be used with bpf_sk_assign to direct packets toward a socket on TC ingress path. Suggested-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
| * | | | | bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_linkAndrii Nakryiko2020-04-284-21/+181
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add ability to fetch bpf_link details through BPF_OBJ_GET_INFO_BY_FD command. Also enhance show_fdinfo to potentially include bpf_link type-specific information (similarly to obj_info). Also introduce enum bpf_link_type stored in bpf_link itself and expose it in UAPI. bpf_link_tracing also now will store and return bpf_attach_type. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200429001614.1544-5-andriin@fb.com
| * | | | | bpf: Support GET_FD_BY_ID and GET_NEXT_ID for bpf_linkAndrii Nakryiko2020-04-281-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support to look up bpf_link by ID and iterate over all existing bpf_links in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is done as the very last step in finalizing bpf_link, together with installing FD. This approach allows users of bpf_link in kernel code to not worry about races between user-space and kernel code that hasn't finished attaching and initializing bpf_link. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200429001614.1544-4-andriin@fb.com
| * | | | | bpf: Allocate ID for bpf_linkAndrii Nakryiko2020-04-282-53/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Generate ID for each bpf_link using IDR, similarly to bpf_map and bpf_prog. bpf_link creation, initialization, attachment, and exposing to user-space through FD and ID is a complicated multi-step process, abstract it away through bpf_link_primer and bpf_link_prime(), bpf_link_settle(), and bpf_link_cleanup() internal API. They guarantee that until bpf_link is properly attached, user-space won't be able to access partially-initialized bpf_link either from FD or ID. All this allows to simplify bpf_link attachment and error handling code. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200429001614.1544-3-andriin@fb.com
| * | | | | bpf: Refactor bpf_link update handlingAndrii Nakryiko2020-04-282-9/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make bpf_link update support more generic by making it into another bpf_link_ops methods. This allows generic syscall handling code to be agnostic to various conditionally compiled features (e.g., the case of CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain static within respective code base. Refactor existing bpf_cgroup_link code and take advantage of this. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200429001614.1544-2-andriin@fb.com
| * | | | | Merge branch 'work.sysctl' of ↵Daniel Borkmann2020-04-281-23/+12
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
| | * | | | | sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig2020-04-271-23/+12
| | | |_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | bpf, cgroup: Remove unused exportsChristoph Hellwig2020-04-271-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Except for a few of the networking hooks called from modular ipv4 or ipv6 code, all of hooks are just called from guaranteed to be built-in code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20200424064338.538313-2-hch@lst.de
| * | | | | bpf: Make verifier log more relevant by defaultAndrii Nakryiko2020-04-261-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To make BPF verifier verbose log more releavant and easier to use to debug verification failures, "pop" parts of log that were successfully verified. This has effect of leaving only verifier logs that correspond to code branches that lead to verification failure, which in practice should result in much shorter and more relevant verifier log dumps. This behavior is made the default behavior and can be overriden to do exhaustive logging by specifying BPF_LOG_LEVEL2 log level. Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump verbosity, but still have only relevant verifier branches logged. But for this patch, I didn't want to add any new flags. It might be worth-while to just rethink how BPF verifier logging is performed and requested and streamline it a bit. But this trimming of successfully verified branches seems to be useful and a good default behavior. To test this, I modified runqslower slightly to introduce read of uninitialized stack variable. Log (**truncated in the middle** to save many lines out of this commit message) BEFORE this change: ; int handle__sched_switch(u64 *ctx) 0: (bf) r6 = r1 ; struct task_struct *prev = (struct task_struct *)ctx[1]; 1: (79) r1 = *(u64 *)(r6 +8) func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct' 2: (b7) r2 = 0 ; struct event event = {}; 3: (7b) *(u64 *)(r10 -24) = r2 last_idx 3 first_idx 0 regs=4 stack=0 before 2: (b7) r2 = 0 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 ; if (prev->state == TASK_RUNNING) [ ... instruction dump from insn #7 through #50 are cut out ... ] 51: (b7) r2 = 16 52: (85) call bpf_get_current_comm#16 last_idx 52 first_idx 42 regs=4 stack=0 before 51: (b7) r2 = 16 ; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, 53: (bf) r1 = r6 54: (18) r2 = 0xffff8881f3868800 56: (18) r3 = 0xffffffff 58: (bf) r4 = r7 59: (b7) r5 = 32 60: (85) call bpf_perf_event_output#25 last_idx 60 first_idx 53 regs=20 stack=0 before 59: (b7) r5 = 32 61: (bf) r2 = r10 ; event.pid = pid; 62: (07) r2 += -16 ; bpf_map_delete_elem(&start, &pid); 63: (18) r1 = 0xffff8881f3868000 65: (85) call bpf_map_delete_elem#3 ; } 66: (b7) r0 = 0 67: (95) exit from 44 to 66: safe from 34 to 66: safe from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000 ; bpf_map_update_elem(&start, &pid, &ts, 0); 28: (bf) r2 = r10 ; 29: (07) r2 += -16 ; tsp = bpf_map_lookup_elem(&start, &pid); 30: (18) r1 = 0xffff8881f3868000 32: (85) call bpf_map_lookup_elem#1 invalid indirect read from stack off -16+0 size 4 processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4 Notice how there is a successful code path from instruction 0 through 67, few successfully verified jumps (44->66, 34->66), and only after that 11->28 jump plus error on instruction #32. AFTER this change (full verifier log, **no truncation**): ; int handle__sched_switch(u64 *ctx) 0: (bf) r6 = r1 ; struct task_struct *prev = (struct task_struct *)ctx[1]; 1: (79) r1 = *(u64 *)(r6 +8) func 'sched_switch' arg1 has btf_id 151 type STRUCT 'task_struct' 2: (b7) r2 = 0 ; struct event event = {}; 3: (7b) *(u64 *)(r10 -24) = r2 last_idx 3 first_idx 0 regs=4 stack=0 before 2: (b7) r2 = 0 4: (7b) *(u64 *)(r10 -32) = r2 5: (7b) *(u64 *)(r10 -40) = r2 6: (7b) *(u64 *)(r10 -48) = r2 ; if (prev->state == TASK_RUNNING) 7: (79) r2 = *(u64 *)(r1 +16) ; if (prev->state == TASK_RUNNING) 8: (55) if r2 != 0x0 goto pc+19 R1_w=ptr_task_struct(id=0,off=0,imm=0) R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000 ; trace_enqueue(prev->tgid, prev->pid); 9: (61) r1 = *(u32 *)(r1 +1184) 10: (63) *(u32 *)(r10 -4) = r1 ; if (!pid || (targ_pid && targ_pid != pid)) 11: (15) if r1 == 0x0 goto pc+16 from 11 to 28: R1_w=inv0 R2_w=inv0 R6_w=ctx(id=0,off=0,imm=0) R10=fp0 fp-8=mmmm???? fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000 ; bpf_map_update_elem(&start, &pid, &ts, 0); 28: (bf) r2 = r10 ; 29: (07) r2 += -16 ; tsp = bpf_map_lookup_elem(&start, &pid); 30: (18) r1 = 0xffff8881db3ce800 32: (85) call bpf_map_lookup_elem#1 invalid indirect read from stack off -16+0 size 4 processed 65 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 4 Notice how in this case, there are 0-11 instructions + jump from 11 to 28 is recorded + 28-32 instructions with error on insn #32. test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose log at BPF_LOG_LEVEL1. On success, verbose log will only have a summary of number of processed instructions, etc, but no branch tracing log. Having just a last succesful branch tracing seemed weird and confusing. Having small and clean summary log in success case seems quite logical and nice, though. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200423195850.1259827-1-andriin@fb.com
| * | | | | bpf: add bpf_ktime_get_boot_ns()Maciej Żenczykowski2020-04-262-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On a device like a cellphone which is constantly suspending and resuming CLOCK_MONOTONIC is not particularly useful for keeping track of or reacting to external network events. Instead you want to use CLOCK_BOOTTIME. Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns() based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC. Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | | net: bpf: Make bpf_ktime_get_ns() available to non GPL programsMaciej Żenczykowski2020-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The entire implementation is in kernel/bpf/helpers.c: BPF_CALL_0(bpf_ktime_get_ns) { /* NMI safe access to clock monotonic */ return ktime_get_mono_fast_ns(); } const struct bpf_func_proto bpf_ktime_get_ns_proto = { .func = bpf_ktime_get_ns, .gpl_only = false, .ret_type = RET_INTEGER, }; and this was presumably marked GPL due to kernel/time/timekeeping.c: EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns); and while that may make sense for kernel modules (although even that is doubtful), there is currently AFAICT no other source of time available to ebpf. Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC) which is exposed to userspace (via vdso even to make it performant)... As such, I see no reason to keep the GPL restriction. (In the future I'd like to have access to time from Apache licensed ebpf code) Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | | | | bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=nStanislav Fomichev2020-04-262-0/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | linux-next build bot reported compile issue [1] with one of its configs. It looks like when we have CONFIG_NET=n and CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto definition (from net/core/filter.c) in cgroup_base_func_proto. I'm reshuffling the code a bit to make it work. The common helpers are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is exported from there. Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing bpf_user_rnd_u32. [1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72 Fixes: 0456ea170cd6 ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
| * | | | | bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}Stanislav Fomichev2020-04-261-17/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the following prog types don't fall back to bpf_base_func_proto() (instead they have cgroup_base_func_proto which has a limited set of helpers from bpf_base_func_proto): * BPF_PROG_TYPE_CGROUP_DEVICE * BPF_PROG_TYPE_CGROUP_SYSCTL * BPF_PROG_TYPE_CGROUP_SOCKOPT I don't see any specific reason why we shouldn't use bpf_base_func_proto(), every other type of program (except bpf-lirc and, understandably, tracing) use it, so let's fall back to bpf_base_func_proto for those prog types as well. This basically boils down to adding access to the following helpers: * BPF_FUNC_get_prandom_u32 * BPF_FUNC_get_smp_processor_id * BPF_FUNC_get_numa_node_id * BPF_FUNC_tail_call * BPF_FUNC_ktime_get_ns * BPF_FUNC_spin_lock (CAP_SYS_ADMIN) * BPF_FUNC_spin_unlock (CAP_SYS_ADMIN) * BPF_FUNC_jiffies64 (CAP_SYS_ADMIN) I've also added bpf_perf_event_output() because it's really handy for logging and debugging. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
| * | | | | bpf: Remove set but not used variable 'dst_known'Mao Wenan2020-04-261-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: kernel/bpf/verifier.c:5603:18: warning: variable ‘dst_known’ set but not used [-Wunused-but-set-variable], delete this variable. Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200418013735.67882-1-maowenan@huawei.com
| * | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-04-253-14/+39
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simple overlapping changes to linux/vermagic.h Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | xdp: export the DEV_MAP_BULK_SIZE macroIoana Ciornei2020-04-221-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export the DEV_MAP_BULK_SIZE macro to the header file so that drivers can directly use it as the maximum number of xdp_frames received in the .ndo_xdp_xmit() callback. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | mm: remove vmalloc_user_node_flagsChristoph Hellwig2020-06-021-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Open code it in __bpf_map_area_alloc, which is the only caller. Also clean up __bpf_map_area_alloc to have a single vmalloc call with slightly different flags instead of the current two different calls. For this to compile for the nommu case add a __vmalloc_node_range stub to nommu.c. [akpm@linux-foundation.org: fix nommu.c build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200414131348.444715-27-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | mm: remove __vmalloc_node_flags_callerChristoph Hellwig2020-06-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just use __vmalloc_node instead which gets and extra argument. To be able to to use __vmalloc_node in all caller make it available outside of vmalloc and implement it in nommu.c. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Kelley <mikelley@microsoft.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200414131348.444715-25-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | mm: remove the pgprot argument to __vmallocChristoph Hellwig2020-06-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv] Acked-by: Gao Xiang <xiang@kernel.org> [erofs] Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Wei Liu <wei.liu@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | Merge branch 'linus' of ↵Linus Torvalds2020-06-011-9/+9
|\ \ \ \ \ \ \ | |_|_|_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Introduce crypto_shash_tfm_digest() and use it wherever possible. - Fix use-after-free and race in crypto_spawn_alg. - Add support for parallel and batch requests to crypto_engine. Algorithms: - Update jitter RNG for SP800-90B compliance. - Always use jitter RNG as seed in drbg. Drivers: - Add Arm CryptoCell driver cctrng. - Add support for SEV-ES to the PSP driver in ccp" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits) crypto: hisilicon - fix driver compatibility issue with different versions of devices crypto: engine - do not requeue in case of fatal error crypto: cavium/nitrox - Fix a typo in a comment crypto: hisilicon/qm - change debugfs file name from qm_regs to regs crypto: hisilicon/qm - add DebugFS for xQC and xQE dump crypto: hisilicon/zip - add debugfs for Hisilicon ZIP crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC crypto: hisilicon/qm - add debugfs to the QM state machine crypto: hisilicon/qm - add debugfs for QM crypto: stm32/crc32 - protect from concurrent accesses crypto: stm32/crc32 - don't sleep in runtime pm crypto: stm32/crc32 - fix multi-instance crypto: stm32/crc32 - fix run-time self test issue. crypto: stm32/crc32 - fix ext4 chksum BUG_ON() crypto: hisilicon/zip - Use temporary sqe when doing work crypto: hisilicon - add device error report through abnormal irq crypto: hisilicon - remove codes of directly report device errors through MSI crypto: hisilicon - QM memory management optimization crypto: hisilicon - unify initial value assignment into QM ...
| * | | | | | crypto: lib/sha1 - rename "sha" to "sha1"Eric Biggers2020-05-081-9/+9
| | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The library implementation of the SHA-1 compression function is confusingly called just "sha_transform()". Alongside it are some "SHA_" constants and "sha_init()". Presumably these are left over from a time when SHA just meant SHA-1. But now there are also SHA-2 and SHA-3, and moreover SHA-1 is now considered insecure and thus shouldn't be used. Therefore, rename these functions and constants to make it very clear that they are for SHA-1. Also add a comment to make it clear that these shouldn't be used. For the extra-misleadingly named "SHA_MESSAGE_BYTES", rename it to SHA1_BLOCK_SIZE and define it to just '64' rather than '(512/8)' so that it matches the same definition in <crypto/sha.h>. This prepares for merging <linux/cryptohash.h> into <crypto/sha.h>. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* | | | | | bpf: Fix a verifier issue when assigning 32bit reg states to 64bit onesJohn Fastabend2020-05-291-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the latest trunk llvm (llvm 11), I hit a verifier issue for test_prog subtest test_verif_scale1. The following simplified example illustrate the issue: w9 = 0 /* R9_w=inv0 */ r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */ r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */ ...... w2 = w9 /* R2_w=inv0 */ r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r6 += r2 /* R6_w=inv(id=0) */ r3 = r6 /* R3_w=inv(id=0) */ r3 += 14 /* R3_w=inv(id=0) */ if r3 > r8 goto end r5 = *(u32 *)(r6 + 0) /* R6_w=inv(id=0) */ <== error here: R6 invalid mem access 'inv' ... end: In real test_verif_scale1 code, "w9 = 0" and "w2 = w9" are in different basic blocks. In the above, after "r6 += r2", r6 becomes a scalar, which eventually caused the memory access error. The correct register state should be a pkt pointer. The inprecise register state starts at "w2 = w9". The 32bit register w9 is 0, in __reg_assign_32_into_64(), the 64bit reg->smax_value is assigned to be U32_MAX. The 64bit reg->smin_value is 0 and the 64bit register itself remains constant based on reg->var_off. In adjust_ptr_min_max_vals(), the verifier checks for a known constant, smin_val must be equal to smax_val. Since they are not equal, the verifier decides r6 is a unknown scalar, which caused later failure. The llvm10 does not have this issue as it generates different code: w9 = 0 /* R9_w=inv0 */ r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */ r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */ ...... r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r6 += r9 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r3 = r6 /* R3_w=pkt(id=0,off=0,r=0,imm=0) */ r3 += 14 /* R3_w=pkt(id=0,off=14,r=0,imm=0) */ if r3 > r8 goto end ... To fix the above issue, we can include zero in the test condition for assigning the s32_max_value and s32_min_value to their 64-bit equivalents smax_value and smin_value. Further, fix the condition to avoid doing zero extension bounds checks when s32_min_value <= 0. This could allow for the case where bounds 32-bit bounds (-1,1) get incorrectly translated to (0,1) 64-bit bounds. When in-fact the -1 min value needs to force U32_MAX bound. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/159077331983.6014.5758956193749002737.stgit@john-Precision-5820-Tower
* | | | | | bpf: Fix use-after-free in fmod_ret checkAlexei Starovoitov2020-05-291-13/+11
| |_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following issue: [ 436.749342] BUG: KASAN: use-after-free in bpf_trampoline_put+0x39/0x2a0 [ 436.749995] Write of size 4 at addr ffff8881ef38b8a0 by task kworker/3:5/2243 [ 436.750712] [ 436.752677] Workqueue: events bpf_prog_free_deferred [ 436.753183] Call Trace: [ 436.756483] bpf_trampoline_put+0x39/0x2a0 [ 436.756904] bpf_prog_free_deferred+0x16d/0x3d0 [ 436.757377] process_one_work+0x94a/0x15b0 [ 436.761969] [ 436.762130] Allocated by task 2529: [ 436.763323] bpf_trampoline_lookup+0x136/0x540 [ 436.763776] bpf_check+0x2872/0xa0a8 [ 436.764144] bpf_prog_load+0xb6f/0x1350 [ 436.764539] __do_sys_bpf+0x16d7/0x3720 [ 436.765825] [ 436.765988] Freed by task 2529: [ 436.767084] kfree+0xc6/0x280 [ 436.767397] bpf_trampoline_put+0x1fd/0x2a0 [ 436.767826] bpf_check+0x6832/0xa0a8 [ 436.768197] bpf_prog_load+0xb6f/0x1350 [ 436.768594] __do_sys_bpf+0x16d7/0x3720 prog->aux->trampoline = tr should be set only when prog is valid. Otherwise prog freeing will try to put trampoline via prog->aux->trampoline, but it may not point to a valid trampoline. Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: KP Singh <kpsingh@google.com> Link: https://lore.kernel.org/bpf/20200529043839.15824-2-alexei.starovoitov@gmail.com
* | | | | bpf: Prevent mmap()'ing read-only maps as writableAndrii Nakryiko2020-05-201-3/+14
| |_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to be frozen and is read-only on BPF program side, because that allows user-space to actually store a writable view to the page even after it is frozen. This is exacerbated by BPF verifier making a strong assumption that contents of such frozen map will remain unchanged. To prevent this, disallow mapping BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever. [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/ Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com
* | | | bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_rangeDaniel Borkmann2020-05-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given bpf_probe_read{,str}() BPF helpers are now only available under CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range() as well to avoid hitting the same issue as in 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net
* | | | bpf: Enforce returning 0 for fentry/fexit progsYonghong Song2020-05-141-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, tracing/fentry and tracing/fexit prog return values are not enforced. In trampoline codes, the fentry/fexit prog return values are ignored. Let us enforce it to be 0 to avoid confusion and allows potential future extension. This patch also explicitly added return value checking for tracing/raw_tp, tracing/fmod_ret, and freplace programs such that these program return values can be anything. The purpose are two folds: 1. to make it explicit about return value expectations for these programs in verifier. 2. for tracing prog_type, if a future attach type is added, the default is -ENOTSUPP which will enforce to specify return value ranges explicitly. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com
* | | | bpf: Fix bug in mmap() implementation for BPF array mapAndrii Nakryiko2020-05-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mmap() subsystem allows user-space application to memory-map region with initial page offset. This wasn't taken into account in initial implementation of BPF array memory-mapping. This would result in wrong pages, not taking into account requested page shift, being memory-mmaped into user-space. This patch fixes this gap and adds a test for such scenario. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com
* | | | bpf: Fix error return code in map_lookup_and_delete_elem()Wei Yongjun2020-04-301-1/+3
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | Fix to return negative error code -EFAULT from the copy_to_user() error handling case instead of 0, as done elsewhere in this function. Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com
* | | bpf: Make bpf_link_fops staticZou Wei2020-04-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following sparse warning: kernel/bpf/syscall.c:2289:30: warning: symbol 'bpf_link_fops' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/1587609160-117806-1-git-send-email-zou_wei@huawei.com
* | | bpf: Propagate expected_attach_type when verifying freplace programsToke Høiland-Jørgensen2020-04-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For some program types, the verifier relies on the expected_attach_type of the program being verified in the verification process. However, for freplace programs, the attach type was not propagated along with the verifier ops, so the expected_attach_type would always be zero for freplace programs. This in turn caused the verifier to sometimes make the wrong call for freplace programs. For all existing uses of expected_attach_type for this purpose, the result of this was only false negatives (i.e., freplace functions would be rejected by the verifier even though they were valid programs for the target they were replacing). However, should a false positive be introduced, this can lead to out-of-bounds accesses and/or crashes. The fix introduced in this patch is to propagate the expected_attach_type to the freplace program during verification, and reset it after that is done. Fixes: be8704ff07d2 ("bpf: Introduce dynamic program extensions") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk
* | | bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fdAndrii Nakryiko2020-04-241-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fix bug of not putting bpf_link in LINK_UPDATE command. Also enforce zeroed old_prog_fd if no BPF_F_REPLACE flag is specified. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200424052045.4002963-1-andriin@fb.com
* | | bpf: Fix handling of XADD on BTF memoryJann Horn2020-04-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check_xadd() can cause check_ptr_to_btf_access() to be executed with atype==BPF_READ and value_regno==-1 (meaning "just check whether the access is okay, don't tell me what type it will result in"). Handle that case properly and skip writing type information, instead of indexing into the registers at index -1 and writing into out-of-bounds memory. Note that at least at the moment, you can't actually write through a BTF pointer, so check_xadd() will reject the program after calling check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the verifier has already corrupted memory. This patch assumes that BTF pointers are not available in unprivileged programs. Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com
* | | bpf: Forbid XADD on spilled pointers for unprivileged usersJann Horn2020-04-201-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When check_xadd() verifies an XADD operation on a pointer to a stack slot containing a spilled pointer, check_stack_read() verifies that the read, which is part of XADD, is valid. However, since the placeholder value -1 is passed as `value_regno`, check_stack_read() can only return a binary decision and can't return the type of the value that was read. The intent here is to verify whether the value read from the stack slot may be used as a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and the type information is lost when check_stack_read() returns, this is not enforced, and a malicious user can abuse XADD to leak spilled kernel pointers. Fix it by letting check_stack_read() verify that the value is usable as a SCALAR_VALUE if no type information is passed to the caller. To be able to use __is_pointer_value() in check_stack_read(), move it up. Fix up the expected unprivileged error message for a BPF selftest that, until now, assumed that unprivileged users can use XADD on stack-spilled pointers. This also gives us a test for the behavior introduced in this patch for free. In theory, this could also be fixed by forbidding XADD on stack spills entirely, since XADD is a locked operation (for operations on memory with concurrency) and there can't be any concurrency on the BPF stack; but Alexei has said that he wants to keep XADD on stack slots working to avoid changes to the test suite [1]. The following BPF program demonstrates how to leak a BPF map pointer as an unprivileged user using this bug: // r7 = map_pointer BPF_LD_MAP_FD(BPF_REG_7, small_map), // r8 = launder(map_pointer) BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8), BPF_MOV64_IMM(BPF_REG_1, 0), ((struct bpf_insn) { .code = BPF_STX | BPF_DW | BPF_XADD, .dst_reg = BPF_REG_FP, .src_reg = BPF_REG_1, .off = -8 }), BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8), // store r8 into map BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7), BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP), BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4), BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0), BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), BPF_EXIT_INSN(), BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN() [1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/ Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com
* | | cpumap: Avoid warning when CONFIG_DEBUG_PER_CPU_MAPS is enabledToke Høiland-Jørgensen2020-04-201-1/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This happens because in this configuration, NR_CPUS can be larger than nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient to guard against hitting the warning in cpumask_check(). Fix this by explicitly checking the supplied key against the nr_cpumask_bits variable before calling cpu_possible(). Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reported-by: Xiumei Mu <xmu@redhat.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Xiumei Mu <xmu@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com
* | bpf: remove unneeded conversion to bool in __mark_reg_unknownZou Wei2020-04-141-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This issue was detected by using the Coccinelle software: kernel/bpf/verifier.c:1259:16-21: WARNING: conversion to bool not needed here The conversion to bool is unneeded, remove it. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/1586779076-101346-1-git-send-email-zou_wei@huawei.com
* | bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mappingAndrii Nakryiko2020-04-141-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed pages can be later remapped as writable ones through mprotect() call. To prevent user application to rewrite contents of memory-mapped as read-only and subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially read-only mapping. Alternatively, we could treat any memory-mapping on unfrozen map as writable and bump writecnt instead. But there is little legitimate reason to map BPF map as read-only and then re-mmap() it as writable through mprotect(), instead of just mmap()'ing it as read/write from the very beginning. Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap operations. We can just rely on VMA holding reference to BPF map's file properly. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com
* | bpf: Fix a typo "inacitve" -> "inactive"Qiujun Huang2020-04-061-1/+1
|/ | | | | | | | | There is a typo in struct bpf_lru_list's next_inactive_rotation description, thus fix s/inacitve/inactive/. Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/1585901254-30377-1-git-send-email-hqjagain@gmail.com
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2020-03-3011-737/+1830
|\ | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: Implement bpf_prog replacement for an active bpf_cgroup_linkAndrii Nakryiko2020-03-302-0/+135
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
| * bpf: Implement bpf_link-based cgroup BPF program attachmentAndrii Nakryiko2020-03-302-85/+294
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
| * bpf: Verifier, refine 32bit bound in do_refine_retval_rangeJohn Fastabend2020-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Further refine return values range in do_refine_retval_range by noting these are int return types (We will assume here that int is a 32-bit type). Two reasons to pull this out of original patch. First it makes the original fix impossible to backport. And second I've not seen this as being problematic in practice unlike the other case. Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower
| * bpf: Verifier, do explicit ALU32 bounds trackingJohn Fastabend2020-03-302-281/+852
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] The expected 64-bit and 32-bit bounds after each line are shown on the right. The current issue is without the w* bounds we are forced to use the worst case bound of [0, U32_MAX]. To resolve this type of case, jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit 32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then from branch_taken logic creating new bounds we can track 32-bit bounds explicitly. The next case we observed is ALU ops after the jmp32, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] 55: w8 += 1 // r8 <- [0, U32_MAX+1] // w8 <- [0, X+1] In order to keep the bounds accurate at this point we also need to track ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU ops, mov, add, sub, etc. Finally there is a question of how and when to merge bounds. The cases enumerate here, 1. MOV ALU32 - zext 32-bit -> 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(<<var_off)). See __reg_combine_64_into_32(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower