summaryrefslogtreecommitdiffstats
path: root/include/net
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | | nexthop: Allow reporting activity of nexthop bucketsIdo Schimmel2021-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel periodically checks the idle time of nexthop buckets to determine if they are idle and can be re-populated with a new nexthop. When the resilient nexthop group is offloaded to hardware, the kernel will not see activity on nexthop buckets unless it is reported from hardware. Add a function that can be periodically called by device drivers to report activity on nexthop buckets after querying it from the underlying device. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | nexthop: Allow setting "offload" and "trap" indication of nexthop bucketsIdo Schimmel2021-03-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a function that can be called by device drivers to set "offload" or "trap" indication on nexthop buckets following nexthop notifications and other changes such as a neighbour becoming invalid. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | nexthop: Add data structures for resilient group notificationsIdo Schimmel2021-03-111-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add data structures that will be used for in-kernel notifications about addition / deletion of a resilient nexthop group and about changes to a hash bucket within a resilient group. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | nexthop: Add implementation of resilient next-hop groupsPetr Machata2021-03-111-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this moment, there is only one type of next-hop group: an mpath group, which implements the hash-threshold algorithm. To select a next hop, hash-threshold algorithm first assigns a range of hashes to each next hop in the group, and then selects the next hop by comparing the SKB hash with the individual ranges. When a next hop is removed from the group, the ranges are recomputed, which leads to reassignment of parts of hash space from one next hop to another. While there will usually be some overlap between the previous and the new distribution, some traffic flows change the next hop that they resolve to. That causes problems e.g. as established TCP connections are reset, because the traffic is forwarded to a server that is not familiar with the connection. Resilient hashing is a technique to address the above problem. Resilient next-hop group has another layer of indirection between the group itself and its constituent next hops: a hash table. The selection algorithm uses a straightforward modulo operation to choose a hash bucket, and then reads the next hop that this bucket contains, and forwards traffic there. This indirection brings an important feature. In the hash-threshold algorithm, the range of hashes associated with a next hop must be continuous. With a hash table, mapping between the hash table buckets and the individual next hops is arbitrary. Therefore when a next hop is deleted the buckets that held it are simply reassigned to other next hops. When weights of next hops in a group are altered, it may be possible to choose a subset of buckets that are currently not used for forwarding traffic, and use those to satisfy the new next-hop distribution demands, keeping the "busy" buckets intact. This way, established flows are ideally kept being forwarded to the same endpoints through the same paths as before the next-hop group change. In a nutshell, the algorithm works as follows. Each next hop has a number of buckets that it wants to have, according to its weight and the number of buckets in the hash table. In case of an event that might cause bucket allocation change, the numbers for individual next hops are updated, similarly to how ranges are updated for mpath group next hops. Following that, a new "upkeep" algorithm runs, and for idle buckets that belong to a next hop that is currently occupying more buckets than it wants (it is "overweight"), it migrates the buckets to one of the next hops that has fewer buckets than it wants (it is "underweight"). If, after this, there are still underweight next hops, another upkeep run is scheduled to a future time. Chances are there are not enough "idle" buckets to satisfy the new demands. The algorithm has knobs to select both what it means for a bucket to be idle, and for whether and when to forcefully migrate buckets if there keeps being an insufficient number of idle buckets. There are three users of the resilient data structures. - The forwarding code accesses them under RCU, and does not modify them except for updating the time a selected bucket was last used. - Netlink code, running under RTNL, which may modify the data. - The delayed upkeep code, which may modify the data. This runs unlocked, and mutual exclusion between the RTNL code and the delayed upkeep is maintained by canceling the delayed work synchronously before the RTNL code touches anything. Later it restarts the delayed work if necessary. The RTNL code has to implement next-hop group replacement, next hop removal, etc. For removal, the mpath code uses a neat trick of having a backup next hop group structure, doing the necessary changes offline, and then RCU-swapping them in. However, the hash tables for resilient hashing are about an order of magnitude larger than the groups themselves (the size might be e.g. 4K entries), and it was felt that keeping two of them is an overkill. Both the primary next-hop group and the spare therefore use the same resilient table, and writers are careful to keep all references valid for the forwarding code. The hash table references next-hop group entries from the next-hop group that is currently in the primary role (i.e. not spare). During the transition from primary to spare, the table references a mix of both the primary group and the spare. When a next hop is deleted, the corresponding buckets are not set to NULL, but instead marked as empty, so that the pointer is valid and can be used by the forwarding code. The buckets are then migrated to a new next-hop group entry during upkeep. The only times that the hash table is invalid is the very beginning and very end of its lifetime. Between those points, it is always kept valid. This patch introduces the core support code itself. It does not handle notifications towards drivers, which are kept as if the group were an mpath one. It does not handle netlink either. The only bit currently exposed to user space is the new next-hop group type, and that is currently bounced. There is therefore no way to actually access this code. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | nexthop: Add a dedicated flag for multipath next-hop groupsPetr Machata2021-03-111-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of resilient nexthop groups, there will be two types of multipath groups: the current hash-threshold "mpath" ones, and resilient groups. Both are multipath, but to determine the fact, the system needs to consider two flags. This might prove costly in the datapath. Therefore, introduce a new flag, that should be set for next-hop groups that have more than one nexthop, and should be considered multipath. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | seg6: add support for IPv4 decapsulation in ipv6_srh_rcv()Julien Massonneau2021-03-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the decapsulated packet to the IPv4 or IPv6 module. Only IPv6 decapsulation was implemented. This patch adds support for IPv4 decapsulation. Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2 Signed-off-by: Julien Massonneau <julien.massonneau@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: add a helper to avoid issues with HW TX timestamping and SO_TXTIMEVladimir Oltean2021-03-101-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As explained in commit 29d98f54a4fe ("net: enetc: allow hardware timestamping on TX queues with tc-etf enabled"), hardware TX timestamping requires an skb with skb->tstamp = 0. When a packet is sent with SO_TXTIME, the skb->skb_mstamp_ns corrupts the value of skb->tstamp, so the drivers need to explicitly reset skb->tstamp to zero after consuming the TX time. Create a helper named skb_txtime_consumed() which does just that. All drivers which offload TC_SETUP_QDISC_ETF should implement it, and it would make it easier to assess during review whether they do the right thing in order to be compatible with hardware timestamping or not. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2021-03-093-53/+11
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-03-09 The following pull-request contains BPF updates for your *net-next* tree. We've added 90 non-merge commits during the last 17 day(s) which contain a total of 114 files changed, 5158 insertions(+), 1288 deletions(-). The main changes are: 1) Faster bpf_redirect_map(), from Björn. 2) skmsg cleanup, from Cong. 3) Support for floating point types in BTF, from Ilya. 4) Documentation for sys_bpf commands, from Joe. 5) Support for sk_lookup in bpf_prog_test_run, form Lorenz. 6) Enable task local storage for tracing programs, from Song. 7) bpf_for_each_map_elem() helper, from Yonghong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | bpf, xdp: Make bpf_redirect_map() a map operationBjörn Töpel2021-03-101-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the bpf_redirect_map() implementation dispatches to the correct map-lookup function via a switch-statement. To avoid the dispatching, this change adds bpf_redirect_map() as a map operation. Each map provides its bpf_redirect_map() version, and correct function is automatically selected by the BPF verifier. A nice side-effect of the code movement is that the map lookup functions are now local to the map implementation files, which removes one additional function call. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20210308112907.559576-2-bjorn.topel@gmail.com
| | * | | | skmsg: Move sk_redir from TCP_SKB_CB to skbCong Wang2021-02-261-19/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly does not work for any other non-TCP protocols. We can move them to skb ext, but it introduces a memory allocation on fast path. Fortunately, we only need to a word-size to store all the information, because the flags actually only contains 1 bit so can be just packed into the lowest bit of the "pointer", which is stored as unsigned long. Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is no longer needed after ->sk_data_ready() so we can just drop it. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com
| | * | | | bpf: Compute data_end dynamically with JIT codeCong Wang2021-02-261-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we compute ->data_end with a compile-time constant offset of skb. But as Jakub pointed out, we can actually compute it in eBPF JIT code at run-time, so that we can competely get rid of ->data_end. This is similar to skb_shinfo(skb) computation in bpf_convert_shinfo_access(). Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com
| | * | | | bpf: Clean up sockmap related KconfigsCong Wang2021-02-262-9/+11
| | | |/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As suggested by John, clean up sockmap related Kconfigs: Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream parser, to reflect its name. Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL and CONFIG_INET, the latter is still needed at this point because of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched, as it is used by non-sockmap cases. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com
* | | | | Merge tag 'afs-netfs-lib-20210426' of ↵Linus Torvalds2021-04-271-1/+1
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "Use the new netfs lib. Begin the process of overhauling the use of the fscache API by AFS and the introduction of support for features such as Transparent Huge Pages (THPs). - Add some support for THPs, including using core VM helper functions to find details of pages. - Use the ITER_XARRAY I/O iterator to mediate access to the pagecache as this handles THPs and doesn't require allocation of large bvec arrays. - Delegate address_space read/pre-write I/O methods for AFS to the netfs helper library. A method is provided to the library that allows it to issue a read against the server. This includes a change in use for PG_fscache (it now indicates a DIO write in progress from the marked page), so a number of waits need to be deployed for it. - Split the core AFS writeback function to make it easier to modify in future patches to handle writing to the cache. [This might feasibly make more sense moved out into my fscache-iter branch]. I've tested these with "xfstests -g quick" against an AFS volume (xfstests needs patching to make it work). With this, AFS without a cache passes all expected xfstests; with a cache, there's an extra failure, but that's also there before these patches. Fixing that probably requires a greater overhaul (as can be found on my fscache-iter branch, but that's for a later time). Thanks should go to Marc Dionne and Jeff Altman of AuriStor for exercising the patches in their test farm also" Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/ * tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Use the netfs_write_begin() helper afs: Use new netfs lib read helper API afs: Use the fs operation ops to handle FetchData completion afs: Prepare for use of THPs afs: Extract writeback extension into its own function afs: Wait on PG_fscache before modifying/releasing a page afs: Use ITER_XARRAY for writing afs: Set up the iov_iter before calling afs_extract_data() afs: Log remote unmarshalling errors afs: Don't truncate iter during data fetch afs: Move key to afs_read struct afs: Print the operation debug_id when logging an unexpected data version afs: Pass page into dirty region helpers to provide THP size afs: Disable use of the fscache I/O routines
| * | | | | afs: Don't truncate iter during data fetchDavid Howells2021-04-231-1/+1
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't truncate the iterator to correspond to the actual data size when fetching the data from the server - rather, pass the length we want to read to rxrpc. This will allow the clear-after-read code in future to simply clear the remaining iterator capacity rather than having to reinitialise the iterator. Signed-off-by: David Howells <dhowells@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/158861249201.340223.13035445866976590375.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/159465825061.1377938.14403904452300909320.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/160588531418.3465195.10712005940763063144.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161118148567.1232039.13380313332292947956.stgit@warthog.procyon.org.uk/ # rfc Link: https://lore.kernel.org/r/161161044610.2537118.17908520793806837792.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/161340407907.1303470.6501394859511712746.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/161539551721.286939.14655713136572200716.stgit@warthog.procyon.org.uk/ # v4 Link: https://lore.kernel.org/r/161653807790.2770958.14034599989374173734.stgit@warthog.procyon.org.uk/ # v5 Link: https://lore.kernel.org/r/161789090823.6155.15673999934535049102.stgit@warthog.procyon.org.uk/ # v6
* | | | | Merge tag 'tty-5.13-rc1' of ↵Linus Torvalds2021-04-261-2/+0
|\ \ \ \ \ | |_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty and serial driver updates from Greg KH: "Here is the big set of tty and serial driver updates for 5.13-rc1. Actually busy this release, with a number of cleanups happening: - much needed core tty cleanups by Jiri Slaby - removal of unused and orphaned old-style serial drivers. If anyone shows up with this hardware, it is trivial to restore these but we really do not think they are in use anymore. - fixes and cleanups from Johan Hovold on a number of termios setting corner cases that loads of drivers got wrong as well as removing unneeded code due to tty core changes from long ago that were never propagated out to the drivers - loads of platform-specific serial port driver updates and fixes - coding style cleanups and other small fixes and updates all over the tty/serial tree. All of these have been in linux-next for a while now with no reported issues" * tag 'tty-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (186 commits) serial: extend compile-test coverage serial: stm32: add FIFO threshold configuration dt-bindings: serial: 8250: update TX FIFO trigger level dt-bindings: serial: stm32: override FIFO threshold properties dt-bindings: serial: add RX and TX FIFO properties serial: xilinx_uartps: drop low-latency workaround serial: vt8500: drop low-latency workaround serial: timbuart: drop low-latency workaround serial: sunsu: drop low-latency workaround serial: sifive: drop low-latency workaround serial: txx9: drop low-latency workaround serial: sa1100: drop low-latency workaround serial: rp2: drop low-latency workaround serial: rda: drop low-latency workaround serial: owl: drop low-latency workaround serial: msm_serial: drop low-latency workaround serial: mpc52xx_uart: drop low-latency workaround serial: meson: drop low-latency workaround serial: mcf: drop low-latency workaround serial: lpc32xx_hs: drop low-latency workaround ...
| * | | | Merge 5.12-rc6 into tty-nextGreg Kroah-Hartman2021-04-057-4/+52
| |\ \ \ \ | | | |_|/ | | |/| | | | | | | | | | | | | | | | | We need the serial/tty fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | net: nfc: nci: drop nci_uart_ops::recv_bufJiri Slaby2021-03-101-2/+0
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is noone setting nci_uart_ops::recv_buf, so the default one (nci_uart_default_recv_buf) is always used. So drop this hook, move nci_uart_default_recv_buf before the use in nci_uart_tty_receive and remove unused parameter flags. Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: Jiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20210302062214.29627-16-jslaby@suse.cz Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | ipv6: report errors for iftoken via netlink extackStephen Hemminger2021-04-081-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting iftoken can fail for several different reasons but there and there was no report to user as to the cause. Add netlink extended errors to the processing of the request. This requires adding additional argument through rtnl_af_ops set_link_af callback. Reported-by: Hongren Zheng <li@zenithal.me> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: sched: fix err handler in tcf_action_init()Vlad Buslov2021-04-081-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With recent changes that separated action module load from action initialization tcf_action_init() function error handling code was modified to manually release the loaded modules if loading/initialization of any further action in same batch failed. For the case when all modules successfully loaded and some of the actions were initialized before one of them failed in init handler. In this case for all previous actions the module will be released twice by the error handler: First time by the loop that manually calls module_put() for all ops, and second time by the action destroy code that puts the module after destroying the action. Reproduction: $ sudo tc actions add action simple sdata \"2\" index 2 $ sudo tc actions add action simple sdata \"1\" index 1 \ action simple sdata \"2\" index 2 RTNETLINK answers: File exists We have an error talking to the kernel $ sudo tc actions ls action simple total acts 1 action order 0: Simple <"2"> index 2 ref 1 bind 0 $ sudo tc actions flush action simple $ sudo tc actions ls action simple $ sudo tc actions add action simple sdata \"2\" index 2 Error: Failed to load TC action module. We have an error talking to the kernel $ lsmod | grep simple act_simple 20480 -1 Fix the issue by modifying module reference counting handling in action initialization code: - Get module reference in tcf_idr_create() and put it in tcf_idr_release() instead of taking over the reference held by the caller. - Modify users of tcf_action_init_1() to always release the module reference which they obtain before calling init function instead of assuming that created action takes over the reference. - Finally, modify tcf_action_init_1() to not release the module reference when overwriting existing action as this is no longer necessary since both upper and lower layers obtain and manage their own module references independently. Fixes: d349f9976868 ("net_sched: fix RTNL deadlock again caused by request_module()") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: sched: fix action overwrite reference countingVlad Buslov2021-04-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Action init code increments reference counter when it changes an action. This is the desired behavior for cls API which needs to obtain action reference for every classifier that points to action. However, act API just needs to change the action and releases the reference before returning. This sequence breaks when the requested action doesn't exist, which causes act API init code to create new action with specified index, but action is still released before returning and is deleted (unless it was referenced concurrently by cls API). Reproduction: $ sudo tc actions ls action gact $ sudo tc actions change action gact drop index 1 $ sudo tc actions ls action gact Extend tcf_action_init() to accept 'init_res' array and initialize it with action->ops->init() result. In tcf_action_add() remove pointers to created actions from actions array before passing it to tcf_action_put_many(). Fixes: cae422f379f3 ("net: sched: use reference counting action init") Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Revert "net: correct sk_acceptq_is_full()"Eric Dumazet2021-03-311-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit f211ac154577ec9ccf07c15f18a6abf0d9bdb4ab. We had similar attempt in the past, and we reverted it. History: 64a146513f8f12ba204b7bf5cb7e9505594ead42 [NET]: Revert incorrect accept queue backlog changes. 8488df894d05d6fa41c2bd298c335f944bb0e401 [NET]: Fix bugs in "Whether sock accept queue is full" checking I am adding a fat comment so that future attempts will be much harder. Fixes: f211ac154577 ("net: correct sk_acceptq_is_full()") Cc: iuyacan <yacanliu@163.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | Merge branch 'master' of ↵David S. Miller2021-03-312-3/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2021-03-31 1) Fix ipv4 pmtu checks for xfrm anf vti interfaces. From Eyal Birger. 2) There are situations where the socket passed to xfrm_output_resume() is not the same as the one attached to the skb. Use the socket passed to xfrm_output_resume() to avoid lookup failures when xfrm is used with VRFs. From Evan Nimmo. 3) Make the xfrm_state_hash_generation sequence counter per network namespace because but its write serialization lock is also per network namespace. Write protection is insufficient otherwise. From Ahmed S. Darwish. 4) Fixup sctp featue flags when used with esp offload. From Xin Long. 5) xfrm BEET mode doesn't support fragments for inner packets. This is a limitation of the protocol, so no fix possible. Warn at least to notify the user about that situation. From Xin Long. 6) Fix NULL pointer dereference on policy lookup when namespaces are uses in combination with esp offload. 7) Fix incorrect transformation on esp offload when packets get segmented at layer 3. 8) Fix some user triggered usages of WARN_ONCE in the xfrm compat layer. From Dmitry Safonov. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | xfrm: Fix NULL pointer dereference on policy lookupSteffen Klassert2021-03-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When xfrm interfaces are used in combination with namespaces and ESP offload, we get a dst_entry NULL pointer dereference. This is because we don't have a dst_entry attached in the ESP offloading case and we need to do a policy lookup before the namespace transition. Fix this by expicit checking of skb_dst(skb) before accessing it. Fixes: f203b76d78092 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | net: xfrm: Use sequence counter with associated spinlockAhmed S. Darwish2021-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A sequence counter write section must be serialized or its internal state can get corrupted. A plain seqcount_t does not contain the information of which lock must be held to guaranteee write side serialization. For xfrm_state_hash_generation, use seqcount_spinlock_t instead of plain seqcount_t. This allows to associate the spinlock used for write serialization with the sequence counter. It thus enables lockdep to verify that the write serialization lock is indeed held before entering the sequence counter write section. If lockdep is disabled, this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | net: xfrm: Localize sequence counter per network namespaceAhmed S. Darwish2021-03-221-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A sequence counter write section must be serialized or its internal state can get corrupted. The "xfrm_state_hash_generation" seqcount is global, but its write serialization lock (net->xfrm.xfrm_state_lock) is instantiated per network namespace. The write protection is thus insufficient. To provide full protection, localize the sequence counter per network namespace instead. This should be safe as both the seqcount read and write sections access data exclusively within the network namespace. It also lays the foundation for transforming "xfrm_state_hash_generation" data type from seqcount_t to seqcount_LOCKNAME_t in further commits. Fixes: b65e3d7be06f ("xfrm: state: add sequence count to detect hash resizes") Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | xfrm: Use actual socket sk instead of skb socket for xfrm_output_resumeEvan Nimmo2021-03-031-1/+1
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A situation can occur where the interface bound to the sk is different to the interface bound to the sk attached to the skb. The interface bound to the sk is the correct one however this information is lost inside xfrm_output2 and instead the sk on the skb is used in xfrm_output_resume instead. This assumes that the sk bound interface and the bound interface attached to the sk within the skb are the same which can lead to lookup failures inside ip_route_me_harder resulting in the packet being dropped. We have an l2tp v3 tunnel with ipsec protection. The tunnel is in the global VRF however we have an encapsulated dot1q tunnel interface that is within a different VRF. We also have a mangle rule that marks the packets causing them to be processed inside ip_route_me_harder. Prior to commit 31c70d5956fc ("l2tp: keep original skb ownership") this worked fine as the sk attached to the skb was changed from the dot1q encapsulated interface to the sk for the tunnel which meant the interface bound to the sk and the interface bound to the skb were identical. Commit 46d6c5ae953c ("netfilter: use actual socket sk rather than skb sk when routing harder") fixed some of these issues however a similar problem existed in the xfrm code. Fixes: 31c70d5956fc ("l2tp: keep original skb ownership") Signed-off-by: Evan Nimmo <evan.nimmo@alliedtelesis.co.nz> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | net: let skb_orphan_partial wake-up waiters.Paolo Abeni2021-03-301-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the mentioned helper can end-up freeing the socket wmem without waking-up any processes waiting for more write memory. If the partially orphaned skb is attached to an UDP (or raw) socket, the lack of wake-up can hang the user-space. Even for TCP sockets not calling the sk destructor could have bad effects on TSQ. Address the issue using skb_orphan to release the sk wmem before setting the new sock_efree destructor. Additionally bundle the whole ownership update in a new helper, so that later other potential users could avoid duplicate code. v1 -> v2: - use skb_orphan() instead of sort of open coding it (Eric) - provide an helper for the ownership change (Eric) Fixes: f6ba8d33cfbb ("netem: fix skb_orphan_partial()") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | sch_red: fix off-by-one checks in red_check_params()Eric Dumazet2021-03-251-2/+2
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes following syzbot report: UBSAN: shift-out-of-bounds in ./include/net/red.h:237:23 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 1 PID: 8418 Comm: syz-executor170 Not tainted 5.12.0-rc4-next-20210324-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 red_set_parms include/net/red.h:237 [inline] choke_change.cold+0x3c/0xc8 net/sched/sch_choke.c:414 qdisc_create+0x475/0x12f0 net/sched/sch_api.c:1247 tc_modify_qdisc+0x4c8/0x1a50 net/sched/sch_api.c:1663 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5553 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2502 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x43f039 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffdfa725168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000400488 RCX: 000000000043f039 RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000004 RBP: 0000000000403020 R08: 0000000000400488 R09: 0000000000400488 R10: 0000000000400488 R11: 0000000000000246 R12: 00000000004030b0 R13: 0000000000000000 R14: 00000000004ac018 R15: 0000000000400488 Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | sch_red: Fix a typoBhaskar Chowdhury2021-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | s/recalcultion/recalculation/ Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2021-03-181-0/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Several patches to testore use of memory barriers instead of RCU to ensure consistent access to ruleset, from Mark Tomlinson. 2) Fix dump of expectation via ctnetlink, from Florian Westphal. 3) GRE helper works for IPv6, from Ludovic Senecaux. 4) Set error on unsupported flowtable flags. 5) Use delayed instead of deferrable workqueue in the flowtable, from Yinjun Zhang. 6) Fix spurious EEXIST in case of add-after-delete flowtable in the same batch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | netfilter: nftables: allow to update flowtable flagsPablo Neira Ayuso2021-03-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Honor flowtable flags from the control update path. Disallow disabling to toggle hardware offload support though. Fixes: 8bb69f3b2918 ("netfilter: nf_tables: add flowtable offload control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | can: dev: Move device back to init netns on owning netns deleteMartin Willi2021-03-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a non-initial netns is destroyed, the usual policy is to delete all virtual network interfaces contained, but move physical interfaces back to the initial netns. This keeps the physical interface visible on the system. CAN devices are somewhat special, as they define rtnl_link_ops even if they are physical devices. If a CAN interface is moved into a non-initial netns, destroying that netns lets the interface vanish instead of moving it back to the initial netns. default_device_exit() skips CAN interfaces due to having rtnl_link_ops set. Reproducer: ip netns add foo ip link set can0 netns foo ip netns delete foo WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60 CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1 Workqueue: netns cleanup_net [<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14) [<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8) [<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114) [<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac) [<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60) [<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380) [<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438) [<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8) [<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c) [<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c) To properly restore physical CAN devices to the initial netns on owning netns exit, introduce a flag on rtnl_link_ops that can be set by drivers. For CAN devices setting this flag, default_device_exit() considers them non-virtual, applying the usual namespace move. The issue was introduced in the commit mentioned below, as at that time CAN devices did not have a dellink() operation. Fixes: e008b5fc8dc7 ("net: Simplfy default_device_exit and improve batching.") Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
* | | | tcp: relookup sock for RST+ACK packets handled by obsolete req sockAlexander Ovechkin2021-03-151-1/+1
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently tcp_check_req can be called with obsolete req socket for which big socket have been already created (because of CPU race or early demux assigning req socket to multiple packets in gro batch). Commit e0f9759f530bf789e984 ("tcp: try to keep packet if SYN_RCV race is lost") added retry in case when tcp_check_req is called for PSH|ACK packet. But if client sends RST+ACK immediatly after connection being established (it is performing healthcheck, for example) retry does not occur. In that case tcp_check_req tries to close req socket, leaving big socket active. Fixes: e0f9759f530 ("tcp: try to keep packet if SYN_RCV race is lost") Signed-off-by: Alexander Ovechkin <ovov@yandex-team.ru> Reported-by: Oleg Senin <olegsenin@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: correct sk_acceptq_is_full()liuyacan2021-03-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "backlog" argument in listen() specifies the maximom length of pending connections, so the accept queue should be considered full if there are exactly "backlog" elements. Signed-off-by: liuyacan <yacanliu@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: sched: validate stab valuesEric Dumazet2021-03-101-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iproute2 package is well behaved, but malicious user space can provide illegal shift values and trigger UBSAN reports. Add stab parameter to red_check_params() to validate user input. syzbot reported: UBSAN: shift-out-of-bounds in ./include/net/red.h:312:18 shift exponent 111 is too large for 64-bit type 'long unsigned int' CPU: 1 PID: 14662 Comm: syz-executor.3 Not tainted 5.12.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 red_calc_qavg_from_idle_time include/net/red.h:312 [inline] red_calc_qavg include/net/red.h:353 [inline] choke_enqueue.cold+0x18/0x3dd net/sched/sch_choke.c:221 __dev_xmit_skb net/core/dev.c:3837 [inline] __dev_queue_xmit+0x1943/0x2e00 net/core/dev.c:4150 neigh_hh_output include/net/neighbour.h:499 [inline] neigh_output include/net/neighbour.h:508 [inline] ip6_finish_output2+0x911/0x1700 net/ipv6/ip6_output.c:117 __ip6_finish_output net/ipv6/ip6_output.c:182 [inline] __ip6_finish_output+0x4c1/0xe10 net/ipv6/ip6_output.c:161 ip6_finish_output+0x35/0x200 net/ipv6/ip6_output.c:192 NF_HOOK_COND include/linux/netfilter.h:290 [inline] ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:215 dst_output include/net/dst.h:448 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] NF_HOOK include/linux/netfilter.h:295 [inline] ip6_xmit+0x127e/0x1eb0 net/ipv6/ip6_output.c:320 inet6_csk_xmit+0x358/0x630 net/ipv6/inet6_connection_sock.c:135 dccp_transmit_skb+0x973/0x12c0 net/dccp/output.c:138 dccp_send_reset+0x21b/0x2b0 net/dccp/output.c:535 dccp_finish_passive_close net/dccp/proto.c:123 [inline] dccp_finish_passive_close+0xed/0x140 net/dccp/proto.c:118 dccp_terminate_connection net/dccp/proto.c:958 [inline] dccp_close+0xb3c/0xe60 net/dccp/proto.c:1028 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478 __sock_release+0xcd/0x280 net/socket.c:599 sock_close+0x18/0x20 net/socket.c:1258 __fput+0x288/0x920 fs/file_table.c:280 task_work_run+0xdd/0x1a0 kernel/task_work.c:140 tracehook_notify_resume include/linux/tracehook.h:189 [inline] Fixes: 8afa10cbe281 ("net_sched: red: Avoid illegal values") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: fix suspecious RCU usage warningWei Wang2021-03-101-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzbot reported the suspecious RCU usage in nexthop_fib6_nh() when called from ipv6_route_seq_show(). The reason is ipv6_route_seq_start() calls rcu_read_lock_bh(), while nexthop_fib6_nh() calls rcu_dereference_rtnl(). The fix proposed is to add a variant of nexthop_fib6_nh() to use rcu_dereference_bh_rtnl() for ipv6_route_seq_show(). The reported trace is as follows: ./include/net/nexthop.h:416 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by syz-executor.0/17895: at: seq_read+0x71/0x12a0 fs/seq_file.c:169 at: seq_file_net include/linux/seq_file_net.h:19 [inline] at: ipv6_route_seq_start+0xaf/0x300 net/ipv6/ip6_fib.c:2616 stack backtrace: CPU: 1 PID: 17895 Comm: syz-executor.0 Not tainted 4.15.0-syzkaller #0 Call Trace: [<ffffffff849edf9e>] __dump_stack lib/dump_stack.c:17 [inline] [<ffffffff849edf9e>] dump_stack+0xd8/0x147 lib/dump_stack.c:53 [<ffffffff8480b7fa>] lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5745 [<ffffffff8459ada6>] nexthop_fib6_nh include/net/nexthop.h:416 [inline] [<ffffffff8459ada6>] ipv6_route_native_seq_show net/ipv6/ip6_fib.c:2488 [inline] [<ffffffff8459ada6>] ipv6_route_seq_show+0x436/0x7a0 net/ipv6/ip6_fib.c:2673 [<ffffffff81c556df>] seq_read+0xccf/0x12a0 fs/seq_file.c:276 [<ffffffff81dbc62c>] proc_reg_read+0x10c/0x1d0 fs/proc/inode.c:231 [<ffffffff81bc28ae>] do_loop_readv_writev fs/read_write.c:714 [inline] [<ffffffff81bc28ae>] do_loop_readv_writev fs/read_write.c:701 [inline] [<ffffffff81bc28ae>] do_iter_read+0x49e/0x660 fs/read_write.c:935 [<ffffffff81bc81ab>] vfs_readv+0xfb/0x170 fs/read_write.c:997 [<ffffffff81c88847>] kernel_readv fs/splice.c:361 [inline] [<ffffffff81c88847>] default_file_splice_read+0x487/0x9c0 fs/splice.c:416 [<ffffffff81c86189>] do_splice_to+0x129/0x190 fs/splice.c:879 [<ffffffff81c86f66>] splice_direct_to_actor+0x256/0x890 fs/splice.c:951 [<ffffffff81c8777d>] do_splice_direct+0x1dd/0x2b0 fs/splice.c:1060 [<ffffffff81bc4747>] do_sendfile+0x597/0xce0 fs/read_write.c:1459 [<ffffffff81bca205>] SYSC_sendfile64 fs/read_write.c:1520 [inline] [<ffffffff81bca205>] SyS_sendfile64+0x155/0x170 fs/read_write.c:1506 [<ffffffff81015fcf>] do_syscall_64+0x1ff/0x310 arch/x86/entry/common.c:305 [<ffffffff84a00076>] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Wei Wang <weiwan@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Ido Schimmel <idosch@idosch.org> Cc: Petr Machata <petrm@nvidia.com> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Consolidate common blackhole dst opsDaniel Borkmann2021-03-101-0/+11
| |/ |/| | | | | | | | | | | | | | | | | Move generic blackhole dst ops to the core and use them from both ipv4_dst_blackhole_ops and ip6_dst_blackhole_ops where possible. No functional change otherwise. We need these also in other locations and having to define them over and over again is not great. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sendingJason A. Donenfeld2021-02-231-1/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The icmp{,v6}_send functions make all sorts of use of skb->cb, casting it with IPCB or IP6CB, assuming the skb to have come directly from the inet layer. But when the packet comes from the ndo layer, especially when forwarded, there's no telling what might be in skb->cb at that point. As a result, the icmp sending code risks reading bogus memory contents, which can result in nasty stack overflows such as this one reported by a user: panic+0x108/0x2ea __stack_chk_fail+0x14/0x20 __icmp_send+0x5bd/0x5c0 icmp_ndo_send+0x148/0x160 In icmp_send, skb->cb is cast with IPCB and an ip_options struct is read from it. The optlen parameter there is of particular note, as it can induce writes beyond bounds. There are quite a few ways that can happen in __ip_options_echo. For example: // sptr/skb are attacker-controlled skb bytes sptr = skb_network_header(skb); // dptr/dopt points to stack memory allocated by __icmp_send dptr = dopt->__data; // sopt is the corrupt skb->cb in question if (sopt->rr) { optlen = sptr[sopt->rr+1]; // corrupt skb->cb + skb->data soffset = sptr[sopt->rr+2]; // corrupt skb->cb + skb->data // this now writes potentially attacker-controlled data, over // flowing the stack: memcpy(dptr, sptr+sopt->rr, optlen); } In the icmpv6_send case, the story is similar, but not as dire, as only IP6CB(skb)->iif and IP6CB(skb)->dsthao are used. The dsthao case is worse than the iif case, but it is passed to ipv6_find_tlv, which does a bit of bounds checking on the value. This is easy to simulate by doing a `memset(skb->cb, 0x41, sizeof(skb->cb));` before calling icmp{,v6}_ndo_send, and it's only by good fortune and the rarity of icmp sending from that context that we've avoided reports like this until now. For example, in KASAN: BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xa0e/0x12b0 Write of size 38 at addr ffff888006f1f80e by task ping/89 CPU: 2 PID: 89 Comm: ping Not tainted 5.10.0-rc7-debug+ #5 Call Trace: dump_stack+0x9a/0xcc print_address_description.constprop.0+0x1a/0x160 __kasan_report.cold+0x20/0x38 kasan_report+0x32/0x40 check_memory_region+0x145/0x1a0 memcpy+0x39/0x60 __ip_options_echo+0xa0e/0x12b0 __icmp_send+0x744/0x1700 Actually, out of the 4 drivers that do this, only gtp zeroed the cb for the v4 case, while the rest did not. So this commit actually removes the gtp-specific zeroing, while putting the code where it belongs in the shared infrastructure of icmp{,v6}_ndo_send. This commit fixes the issue by passing an empty IPCB or IP6CB along to the functions that actually do the work. For the icmp_send, this was already trivial, thanks to __icmp_send providing the plumbing function. For icmpv6_send, this required a tiny bit of refactoring to make it behave like the v4 case, after which it was straight forward. Fixes: a2b78e9b2cac ("sunvnet: generate ICMP PTMUD messages for smaller port MTUs") Reported-by: SinYu <liuxyon@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/netdev/CAF=yD-LOF116aHub6RMe8vB8ZpnrrnoTdqhobEx+bvoA8AsP0w@mail.gmail.com/T/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/20210223131858.72082-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2021-02-171-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Add two helper functions to release one table and hooks from the netns and netlink event path. 2) Add table ownership infrastructure, this new infrastructure allows users to bind a table (and its content) to a process through the netlink socket. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nftables: introduce table ownershipPablo Neira Ayuso2021-02-151-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A userspace daemon like firewalld might need to monitor for netlink updates to detect its ruleset removal by the (global) flush ruleset command to ensure ruleset persistency. This adds extra complexity from userspace and, for some little time, the firewall policy is not in place. This patch adds the NFT_TABLE_F_OWNER flag which allows a userspace program to own the table that creates in exclusivity. Tables that are owned... - can only be updated and removed by the owner, non-owners hit EPERM if they try to update it or remove it. - are destroyed when the owner closes the netlink socket or the process is gone (implicit netlink socket closure). - are skipped by the global flush ruleset command. - are listed in the global ruleset. The userspace process that sets on the NFT_TABLE_F_OWNER flag need to leave open the netlink socket. A new NFTA_TABLE_OWNER netlink attribute specifies the netlink port ID to identify the owner from userspace. This patch also updates error reporting when an unknown table flag is specified to change it from EINVAL to EOPNOTSUPP given that EINVAL is usually reserved to report for malformed netlink messages to userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2021-02-161-0/+1
|\ \
| * | net: sched: fix police ext initializationVlad Buslov2021-02-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When police action is created by cls API tcf_exts_validate() first conditional that calls tcf_action_init_1() directly, the action idr is not updated according to latest changes in action API that require caller to commit newly created action to idr with tcf_idr_insert_many(). This results such action not being accessible through act API and causes crash reported by syzbot: ================================================================== BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] BUG: KASAN: null-ptr-deref in __tcf_idr_release net/sched/act_api.c:178 [inline] BUG: KASAN: null-ptr-deref in tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598 Read of size 4 at addr 0000000000000010 by task kworker/u4:5/204 CPU: 0 PID: 204 Comm: kworker/u4:5 Not tainted 5.11.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 __kasan_report mm/kasan/report.c:400 [inline] kasan_report.cold+0x5f/0xd5 mm/kasan/report.c:413 check_memory_region_inline mm/kasan/generic.c:179 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:185 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] __tcf_idr_release net/sched/act_api.c:178 [inline] tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598 tc_action_net_exit include/net/act_api.h:151 [inline] police_exit_net+0x168/0x360 net/sched/act_police.c:390 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:190 cleanup_net+0x4ea/0xb10 net/core/net_namespace.c:604 process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421 kthread+0x3b1/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296 ================================================================== Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 204 Comm: kworker/u4:5 Tainted: G B 5.11.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: netns cleanup_net Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 panic+0x306/0x73d kernel/panic.c:231 end_report+0x58/0x5e mm/kasan/report.c:100 __kasan_report mm/kasan/report.c:403 [inline] kasan_report.cold+0x67/0xd5 mm/kasan/report.c:413 check_memory_region_inline mm/kasan/generic.c:179 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:185 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic_read include/asm-generic/atomic-instrumented.h:27 [inline] __tcf_idr_release net/sched/act_api.c:178 [inline] tcf_idrinfo_destroy+0x129/0x1d0 net/sched/act_api.c:598 tc_action_net_exit include/net/act_api.h:151 [inline] police_exit_net+0x168/0x360 net/sched/act_police.c:390 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:190 cleanup_net+0x4ea/0xb10 net/core/net_namespace.c:604 process_one_work+0x98d/0x15f0 kernel/workqueue.c:2275 worker_thread+0x64c/0x1120 kernel/workqueue.c:2421 kthread+0x3b1/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296 Kernel Offset: disabled Fix the issue by calling tcf_idr_insert_many() after successful action initialization. Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together") Reported-by: syzbot+151e3e714d34ae4ce7e8@syzkaller.appspotmail.com Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: dsa: add MRP supportHoratiu Vultur2021-02-161-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for offloading MRP in HW. Currently implement the switchdev calls 'SWITCHDEV_OBJ_ID_MRP', 'SWITCHDEV_OBJ_ID_RING_ROLE_MRP', to allow to create MRP instances and to set the role of these instances. Add DSA_NOTIFIER_MRP_ADD/DEL and DSA_NOTIFIER_MRP_ADD/DEL_RING_ROLE which calls to .port_mrp_add/del and .port_mrp_add/del_ring_role in the DSA driver for the switch. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: mrp: Extend ring_role_mrp and in_role_mrpHoratiu Vultur2021-02-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the member sw_backup to the structures switchdev_obj_ring_role_mrp and switchdev_obj_in_role_mrp. In this way the SW can call the driver in 2 ways, once when sw_backup is set to false, meaning that the driver should implement this completely in HW. And if that is not supported the SW will call again but with sw_backup set to true, meaning that the HW should help or allow the SW to run the protocol. For example when role is MRM, if the HW can't detect when it stops receiving MRP Test frames but it can trap these frames to CPU, then it needs to return -EOPNOTSUPP when sw_backup is false and return 0 when sw_backup is true. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | switchdev: mrp: Remove CONFIG_BRIDGE_MRPHoratiu Vultur2021-02-161-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Remove #IS_ENABLED(CONFIG_BRIDGE_MRP) from switchdev.h. This will simplify the code implements MRP callbacks and will be similar with the vlan filtering. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2021-02-164-0/+11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2021-02-16 The following pull-request contains BPF updates for your *net-next* tree. There's a small merge conflict between 7eeba1706eba ("tcp: Add receive timestamp support for receive zerocopy.") from net-next tree and 9cacf81f8161 ("bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows: [...] lock_sock(sk); err = tcp_zerocopy_receive(sk, &zc, &tss); err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname, &zc, &len, err); release_sock(sk); [...] We've added 116 non-merge commits during the last 27 day(s) which contain a total of 156 files changed, 5662 insertions(+), 1489 deletions(-). The main changes are: 1) Adds support of pointers to types with known size among global function args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov. 2) Add bpf_iter for task_vma which can be used to generate information similar to /proc/pid/maps, from Song Liu. 3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow rewriting bind user ports from BPF side below the ip_unprivileged_port_start range, both from Stanislav Fomichev. 4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map as well as per-cpu maps for the latter, from Alexei Starovoitov. 5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer for sleepable programs, both from KP Singh. 6) Extend verifier to enable variable offset read/write access to the BPF program stack, from Andrei Matei. 7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to query device MTU from programs, from Jesper Dangaard Brouer. 8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF tracing programs, from Florent Revest. 9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when otherwise too many passes are required, from Gary Lin. 10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function verification both related to zero-extension handling, from Ilya Leoshkevich. 11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa. 12) Batch of AF_XDP selftest cleanups and small performance improvement for libbpf's xsk map redirect for newer kernels, from Björn Töpel. 13) Follow-up BPF doc and verifier improvements around atomics with BPF_FETCH, from Brendan Jackman. 14) Permit zero-sized data sections e.g. if ELF .rodata section contains read-only data from local variables, from Yonghong Song. 15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net, veth: Alloc skb in bulk for ndo_xdp_xmitLorenzo Bianconi2021-02-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine in order to alloc skbs in bulk for XDP_PASS verdict. Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list. The proposed approach has been tested in the following scenario: eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS XDP_REDIRECT: xdp_redirect_map bpf sample XDP_PASS: xdp_rxq_info bpf sample traffic generator: pkt_gen sending udp traffic on a remote device bpf-next master: ~3.64Mpps bpf-next + skb bulking allocation: ~3.79Mpps Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/a14a30d3c06fff24e13f836c733d80efc0bd6eb5.1611957532.git.lorenzo@kernel.org
| * | | bpf: Allow rewriting to ports under ip_unprivileged_port_startStanislav Fomichev2021-01-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port to the privileged ones (< ip_unprivileged_port_start), but it will be rejected later on in the __inet_bind or __inet6_bind. Let's add another return value to indicate that CAP_NET_BIND_SERVICE check should be ignored. Use the same idea as we currently use in cgroup/egress where bit #1 indicates CN. Instead, for cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should be bypassed. v5: - rename flags to be less confusing (Andrey Ignatov) - rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags and accept BPF_RET_SET_CN (no behavioral changes) v4: - Add missing IPv6 support (Martin KaFai Lau) v3: - Update description (Martin KaFai Lau) - Fix capability restore in selftest (Martin KaFai Lau) v2: - Switch to explicit return code (Martin KaFai Lau) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
| * | | bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVEStanislav Fomichev2021-01-202-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom call in do_tcp_getsockopt using the on-stack data. This removes 3% overhead for locking/unlocking the socket. Without this patch: 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc With the patch applied: 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern Note, exporting uapi/tcp.h requires removing netinet/tcp.h from test_progs.h because those headers have confliciting definitions. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
| * | | net, xdp: Introduce xdp_build_skb_from_frame utility routineLorenzo Bianconi2021-01-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Respect to __xdp_build_skb_from_frame, xdp_build_skb_from_frame will allocate the skb object. Rely on xdp_build_skb_from_frame in veth driver. Introduce missing xdp metadata support in veth_xdp_rcv_one routine. Add missing metadata support in veth_xdp_rcv_one(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>