From c54bc0fc84214b203f7a0ebfd1bd308ce2abe920 Mon Sep 17 00:00:00 2001 From: Anna-Maria Behnsen Date: Tue, 5 Apr 2022 21:17:32 +0200 Subject: timers: Fix warning condition in __run_timers() When the timer base is empty, base::next_expiry is set to base::clk + NEXT_TIMER_MAX_DELTA and base::next_expiry_recalc is false. When no timer is queued until jiffies reaches base::next_expiry value, the warning for not finding any expired timer and base::next_expiry_recalc is false in __run_timers() triggers. To prevent triggering the warning in this valid scenario base::timers_pending needs to be added to the warning condition. Fixes: 31cd0e119d50 ("timers: Recalculate next timer interrupt only when necessary") Reported-by: Johannes Berg Signed-off-by: Anna-Maria Behnsen Signed-off-by: Thomas Gleixner Reviewed-by: Frederic Weisbecker Link: https://lore.kernel.org/r/20220405191732.7438-3-anna-maria@linutronix.de --- kernel/time/timer.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 85f1021ad459..9dd2a39cb3b0 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1722,11 +1722,14 @@ static inline void __run_timers(struct timer_base *base) time_after_eq(jiffies, base->next_expiry)) { levels = collect_expired_timers(base, heads); /* - * The only possible reason for not finding any expired - * timer at this clk is that all matching timers have been - * dequeued. + * The two possible reasons for not finding any expired + * timer at this clk are that all matching timers have been + * dequeued or no timer has been queued since + * base::next_expiry was set to base::clk + + * NEXT_TIMER_MAX_DELTA. */ - WARN_ON_ONCE(!levels && !base->next_expiry_recalc); + WARN_ON_ONCE(!levels && !base->next_expiry_recalc + && base->timers_pending); base->clk++; base->next_expiry = __next_timer_interrupt(base); -- cgit v1.2.3-71-gd317 From 40e97e42961f8c6cc7bd5fe67cc18417e02d78f1 Mon Sep 17 00:00:00 2001 From: Paul Gortmaker Date: Mon, 6 Dec 2021 09:59:50 -0500 Subject: tick/nohz: Use WARN_ON_ONCE() to prevent console saturation While running some testing on code that happened to allow the variable tick_nohz_full_running to get set but with no "possible" NOHZ cores to back up that setting, this warning triggered: if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) WARN_ON(tick_nohz_full_running); The console was overwhemled with an endless stream of one WARN per tick per core and there was no way to even see what was going on w/o using a serial console to capture it and then trace it back to this. Change it to WARN_ON_ONCE(). Fixes: 08ae95f4fd3b ("nohz_full: Allow the boot CPU to be nohz_full") Signed-off-by: Paul Gortmaker Signed-off-by: Thomas Gleixner Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211206145950.10927-3-paul.gortmaker@windriver.com --- kernel/time/tick-sched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 2d76c91b85de..3506f6ed790c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -188,7 +188,7 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now) */ if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) { #ifdef CONFIG_NO_HZ_FULL - WARN_ON(tick_nohz_full_running); + WARN_ON_ONCE(tick_nohz_full_running); #endif tick_do_timer_cpu = cpu; } -- cgit v1.2.3-71-gd317 From 9c95bc25ad3b1a2240cd1f896569292a57d3ce85 Mon Sep 17 00:00:00 2001 From: Jiapeng Chong Date: Mon, 14 Feb 2022 16:47:39 +0800 Subject: tick/sched: Fix non-kernel-doc comment Fixes the following W=1 kernel build warning: kernel/time/tick-sched.c:1563: warning: This comment starts with '/**', but isn't a kernel-doc comment. Reported-by: Abaci Robot Signed-off-by: Jiapeng Chong Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20220214084739.63228-1-jiapeng.chong@linux.alibaba.com --- kernel/time/tick-sched.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 3506f6ed790c..d257721c68b8 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -1538,7 +1538,7 @@ void tick_cancel_sched_timer(int cpu) } #endif -/** +/* * Async notification about clocksource changes */ void tick_clock_notify(void) @@ -1559,7 +1559,7 @@ void tick_oneshot_notify(void) set_bit(0, &ts->check_clocks); } -/** +/* * Check, if a change happened, which makes oneshot possible. * * Called cyclic from the hrtimer softirq (driven by the timer -- cgit v1.2.3-71-gd317 From 08d835dff916bfe8f45acc7b92c7af6c4081c8a7 Mon Sep 17 00:00:00 2001 From: Rei Yamamoto Date: Thu, 31 Mar 2022 09:33:09 +0900 Subject: genirq/affinity: Consider that CPUs on nodes can be unbalanced If CPUs on a node are offline at boot time, the number of nodes is different when building affinity masks for present cpus and when building affinity masks for possible cpus. This causes the following problem: In the case that the number of vectors is less than the number of nodes there are cases where bits of masks for present cpus are overwritten when building masks for possible cpus. Fix this by excluding CPUs, which are not part of the current build mask (present/possible). [ tglx: Massaged changelog and added comment ] Fixes: b82592199032 ("genirq/affinity: Spread IRQs to all available NUMA nodes") Signed-off-by: Rei Yamamoto Signed-off-by: Thomas Gleixner Reviewed-by: Ming Lei Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220331003309.10891-1-yamamoto.rei@jp.fujitsu.com --- kernel/irq/affinity.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index f7ff8919dc9b..fdf170404650 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -269,8 +269,9 @@ static int __irq_build_affinity_masks(unsigned int startvec, */ if (numvecs <= nodes) { for_each_node_mask(n, nodemsk) { - cpumask_or(&masks[curvec].mask, &masks[curvec].mask, - node_to_cpumask[n]); + /* Ensure that only CPUs which are in both masks are set */ + cpumask_and(nmsk, cpu_mask, node_to_cpumask[n]); + cpumask_or(&masks[curvec].mask, &masks[curvec].mask, nmsk); if (++curvec == last_affv) curvec = firstvec; } -- cgit v1.2.3-71-gd317 From 9e949a3886356fe9112c6f6f34a6e23d1d35407f Mon Sep 17 00:00:00 2001 From: Nadav Amit Date: Sat, 19 Mar 2022 00:20:15 -0700 Subject: smp: Fix offline cpu check in flush_smp_call_function_queue() The check in flush_smp_call_function_queue() for callbacks that are sent to offline CPUs currently checks whether the queue is empty. However, flush_smp_call_function_queue() has just deleted all the callbacks from the queue and moved all the entries into a local list. This checks would only be positive if some callbacks were added in the short time after llist_del_all() was called. This does not seem to be the intention of this check. Change the check to look at the local list to which the entries were moved instead of the queue from which all the callbacks were just removed. Fixes: 8d056c48e4862 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline") Signed-off-by: Nadav Amit Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20220319072015.1495036-1-namit@vmware.com --- kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/smp.c b/kernel/smp.c index 01a7c1706a58..65a630f62363 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -579,7 +579,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) /* There shouldn't be any pending callbacks on an offline CPU. */ if (unlikely(warn_cpu_offline && !cpu_online(smp_processor_id()) && - !warned && !llist_empty(head))) { + !warned && entry != NULL)) { warned = true; WARN(1, "IPI on offline CPU %d\n", smp_processor_id()); -- cgit v1.2.3-71-gd317 From b7ba6d8dc3569e49800ef0136799f26f43e237e8 Mon Sep 17 00:00:00 2001 From: Steven Price Date: Mon, 11 Apr 2022 16:22:32 +0100 Subject: cpu/hotplug: Remove the 'cpu' member of cpuhp_cpu_state Currently the setting of the 'cpu' member of struct cpuhp_cpu_state in cpuhp_create() is too late as it is used earlier in _cpu_up(). If kzalloc_node() in __smpboot_create_thread() fails then the rollback will be done with st->cpu==0 causing CPU0 to be erroneously set to be dying, causing the scheduler to get mightily confused and throw its toys out of the pram. However the cpu number is actually available directly, so simply remove the 'cpu' member and avoid the problem in the first place. Fixes: 2ea46c6fc945 ("cpumask/hotplug: Fix cpu_dying() state tracking") Signed-off-by: Steven Price Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20220411152233.474129-2-steven.price@arm.com --- kernel/cpu.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/cpu.c b/kernel/cpu.c index 5797c2a7a93f..d0a9aa0b42e8 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -71,7 +71,6 @@ struct cpuhp_cpu_state { bool rollback; bool single; bool bringup; - int cpu; struct hlist_node *node; struct hlist_node *last; enum cpuhp_state cb_state; @@ -475,7 +474,7 @@ static inline bool cpu_smt_allowed(unsigned int cpu) { return true; } #endif static inline enum cpuhp_state -cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target) +cpuhp_set_state(int cpu, struct cpuhp_cpu_state *st, enum cpuhp_state target) { enum cpuhp_state prev_state = st->state; bool bringup = st->state < target; @@ -486,14 +485,15 @@ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target) st->target = target; st->single = false; st->bringup = bringup; - if (cpu_dying(st->cpu) != !bringup) - set_cpu_dying(st->cpu, !bringup); + if (cpu_dying(cpu) != !bringup) + set_cpu_dying(cpu, !bringup); return prev_state; } static inline void -cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state) +cpuhp_reset_state(int cpu, struct cpuhp_cpu_state *st, + enum cpuhp_state prev_state) { bool bringup = !st->bringup; @@ -520,8 +520,8 @@ cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state) } st->bringup = bringup; - if (cpu_dying(st->cpu) != !bringup) - set_cpu_dying(st->cpu, !bringup); + if (cpu_dying(cpu) != !bringup) + set_cpu_dying(cpu, !bringup); } /* Regular hotplug invocation of the AP hotplug thread */ @@ -541,15 +541,16 @@ static void __cpuhp_kick_ap(struct cpuhp_cpu_state *st) wait_for_ap_thread(st, st->bringup); } -static int cpuhp_kick_ap(struct cpuhp_cpu_state *st, enum cpuhp_state target) +static int cpuhp_kick_ap(int cpu, struct cpuhp_cpu_state *st, + enum cpuhp_state target) { enum cpuhp_state prev_state; int ret; - prev_state = cpuhp_set_state(st, target); + prev_state = cpuhp_set_state(cpu, st, target); __cpuhp_kick_ap(st); if ((ret = st->result)) { - cpuhp_reset_state(st, prev_state); + cpuhp_reset_state(cpu, st, prev_state); __cpuhp_kick_ap(st); } @@ -581,7 +582,7 @@ static int bringup_wait_for_ap(unsigned int cpu) if (st->target <= CPUHP_AP_ONLINE_IDLE) return 0; - return cpuhp_kick_ap(st, st->target); + return cpuhp_kick_ap(cpu, st, st->target); } static int bringup_cpu(unsigned int cpu) @@ -704,7 +705,7 @@ static int cpuhp_up_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, ret, cpu, cpuhp_get_step(st->state)->name, st->state); - cpuhp_reset_state(st, prev_state); + cpuhp_reset_state(cpu, st, prev_state); if (can_rollback_cpu(st)) WARN_ON(cpuhp_invoke_callback_range(false, cpu, st, prev_state)); @@ -721,7 +722,6 @@ static void cpuhp_create(unsigned int cpu) init_completion(&st->done_up); init_completion(&st->done_down); - st->cpu = cpu; } static int cpuhp_should_run(unsigned int cpu) @@ -875,7 +875,7 @@ static int cpuhp_kick_ap_work(unsigned int cpu) cpuhp_lock_release(true); trace_cpuhp_enter(cpu, st->target, prev_state, cpuhp_kick_ap_work); - ret = cpuhp_kick_ap(st, st->target); + ret = cpuhp_kick_ap(cpu, st, st->target); trace_cpuhp_exit(cpu, st->state, prev_state, ret); return ret; @@ -1107,7 +1107,7 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st, ret, cpu, cpuhp_get_step(st->state)->name, st->state); - cpuhp_reset_state(st, prev_state); + cpuhp_reset_state(cpu, st, prev_state); if (st->state < prev_state) WARN_ON(cpuhp_invoke_callback_range(true, cpu, st, @@ -1134,7 +1134,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, cpuhp_tasks_frozen = tasks_frozen; - prev_state = cpuhp_set_state(st, target); + prev_state = cpuhp_set_state(cpu, st, target); /* * If the current CPU state is in the range of the AP hotplug thread, * then we need to kick the thread. @@ -1165,7 +1165,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, ret = cpuhp_down_callbacks(cpu, st, target); if (ret && st->state < prev_state) { if (st->state == CPUHP_TEARDOWN_CPU) { - cpuhp_reset_state(st, prev_state); + cpuhp_reset_state(cpu, st, prev_state); __cpuhp_kick_ap(st); } else { WARN(1, "DEAD callback error for CPU%d", cpu); @@ -1352,7 +1352,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) cpuhp_tasks_frozen = tasks_frozen; - cpuhp_set_state(st, target); + cpuhp_set_state(cpu, st, target); /* * If the current CPU state is in the range of the AP hotplug thread, * then we need to kick the thread once more. -- cgit v1.2.3-71-gd317 From 9e02977bfad006af328add9434c8bffa40e053bb Mon Sep 17 00:00:00 2001 From: Chao Gao Date: Wed, 13 Apr 2022 08:32:22 +0200 Subject: dma-direct: avoid redundant memory sync for swiotlb When we looked into FIO performance with swiotlb enabled in VM, we found swiotlb_bounce() is always called one more time than expected for each DMA read request. It turns out that the bounce buffer is copied to original DMA buffer twice after the completion of a DMA request (one is done by in dma_direct_sync_single_for_cpu(), the other by swiotlb_tbl_unmap_single()). But the content in bounce buffer actually doesn't change between the two rounds of copy. So, one round of copy is redundant. Pass DMA_ATTR_SKIP_CPU_SYNC flag to swiotlb_tbl_unmap_single() to skip the memory copy in it. This fix increases FIO 64KB sequential read throughput in a guest with swiotlb=force by 5.6%. Fixes: 55897af63091 ("dma-direct: merge swiotlb_dma_ops into the dma_direct code") Reported-by: Wang Zhaoyang1 Reported-by: Gao Liang Signed-off-by: Chao Gao Reviewed-by: Kevin Tian Signed-off-by: Christoph Hellwig --- kernel/dma/direct.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h index 4632b0f4f72e..8a6cd53dbe8c 100644 --- a/kernel/dma/direct.h +++ b/kernel/dma/direct.h @@ -114,6 +114,7 @@ static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr, dma_direct_sync_single_for_cpu(dev, addr, size, dir); if (unlikely(is_swiotlb_buffer(dev, phys))) - swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); + swiotlb_tbl_unmap_single(dev, phys, size, dir, + attrs | DMA_ATTR_SKIP_CPU_SYNC); } #endif /* _KERNEL_DMA_DIRECT_H */ -- cgit v1.2.3-71-gd317 From 25934fcfb93c4687ad32fd3d062bcf03457129d4 Mon Sep 17 00:00:00 2001 From: Zqiang Date: Thu, 14 Apr 2022 19:13:34 -0700 Subject: irq_work: use kasan_record_aux_stack_noalloc() record callstack On PREEMPT_RT kernel and KASAN is enabled. the kasan_record_aux_stack() may call alloc_pages(), and the rt-spinlock will be acquired, if currently in atomic context, will trigger warning: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 239, name: bootlogd Preemption disabled at: [] rt_mutex_slowunlock+0xa1/0x4e0 CPU: 3 PID: 239 Comm: bootlogd Tainted: G W 5.17.1-rt17-yocto-preempt-rt+ #105 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace: __might_resched.cold+0x13b/0x173 rt_spin_lock+0x5b/0xf0 get_page_from_freelist+0x20c/0x1610 __alloc_pages+0x25e/0x5e0 __stack_depot_save+0x3c0/0x4a0 kasan_save_stack+0x3a/0x50 __kasan_record_aux_stack+0xb6/0xc0 kasan_record_aux_stack+0xe/0x10 irq_work_queue_on+0x6a/0x1c0 pull_rt_task+0x631/0x6b0 do_balance_callbacks+0x56/0x80 __balance_callbacks+0x63/0x90 rt_mutex_setprio+0x349/0x880 rt_mutex_slowunlock+0x22a/0x4e0 rt_spin_unlock+0x49/0x80 uart_write+0x186/0x2b0 do_output_char+0x2e9/0x3a0 n_tty_write+0x306/0x800 file_tty_write.isra.0+0x2af/0x450 tty_write+0x22/0x30 new_sync_write+0x27c/0x3a0 vfs_write+0x3f7/0x5d0 ksys_write+0xd9/0x180 __x64_sys_write+0x43/0x50 do_syscall_64+0x44/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix it by using kasan_record_aux_stack_noalloc() to avoid the call to alloc_pages(). Link: https://lkml.kernel.org/r/20220402142555.2699582-1-qiang1.zhang@intel.com Signed-off-by: Zqiang Cc: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/irq_work.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq_work.c b/kernel/irq_work.c index f7df715ec28e..7afa40fe5cc4 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -137,7 +137,7 @@ bool irq_work_queue_on(struct irq_work *work, int cpu) if (!irq_work_claim(work)) return false; - kasan_record_aux_stack(work); + kasan_record_aux_stack_noalloc(work); preempt_disable(); if (cpu != smp_processor_id()) { -- cgit v1.2.3-71-gd317 From 40f5aa4c5eaebfeaca4566217cb9c468e28ed682 Mon Sep 17 00:00:00 2001 From: kuyo chang Date: Thu, 14 Apr 2022 17:02:20 +0800 Subject: sched/pelt: Fix attach_entity_load_avg() corner case The warning in cfs_rq_is_decayed() triggered: SCHED_WARN_ON(cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || cfs_rq->avg.runnable_avg) There exists a corner case in attach_entity_load_avg() which will cause load_sum to be zero while load_avg will not be. Consider se_weight is 88761 as per the sched_prio_to_weight[] table. Further assume the get_pelt_divider() is 47742, this gives: se->avg.load_avg is 1. However, calculating load_sum: se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); se->avg.load_sum = 1*47742/88761 = 0. Then enqueue_load_avg() adds this to the cfs_rq totals: cfs_rq->avg.load_avg += se->avg.load_avg; cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum; Resulting in load_avg being 1 with load_sum is 0, which will trigger the WARN. Fixes: f207934fb79d ("sched/fair: Align PELT windows between cfs_rq and its se") Signed-off-by: kuyo chang [peterz: massage changelog] Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Tested-by: Dietmar Eggemann Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com --- kernel/sched/fair.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d4bd299d67ab..a68482d66535 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3829,11 +3829,11 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s se->avg.runnable_sum = se->avg.runnable_avg * divider; - se->avg.load_sum = divider; - if (se_weight(se)) { - se->avg.load_sum = - div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); - } + se->avg.load_sum = se->avg.load_avg * divider; + if (se_weight(se) < se->avg.load_sum) + se->avg.load_sum = div_u64(se->avg.load_sum, se_weight(se)); + else + se->avg.load_sum = 1; enqueue_load_avg(cfs_rq, se); cfs_rq->avg.util_avg += se->avg.util_avg; -- cgit v1.2.3-71-gd317 From 60490e7966659b26d74bf1fa4aa8693d9a94ca88 Mon Sep 17 00:00:00 2001 From: Zhipeng Xie Date: Wed, 9 Feb 2022 09:54:17 -0500 Subject: perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled This problem can be reproduced with CONFIG_PERF_USE_VMALLOC enabled on both x86_64 and aarch64 arch when using sysdig -B(using ebpf)[1]. sysdig -B works fine after rebuilding the kernel with CONFIG_PERF_USE_VMALLOC disabled. I tracked it down to the if condition event->rb->nr_pages != nr_pages in perf_mmap is true when CONFIG_PERF_USE_VMALLOC is enabled where event->rb->nr_pages = 1 and nr_pages = 2048 resulting perf_mmap to return -EINVAL. This is because when CONFIG_PERF_USE_VMALLOC is enabled, rb->nr_pages is always equal to 1. Arch with CONFIG_PERF_USE_VMALLOC enabled by default: arc/arm/csky/mips/sh/sparc/xtensa Arch with CONFIG_PERF_USE_VMALLOC disabled by default: x86_64/aarch64/... Fix this problem by using data_page_nr() [1] https://github.com/draios/sysdig Fixes: 906010b2134e ("perf_event: Provide vmalloc() based mmap() backing") Signed-off-by: Zhipeng Xie Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20220209145417.6495-1-xiezhipeng1@huawei.com --- kernel/events/core.c | 2 +- kernel/events/internal.h | 5 +++++ kernel/events/ring_buffer.c | 5 ----- 3 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 23bb19716ad3..7858bafffa9d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6247,7 +6247,7 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) again: mutex_lock(&event->mmap_mutex); if (event->rb) { - if (event->rb->nr_pages != nr_pages) { + if (data_page_nr(event->rb) != nr_pages) { ret = -EINVAL; goto unlock; } diff --git a/kernel/events/internal.h b/kernel/events/internal.h index 082832738c8f..5150d5f84c03 100644 --- a/kernel/events/internal.h +++ b/kernel/events/internal.h @@ -116,6 +116,11 @@ static inline int page_order(struct perf_buffer *rb) } #endif +static inline int data_page_nr(struct perf_buffer *rb) +{ + return rb->nr_pages << page_order(rb); +} + static inline unsigned long perf_data_size(struct perf_buffer *rb) { return rb->nr_pages << (PAGE_SHIFT + page_order(rb)); diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 52868716ec35..fb35b926024c 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -859,11 +859,6 @@ void rb_free(struct perf_buffer *rb) } #else -static int data_page_nr(struct perf_buffer *rb) -{ - return rb->nr_pages << page_order(rb); -} - static struct page * __perf_mmap_to_page(struct perf_buffer *rb, unsigned long pgoff) { -- cgit v1.2.3-71-gd317 From ecc04463d1a36f88baa750d45dfb02c364e1fdb1 Mon Sep 17 00:00:00 2001 From: Aleksandr Nogikh Date: Thu, 21 Apr 2022 16:36:07 -0700 Subject: kcov: don't generate a warning on vm_insert_page()'s failure vm_insert_page()'s failure is not an unexpected condition, so don't do WARN_ONCE() in such a case. Instead, print a kernel message and just return an error code. This flaw has been reported under an OOM condition by sysbot [1]. The message is mainly for the benefit of the test log, in this case the fuzzer's log so that humans inspecting the log can figure out what was going on. KCOV is a testing tool, so I think being a little more chatty when KCOV unexpectedly is about to fail will save someone debugging time. We don't want the WARN, because it's not a kernel bug that syzbot should report, and failure can happen if the fuzzer tries hard enough (as above). Link: https://lkml.kernel.org/r/Ylkr2xrVbhQYwNLf@elver.google.com [1] Link: https://lkml.kernel.org/r/20220401182512.249282-1-nogikh@google.com Fixes: b3d7fe86fbd0 ("kcov: properly handle subsequent mmap calls"), Signed-off-by: Aleksandr Nogikh Acked-by: Marco Elver Cc: Dmitry Vyukov Cc: Andrey Konovalov Cc: Alexander Potapenko Cc: Taras Madan Cc: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index 475524bd900a..b3732b210593 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -475,8 +475,11 @@ static int kcov_mmap(struct file *filep, struct vm_area_struct *vma) vma->vm_flags |= VM_DONTEXPAND; for (off = 0; off < size; off += PAGE_SIZE) { page = vmalloc_to_page(kcov->area + off); - if (vm_insert_page(vma, vma->vm_start + off, page)) - WARN_ONCE(1, "vm_insert_page() failed"); + res = vm_insert_page(vma, vma->vm_start + off, page); + if (res) { + pr_warn_once("kcov: vm_insert_page() failed\n"); + return res; + } } return 0; exit: -- cgit v1.2.3-71-gd317 From 1d661ed54d8613c97bcff2c7d6181c61e482a1da Mon Sep 17 00:00:00 2001 From: Adam Zabrocki Date: Fri, 22 Apr 2022 18:40:27 +0200 Subject: kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is set The recent kernel change in 73f9b911faa7 ("kprobes: Use rethook for kretprobe if possible"), introduced a potential NULL pointer dereference bug in the KRETPROBE mechanism. The official Kprobes documentation defines that "Any or all handlers can be NULL". Unfortunately, there is a missing return handler verification to fulfill these requirements and can result in a NULL pointer dereference bug. This patch adds such verification in kretprobe_rethook_handler() function. Fixes: 73f9b911faa7 ("kprobes: Use rethook for kretprobe if possible") Signed-off-by: Adam Zabrocki Signed-off-by: Daniel Borkmann Acked-by: Masami Hiramatsu Cc: Steven Rostedt Cc: Naveen N. Rao Cc: Anil S. Keshavamurthy Link: https://lore.kernel.org/bpf/20220422164027.GA7862@pi3.com.pl --- kernel/kprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kprobes.c b/kernel/kprobes.c index dbe57df2e199..dd58c0be9ce2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -2126,7 +2126,7 @@ static void kretprobe_rethook_handler(struct rethook_node *rh, void *data, struct kprobe_ctlblk *kcb; /* The data must NOT be null. This means rethook data structure is broken. */ - if (WARN_ON_ONCE(!data)) + if (WARN_ON_ONCE(!data) || !rp->handler) return; __this_cpu_write(current_kprobe, &rp->kp); -- cgit v1.2.3-71-gd317