From 1567c3e3467cddeb019a7b53ec632f834b6a9239 Mon Sep 17 00:00:00 2001 From: Giovanni Gherdovich Date: Wed, 22 Jan 2020 16:16:12 +0100 Subject: x86, sched: Add support for frequency invariance Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency. 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra Signed-off-by: Giovanni Gherdovich Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Doug Smythies Acked-by: Rafael J. Wysocki Link: https://lkml.kernel.org/r/20200122151617.531-2-ggherdovich@suse.cz --- kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 +++++++ 2 files changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 89e54f3ed571..45f79bcc3146 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3600,6 +3600,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; + arch_scale_freq_tick(); sched_clock_tick(); rq_lock(rq, &rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1a88dc8ad11b..0844e81964e5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1968,6 +1968,13 @@ static inline int hrtick_enabled(struct rq *rq) #endif /* CONFIG_SCHED_HRTICK */ +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_freq_capacity static __always_inline unsigned long arch_scale_freq_capacity(int cpu) -- cgit v1.2.3-71-gd317 From bec2860a2bd6cd38ea34434d04f4033eb32f0f31 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Fri, 6 Dec 2019 22:54:22 +0530 Subject: sched/fair: Optimize select_idle_core() Currently we loop through all threads of a core to evaluate if the core is idle or not. This is unnecessary. If a thread of a core is not idle, skip evaluating other threads of a core. Also while clearing the cpumask, bits of all CPUs of a core can be cleared in one-shot. Collecting ticks on a Power 9 SMT 8 system around select_idle_core while running schbench shows us (units are in ticks, hence lesser is better) Without patch N Min Max Median Avg Stddev x 130 151 1083 284 322.72308 144.41494 With patch N Min Max Median Avg Stddev Improvement x 164 88 610 201 225.79268 106.78943 30.03% Signed-off-by: Srikar Dronamraju Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Valentin Schneider Reviewed-by: Vincent Guittot Acked-by: Mel Gorman Link: https://lkml.kernel.org/r/20191206172422.6578-1-srikar@linux.vnet.ibm.com --- kernel/sched/fair.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 25dffc03f0f6..1a0ce83e835a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5787,10 +5787,12 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int bool idle = true; for_each_cpu(cpu, cpu_smt_mask(core)) { - __cpumask_clear_cpu(cpu, cpus); - if (!available_idle_cpu(cpu)) + if (!available_idle_cpu(cpu)) { idle = false; + break; + } } + cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); if (idle) return core; -- cgit v1.2.3-71-gd317 From b4fb015eeff7f3e5518a7dbe8061169a3e2f2bc7 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Sat, 25 Jan 2020 17:50:38 +0300 Subject: sched/rt: Optimize checking group RT scheduler constraints Group RT scheduler contains protection against setting zero runtime for cgroup with RT tasks. Right now function tg_set_rt_bandwidth() iterates over all CPU cgroups and calls tg_has_rt_tasks() for any cgroup which runtime is zero (not only for changed one). Default RT runtime is zero, thus tg_has_rt_tasks() will is called for almost at CPU cgroups. This protection already is slightly racy: runtime limit could be changed between cpu_cgroup_can_attach() and cpu_cgroup_attach() because changing cgroup attribute does not lock cgroup_mutex while attach does not lock rt_constraints_mutex. Changing task scheduler class also races with changing rt runtime: check in __sched_setscheduler() isn't protected. Function tg_has_rt_tasks() iterates over all threads in the system. This gives NR_CGROUPS * NR_TASKS operations under single tasklist_lock locked for read tg_set_rt_bandwidth(). Any concurrent attempt of locking tasklist_lock for write (for example fork) will stuck with disabled irqs. This patch makes two optimizations: 1) Remove locking tasklist_lock and iterate only tasks in cgroup 2) Call tg_has_rt_tasks() iff rt runtime changes from non-zero to zero All changed code is under CONFIG_RT_GROUP_SCHED. Testcase: # mkdir /sys/fs/cgroup/cpu/test{1..10000} # echo 0 | tee /sys/fs/cgroup/cpu/test*/cpu.rt_runtime_us At the same time without patch fork time will be >100ms: # perf trace -e clone --duration 100 stress-ng --fork 1 Also remote ping will show timings >100ms caused by irq latency. Signed-off-by: Konstantin Khlebnikov Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/157996383820.4651.11292439232549211693.stgit@buzz --- kernel/sched/rt.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 4043abe45459..55a4a5042292 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2449,10 +2449,11 @@ const struct sched_class rt_sched_class = { */ static DEFINE_MUTEX(rt_constraints_mutex); -/* Must be called with tasklist_lock held */ static inline int tg_has_rt_tasks(struct task_group *tg) { - struct task_struct *g, *p; + struct task_struct *task; + struct css_task_iter it; + int ret = 0; /* * Autogroups do not have RT tasks; see autogroup_create(). @@ -2460,12 +2461,12 @@ static inline int tg_has_rt_tasks(struct task_group *tg) if (task_group_is_autogroup(tg)) return 0; - for_each_process_thread(g, p) { - if (rt_task(p) && task_group(p) == tg) - return 1; - } + css_task_iter_start(&tg->css, 0, &it); + while (!ret && (task = css_task_iter_next(&it))) + ret |= rt_task(task); + css_task_iter_end(&it); - return 0; + return ret; } struct rt_schedulable_data { @@ -2496,9 +2497,10 @@ static int tg_rt_schedulable(struct task_group *tg, void *data) return -EINVAL; /* - * Ensure we don't starve existing RT tasks. + * Ensure we don't starve existing RT tasks if runtime turns zero. */ - if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) + if (rt_bandwidth_enabled() && !runtime && + tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg)) return -EBUSY; total = to_ratio(period, runtime); @@ -2564,7 +2566,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg, return -EINVAL; mutex_lock(&rt_constraints_mutex); - read_lock(&tasklist_lock); err = __rt_schedulable(tg, rt_period, rt_runtime); if (err) goto unlock; @@ -2582,7 +2583,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg, } raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); unlock: - read_unlock(&tasklist_lock); mutex_unlock(&rt_constraints_mutex); return err; @@ -2641,9 +2641,7 @@ static int sched_rt_global_constraints(void) int ret = 0; mutex_lock(&rt_constraints_mutex); - read_lock(&tasklist_lock); ret = __rt_schedulable(NULL, 0, 0); - read_unlock(&tasklist_lock); mutex_unlock(&rt_constraints_mutex); return ret; -- cgit v1.2.3-71-gd317 From 70b3eeed49e8190d97139806f6fbaf8964306cdb Mon Sep 17 00:00:00 2001 From: Steve Grubb Date: Fri, 24 Jan 2020 17:29:16 -0500 Subject: audit: CONFIG_CHANGE don't log internal bookkeeping as an event Common Criteria calls out for any action that modifies the audit trail to be recorded. That usually is interpreted to mean insertion or removal of rules. It is not required to log modification of the inode information since the watch is still in effect. Additionally, if the rule is a never rule and the underlying file is one they do not want events for, they get an event for this bookkeeping update against their wishes. Since no device/inode info is logged at insertion and no device/inode information is logged on update, there is nothing meaningful being communicated to the admin by the CONFIG_CHANGE updated_rules event. One can assume that the rule was not "modified" because it is still watching the intended target. If the device or inode cannot be resolved, then audit_panic is called which is sufficient. The correct resolution is to drop logging config_update events since the watch is still in effect but just on another unknown inode. Signed-off-by: Steve Grubb Signed-off-by: Paul Moore --- kernel/audit_watch.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 4508d5e0cf69..8a8fd732ff6d 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -302,8 +302,6 @@ static void audit_update_watch(struct audit_parent *parent, if (oentry->rule.exe) audit_remove_mark(oentry->rule.exe); - audit_watch_log_rule_change(r, owatch, "updated_rules"); - call_rcu(&oentry->rcu, audit_free_rule_rcu); } -- cgit v1.2.3-71-gd317 From caa72c3bc584bc28b557bcf1a47532a7a6f37e6f Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:25 +0200 Subject: console: Drop double check for console_drivers being non-NULL There is no need to explicitly check for console_drivers to be non-NULL since for_each_console() does this. Link: http://lkml.kernel.org/r/20200203133130.11591-2-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Signed-off-by: Andy Shevchenko Reviewed-by: Petr Mladek Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fada22dc4ab6..51337ed426e0 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -1772,9 +1772,6 @@ static void call_console_drivers(const char *ext_text, size_t ext_len, trace_console_rcuidle(text, len); - if (!console_drivers) - return; - for_each_console(con) { if (exclusive_console && con != exclusive_console) continue; @@ -2653,18 +2650,17 @@ void register_console(struct console *newcon) struct console_cmdline *c; static bool has_preferred; - if (console_drivers) - for_each_console(bcon) - if (WARN(bcon == newcon, - "console '%s%d' already registered\n", - bcon->name, bcon->index)) - return; + for_each_console(bcon) { + if (WARN(bcon == newcon, "console '%s%d' already registered\n", + bcon->name, bcon->index)) + return; + } /* * before we register a new CON_BOOT console, make sure we don't * already have a valid console */ - if (console_drivers && newcon->flags & CON_BOOT) { + if (newcon->flags & CON_BOOT) { /* find the last or real console */ for_each_console(bcon) { if (!(bcon->flags & CON_BOOT)) { -- cgit v1.2.3-71-gd317 From 12825e6ba8ea8efbb6d81363b0cf6b5bf84c051a Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:26 +0200 Subject: console: Use for_each_console() helper in unregister_console() We have rather open coded single linked list manipulations where we may simple use for_each_console() helper with properly set exit conditions. Replace open coded single-linked list handling with for_each_console() helper in use. Link: http://lkml.kernel.org/r/20200203133130.11591-3-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Signed-off-by: Andy Shevchenko Reviewed-by: Petr Mladek Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 51337ed426e0..d40a316908da 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2809,7 +2809,7 @@ EXPORT_SYMBOL(register_console); int unregister_console(struct console *console) { - struct console *a, *b; + struct console *con; int res; pr_info("%sconsole [%s%d] disabled\n", @@ -2825,11 +2825,10 @@ int unregister_console(struct console *console) if (console_drivers == console) { console_drivers=console->next; res = 0; - } else if (console_drivers) { - for (a=console_drivers->next, b=console_drivers ; - a; b=a, a=b->next) { - if (a == console) { - b->next = a->next; + } else { + for_each_console(con) { + if (con->next == console) { + con->next = console->next; res = 0; break; } -- cgit v1.2.3-71-gd317 From d58ad10122e6f8f84b1ef60227233b7c5de2bc02 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:27 +0200 Subject: console: Drop misleading comment /* find the last or real console */ This comment is misleading. The purpose of the loop is to check if we are trying to register boot console after a real one has already been registered. This is already mentioned in a comment above. Link: http://lkml.kernel.org/r/20200203133130.11591-4-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Suggested-by: Petr Mladek Signed-off-by: Andy Shevchenko [pmladek@suse.com: Updated commit message.] Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index d40a316908da..b0c12c08cac4 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2661,7 +2661,6 @@ void register_console(struct console *newcon) * already have a valid console */ if (newcon->flags & CON_BOOT) { - /* find the last or real console */ for_each_console(bcon) { if (!(bcon->flags & CON_BOOT)) { pr_info("Too late to register bootconsole %s%d\n", -- cgit v1.2.3-71-gd317 From bb72e3981d8e7bb80db7cebc57a4e10769a509c9 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:28 +0200 Subject: console: Avoid positive return code from unregister_console() There are only two callers that use the returned code from unregister_console(): - unregister_early_console() in arch/m68k/kernel/early_printk.c - kgdb_unregister_nmi_console() in drivers/tty/serial/kgdb_nmi.c They both expect to get "0" on success and a non-zero value on error. But the current behavior is confusing and buggy: - _braille_unregister_console() returns "1" on success - unregister_console() returns "1" on error Fix and clean up the behavior: - Return success when _braille_unregister_console() succeeded - Return a meaningful error code when the console was not registered before Link: http://lkml.kernel.org/r/20200203133130.11591-5-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Signed-off-by: Andy Shevchenko Reviewed-by: Sergey Senozhatsky Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index b0c12c08cac4..61d188f4c672 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2816,10 +2816,12 @@ int unregister_console(struct console *console) console->name, console->index); res = _braille_unregister_console(console); - if (res) + if (res < 0) return res; + if (res > 0) + return 0; - res = 1; + res = -ENODEV; console_lock(); if (console_drivers == console) { console_drivers=console->next; -- cgit v1.2.3-71-gd317 From e78bedbd42b7caa65551f3d56cbff451ccbd0aee Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:29 +0200 Subject: console: Don't notify user space when unregister non-listed console If console is not on the list then there is nothing for us to do and sysfs notify is pointless. Note, that nr_ext_console_drivers is being changed only for listed consoles. Suggested-by: Sergey Senozhatsky Link: http://lkml.kernel.org/r/20200203133130.11591-6-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Signed-off-by: Andy Shevchenko Reviewed-by: Sergey Senozhatsky Reviewed-by: Petr Mladek Signed-off-by: Petr Mladek --- kernel/printk/printk.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 61d188f4c672..f435a17ef988 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2836,7 +2836,10 @@ int unregister_console(struct console *console) } } - if (!res && (console->flags & CON_EXTENDED)) + if (res) + goto out_disable_unlock; + + if (console->flags & CON_EXTENDED) nr_ext_console_drivers--; /* @@ -2849,6 +2852,13 @@ int unregister_console(struct console *console) console->flags &= ~CON_ENABLED; console_unlock(); console_sysfs_notify(); + + return res; + +out_disable_unlock: + console->flags &= ~CON_ENABLED; + console_unlock(); + return res; } EXPORT_SYMBOL(unregister_console); -- cgit v1.2.3-71-gd317 From ed31685c96e18f773ca11dd1a637974d62130673 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Mon, 3 Feb 2020 15:31:30 +0200 Subject: console: Introduce ->exit() callback Some consoles might require special operations on unregistering. For instance, serial console, when registered in the kernel, keeps power on for entire time, until it gets unregistered. Example of use: ->setup(console): pm_runtime_get(...); ->exit(console): pm_runtime_put(...); For such cases to have a balance we would provide ->exit() callback. Link: http://lkml.kernel.org/r/20200203133130.11591-7-andriy.shevchenko@linux.intel.com To: linux-kernel@vger.kernel.org To: Steven Rostedt Signed-off-by: Andy Shevchenko Reviewed-by: Sergey Senozhatsky Reviewed-by: Petr Mladek Signed-off-by: Petr Mladek --- include/linux/console.h | 1 + kernel/printk/printk.c | 3 +++ 2 files changed, 4 insertions(+) (limited to 'kernel') diff --git a/include/linux/console.h b/include/linux/console.h index d09951d5a94e..0cbd2a36aa00 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -149,6 +149,7 @@ struct console { struct tty_driver *(*device)(struct console *, int *); void (*unblank)(void); int (*setup)(struct console *, char *); + int (*exit)(struct console *); int (*match)(struct console *, char *name, int idx, char *options); short flags; short index; diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index f435a17ef988..633f41a11d75 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -2853,6 +2853,9 @@ int unregister_console(struct console *console) console_unlock(); console_sysfs_notify(); + if (console->exit) + res = console->exit(console); + return res; out_disable_unlock: -- cgit v1.2.3-71-gd317 From b3b9c187dc2544923a601733a85352b9ddaba9b3 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:03 -0500 Subject: locking/lockdep: Decrement IRQ context counters when removing lock chain There are currently three counters to track the IRQ context of a lock chain - nr_hardirq_chains, nr_softirq_chains and nr_process_chains. They are incremented when a new lock chain is added, but they are not decremented when a lock chain is removed. That causes some of the statistic counts reported by /proc/lockdep_stats to be incorrect. IRQ Fix that by decrementing the right counter when a lock chain is removed. Since inc_chains() no longer accesses hardirq_context and softirq_context directly, it is moved out from the CONFIG_TRACE_IRQFLAGS conditional compilation block. Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-2-longman@redhat.com --- kernel/locking/lockdep.c | 40 ++++++++++++++++++++++---------------- kernel/locking/lockdep_internals.h | 6 ++++++ 2 files changed, 29 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 32406ef0d6a2..35449f5b79fb 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2298,18 +2298,6 @@ static int check_irq_usage(struct task_struct *curr, struct held_lock *prev, return 0; } -static void inc_chains(void) -{ - if (current->hardirq_context) - nr_hardirq_chains++; - else { - if (current->softirq_context) - nr_softirq_chains++; - else - nr_process_chains++; - } -} - #else static inline int check_irq_usage(struct task_struct *curr, @@ -2317,13 +2305,27 @@ static inline int check_irq_usage(struct task_struct *curr, { return 1; } +#endif /* CONFIG_TRACE_IRQFLAGS */ -static inline void inc_chains(void) +static void inc_chains(int irq_context) { - nr_process_chains++; + if (irq_context & LOCK_CHAIN_HARDIRQ_CONTEXT) + nr_hardirq_chains++; + else if (irq_context & LOCK_CHAIN_SOFTIRQ_CONTEXT) + nr_softirq_chains++; + else + nr_process_chains++; } -#endif /* CONFIG_TRACE_IRQFLAGS */ +static void dec_chains(int irq_context) +{ + if (irq_context & LOCK_CHAIN_HARDIRQ_CONTEXT) + nr_hardirq_chains--; + else if (irq_context & LOCK_CHAIN_SOFTIRQ_CONTEXT) + nr_softirq_chains--; + else + nr_process_chains--; +} static void print_deadlock_scenario(struct held_lock *nxt, struct held_lock *prv) @@ -2843,7 +2845,7 @@ static inline int add_chain_cache(struct task_struct *curr, hlist_add_head_rcu(&chain->entry, hash_head); debug_atomic_inc(chain_lookup_misses); - inc_chains(); + inc_chains(chain->irq_context); return 1; } @@ -3596,7 +3598,8 @@ lock_used: static inline unsigned int task_irq_context(struct task_struct *task) { - return 2 * !!task->hardirq_context + !!task->softirq_context; + return LOCK_CHAIN_HARDIRQ_CONTEXT * !!task->hardirq_context + + LOCK_CHAIN_SOFTIRQ_CONTEXT * !!task->softirq_context; } static int separate_irq_context(struct task_struct *curr, @@ -4798,6 +4801,8 @@ recalc: return; /* Overwrite the chain key for concurrent RCU readers. */ WRITE_ONCE(chain->chain_key, chain_key); + dec_chains(chain->irq_context); + /* * Note: calling hlist_del_rcu() from inside a * hlist_for_each_entry_rcu() loop is safe. @@ -4819,6 +4824,7 @@ recalc: } *new_chain = *chain; hlist_add_head_rcu(&new_chain->entry, chainhashentry(chain_key)); + inc_chains(new_chain->irq_context); #endif } diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index 18d85aebbb57..a525368b8cf6 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -106,6 +106,12 @@ static const unsigned long LOCKF_USED_IN_IRQ_READ = #define STACK_TRACE_HASH_SIZE 16384 #endif +/* + * Bit definitions for lock_chain.irq_context + */ +#define LOCK_CHAIN_SOFTIRQ_CONTEXT (1 << 0) +#define LOCK_CHAIN_HARDIRQ_CONTEXT (1 << 1) + #define MAX_LOCKDEP_CHAINS (1UL << MAX_LOCKDEP_CHAINS_BITS) #define MAX_LOCKDEP_CHAIN_HLOCKS (MAX_LOCKDEP_CHAINS*5) -- cgit v1.2.3-71-gd317 From b9875e9882295749a14b31e16dd504ae904cf070 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:04 -0500 Subject: locking/lockdep: Display irq_context names in /proc/lockdep_chains Currently, the irq_context field of a lock chains displayed in /proc/lockdep_chains is just a number. It is likely that many people may not know what a non-zero number means. To make the information more useful, print the actual irq names ("softirq" and "hardirq") instead. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-3-longman@redhat.com --- kernel/locking/lockdep_proc.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 231684cfc5ae..c2224381c149 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -128,6 +128,13 @@ static int lc_show(struct seq_file *m, void *v) struct lock_chain *chain = v; struct lock_class *class; int i; + static const char * const irq_strs[] = { + [0] = "0", + [LOCK_CHAIN_HARDIRQ_CONTEXT] = "hardirq", + [LOCK_CHAIN_SOFTIRQ_CONTEXT] = "softirq", + [LOCK_CHAIN_SOFTIRQ_CONTEXT| + LOCK_CHAIN_HARDIRQ_CONTEXT] = "hardirq|softirq", + }; if (v == SEQ_START_TOKEN) { if (nr_chain_hlocks > MAX_LOCKDEP_CHAIN_HLOCKS) @@ -136,7 +143,7 @@ static int lc_show(struct seq_file *m, void *v) return 0; } - seq_printf(m, "irq_context: %d\n", chain->irq_context); + seq_printf(m, "irq_context: %s\n", irq_strs[chain->irq_context]); for (i = 0; i < chain->depth; i++) { class = lock_chain_get_class(chain, i); -- cgit v1.2.3-71-gd317 From 1d44bcb4fdb650b2a57c9ff593a4d246a10ad801 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:05 -0500 Subject: locking/lockdep: Track number of zapped classes The whole point of the lockdep dynamic key patch is to allow unused locks to be removed from the lockdep data buffers so that existing buffer space can be reused. However, there is no way to find out how many unused locks are zapped and so we don't know if the zapping process is working properly. Add a new nr_zapped_classes counter to track that and show it in /proc/lockdep_stats. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-4-longman@redhat.com --- kernel/locking/lockdep.c | 2 ++ kernel/locking/lockdep_internals.h | 1 + kernel/locking/lockdep_proc.c | 6 ++++++ 3 files changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 35449f5b79fb..28222d03345f 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -147,6 +147,7 @@ static DECLARE_BITMAP(list_entries_in_use, MAX_LOCKDEP_ENTRIES); #define KEYHASH_SIZE (1UL << KEYHASH_BITS) static struct hlist_head lock_keys_hash[KEYHASH_SIZE]; unsigned long nr_lock_classes; +unsigned long nr_zapped_classes; #ifndef CONFIG_DEBUG_LOCKDEP static #endif @@ -4880,6 +4881,7 @@ static void zap_class(struct pending_free *pf, struct lock_class *class) } remove_class_from_lock_chains(pf, class); + nr_zapped_classes++; } static void reinit_class(struct lock_class *class) diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index a525368b8cf6..76db80446a32 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -130,6 +130,7 @@ extern const char *__get_key_name(const struct lockdep_subclass_key *key, struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i); extern unsigned long nr_lock_classes; +extern unsigned long nr_zapped_classes; extern unsigned long nr_list_entries; long lockdep_next_lockchain(long i); unsigned long lock_chain_count(void); diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index c2224381c149..99e2f3f6adca 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -343,6 +343,12 @@ static int lockdep_stats_show(struct seq_file *m, void *v) seq_printf(m, " debug_locks: %11u\n", debug_locks); + /* + * Zappped classes and lockdep data buffers reuse statistics. + */ + seq_puts(m, "\n"); + seq_printf(m, " zapped classes: %11lu\n", + nr_zapped_classes); return 0; } -- cgit v1.2.3-71-gd317 From 836bd74b5957779171c9648e9e29145fc089fffe Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:06 -0500 Subject: locking/lockdep: Throw away all lock chains with zapped class If a lock chain contains a class that is zapped, the whole lock chain is likely to be invalid. If the zapped class is at the end of the chain, the partial chain without the zapped class should have been stored already as the current code will store all its predecessor chains. If the zapped class is somewhere in the middle, there is no guarantee that the partial chain will actually happen. It may just clutter up the hash and make searching slower. I would rather prefer storing the chain only when it actually happens. So just dump the corresponding chain_hlocks entries for now. A latter patch will try to reuse the freed chain_hlocks entries. This patch also changes the type of nr_chain_hlocks to unsigned integer to be consistent with the other counters. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-5-longman@redhat.com --- kernel/locking/lockdep.c | 37 ++++--------------------------------- kernel/locking/lockdep_internals.h | 4 ++-- kernel/locking/lockdep_proc.c | 2 +- 3 files changed, 7 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 28222d03345f..ef2a6432dd10 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2625,8 +2625,8 @@ out_bug: struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS); -int nr_chain_hlocks; static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; +unsigned int nr_chain_hlocks; struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i) { @@ -4772,36 +4772,23 @@ static void remove_class_from_lock_chain(struct pending_free *pf, struct lock_class *class) { #ifdef CONFIG_PROVE_LOCKING - struct lock_chain *new_chain; - u64 chain_key; int i; for (i = chain->base; i < chain->base + chain->depth; i++) { if (chain_hlocks[i] != class - lock_classes) continue; - /* The code below leaks one chain_hlock[] entry. */ - if (--chain->depth > 0) { - memmove(&chain_hlocks[i], &chain_hlocks[i + 1], - (chain->base + chain->depth - i) * - sizeof(chain_hlocks[0])); - } /* * Each lock class occurs at most once in a lock chain so once * we found a match we can break out of this loop. */ - goto recalc; + goto free_lock_chain; } /* Since the chain has not been modified, return. */ return; -recalc: - chain_key = INITIAL_CHAIN_KEY; - for (i = chain->base; i < chain->base + chain->depth; i++) - chain_key = iterate_chain_key(chain_key, chain_hlocks[i]); - if (chain->depth && chain->chain_key == chain_key) - return; +free_lock_chain: /* Overwrite the chain key for concurrent RCU readers. */ - WRITE_ONCE(chain->chain_key, chain_key); + WRITE_ONCE(chain->chain_key, INITIAL_CHAIN_KEY); dec_chains(chain->irq_context); /* @@ -4810,22 +4797,6 @@ recalc: */ hlist_del_rcu(&chain->entry); __set_bit(chain - lock_chains, pf->lock_chains_being_freed); - if (chain->depth == 0) - return; - /* - * If the modified lock chain matches an existing lock chain, drop - * the modified lock chain. - */ - if (lookup_chain_cache(chain_key)) - return; - new_chain = alloc_lock_chain(); - if (WARN_ON_ONCE(!new_chain)) { - debug_locks_off(); - return; - } - *new_chain = *chain; - hlist_add_head_rcu(&new_chain->entry, chainhashentry(chain_key)); - inc_chains(new_chain->irq_context); #endif } diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index 76db80446a32..926bfa4b4564 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -134,14 +134,14 @@ extern unsigned long nr_zapped_classes; extern unsigned long nr_list_entries; long lockdep_next_lockchain(long i); unsigned long lock_chain_count(void); -extern int nr_chain_hlocks; extern unsigned long nr_stack_trace_entries; extern unsigned int nr_hardirq_chains; extern unsigned int nr_softirq_chains; extern unsigned int nr_process_chains; -extern unsigned int max_lockdep_depth; +extern unsigned int nr_chain_hlocks; +extern unsigned int max_lockdep_depth; extern unsigned int max_bfs_queue_depth; #ifdef CONFIG_PROVE_LOCKING diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 99e2f3f6adca..53c2a2ab4f07 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -278,7 +278,7 @@ static int lockdep_stats_show(struct seq_file *m, void *v) #ifdef CONFIG_PROVE_LOCKING seq_printf(m, " dependency chains: %11lu [max: %lu]\n", lock_chain_count(), MAX_LOCKDEP_CHAINS); - seq_printf(m, " dependency chain hlocks: %11d [max: %lu]\n", + seq_printf(m, " dependency chain hlocks: %11u [max: %lu]\n", nr_chain_hlocks, MAX_LOCKDEP_CHAIN_HLOCKS); #endif -- cgit v1.2.3-71-gd317 From 797b82eb906eeba24dcd6e9ab92faef01fc684cb Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:07 -0500 Subject: locking/lockdep: Track number of zapped lock chains Add a new counter nr_zapped_lock_chains to track the number lock chains that have been removed. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-6-longman@redhat.com --- kernel/locking/lockdep.c | 2 ++ kernel/locking/lockdep_internals.h | 1 + kernel/locking/lockdep_proc.c | 4 ++++ 3 files changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index ef2a6432dd10..a63976c75253 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2626,6 +2626,7 @@ out_bug: struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS); static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; +unsigned long nr_zapped_lock_chains; unsigned int nr_chain_hlocks; struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i) @@ -4797,6 +4798,7 @@ free_lock_chain: */ hlist_del_rcu(&chain->entry); __set_bit(chain - lock_chains, pf->lock_chains_being_freed); + nr_zapped_lock_chains++; #endif } diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index 926bfa4b4564..af722ceeda33 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -131,6 +131,7 @@ struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i); extern unsigned long nr_lock_classes; extern unsigned long nr_zapped_classes; +extern unsigned long nr_zapped_lock_chains; extern unsigned long nr_list_entries; long lockdep_next_lockchain(long i); unsigned long lock_chain_count(void); diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 53c2a2ab4f07..524580db4779 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -349,6 +349,10 @@ static int lockdep_stats_show(struct seq_file *m, void *v) seq_puts(m, "\n"); seq_printf(m, " zapped classes: %11lu\n", nr_zapped_classes); +#ifdef CONFIG_PROVE_LOCKING + seq_printf(m, " zapped lock chains: %11lu\n", + nr_zapped_lock_chains); +#endif return 0; } -- cgit v1.2.3-71-gd317 From 810507fe6fd5ff3de429121adff49523fabb643a Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Thu, 6 Feb 2020 10:24:08 -0500 Subject: locking/lockdep: Reuse freed chain_hlocks entries Once a lock class is zapped, all the lock chains that include the zapped class are essentially useless. The lock_chain structure itself can be reused, but not the corresponding chain_hlocks[] entries. Over time, we will run out of chain_hlocks entries while there are still plenty of other lockdep array entries available. To fix this imbalance, we have to make chain_hlocks entries reusable just like the others. As the freed chain_hlocks entries are in blocks of various lengths. A simple bitmap like the one used in the other reusable lockdep arrays isn't applicable. Instead the chain_hlocks entries are put into bucketed lists (MAX_CHAIN_BUCKETS) of chain blocks. Bucket 0 is the variable size bucket which houses chain blocks of size larger than MAX_CHAIN_BUCKETS sorted in decreasing size order. Initially, the whole array is in one chain block (the primordial chain block) in bucket 0. The minimum size of a chain block is 2 chain_hlocks entries. That will be the minimum allocation size. In other word, allocation requests for one chain_hlocks entry will cause 2-entry block to be returned and hence 1 entry will be wasted. Allocation requests for the chain_hlocks are fulfilled first by looking for chain block of matching size. If not found, the first chain block from bucket[0] (the largest one) is split. That can cause hlock entries fragmentation and reduce allocation efficiency if a chain block of size > MAX_CHAIN_BUCKETS is ever zapped and put back to after the primordial chain block. So the MAX_CHAIN_BUCKETS must be large enough that this should seldom happen. By reusing the chain_hlocks entries, we are able to handle workloads that add and zap a lot of lock classes without the risk of running out of chain_hlocks entries as long as the total number of outstanding lock classes at any time remain within a reasonable limit. Two new tracking counters, nr_free_chain_hlocks & nr_large_chain_blocks, are added to track the total number of chain_hlocks entries in the free bucketed lists and the number of large chain blocks in buckets[0] respectively. The nr_free_chain_hlocks replaces nr_chain_hlocks. The nr_large_chain_blocks counter enables to see if we should increase the number of buckets (MAX_CHAIN_BUCKETS) available so as to avoid to avoid the fragmentation problem in bucket[0]. An internal nfsd test that ran for more than an hour and kept on loading and unloading kernel modules could cause the following message to be displayed. [ 4318.443670] BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low! The patched kernel was able to complete the test with a lot of free chain_hlocks entries to spare: # cat /proc/lockdep_stats : dependency chains: 18867 [max: 65536] dependency chain hlocks: 74926 [max: 327680] dependency chain hlocks lost: 0 : zapped classes: 1541 zapped lock chains: 56765 large chain blocks: 1 By changing MAX_CHAIN_BUCKETS to 3 and add a counter for the size of the largest chain block. The system still worked and We got the following lockdep_stats data: dependency chains: 18601 [max: 65536] dependency chain hlocks used: 73133 [max: 327680] dependency chain hlocks lost: 0 : zapped classes: 1541 zapped lock chains: 56702 large chain blocks: 45165 large chain block size: 20165 By running the test again, I was indeed able to cause chain_hlocks entries to get lost: dependency chain hlocks used: 74806 [max: 327680] dependency chain hlocks lost: 575 : large chain blocks: 48737 large chain block size: 7 Due to the fragmentation, it is possible that the "MAX_LOCKDEP_CHAIN_HLOCKS too low!" error can happen even if a lot of of chain_hlocks entries appear to be free. Fortunately, a MAX_CHAIN_BUCKETS value of 16 should be big enough that few variable sized chain blocks, other than the initial one, should ever be present in bucket 0. Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200206152408.24165-7-longman@redhat.com --- kernel/locking/lockdep.c | 254 +++++++++++++++++++++++++++++++++++-- kernel/locking/lockdep_internals.h | 4 +- kernel/locking/lockdep_proc.c | 12 +- 3 files changed, 255 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index a63976c75253..e55c4ee14e64 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1071,13 +1071,15 @@ static inline void check_data_structures(void) { } #endif /* CONFIG_DEBUG_LOCKDEP */ +static void init_chain_block_buckets(void); + /* * Initialize the lock_classes[] array elements, the free_lock_classes list * and also the delayed_free structure. */ static void init_data_structures_once(void) { - static bool ds_initialized, rcu_head_initialized; + static bool __read_mostly ds_initialized, rcu_head_initialized; int i; if (likely(rcu_head_initialized)) @@ -1101,6 +1103,7 @@ static void init_data_structures_once(void) INIT_LIST_HEAD(&lock_classes[i].locks_after); INIT_LIST_HEAD(&lock_classes[i].locks_before); } + init_chain_block_buckets(); } static inline struct hlist_head *keyhashentry(const struct lock_class_key *key) @@ -2627,7 +2630,233 @@ struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS]; static DECLARE_BITMAP(lock_chains_in_use, MAX_LOCKDEP_CHAINS); static u16 chain_hlocks[MAX_LOCKDEP_CHAIN_HLOCKS]; unsigned long nr_zapped_lock_chains; -unsigned int nr_chain_hlocks; +unsigned int nr_free_chain_hlocks; /* Free chain_hlocks in buckets */ +unsigned int nr_lost_chain_hlocks; /* Lost chain_hlocks */ +unsigned int nr_large_chain_blocks; /* size > MAX_CHAIN_BUCKETS */ + +/* + * The first 2 chain_hlocks entries in the chain block in the bucket + * list contains the following meta data: + * + * entry[0]: + * Bit 15 - always set to 1 (it is not a class index) + * Bits 0-14 - upper 15 bits of the next block index + * entry[1] - lower 16 bits of next block index + * + * A next block index of all 1 bits means it is the end of the list. + * + * On the unsized bucket (bucket-0), the 3rd and 4th entries contain + * the chain block size: + * + * entry[2] - upper 16 bits of the chain block size + * entry[3] - lower 16 bits of the chain block size + */ +#define MAX_CHAIN_BUCKETS 16 +#define CHAIN_BLK_FLAG (1U << 15) +#define CHAIN_BLK_LIST_END 0xFFFFU + +static int chain_block_buckets[MAX_CHAIN_BUCKETS]; + +static inline int size_to_bucket(int size) +{ + if (size > MAX_CHAIN_BUCKETS) + return 0; + + return size - 1; +} + +/* + * Iterate all the chain blocks in a bucket. + */ +#define for_each_chain_block(bucket, prev, curr) \ + for ((prev) = -1, (curr) = chain_block_buckets[bucket]; \ + (curr) >= 0; \ + (prev) = (curr), (curr) = chain_block_next(curr)) + +/* + * next block or -1 + */ +static inline int chain_block_next(int offset) +{ + int next = chain_hlocks[offset]; + + WARN_ON_ONCE(!(next & CHAIN_BLK_FLAG)); + + if (next == CHAIN_BLK_LIST_END) + return -1; + + next &= ~CHAIN_BLK_FLAG; + next <<= 16; + next |= chain_hlocks[offset + 1]; + + return next; +} + +/* + * bucket-0 only + */ +static inline int chain_block_size(int offset) +{ + return (chain_hlocks[offset + 2] << 16) | chain_hlocks[offset + 3]; +} + +static inline void init_chain_block(int offset, int next, int bucket, int size) +{ + chain_hlocks[offset] = (next >> 16) | CHAIN_BLK_FLAG; + chain_hlocks[offset + 1] = (u16)next; + + if (size && !bucket) { + chain_hlocks[offset + 2] = size >> 16; + chain_hlocks[offset + 3] = (u16)size; + } +} + +static inline void add_chain_block(int offset, int size) +{ + int bucket = size_to_bucket(size); + int next = chain_block_buckets[bucket]; + int prev, curr; + + if (unlikely(size < 2)) { + /* + * We can't store single entries on the freelist. Leak them. + * + * One possible way out would be to uniquely mark them, other + * than with CHAIN_BLK_FLAG, such that we can recover them when + * the block before it is re-added. + */ + if (size) + nr_lost_chain_hlocks++; + return; + } + + nr_free_chain_hlocks += size; + if (!bucket) { + nr_large_chain_blocks++; + + /* + * Variable sized, sort large to small. + */ + for_each_chain_block(0, prev, curr) { + if (size >= chain_block_size(curr)) + break; + } + init_chain_block(offset, curr, 0, size); + if (prev < 0) + chain_block_buckets[0] = offset; + else + init_chain_block(prev, offset, 0, 0); + return; + } + /* + * Fixed size, add to head. + */ + init_chain_block(offset, next, bucket, size); + chain_block_buckets[bucket] = offset; +} + +/* + * Only the first block in the list can be deleted. + * + * For the variable size bucket[0], the first block (the largest one) is + * returned, broken up and put back into the pool. So if a chain block of + * length > MAX_CHAIN_BUCKETS is ever used and zapped, it will just be + * queued up after the primordial chain block and never be used until the + * hlock entries in the primordial chain block is almost used up. That + * causes fragmentation and reduce allocation efficiency. That can be + * monitored by looking at the "large chain blocks" number in lockdep_stats. + */ +static inline void del_chain_block(int bucket, int size, int next) +{ + nr_free_chain_hlocks -= size; + chain_block_buckets[bucket] = next; + + if (!bucket) + nr_large_chain_blocks--; +} + +static void init_chain_block_buckets(void) +{ + int i; + + for (i = 0; i < MAX_CHAIN_BUCKETS; i++) + chain_block_buckets[i] = -1; + + add_chain_block(0, ARRAY_SIZE(chain_hlocks)); +} + +/* + * Return offset of a chain block of the right size or -1 if not found. + * + * Fairly simple worst-fit allocator with the addition of a number of size + * specific free lists. + */ +static int alloc_chain_hlocks(int req) +{ + int bucket, curr, size; + + /* + * We rely on the MSB to act as an escape bit to denote freelist + * pointers. Make sure this bit isn't set in 'normal' class_idx usage. + */ + BUILD_BUG_ON((MAX_LOCKDEP_KEYS-1) & CHAIN_BLK_FLAG); + + init_data_structures_once(); + + if (nr_free_chain_hlocks < req) + return -1; + + /* + * We require a minimum of 2 (u16) entries to encode a freelist + * 'pointer'. + */ + req = max(req, 2); + bucket = size_to_bucket(req); + curr = chain_block_buckets[bucket]; + + if (bucket) { + if (curr >= 0) { + del_chain_block(bucket, req, chain_block_next(curr)); + return curr; + } + /* Try bucket 0 */ + curr = chain_block_buckets[0]; + } + + /* + * The variable sized freelist is sorted by size; the first entry is + * the largest. Use it if it fits. + */ + if (curr >= 0) { + size = chain_block_size(curr); + if (likely(size >= req)) { + del_chain_block(0, size, chain_block_next(curr)); + add_chain_block(curr + req, size - req); + return curr; + } + } + + /* + * Last resort, split a block in a larger sized bucket. + */ + for (size = MAX_CHAIN_BUCKETS; size > req; size--) { + bucket = size_to_bucket(size); + curr = chain_block_buckets[bucket]; + if (curr < 0) + continue; + + del_chain_block(bucket, size, chain_block_next(curr)); + add_chain_block(curr + req, size - req); + return curr; + } + + return -1; +} + +static inline void free_chain_hlocks(int base, int size) +{ + add_chain_block(base, max(size, 2)); +} struct lock_class *lock_chain_get_class(struct lock_chain *chain, int i) { @@ -2828,15 +3057,8 @@ static inline int add_chain_cache(struct task_struct *curr, BUILD_BUG_ON((1UL << 6) <= ARRAY_SIZE(curr->held_locks)); BUILD_BUG_ON((1UL << 8*sizeof(chain_hlocks[0])) <= ARRAY_SIZE(lock_classes)); - if (likely(nr_chain_hlocks + chain->depth <= MAX_LOCKDEP_CHAIN_HLOCKS)) { - chain->base = nr_chain_hlocks; - for (j = 0; j < chain->depth - 1; j++, i++) { - int lock_id = curr->held_locks[i].class_idx; - chain_hlocks[chain->base + j] = lock_id; - } - chain_hlocks[chain->base + j] = class - lock_classes; - nr_chain_hlocks += chain->depth; - } else { + j = alloc_chain_hlocks(chain->depth); + if (j < 0) { if (!debug_locks_off_graph_unlock()) return 0; @@ -2845,6 +3067,13 @@ static inline int add_chain_cache(struct task_struct *curr, return 0; } + chain->base = j; + for (j = 0; j < chain->depth - 1; j++, i++) { + int lock_id = curr->held_locks[i].class_idx; + + chain_hlocks[chain->base + j] = lock_id; + } + chain_hlocks[chain->base + j] = class - lock_classes; hlist_add_head_rcu(&chain->entry, hash_head); debug_atomic_inc(chain_lookup_misses); inc_chains(chain->irq_context); @@ -2991,6 +3220,8 @@ static inline int validate_chain(struct task_struct *curr, { return 1; } + +static void init_chain_block_buckets(void) { } #endif /* CONFIG_PROVE_LOCKING */ /* @@ -4788,6 +5019,7 @@ static void remove_class_from_lock_chain(struct pending_free *pf, return; free_lock_chain: + free_chain_hlocks(chain->base, chain->depth); /* Overwrite the chain key for concurrent RCU readers. */ WRITE_ONCE(chain->chain_key, INITIAL_CHAIN_KEY); dec_chains(chain->irq_context); diff --git a/kernel/locking/lockdep_internals.h b/kernel/locking/lockdep_internals.h index af722ceeda33..baca699b94e9 100644 --- a/kernel/locking/lockdep_internals.h +++ b/kernel/locking/lockdep_internals.h @@ -140,7 +140,9 @@ extern unsigned long nr_stack_trace_entries; extern unsigned int nr_hardirq_chains; extern unsigned int nr_softirq_chains; extern unsigned int nr_process_chains; -extern unsigned int nr_chain_hlocks; +extern unsigned int nr_free_chain_hlocks; +extern unsigned int nr_lost_chain_hlocks; +extern unsigned int nr_large_chain_blocks; extern unsigned int max_lockdep_depth; extern unsigned int max_bfs_queue_depth; diff --git a/kernel/locking/lockdep_proc.c b/kernel/locking/lockdep_proc.c index 524580db4779..5525cd3ba0c8 100644 --- a/kernel/locking/lockdep_proc.c +++ b/kernel/locking/lockdep_proc.c @@ -137,7 +137,7 @@ static int lc_show(struct seq_file *m, void *v) }; if (v == SEQ_START_TOKEN) { - if (nr_chain_hlocks > MAX_LOCKDEP_CHAIN_HLOCKS) + if (!nr_free_chain_hlocks) seq_printf(m, "(buggered) "); seq_printf(m, "all lock chains:\n"); return 0; @@ -278,8 +278,12 @@ static int lockdep_stats_show(struct seq_file *m, void *v) #ifdef CONFIG_PROVE_LOCKING seq_printf(m, " dependency chains: %11lu [max: %lu]\n", lock_chain_count(), MAX_LOCKDEP_CHAINS); - seq_printf(m, " dependency chain hlocks: %11u [max: %lu]\n", - nr_chain_hlocks, MAX_LOCKDEP_CHAIN_HLOCKS); + seq_printf(m, " dependency chain hlocks used: %11lu [max: %lu]\n", + MAX_LOCKDEP_CHAIN_HLOCKS - + (nr_free_chain_hlocks + nr_lost_chain_hlocks), + MAX_LOCKDEP_CHAIN_HLOCKS); + seq_printf(m, " dependency chain hlocks lost: %11u\n", + nr_lost_chain_hlocks); #endif #ifdef CONFIG_TRACE_IRQFLAGS @@ -352,6 +356,8 @@ static int lockdep_stats_show(struct seq_file *m, void *v) #ifdef CONFIG_PROVE_LOCKING seq_printf(m, " zapped lock chains: %11lu\n", nr_zapped_lock_chains); + seq_printf(m, " large chain blocks: %11u\n", + nr_large_chain_blocks); #endif return 0; } -- cgit v1.2.3-71-gd317 From 1751060e2527462714359573a39dca10451ffbf8 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Oct 2019 20:01:26 +0100 Subject: locking/percpu-rwsem, lockdep: Make percpu-rwsem use its own lockdep_map As preparation for replacing the embedded rwsem, give percpu-rwsem its own lockdep_map. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200131151539.927625541@infradead.org --- include/linux/percpu-rwsem.h | 29 +++++++++++++++++++---------- kernel/cpu.c | 4 ++-- kernel/locking/percpu-rwsem.c | 16 ++++++++++++---- kernel/locking/rwsem.c | 4 ++-- kernel/locking/rwsem.h | 2 ++ 5 files changed, 37 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index ad2ca2a89d5b..f2c36fb5e661 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -15,8 +15,17 @@ struct percpu_rw_semaphore { struct rw_semaphore rw_sem; /* slowpath */ struct rcuwait writer; /* blocked writer */ int readers_block; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif }; +#ifdef CONFIG_DEBUG_LOCK_ALLOC +#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname }, +#else +#define __PERCPU_RWSEM_DEP_MAP_INIT(lockname) +#endif + #define __DEFINE_PERCPU_RWSEM(name, is_static) \ static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_rc_##name); \ is_static struct percpu_rw_semaphore name = { \ @@ -24,7 +33,9 @@ is_static struct percpu_rw_semaphore name = { \ .read_count = &__percpu_rwsem_rc_##name, \ .rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \ .writer = __RCUWAIT_INITIALIZER(name.writer), \ + __PERCPU_RWSEM_DEP_MAP_INIT(name) \ } + #define DEFINE_PERCPU_RWSEM(name) \ __DEFINE_PERCPU_RWSEM(name, /* not static */) #define DEFINE_STATIC_PERCPU_RWSEM(name) \ @@ -37,7 +48,7 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) { might_sleep(); - rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 0, _RET_IP_); + rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); preempt_disable(); /* @@ -76,13 +87,15 @@ static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem) */ if (ret) - rwsem_acquire_read(&sem->rw_sem.dep_map, 0, 1, _RET_IP_); + rwsem_acquire_read(&sem->dep_map, 0, 1, _RET_IP_); return ret; } static inline void percpu_up_read(struct percpu_rw_semaphore *sem) { + rwsem_release(&sem->dep_map, _RET_IP_); + preempt_disable(); /* * Same as in percpu_down_read(). @@ -92,8 +105,6 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem) else __percpu_up_read(sem); /* Unconditional memory barrier */ preempt_enable(); - - rwsem_release(&sem->rw_sem.dep_map, _RET_IP_); } extern void percpu_down_write(struct percpu_rw_semaphore *); @@ -110,15 +121,13 @@ extern void percpu_free_rwsem(struct percpu_rw_semaphore *); __percpu_init_rwsem(sem, #sem, &rwsem_key); \ }) -#define percpu_rwsem_is_held(sem) lockdep_is_held(&(sem)->rw_sem) - -#define percpu_rwsem_assert_held(sem) \ - lockdep_assert_held(&(sem)->rw_sem) +#define percpu_rwsem_is_held(sem) lockdep_is_held(sem) +#define percpu_rwsem_assert_held(sem) lockdep_assert_held(sem) static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { - lock_release(&sem->rw_sem.dep_map, ip); + lock_release(&sem->dep_map, ip); #ifdef CONFIG_RWSEM_SPIN_ON_OWNER if (!read) atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN); @@ -128,7 +137,7 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { - lock_acquire(&sem->rw_sem.dep_map, 0, 1, read, 1, NULL, ip); + lock_acquire(&sem->dep_map, 0, 1, read, 1, NULL, ip); #ifdef CONFIG_RWSEM_SPIN_ON_OWNER if (!read) atomic_long_set(&sem->rw_sem.owner, (long)current); diff --git a/kernel/cpu.c b/kernel/cpu.c index 9c706af713fb..221bf6a9e98a 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -331,12 +331,12 @@ void lockdep_assert_cpus_held(void) static void lockdep_acquire_cpus_lock(void) { - rwsem_acquire(&cpu_hotplug_lock.rw_sem.dep_map, 0, 0, _THIS_IP_); + rwsem_acquire(&cpu_hotplug_lock.dep_map, 0, 0, _THIS_IP_); } static void lockdep_release_cpus_lock(void) { - rwsem_release(&cpu_hotplug_lock.rw_sem.dep_map, _THIS_IP_); + rwsem_release(&cpu_hotplug_lock.dep_map, _THIS_IP_); } /* diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 364d38a0c444..aa2b118d2f88 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -11,7 +11,7 @@ #include "rwsem.h" int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, - const char *name, struct lock_class_key *rwsem_key) + const char *name, struct lock_class_key *key) { sem->read_count = alloc_percpu(int); if (unlikely(!sem->read_count)) @@ -19,9 +19,13 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ rcu_sync_init(&sem->rss); - __init_rwsem(&sem->rw_sem, name, rwsem_key); + init_rwsem(&sem->rw_sem); rcuwait_init(&sem->writer); sem->readers_block = 0; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + debug_check_no_locks_freed((void *)sem, sizeof(*sem)); + lockdep_init_map(&sem->dep_map, name, key, 0); +#endif return 0; } EXPORT_SYMBOL_GPL(__percpu_init_rwsem); @@ -142,10 +146,12 @@ static bool readers_active_check(struct percpu_rw_semaphore *sem) void percpu_down_write(struct percpu_rw_semaphore *sem) { + rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); + /* Notify readers to take the slow path. */ rcu_sync_enter(&sem->rss); - down_write(&sem->rw_sem); + __down_write(&sem->rw_sem); /* * Notify new readers to block; up until now, and thus throughout the @@ -168,6 +174,8 @@ EXPORT_SYMBOL_GPL(percpu_down_write); void percpu_up_write(struct percpu_rw_semaphore *sem) { + rwsem_release(&sem->dep_map, _RET_IP_); + /* * Signal the writer is done, no fast path yet. * @@ -183,7 +191,7 @@ void percpu_up_write(struct percpu_rw_semaphore *sem) /* * Release the write lock, this will allow readers back in the game. */ - up_write(&sem->rw_sem); + __up_write(&sem->rw_sem); /* * Once this completes (at least one RCU-sched grace period hence) the diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 0d9b6be9ecc8..30df8dff217b 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -1383,7 +1383,7 @@ static inline int __down_read_trylock(struct rw_semaphore *sem) /* * lock for writing */ -static inline void __down_write(struct rw_semaphore *sem) +inline void __down_write(struct rw_semaphore *sem) { long tmp = RWSEM_UNLOCKED_VALUE; @@ -1446,7 +1446,7 @@ inline void __up_read(struct rw_semaphore *sem) /* * unlock after writing */ -static inline void __up_write(struct rw_semaphore *sem) +inline void __up_write(struct rw_semaphore *sem) { long tmp; diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 2534ce49f648..d0d33a59622d 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -6,5 +6,7 @@ extern void __down_read(struct rw_semaphore *sem); extern void __up_read(struct rw_semaphore *sem); +extern void __down_write(struct rw_semaphore *sem); +extern void __up_write(struct rw_semaphore *sem); #endif /* __INTERNAL_RWSEM_H */ -- cgit v1.2.3-71-gd317 From 206c98ffbeda588dbbd9d272505c42acbc364a30 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Oct 2019 20:12:37 +0100 Subject: locking/percpu-rwsem: Convert to bool Use bool where possible. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200131151539.984626569@infradead.org --- include/linux/percpu-rwsem.h | 6 +++--- kernel/locking/percpu-rwsem.c | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index f2c36fb5e661..4ceaa1921951 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -41,7 +41,7 @@ is_static struct percpu_rw_semaphore name = { \ #define DEFINE_STATIC_PERCPU_RWSEM(name) \ __DEFINE_PERCPU_RWSEM(name, static) -extern int __percpu_down_read(struct percpu_rw_semaphore *, int); +extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool); extern void __percpu_up_read(struct percpu_rw_semaphore *); static inline void percpu_down_read(struct percpu_rw_semaphore *sem) @@ -69,9 +69,9 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) preempt_enable(); } -static inline int percpu_down_read_trylock(struct percpu_rw_semaphore *sem) +static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { - int ret = 1; + bool ret = true; preempt_disable(); /* diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index aa2b118d2f88..969389df6eee 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -45,7 +45,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *sem) } EXPORT_SYMBOL_GPL(percpu_free_rwsem); -int __percpu_down_read(struct percpu_rw_semaphore *sem, int try) +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) { /* * Due to having preemption disabled the decrement happens on @@ -69,7 +69,7 @@ int __percpu_down_read(struct percpu_rw_semaphore *sem, int try) * release in percpu_up_write(). */ if (likely(!smp_load_acquire(&sem->readers_block))) - return 1; + return true; /* * Per the above comment; we still have preemption disabled and @@ -78,7 +78,7 @@ int __percpu_down_read(struct percpu_rw_semaphore *sem, int try) __percpu_up_read(sem); if (try) - return 0; + return false; /* * We either call schedule() in the wait, or we'll fall through @@ -94,7 +94,7 @@ int __percpu_down_read(struct percpu_rw_semaphore *sem, int try) __up_read(&sem->rw_sem); preempt_disable(); - return 1; + return true; } EXPORT_SYMBOL_GPL(__percpu_down_read); -- cgit v1.2.3-71-gd317 From 71365d40232110f7b029befc9033ea311d680611 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Oct 2019 20:17:51 +0100 Subject: locking/percpu-rwsem: Move __this_cpu_inc() into the slowpath As preparation to rework __percpu_down_read() move the __this_cpu_inc() into it. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200131151540.041600199@infradead.org --- include/linux/percpu-rwsem.h | 10 ++++++---- kernel/locking/percpu-rwsem.c | 2 ++ 2 files changed, 8 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index 4ceaa1921951..bb5b71c1e446 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -59,8 +59,9 @@ static inline void percpu_down_read(struct percpu_rw_semaphore *sem) * and that once the synchronize_rcu() is done, the writer will see * anything we did within this RCU-sched read-size critical section. */ - __this_cpu_inc(*sem->read_count); - if (unlikely(!rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) + __this_cpu_inc(*sem->read_count); + else __percpu_down_read(sem, false); /* Unconditional memory barrier */ /* * The preempt_enable() prevents the compiler from @@ -77,8 +78,9 @@ static inline bool percpu_down_read_trylock(struct percpu_rw_semaphore *sem) /* * Same as in percpu_down_read(). */ - __this_cpu_inc(*sem->read_count); - if (unlikely(!rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) + __this_cpu_inc(*sem->read_count); + else ret = __percpu_down_read(sem, true); /* Unconditional memory barrier */ preempt_enable(); /* diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 969389df6eee..becf925b27b5 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -47,6 +47,8 @@ EXPORT_SYMBOL_GPL(percpu_free_rwsem); bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) { + __this_cpu_inc(*sem->read_count); + /* * Due to having preemption disabled the decrement happens on * the same CPU as the increment, avoiding the -- cgit v1.2.3-71-gd317 From 75ff64572e497578e238fefbdff221c96f29067a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 31 Oct 2019 12:34:23 +0100 Subject: locking/percpu-rwsem: Extract __percpu_down_read_trylock() In preparation for removing the embedded rwsem and building a custom lock, extract the read-trylock primitive. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200131151540.098485539@infradead.org --- kernel/locking/percpu-rwsem.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index becf925b27b5..b155e8e7ac39 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -45,7 +45,7 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *sem) } EXPORT_SYMBOL_GPL(percpu_free_rwsem); -bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) +static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) { __this_cpu_inc(*sem->read_count); @@ -73,11 +73,18 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) if (likely(!smp_load_acquire(&sem->readers_block))) return true; - /* - * Per the above comment; we still have preemption disabled and - * will thus decrement on the same CPU as we incremented. - */ - __percpu_up_read(sem); + __this_cpu_dec(*sem->read_count); + + /* Prod writer to re-evaluate readers_active_check() */ + rcuwait_wake_up(&sem->writer); + + return false; +} + +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) +{ + if (__percpu_down_read_trylock(sem)) + return true; if (try) return false; -- cgit v1.2.3-71-gd317 From 7f26482a872c36b2ee87ea95b9dcd96e3d5805df Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 30 Oct 2019 20:30:41 +0100 Subject: locking/percpu-rwsem: Remove the embedded rwsem The filesystem freezer uses percpu-rwsem in a way that is effectively write_non_owner() and achieves this with a few horrible hacks that rely on the rwsem (!percpu) implementation. When PREEMPT_RT replaces the rwsem implementation with a PI aware variant this comes apart. Remove the embedded rwsem and implement it using a waitqueue and an atomic_t. - make readers_block an atomic, and use it, with the waitqueue for a blocking test-and-set write-side. - have the read-side wait for the 'lock' state to clear. Have the waiters use FIFO queueing and mark them (reader/writer) with a new WQ_FLAG. Use a custom wake_function to wake either a single writer or all readers until a writer. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200204092403.GB14879@hirez.programming.kicks-ass.net --- include/linux/percpu-rwsem.h | 19 ++---- include/linux/wait.h | 1 + kernel/locking/percpu-rwsem.c | 153 +++++++++++++++++++++++++++++++----------- kernel/locking/rwsem.c | 9 ++- kernel/locking/rwsem.h | 12 ---- 5 files changed, 123 insertions(+), 71 deletions(-) (limited to 'kernel') diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index bb5b71c1e446..f5ecf6a8a1dd 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -3,18 +3,18 @@ #define _LINUX_PERCPU_RWSEM_H #include -#include #include #include +#include #include #include struct percpu_rw_semaphore { struct rcu_sync rss; unsigned int __percpu *read_count; - struct rw_semaphore rw_sem; /* slowpath */ - struct rcuwait writer; /* blocked writer */ - int readers_block; + struct rcuwait writer; + wait_queue_head_t waiters; + atomic_t block; #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; #endif @@ -31,8 +31,9 @@ static DEFINE_PER_CPU(unsigned int, __percpu_rwsem_rc_##name); \ is_static struct percpu_rw_semaphore name = { \ .rss = __RCU_SYNC_INITIALIZER(name.rss), \ .read_count = &__percpu_rwsem_rc_##name, \ - .rw_sem = __RWSEM_INITIALIZER(name.rw_sem), \ .writer = __RCUWAIT_INITIALIZER(name.writer), \ + .waiters = __WAIT_QUEUE_HEAD_INITIALIZER(name.waiters), \ + .block = ATOMIC_INIT(0), \ __PERCPU_RWSEM_DEP_MAP_INIT(name) \ } @@ -130,20 +131,12 @@ static inline void percpu_rwsem_release(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { lock_release(&sem->dep_map, ip); -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER - if (!read) - atomic_long_set(&sem->rw_sem.owner, RWSEM_OWNER_UNKNOWN); -#endif } static inline void percpu_rwsem_acquire(struct percpu_rw_semaphore *sem, bool read, unsigned long ip) { lock_acquire(&sem->dep_map, 0, 1, read, 1, NULL, ip); -#ifdef CONFIG_RWSEM_SPIN_ON_OWNER - if (!read) - atomic_long_set(&sem->rw_sem.owner, (long)current); -#endif } #endif diff --git a/include/linux/wait.h b/include/linux/wait.h index 3283c8d02137..feeb6be5cad6 100644 --- a/include/linux/wait.h +++ b/include/linux/wait.h @@ -20,6 +20,7 @@ int default_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, int #define WQ_FLAG_EXCLUSIVE 0x01 #define WQ_FLAG_WOKEN 0x02 #define WQ_FLAG_BOOKMARK 0x04 +#define WQ_FLAG_CUSTOM 0x08 /* * A single wait-queue entry structure: diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index b155e8e7ac39..a136677543b4 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -1,15 +1,14 @@ // SPDX-License-Identifier: GPL-2.0-only #include -#include #include +#include #include #include #include #include +#include #include -#include "rwsem.h" - int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, const char *name, struct lock_class_key *key) { @@ -17,11 +16,10 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *sem, if (unlikely(!sem->read_count)) return -ENOMEM; - /* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */ rcu_sync_init(&sem->rss); - init_rwsem(&sem->rw_sem); rcuwait_init(&sem->writer); - sem->readers_block = 0; + init_waitqueue_head(&sem->waiters); + atomic_set(&sem->block, 0); #ifdef CONFIG_DEBUG_LOCK_ALLOC debug_check_no_locks_freed((void *)sem, sizeof(*sem)); lockdep_init_map(&sem->dep_map, name, key, 0); @@ -54,23 +52,23 @@ static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) * the same CPU as the increment, avoiding the * increment-on-one-CPU-and-decrement-on-another problem. * - * If the reader misses the writer's assignment of readers_block, then - * the writer is guaranteed to see the reader's increment. + * If the reader misses the writer's assignment of sem->block, then the + * writer is guaranteed to see the reader's increment. * * Conversely, any readers that increment their sem->read_count after - * the writer looks are guaranteed to see the readers_block value, - * which in turn means that they are guaranteed to immediately - * decrement their sem->read_count, so that it doesn't matter that the - * writer missed them. + * the writer looks are guaranteed to see the sem->block value, which + * in turn means that they are guaranteed to immediately decrement + * their sem->read_count, so that it doesn't matter that the writer + * missed them. */ smp_mb(); /* A matches D */ /* - * If !readers_block the critical section starts here, matched by the + * If !sem->block the critical section starts here, matched by the * release in percpu_up_write(). */ - if (likely(!smp_load_acquire(&sem->readers_block))) + if (likely(!atomic_read_acquire(&sem->block))) return true; __this_cpu_dec(*sem->read_count); @@ -81,6 +79,88 @@ static bool __percpu_down_read_trylock(struct percpu_rw_semaphore *sem) return false; } +static inline bool __percpu_down_write_trylock(struct percpu_rw_semaphore *sem) +{ + if (atomic_read(&sem->block)) + return false; + + return atomic_xchg(&sem->block, 1) == 0; +} + +static bool __percpu_rwsem_trylock(struct percpu_rw_semaphore *sem, bool reader) +{ + if (reader) { + bool ret; + + preempt_disable(); + ret = __percpu_down_read_trylock(sem); + preempt_enable(); + + return ret; + } + return __percpu_down_write_trylock(sem); +} + +/* + * The return value of wait_queue_entry::func means: + * + * <0 - error, wakeup is terminated and the error is returned + * 0 - no wakeup, a next waiter is tried + * >0 - woken, if EXCLUSIVE, counted towards @nr_exclusive. + * + * We use EXCLUSIVE for both readers and writers to preserve FIFO order, + * and play games with the return value to allow waking multiple readers. + * + * Specifically, we wake readers until we've woken a single writer, or until a + * trylock fails. + */ +static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, + unsigned int mode, int wake_flags, + void *key) +{ + struct task_struct *p = get_task_struct(wq_entry->private); + bool reader = wq_entry->flags & WQ_FLAG_CUSTOM; + struct percpu_rw_semaphore *sem = key; + + /* concurrent against percpu_down_write(), can get stolen */ + if (!__percpu_rwsem_trylock(sem, reader)) + return 1; + + list_del_init(&wq_entry->entry); + smp_store_release(&wq_entry->private, NULL); + + wake_up_process(p); + put_task_struct(p); + + return !reader; /* wake (readers until) 1 writer */ +} + +static void percpu_rwsem_wait(struct percpu_rw_semaphore *sem, bool reader) +{ + DEFINE_WAIT_FUNC(wq_entry, percpu_rwsem_wake_function); + bool wait; + + spin_lock_irq(&sem->waiters.lock); + /* + * Serialize against the wakeup in percpu_up_write(), if we fail + * the trylock, the wakeup must see us on the list. + */ + wait = !__percpu_rwsem_trylock(sem, reader); + if (wait) { + wq_entry.flags |= WQ_FLAG_EXCLUSIVE | reader * WQ_FLAG_CUSTOM; + __add_wait_queue_entry_tail(&sem->waiters, &wq_entry); + } + spin_unlock_irq(&sem->waiters.lock); + + while (wait) { + set_current_state(TASK_UNINTERRUPTIBLE); + if (!smp_load_acquire(&wq_entry.private)) + break; + schedule(); + } + __set_current_state(TASK_RUNNING); +} + bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) { if (__percpu_down_read_trylock(sem)) @@ -89,20 +169,10 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) if (try) return false; - /* - * We either call schedule() in the wait, or we'll fall through - * and reschedule on the preempt_enable() in percpu_down_read(). - */ - preempt_enable_no_resched(); - - /* - * Avoid lockdep for the down/up_read() we already have them. - */ - __down_read(&sem->rw_sem); - this_cpu_inc(*sem->read_count); - __up_read(&sem->rw_sem); - + preempt_enable(); + percpu_rwsem_wait(sem, /* .reader = */ true); preempt_disable(); + return true; } EXPORT_SYMBOL_GPL(__percpu_down_read); @@ -117,7 +187,7 @@ void __percpu_up_read(struct percpu_rw_semaphore *sem) */ __this_cpu_dec(*sem->read_count); - /* Prod writer to recheck readers_active */ + /* Prod writer to re-evaluate readers_active_check() */ rcuwait_wake_up(&sem->writer); } EXPORT_SYMBOL_GPL(__percpu_up_read); @@ -137,6 +207,8 @@ EXPORT_SYMBOL_GPL(__percpu_up_read); * zero. If this sum is zero, then it is stable due to the fact that if any * newly arriving readers increment a given counter, they will immediately * decrement that same counter. + * + * Assumes sem->block is set. */ static bool readers_active_check(struct percpu_rw_semaphore *sem) { @@ -160,23 +232,22 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) /* Notify readers to take the slow path. */ rcu_sync_enter(&sem->rss); - __down_write(&sem->rw_sem); - /* - * Notify new readers to block; up until now, and thus throughout the - * longish rcu_sync_enter() above, new readers could still come in. + * Try set sem->block; this provides writer-writer exclusion. + * Having sem->block set makes new readers block. */ - WRITE_ONCE(sem->readers_block, 1); + if (!__percpu_down_write_trylock(sem)) + percpu_rwsem_wait(sem, /* .reader = */ false); - smp_mb(); /* D matches A */ + /* smp_mb() implied by __percpu_down_write_trylock() on success -- D matches A */ /* - * If they don't see our writer of readers_block, then we are - * guaranteed to see their sem->read_count increment, and therefore - * will wait for them. + * If they don't see our store of sem->block, then we are guaranteed to + * see their sem->read_count increment, and therefore will wait for + * them. */ - /* Wait for all now active readers to complete. */ + /* Wait for all active readers to complete. */ rcuwait_wait_event(&sem->writer, readers_active_check(sem)); } EXPORT_SYMBOL_GPL(percpu_down_write); @@ -195,12 +266,12 @@ void percpu_up_write(struct percpu_rw_semaphore *sem) * Therefore we force it through the slow path which guarantees an * acquire and thereby guarantees the critical section's consistency. */ - smp_store_release(&sem->readers_block, 0); + atomic_set_release(&sem->block, 0); /* - * Release the write lock, this will allow readers back in the game. + * Prod any pending reader/writer to make progress. */ - __up_write(&sem->rw_sem); + __wake_up(&sem->waiters, TASK_NORMAL, 1, sem); /* * Once this completes (at least one RCU-sched grace period hence) the diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 30df8dff217b..81c0d75d5dab 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -28,7 +28,6 @@ #include #include -#include "rwsem.h" #include "lock_events.h" /* @@ -1338,7 +1337,7 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem) /* * lock for reading */ -inline void __down_read(struct rw_semaphore *sem) +static inline void __down_read(struct rw_semaphore *sem) { if (!rwsem_read_trylock(sem)) { rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE); @@ -1383,7 +1382,7 @@ static inline int __down_read_trylock(struct rw_semaphore *sem) /* * lock for writing */ -inline void __down_write(struct rw_semaphore *sem) +static inline void __down_write(struct rw_semaphore *sem) { long tmp = RWSEM_UNLOCKED_VALUE; @@ -1426,7 +1425,7 @@ static inline int __down_write_trylock(struct rw_semaphore *sem) /* * unlock after reading */ -inline void __up_read(struct rw_semaphore *sem) +static inline void __up_read(struct rw_semaphore *sem) { long tmp; @@ -1446,7 +1445,7 @@ inline void __up_read(struct rw_semaphore *sem) /* * unlock after writing */ -inline void __up_write(struct rw_semaphore *sem) +static inline void __up_write(struct rw_semaphore *sem) { long tmp; diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index d0d33a59622d..e69de29bb2d1 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -1,12 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ - -#ifndef __INTERNAL_RWSEM_H -#define __INTERNAL_RWSEM_H -#include - -extern void __down_read(struct rw_semaphore *sem); -extern void __up_read(struct rw_semaphore *sem); -extern void __down_write(struct rw_semaphore *sem); -extern void __up_write(struct rw_semaphore *sem); - -#endif /* __INTERNAL_RWSEM_H */ -- cgit v1.2.3-71-gd317 From bcba67cd806800fa8e973ac49dbc7d2d8fb3e55e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 4 Feb 2020 09:34:37 +0100 Subject: locking/rwsem: Remove RWSEM_OWNER_UNKNOWN Remove the now unused RWSEM_OWNER_UNKNOWN hack. This hack breaks PREEMPT_RT and getting rid of it was the entire motivation for re-writing the percpu rwsem. The biggest problem is that it is fundamentally incompatible with any form of Priority Inheritance, any exclusively held lock must have a distinct owner. Requested-by: Christoph Hellwig Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Davidlohr Bueso Acked-by: Will Deacon Acked-by: Waiman Long Tested-by: Juri Lelli Link: https://lkml.kernel.org/r/20200204092228.GP14946@hirez.programming.kicks-ass.net --- include/linux/rwsem.h | 6 ------ kernel/locking/rwsem.c | 2 -- 2 files changed, 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 00d6054687dd..8a418d9eeb7a 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -53,12 +53,6 @@ struct rw_semaphore { #endif }; -/* - * Setting all bits of the owner field except bit 0 will indicate - * that the rwsem is writer-owned with an unknown owner. - */ -#define RWSEM_OWNER_UNKNOWN (-2L) - /* In all implementations count != 0 means locked */ static inline int rwsem_is_locked(struct rw_semaphore *sem) { diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 81c0d75d5dab..e6f437bafb23 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -659,8 +659,6 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem, unsigned long flags; bool ret = true; - BUILD_BUG_ON(!(RWSEM_OWNER_UNKNOWN & RWSEM_NONSPINNABLE)); - if (need_resched()) { lockevent_inc(rwsem_opt_fail); return false; -- cgit v1.2.3-71-gd317 From ac8dec420970f5cbaf2f6eda39153a60ec5b257b Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Mon, 18 Nov 2019 15:19:35 -0800 Subject: locking/percpu-rwsem: Fold __percpu_up_read() Now that __percpu_up_read() is only ever used from percpu_up_read() merge them, it's a small function. Signed-off-by: Davidlohr Bueso Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Will Deacon Acked-by: Waiman Long Link: https://lkml.kernel.org/r/20200131151540.212415454@infradead.org --- include/linux/percpu-rwsem.h | 19 +++++++++++++++---- kernel/exit.c | 1 + kernel/locking/percpu-rwsem.c | 15 --------------- 3 files changed, 16 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h index f5ecf6a8a1dd..5e033fe1ff4e 100644 --- a/include/linux/percpu-rwsem.h +++ b/include/linux/percpu-rwsem.h @@ -43,7 +43,6 @@ is_static struct percpu_rw_semaphore name = { \ __DEFINE_PERCPU_RWSEM(name, static) extern bool __percpu_down_read(struct percpu_rw_semaphore *, bool); -extern void __percpu_up_read(struct percpu_rw_semaphore *); static inline void percpu_down_read(struct percpu_rw_semaphore *sem) { @@ -103,10 +102,22 @@ static inline void percpu_up_read(struct percpu_rw_semaphore *sem) /* * Same as in percpu_down_read(). */ - if (likely(rcu_sync_is_idle(&sem->rss))) + if (likely(rcu_sync_is_idle(&sem->rss))) { __this_cpu_dec(*sem->read_count); - else - __percpu_up_read(sem); /* Unconditional memory barrier */ + } else { + /* + * slowpath; reader will only ever wake a single blocked + * writer. + */ + smp_mb(); /* B matches C */ + /* + * In other words, if they see our decrement (presumably to + * aggregate zero, as that is the only time it matters) they + * will also see our critical section. + */ + __this_cpu_dec(*sem->read_count); + rcuwait_wake_up(&sem->writer); + } preempt_enable(); } diff --git a/kernel/exit.c b/kernel/exit.c index 2833ffb0c211..f64a8f9d412a 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -258,6 +258,7 @@ void rcuwait_wake_up(struct rcuwait *w) wake_up_process(task); rcu_read_unlock(); } +EXPORT_SYMBOL_GPL(rcuwait_wake_up); /* * Determine if a process group is "orphaned", according to the POSIX diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index a136677543b4..8048a9a255d5 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -177,21 +177,6 @@ bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try) } EXPORT_SYMBOL_GPL(__percpu_down_read); -void __percpu_up_read(struct percpu_rw_semaphore *sem) -{ - smp_mb(); /* B matches C */ - /* - * In other words, if they see our decrement (presumably to aggregate - * zero, as that is the only time it matters) they will also see our - * critical section. - */ - __this_cpu_dec(*sem->read_count); - - /* Prod writer to re-evaluate readers_active_check() */ - rcuwait_wake_up(&sem->writer); -} -EXPORT_SYMBOL_GPL(__percpu_up_read); - #define per_cpu_sum(var) \ ({ \ typeof(var) __sum = 0; \ -- cgit v1.2.3-71-gd317 From 41f0e29190ac9e38099a37abd1a8a4cb1dc21233 Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Tue, 7 Jan 2020 17:33:04 -0800 Subject: locking/percpu-rwsem: Add might_sleep() for writer locking We are missing this annotation in percpu_down_write(). Correct this. Signed-off-by: Davidlohr Bueso Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Acked-by: Will Deacon Acked-by: Waiman Long Link: https://lkml.kernel.org/r/20200108013305.7732-1-dave@stgolabs.net --- kernel/locking/percpu-rwsem.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 8048a9a255d5..183a3aaa8509 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -212,6 +212,7 @@ static bool readers_active_check(struct percpu_rw_semaphore *sem) void percpu_down_write(struct percpu_rw_semaphore *sem) { + might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); /* Notify readers to take the slow path. */ -- cgit v1.2.3-71-gd317 From bbfd5e4fab63703375eafaf241a0c696024a59e1 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Mon, 27 Jan 2020 08:53:54 -0800 Subject: perf/core: Add new branch sample type for HW index of raw branch records The low level index is the index in the underlying hardware buffer of the most recently captured taken branch which is always saved in branch_entries[0]. It is very useful for reconstructing the call stack. For example, in Intel LBR call stack mode, the depth of reconstructed LBR call stack limits to the number of LBR registers. With the low level index information, perf tool may stitch the stacks of two samples. The reconstructed LBR call stack can break the HW limitation. Add a new branch sample type to retrieve low level index of raw branch records. The low level index is between -1 (unknown) and max depth which can be retrieved in /sys/devices/cpu/caps/branches. Only when the new branch sample type is set, the low level index information is dumped into the PERF_SAMPLE_BRANCH_STACK output. Perf tool should check the attr.branch_sample_type, and apply the corresponding format for PERF_SAMPLE_BRANCH_STACK samples. Otherwise, some user case may be broken. For example, users may parse a perf.data, which include the new branch sample type, with an old version perf tool (without the check). Users probably get incorrect information without any warning. Signed-off-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com --- arch/powerpc/perf/core-book3s.c | 1 + arch/x86/events/intel/lbr.c | 3 +++ include/linux/perf_event.h | 12 ++++++++++++ include/uapi/linux/perf_event.h | 8 +++++++- kernel/events/core.c | 10 ++++++++++ 5 files changed, 33 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 3086055bf681..3dcfecf858f3 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -518,6 +518,7 @@ static void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events * } } cpuhw->bhrb_stack.nr = u_index; + cpuhw->bhrb_stack.hw_idx = -1ULL; return; } diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c index 534c76606049..7639e2097101 100644 --- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -585,6 +585,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc) cpuc->lbr_entries[i].reserved = 0; } cpuc->lbr_stack.nr = i; + cpuc->lbr_stack.hw_idx = -1ULL; } /* @@ -680,6 +681,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc) out++; } cpuc->lbr_stack.nr = out; + cpuc->lbr_stack.hw_idx = -1ULL; } void intel_pmu_lbr_read(void) @@ -1120,6 +1122,7 @@ void intel_pmu_store_pebs_lbrs(struct pebs_lbr *lbr) int i; cpuc->lbr_stack.nr = x86_pmu.lbr_nr; + cpuc->lbr_stack.hw_idx = -1ULL; for (i = 0; i < x86_pmu.lbr_nr; i++) { u64 info = lbr->lbr[i].info; struct perf_branch_entry *e = &cpuc->lbr_entries[i]; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 547773f5894e..68e21e828893 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -93,14 +93,26 @@ struct perf_raw_record { /* * branch stack layout: * nr: number of taken branches stored in entries[] + * hw_idx: The low level index of raw branch records + * for the most recent branch. + * -1ULL means invalid/unknown. * * Note that nr can vary from sample to sample * branches (to, from) are stored from most recent * to least recent, i.e., entries[0] contains the most * recent branch. + * The entries[] is an abstraction of raw branch records, + * which may not be stored in age order in HW, e.g. Intel LBR. + * The hw_idx is to expose the low level index of raw + * branch record for the most recent branch aka entries[0]. + * The hw_idx index is between -1 (unknown) and max depth, + * which can be retrieved in /sys/devices/cpu/caps/branches. + * For the architectures whose raw branch records are + * already stored in age order, the hw_idx should be 0. */ struct perf_branch_stack { __u64 nr; + __u64 hw_idx; struct perf_branch_entry entries[0]; }; diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 377d794d3105..397cfd65b3fe 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -181,6 +181,8 @@ enum perf_branch_sample_type_shift { PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT = 16, /* save branch type */ + PERF_SAMPLE_BRANCH_HW_INDEX_SHIFT = 17, /* save low level index of raw branch records */ + PERF_SAMPLE_BRANCH_MAX_SHIFT /* non-ABI */ }; @@ -208,6 +210,8 @@ enum perf_branch_sample_type { PERF_SAMPLE_BRANCH_TYPE_SAVE = 1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT, + PERF_SAMPLE_BRANCH_HW_INDEX = 1U << PERF_SAMPLE_BRANCH_HW_INDEX_SHIFT, + PERF_SAMPLE_BRANCH_MAX = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT, }; @@ -853,7 +857,9 @@ enum perf_event_type { * char data[size];}&& PERF_SAMPLE_RAW * * { u64 nr; - * { u64 from, to, flags } lbr[nr];} && PERF_SAMPLE_BRANCH_STACK + * { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX + * { u64 from, to, flags } lbr[nr]; + * } && PERF_SAMPLE_BRANCH_STACK * * { u64 abi; # enum perf_sample_regs_abi * u64 regs[weight(mask)]; } && PERF_SAMPLE_REGS_USER diff --git a/kernel/events/core.c b/kernel/events/core.c index e453589da97c..3f1f77de7247 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6555,6 +6555,11 @@ static void perf_output_read(struct perf_output_handle *handle, perf_output_read_one(handle, event, enabled, running); } +static inline bool perf_sample_save_hw_index(struct perf_event *event) +{ + return event->attr.branch_sample_type & PERF_SAMPLE_BRANCH_HW_INDEX; +} + void perf_output_sample(struct perf_output_handle *handle, struct perf_event_header *header, struct perf_sample_data *data, @@ -6643,6 +6648,8 @@ void perf_output_sample(struct perf_output_handle *handle, * sizeof(struct perf_branch_entry); perf_output_put(handle, data->br_stack->nr); + if (perf_sample_save_hw_index(event)) + perf_output_put(handle, data->br_stack->hw_idx); perf_output_copy(handle, data->br_stack->entries, size); } else { /* @@ -6836,6 +6843,9 @@ void perf_prepare_sample(struct perf_event_header *header, if (sample_type & PERF_SAMPLE_BRANCH_STACK) { int size = sizeof(u64); /* nr */ if (data->br_stack) { + if (perf_sample_save_hw_index(event)) + size += sizeof(u64); + size += data->br_stack->nr * sizeof(struct perf_branch_entry); } -- cgit v1.2.3-71-gd317 From 34896620422eefe1bad998ed4b48117426ad81f9 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:52:35 +0100 Subject: PM: QoS: Drop debugfs interface After commit c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class in use is PM_QOS_CPU_DMA_LATENCY and the existing PM QoS debugfs interface has become overly complicated (as it takes other potentially possible PM QoS classes that are not there any more into account). It is also not particularly useful (the "type" of the PM_QOS_CPU_DMA_LATENCY is known, its aggregate value can be read from /dev/cpu_dma_latency and the number of requests in the queue does not really matter) and there are no known users depending on it. Moreover, there are dedicated trace events that can be used for tracking PM QoS usage with much higher precision. For these reasons, drop the PM QoS debugfs interface altogether. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson --- kernel/power/qos.c | 73 ++---------------------------------------------------- 1 file changed, 2 insertions(+), 71 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 83edf8698118..d932fa42e8e4 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -137,69 +137,6 @@ static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) c->target_value = value; } -static int pm_qos_debug_show(struct seq_file *s, void *unused) -{ - struct pm_qos_object *qos = (struct pm_qos_object *)s->private; - struct pm_qos_constraints *c; - struct pm_qos_request *req; - char *type; - unsigned long flags; - int tot_reqs = 0; - int active_reqs = 0; - - if (IS_ERR_OR_NULL(qos)) { - pr_err("%s: bad qos param!\n", __func__); - return -EINVAL; - } - c = qos->constraints; - if (IS_ERR_OR_NULL(c)) { - pr_err("%s: Bad constraints on qos?\n", __func__); - return -EINVAL; - } - - /* Lock to ensure we have a snapshot */ - spin_lock_irqsave(&pm_qos_lock, flags); - if (plist_head_empty(&c->list)) { - seq_puts(s, "Empty!\n"); - goto out; - } - - switch (c->type) { - case PM_QOS_MIN: - type = "Minimum"; - break; - case PM_QOS_MAX: - type = "Maximum"; - break; - case PM_QOS_SUM: - type = "Sum"; - break; - default: - type = "Unknown"; - } - - plist_for_each_entry(req, &c->list, node) { - char *state = "Default"; - - if ((req->node).prio != c->default_value) { - active_reqs++; - state = "Active"; - } - tot_reqs++; - seq_printf(s, "%d: %d: %s\n", tot_reqs, - (req->node).prio, state); - } - - seq_printf(s, "Type=%s, Value=%d, Requests: active=%d / total=%d\n", - type, pm_qos_get_value(c), active_reqs, tot_reqs); - -out: - spin_unlock_irqrestore(&pm_qos_lock, flags); - return 0; -} - -DEFINE_SHOW_ATTRIBUTE(pm_qos_debug); - /** * pm_qos_update_target - manages the constraints list and calls the notifiers * if needed @@ -529,15 +466,12 @@ int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier) EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); /* User space interface to PM QoS classes via misc devices */ -static int register_pm_qos_misc(struct pm_qos_object *qos, struct dentry *d) +static int register_pm_qos_misc(struct pm_qos_object *qos) { qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; qos->pm_qos_power_miscdev.name = qos->name; qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; - debugfs_create_file(qos->name, S_IRUGO, d, (void *)qos, - &pm_qos_debug_fops); - return misc_register(&qos->pm_qos_power_miscdev); } @@ -631,14 +565,11 @@ static int __init pm_qos_power_init(void) { int ret = 0; int i; - struct dentry *d; BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES); - d = debugfs_create_dir("pm_qos", NULL); - for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) { - ret = register_pm_qos_misc(pm_qos_array[i], d); + ret = register_pm_qos_misc(pm_qos_array[i]); if (ret < 0) { pr_err("%s: %s setup failed\n", __func__, pm_qos_array[i]->name); -- cgit v1.2.3-71-gd317 From f43caa2adc96fc9c95fd77eef63cdff86ebf33cb Mon Sep 17 00:00:00 2001 From: Michal Koutný Date: Fri, 24 Jan 2020 12:40:16 +0100 Subject: cgroup: Clean up css_set task traversal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit css_task_iter stores pointer to head of each iterable list, this dates back to commit 0f0a2b4fa621 ("cgroup: reorganize css_task_iter") when we did not store cur_cset. Let us utilize list heads directly in cur_cset and streamline css_task_iter_advance_css_set a bit. This is no intentional function change. Signed-off-by: Michal Koutný Signed-off-by: Tejun Heo --- include/linux/cgroup.h | 3 --- kernel/cgroup/cgroup.c | 61 +++++++++++++++++++++++--------------------------- 2 files changed, 28 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index e75d2191226b..f1219b927817 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -58,9 +58,6 @@ struct css_task_iter { struct list_head *tcset_head; struct list_head *task_pos; - struct list_head *tasks_head; - struct list_head *mg_tasks_head; - struct list_head *dying_tasks_head; struct list_head *cur_tasks_head; struct css_set *cur_cset; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c719a4154d6d..b4c4c4fbd6de 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4391,29 +4391,24 @@ static void css_task_iter_advance_css_set(struct css_task_iter *it) lockdep_assert_held(&css_set_lock); - /* Advance to the next non-empty css_set */ - do { - cset = css_task_iter_next_css_set(it); - if (!cset) { - it->task_pos = NULL; - return; + /* Advance to the next non-empty css_set and find first non-empty tasks list*/ + while ((cset = css_task_iter_next_css_set(it))) { + if (!list_empty(&cset->tasks)) { + it->cur_tasks_head = &cset->tasks; + break; + } else if (!list_empty(&cset->mg_tasks)) { + it->cur_tasks_head = &cset->mg_tasks; + break; + } else if (!list_empty(&cset->dying_tasks)) { + it->cur_tasks_head = &cset->dying_tasks; + break; } - } while (!css_set_populated(cset) && list_empty(&cset->dying_tasks)); - - if (!list_empty(&cset->tasks)) { - it->task_pos = cset->tasks.next; - it->cur_tasks_head = &cset->tasks; - } else if (!list_empty(&cset->mg_tasks)) { - it->task_pos = cset->mg_tasks.next; - it->cur_tasks_head = &cset->mg_tasks; - } else { - it->task_pos = cset->dying_tasks.next; - it->cur_tasks_head = &cset->dying_tasks; } - - it->tasks_head = &cset->tasks; - it->mg_tasks_head = &cset->mg_tasks; - it->dying_tasks_head = &cset->dying_tasks; + if (!cset) { + it->task_pos = NULL; + return; + } + it->task_pos = it->cur_tasks_head->next; /* * We don't keep css_sets locked across iteration steps and thus @@ -4458,24 +4453,24 @@ static void css_task_iter_advance(struct css_task_iter *it) repeat: if (it->task_pos) { /* - * Advance iterator to find next entry. cset->tasks is - * consumed first and then ->mg_tasks. After ->mg_tasks, - * we move onto the next cset. + * Advance iterator to find next entry. We go through cset + * tasks, mg_tasks and dying_tasks, when consumed we move onto + * the next cset. */ if (it->flags & CSS_TASK_ITER_SKIPPED) it->flags &= ~CSS_TASK_ITER_SKIPPED; else it->task_pos = it->task_pos->next; - if (it->task_pos == it->tasks_head) { - it->task_pos = it->mg_tasks_head->next; - it->cur_tasks_head = it->mg_tasks_head; + if (it->task_pos == &it->cur_cset->tasks) { + it->cur_tasks_head = &it->cur_cset->mg_tasks; + it->task_pos = it->cur_tasks_head->next; } - if (it->task_pos == it->mg_tasks_head) { - it->task_pos = it->dying_tasks_head->next; - it->cur_tasks_head = it->dying_tasks_head; + if (it->task_pos == &it->cur_cset->mg_tasks) { + it->cur_tasks_head = &it->cur_cset->dying_tasks; + it->task_pos = it->cur_tasks_head->next; } - if (it->task_pos == it->dying_tasks_head) + if (it->task_pos == &it->cur_cset->dying_tasks) css_task_iter_advance_css_set(it); } else { /* called from start, proceed to the first cset */ @@ -4493,12 +4488,12 @@ repeat: goto repeat; /* and dying leaders w/o live member threads */ - if (it->cur_tasks_head == it->dying_tasks_head && + if (it->cur_tasks_head == &it->cur_cset->dying_tasks && !atomic_read(&task->signal->live)) goto repeat; } else { /* skip all dying ones */ - if (it->cur_tasks_head == it->dying_tasks_head) + if (it->cur_tasks_head == &it->cur_cset->dying_tasks) goto repeat; } } -- cgit v1.2.3-71-gd317 From 3010c5b9f5f476b35b24955f08a3a6c06ec8e878 Mon Sep 17 00:00:00 2001 From: Madhuparna Bhowmik Date: Sat, 18 Jan 2020 08:40:51 +0530 Subject: cgroup.c: Use built-in RCU list checking MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit list_for_each_entry_rcu has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled by default. Even though the function css_next_child() already checks if cgroup_mutex or rcu_read_lock() is held using cgroup_assert_mutex_or_rcu_locked(), there is a need to pass cond to list_for_each_entry_rcu() to avoid false positive lockdep warning. Signed-off-by: Madhuparna Bhowmik Acked-by: Michal Koutný Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index b4c4c4fbd6de..7a310db6c807 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4148,7 +4148,8 @@ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos, } else if (likely(!(pos->flags & CSS_RELEASED))) { next = list_entry_rcu(pos->sibling.next, struct cgroup_subsys_state, sibling); } else { - list_for_each_entry_rcu(next, &parent->children, sibling) + list_for_each_entry_rcu(next, &parent->children, sibling, + lockdep_is_held(&cgroup_mutex)) if (next->serial_nr > pos->serial_nr) break; } -- cgit v1.2.3-71-gd317 From a49e4629b5edf1db856de05fbf1aae05502ef1af Mon Sep 17 00:00:00 2001 From: Prateek Sood Date: Fri, 24 Jan 2020 20:37:29 +0530 Subject: cpuset: Make cpuset hotplug synchronous Convert cpuset_hotplug_workfn() into synchronous call for cpu hotplug path. For memory hotplug path it still gets queued as a work item. Since cpuset_hotplug_workfn() can be made synchronous for cpu hotplug path, it is not required to wait for cpuset hotplug while thawing processes. Signed-off-by: Prateek Sood Signed-off-by: Tejun Heo --- include/linux/cpuset.h | 3 --- kernel/cgroup/cpuset.c | 31 +++++++++++++++++++------------ kernel/power/process.c | 2 -- 3 files changed, 19 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index 04c20de66afc..cede4cb98b78 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -54,7 +54,6 @@ extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); -extern void cpuset_wait_for_hotplug(void); extern void cpuset_read_lock(void); extern void cpuset_read_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); @@ -176,8 +175,6 @@ static inline void cpuset_update_active_cpus(void) partition_sched_domains(1, NULL, NULL); } -static inline void cpuset_wait_for_hotplug(void) { } - static inline void cpuset_read_lock(void) { } static inline void cpuset_read_unlock(void) { } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 58f5073acff7..cafd4d2ff882 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3101,7 +3101,7 @@ update_tasks: } /** - * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset + * cpuset_hotplug - handle CPU/memory hotunplug for a cpuset * * This function is called after either CPU or memory configuration has * changed and updates cpuset accordingly. The top_cpuset is always @@ -3116,7 +3116,7 @@ update_tasks: * Note that CPU offlining during suspend is ignored. We don't modify * cpusets across suspend/resume cycles at all. */ -static void cpuset_hotplug_workfn(struct work_struct *work) +static void cpuset_hotplug(bool use_cpu_hp_lock) { static cpumask_t new_cpus; static nodemask_t new_mems; @@ -3201,25 +3201,32 @@ static void cpuset_hotplug_workfn(struct work_struct *work) /* rebuild sched domains if cpus_allowed has changed */ if (cpus_updated || force_rebuild) { force_rebuild = false; - rebuild_sched_domains(); + if (use_cpu_hp_lock) + rebuild_sched_domains(); + else { + /* Acquiring cpu_hotplug_lock is not required. + * When cpuset_hotplug() is called in hotplug path, + * cpu_hotplug_lock is held by the hotplug context + * which is waiting for cpuhp_thread_fun to indicate + * completion of callback. + */ + percpu_down_write(&cpuset_rwsem); + rebuild_sched_domains_locked(); + percpu_up_write(&cpuset_rwsem); + } } free_cpumasks(NULL, ptmp); } -void cpuset_update_active_cpus(void) +static void cpuset_hotplug_workfn(struct work_struct *work) { - /* - * We're inside cpu hotplug critical region which usually nests - * inside cgroup synchronization. Bounce actual hotplug processing - * to a work item to avoid reverse locking order. - */ - schedule_work(&cpuset_hotplug_work); + cpuset_hotplug(true); } -void cpuset_wait_for_hotplug(void) +void cpuset_update_active_cpus(void) { - flush_work(&cpuset_hotplug_work); + cpuset_hotplug(false); } /* diff --git a/kernel/power/process.c b/kernel/power/process.c index 4b6a54da7e65..08f7019357ee 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -204,8 +204,6 @@ void thaw_processes(void) __usermodehelper_set_disable_depth(UMH_FREEZING); thaw_workqueues(); - cpuset_wait_for_hotplug(); - read_lock(&tasklist_lock); for_each_process_thread(g, p) { /* No other threads should have PF_SUSPEND_TASK set */ -- cgit v1.2.3-71-gd317 From 6df970e4f5d2c273554550d40d8b92cea9bec1a0 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 5 Feb 2020 14:26:18 +0100 Subject: cgroup: unify attach permission checking MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The core codepaths to check whether a process can be attached to a cgroup are the same for threads and thread-group leaders. Only a small piece of code verifying that source and destination cgroup are in the same domain differentiates the thread permission checking from thread-group leader permission checking. Since cgroup_migrate_vet_dst() only matters cgroup2 - it is a noop on cgroup1 - we can move it out of cgroup_attach_task(). All checks can now be consolidated into a new helper cgroup_attach_permissions() callable from both cgroup_procs_write() and cgroup_threads_write(). Cc: Tejun Heo Cc: Li Zefan Cc: Johannes Weiner Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný Signed-off-by: Christian Brauner Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 7a310db6c807..9ca51bf3769a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -2714,11 +2714,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, { DEFINE_CGROUP_MGCTX(mgctx); struct task_struct *task; - int ret; - - ret = cgroup_migrate_vet_dst(dst_cgrp); - if (ret) - return ret; + int ret = 0; /* look up all src csets */ spin_lock_irq(&css_set_lock); @@ -4695,6 +4691,26 @@ static int cgroup_procs_write_permission(struct cgroup *src_cgrp, return 0; } +static int cgroup_attach_permissions(struct cgroup *src_cgrp, + struct cgroup *dst_cgrp, + struct super_block *sb, bool threadgroup) +{ + int ret = 0; + + ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb); + if (ret) + return ret; + + ret = cgroup_migrate_vet_dst(dst_cgrp); + if (ret) + return ret; + + if (!threadgroup && (src_cgrp->dom_cgrp != dst_cgrp->dom_cgrp)) + ret = -EOPNOTSUPP; + + return ret; +} + static ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { @@ -4717,8 +4733,8 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file *of, src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root); spin_unlock_irq(&css_set_lock); - ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, - of->file->f_path.dentry->d_sb); + ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, + of->file->f_path.dentry->d_sb, true); if (ret) goto out_finish; @@ -4762,16 +4778,11 @@ static ssize_t cgroup_threads_write(struct kernfs_open_file *of, spin_unlock_irq(&css_set_lock); /* thread migrations follow the cgroup.procs delegation rule */ - ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, - of->file->f_path.dentry->d_sb); + ret = cgroup_attach_permissions(src_cgrp, dst_cgrp, + of->file->f_path.dentry->d_sb, false); if (ret) goto out_finish; - /* and must be contained in the same domain */ - ret = -EOPNOTSUPP; - if (src_cgrp->dom_cgrp != dst_cgrp->dom_cgrp) - goto out_finish; - ret = cgroup_attach_task(dst_cgrp, task, false); out_finish: -- cgit v1.2.3-71-gd317 From 17703097f3456498e6424614571648c6452f4d34 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 5 Feb 2020 14:26:19 +0100 Subject: cgroup: add cgroup_get_from_file() helper MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a helper cgroup_get_from_file(). The helper will be used in subsequent patches to retrieve a cgroup while holding a reference to the struct file it was taken from. Cc: Tejun Heo Cc: Johannes Weiner Cc: Li Zefan Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný Signed-off-by: Christian Brauner Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 9ca51bf3769a..16fe1c6cad35 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5880,6 +5880,24 @@ void cgroup_fork(struct task_struct *child) INIT_LIST_HEAD(&child->cg_list); } +static struct cgroup *cgroup_get_from_file(struct file *f) +{ + struct cgroup_subsys_state *css; + struct cgroup *cgrp; + + css = css_tryget_online_from_dir(f->f_path.dentry, NULL); + if (IS_ERR(css)) + return ERR_CAST(css); + + cgrp = css->cgroup; + if (!cgroup_on_dfl(cgrp)) { + cgroup_put(cgrp); + return ERR_PTR(-EBADF); + } + + return cgrp; +} + /** * cgroup_can_fork - called on a new task before the process is exposed * @child: the task in question. @@ -6171,7 +6189,6 @@ EXPORT_SYMBOL_GPL(cgroup_get_from_path); */ struct cgroup *cgroup_get_from_fd(int fd) { - struct cgroup_subsys_state *css; struct cgroup *cgrp; struct file *f; @@ -6179,17 +6196,8 @@ struct cgroup *cgroup_get_from_fd(int fd) if (!f) return ERR_PTR(-EBADF); - css = css_tryget_online_from_dir(f->f_path.dentry, NULL); + cgrp = cgroup_get_from_file(f); fput(f); - if (IS_ERR(css)) - return ERR_CAST(css); - - cgrp = css->cgroup; - if (!cgroup_on_dfl(cgrp)) { - cgroup_put(cgrp); - return ERR_PTR(-EBADF); - } - return cgrp; } EXPORT_SYMBOL_GPL(cgroup_get_from_fd); -- cgit v1.2.3-71-gd317 From 5a5cf5cb30d7815c01035fde4b84edef85d11c68 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 5 Feb 2020 14:26:20 +0100 Subject: cgroup: refactor fork helpers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This refactors the fork helpers so they can be easily modified in the next patches. The patch just moves the cgroup threadgroup rwsem grab and release into the helpers. They don't need to be directly exposed in fork.c. Cc: Tejun Heo Cc: Johannes Weiner Cc: Li Zefan Cc: cgroups@vger.kernel.org Acked-by: Michal Koutný Signed-off-by: Christian Brauner Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 44 ++++++++++++++++++++++++++------------------ kernel/fork.c | 6 +----- 2 files changed, 27 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 16fe1c6cad35..502769b2683c 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5900,17 +5900,20 @@ static struct cgroup *cgroup_get_from_file(struct file *f) /** * cgroup_can_fork - called on a new task before the process is exposed - * @child: the task in question. + * @child: the child process * - * This calls the subsystem can_fork() callbacks. If the can_fork() callback - * returns an error, the fork aborts with that error code. This allows for - * a cgroup subsystem to conditionally allow or deny new forks. + * This calls the subsystem can_fork() callbacks. If the cgroup_can_fork() + * callback returns an error, the fork aborts with that error code. This + * allows for a cgroup subsystem to conditionally allow or deny new forks. */ int cgroup_can_fork(struct task_struct *child) + __acquires(&cgroup_threadgroup_rwsem) __releases(&cgroup_threadgroup_rwsem) { struct cgroup_subsys *ss; int i, j, ret; + cgroup_threadgroup_change_begin(current); + do_each_subsys_mask(ss, i, have_canfork_callback) { ret = ss->can_fork(child); if (ret) @@ -5927,17 +5930,20 @@ out_revert: ss->cancel_fork(child); } + cgroup_threadgroup_change_end(current); + return ret; } /** - * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork() - * @child: the task in question - * - * This calls the cancel_fork() callbacks if a fork failed *after* - * cgroup_can_fork() succeded. - */ + * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork() + * @child: the child process + * + * This calls the cancel_fork() callbacks if a fork failed *after* + * cgroup_can_fork() succeded. + */ void cgroup_cancel_fork(struct task_struct *child) + __releases(&cgroup_threadgroup_rwsem) { struct cgroup_subsys *ss; int i; @@ -5945,19 +5951,19 @@ void cgroup_cancel_fork(struct task_struct *child) for_each_subsys(ss, i) if (ss->cancel_fork) ss->cancel_fork(child); + + cgroup_threadgroup_change_end(current); } /** - * cgroup_post_fork - called on a new task after adding it to the task list - * @child: the task in question - * - * Adds the task to the list running through its css_set if necessary and - * call the subsystem fork() callbacks. Has to be after the task is - * visible on the task list in case we race with the first call to - * cgroup_task_iter_start() - to guarantee that the new task ends up on its - * list. + * cgroup_post_fork - finalize cgroup setup for the child process + * @child: the child process + * + * Attach the child process to its css_set calling the subsystem fork() + * callbacks. */ void cgroup_post_fork(struct task_struct *child) + __releases(&cgroup_threadgroup_rwsem) { struct cgroup_subsys *ss; struct css_set *cset; @@ -6003,6 +6009,8 @@ void cgroup_post_fork(struct task_struct *child) do_each_subsys_mask(ss, i, have_fork_callback) { ss->fork(child); } while_each_subsys_mask(); + + cgroup_threadgroup_change_end(current); } /** diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..9245b6e53f55 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2174,7 +2174,6 @@ static __latent_entropy struct task_struct *copy_process( INIT_LIST_HEAD(&p->thread_group); p->task_works = NULL; - cgroup_threadgroup_change_begin(current); /* * Ensure that the cgroup subsystem policies allow the new process to be * forked. It should be noted the the new process's css_set can be changed @@ -2183,7 +2182,7 @@ static __latent_entropy struct task_struct *copy_process( */ retval = cgroup_can_fork(p); if (retval) - goto bad_fork_cgroup_threadgroup_change_end; + goto bad_fork_put_pidfd; /* * From this point on we must avoid any synchronous user-space @@ -2289,7 +2288,6 @@ static __latent_entropy struct task_struct *copy_process( proc_fork_connector(p); cgroup_post_fork(p); - cgroup_threadgroup_change_end(current); perf_event_fork(p); trace_task_newtask(p, clone_flags); @@ -2301,8 +2299,6 @@ bad_fork_cancel_cgroup: spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p); -bad_fork_cgroup_threadgroup_change_end: - cgroup_threadgroup_change_end(current); bad_fork_put_pidfd: if (clone_flags & CLONE_PIDFD) { fput(pidfile); -- cgit v1.2.3-71-gd317 From f3553220d4cc458d69f7da6e71a3a6097778bd28 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 5 Feb 2020 14:26:21 +0100 Subject: cgroup: add cgroup_may_write() helper Add a cgroup_may_write() helper which we can use in the CLONE_INTO_CGROUP patch series to verify that we can write to the destination cgroup. Cc: Tejun Heo Cc: Johannes Weiner Cc: Li Zefan Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 502769b2683c..6d8bdddd8c28 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -4654,13 +4654,28 @@ static int cgroup_procs_show(struct seq_file *s, void *v) return 0; } +static int cgroup_may_write(const struct cgroup *cgrp, struct super_block *sb) +{ + int ret; + struct inode *inode; + + lockdep_assert_held(&cgroup_mutex); + + inode = kernfs_get_inode(sb, cgrp->procs_file.kn); + if (!inode) + return -ENOMEM; + + ret = inode_permission(inode, MAY_WRITE); + iput(inode); + return ret; +} + static int cgroup_procs_write_permission(struct cgroup *src_cgrp, struct cgroup *dst_cgrp, struct super_block *sb) { struct cgroup_namespace *ns = current->nsproxy->cgroup_ns; struct cgroup *com_cgrp = src_cgrp; - struct inode *inode; int ret; lockdep_assert_held(&cgroup_mutex); @@ -4670,12 +4685,7 @@ static int cgroup_procs_write_permission(struct cgroup *src_cgrp, com_cgrp = cgroup_parent(com_cgrp); /* %current should be authorized to migrate to the common ancestor */ - inode = kernfs_get_inode(sb, com_cgrp->procs_file.kn); - if (!inode) - return -ENOMEM; - - ret = inode_permission(inode, MAY_WRITE); - iput(inode); + ret = cgroup_may_write(com_cgrp, sb); if (ret) return ret; -- cgit v1.2.3-71-gd317 From ef2c41cf38a7559bbf91af42d5b6a4429db8fc68 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Wed, 5 Feb 2020 14:26:22 +0100 Subject: clone3: allow spawning processes into cgroups This adds support for creating a process in a different cgroup than its parent. Callers can limit and account processes and threads right from the moment they are spawned: - A service manager can directly spawn new services into dedicated cgroups. - A process can be directly created in a frozen cgroup and will be frozen as well. - The initial accounting jitter experienced by process supervisors and daemons is eliminated with this. - Threaded applications or even thread implementations can choose to create a specific cgroup layout where each thread is spawned directly into a dedicated cgroup. This feature is limited to the unified hierarchy. Callers need to pass a directory file descriptor for the target cgroup. The caller can choose to pass an O_PATH file descriptor. All usual migration restrictions apply, i.e. there can be no processes in inner nodes. In general, creating a process directly in a target cgroup adheres to all migration restrictions. One of the biggest advantages of this feature is that CLONE_INTO_GROUP does not need to grab the write side of the cgroup cgroup_threadgroup_rwsem. This global lock makes moving tasks/threads around super expensive. With clone3() this lock is avoided. Cc: Tejun Heo Cc: Ingo Molnar Cc: Oleg Nesterov Cc: Johannes Weiner Cc: Li Zefan Cc: Peter Zijlstra Cc: cgroups@vger.kernel.org Signed-off-by: Christian Brauner Signed-off-by: Tejun Heo --- include/linux/cgroup-defs.h | 5 +- include/linux/cgroup.h | 20 +++-- include/linux/sched/task.h | 4 + include/uapi/linux/sched.h | 5 ++ kernel/cgroup/cgroup.c | 191 ++++++++++++++++++++++++++++++++++++++------ kernel/cgroup/pids.c | 15 +++- kernel/fork.c | 13 ++- 7 files changed, 214 insertions(+), 39 deletions(-) (limited to 'kernel') diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..68c391f451d1 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -628,8 +628,9 @@ struct cgroup_subsys { void (*cancel_attach)(struct cgroup_taskset *tset); void (*attach)(struct cgroup_taskset *tset); void (*post_attach)(void); - int (*can_fork)(struct task_struct *task); - void (*cancel_fork)(struct task_struct *task); + int (*can_fork)(struct task_struct *task, + struct css_set *cset); + void (*cancel_fork)(struct task_struct *task, struct css_set *cset); void (*fork)(struct task_struct *task); void (*exit)(struct task_struct *task); void (*release)(struct task_struct *task); diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index f1219b927817..4598e4da6b1b 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -27,6 +27,8 @@ #include +struct kernel_clone_args; + #ifdef CONFIG_CGROUPS /* @@ -119,9 +121,12 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); void cgroup_fork(struct task_struct *p); -extern int cgroup_can_fork(struct task_struct *p); -extern void cgroup_cancel_fork(struct task_struct *p); -extern void cgroup_post_fork(struct task_struct *p); +extern int cgroup_can_fork(struct task_struct *p, + struct kernel_clone_args *kargs); +extern void cgroup_cancel_fork(struct task_struct *p, + struct kernel_clone_args *kargs); +extern void cgroup_post_fork(struct task_struct *p, + struct kernel_clone_args *kargs); void cgroup_exit(struct task_struct *p); void cgroup_release(struct task_struct *p); void cgroup_free(struct task_struct *p); @@ -705,9 +710,12 @@ static inline int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry) { return -EINVAL; } static inline void cgroup_fork(struct task_struct *p) {} -static inline int cgroup_can_fork(struct task_struct *p) { return 0; } -static inline void cgroup_cancel_fork(struct task_struct *p) {} -static inline void cgroup_post_fork(struct task_struct *p) {} +static inline int cgroup_can_fork(struct task_struct *p, + struct kernel_clone_args *kargs) { return 0; } +static inline void cgroup_cancel_fork(struct task_struct *p, + struct kernel_clone_args *kargs) {} +static inline void cgroup_post_fork(struct task_struct *p, + struct kernel_clone_args *kargs) {} static inline void cgroup_exit(struct task_struct *p) {} static inline void cgroup_release(struct task_struct *p) {} static inline void cgroup_free(struct task_struct *p) {} diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index f1879884238e..38359071236a 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -13,6 +13,7 @@ struct task_struct; struct rusage; union thread_union; +struct css_set; /* All the bits taken by the old clone syscall. */ #define CLONE_LEGACY_FLAGS 0xffffffffULL @@ -29,6 +30,9 @@ struct kernel_clone_args { pid_t *set_tid; /* Number of elements in *set_tid */ size_t set_tid_size; + int cgroup; + struct cgroup *cgrp; + struct css_set *cset; }; /* diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 2e3bc22c6f20..3bac0a8ceab2 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -35,6 +35,7 @@ /* Flags for the clone3() syscall. */ #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */ +#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */ /* * cloning flags intersect with CSIGNAL so can be used with unshare and clone3 @@ -81,6 +82,8 @@ * @set_tid_size: This defines the size of the array referenced * in @set_tid. This cannot be larger than the * kernel's limit of nested PID namespaces. + * @cgroup: If CLONE_INTO_CGROUP is specified set this to + * a file descriptor for the cgroup. * * The structure is versioned by size and thus extensible. * New struct members must go at the end of the struct and @@ -97,11 +100,13 @@ struct clone_args { __aligned_u64 tls; __aligned_u64 set_tid; __aligned_u64 set_tid_size; + __aligned_u64 cgroup; }; #endif #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */ #define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */ +#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */ /* * Scheduling policies diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 6d8bdddd8c28..9a8a5ded3c48 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5881,8 +5881,7 @@ out: * @child: pointer to task_struct of forking parent process. * * A task is associated with the init_css_set until cgroup_post_fork() - * attaches it to the parent's css_set. Empty cg_list indicates that - * @child isn't holding reference to its css_set. + * attaches it to the target css_set. */ void cgroup_fork(struct task_struct *child) { @@ -5908,24 +5907,154 @@ static struct cgroup *cgroup_get_from_file(struct file *f) return cgrp; } +/** + * cgroup_css_set_fork - find or create a css_set for a child process + * @kargs: the arguments passed to create the child process + * + * This functions finds or creates a new css_set which the child + * process will be attached to in cgroup_post_fork(). By default, + * the child process will be given the same css_set as its parent. + * + * If CLONE_INTO_CGROUP is specified this function will try to find an + * existing css_set which includes the requested cgroup and if not create + * a new css_set that the child will be attached to later. If this function + * succeeds it will hold cgroup_threadgroup_rwsem on return. If + * CLONE_INTO_CGROUP is requested this function will grab cgroup mutex + * before grabbing cgroup_threadgroup_rwsem and will hold a reference + * to the target cgroup. + */ +static int cgroup_css_set_fork(struct kernel_clone_args *kargs) + __acquires(&cgroup_mutex) __acquires(&cgroup_threadgroup_rwsem) +{ + int ret; + struct cgroup *dst_cgrp = NULL; + struct css_set *cset; + struct super_block *sb; + struct file *f; + + if (kargs->flags & CLONE_INTO_CGROUP) + mutex_lock(&cgroup_mutex); + + cgroup_threadgroup_change_begin(current); + + spin_lock_irq(&css_set_lock); + cset = task_css_set(current); + get_css_set(cset); + spin_unlock_irq(&css_set_lock); + + if (!(kargs->flags & CLONE_INTO_CGROUP)) { + kargs->cset = cset; + return 0; + } + + f = fget_raw(kargs->cgroup); + if (!f) { + ret = -EBADF; + goto err; + } + sb = f->f_path.dentry->d_sb; + + dst_cgrp = cgroup_get_from_file(f); + if (IS_ERR(dst_cgrp)) { + ret = PTR_ERR(dst_cgrp); + dst_cgrp = NULL; + goto err; + } + + if (cgroup_is_dead(dst_cgrp)) { + ret = -ENODEV; + goto err; + } + + /* + * Verify that we the target cgroup is writable for us. This is + * usually done by the vfs layer but since we're not going through + * the vfs layer here we need to do it "manually". + */ + ret = cgroup_may_write(dst_cgrp, sb); + if (ret) + goto err; + + ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb, + !(kargs->flags & CLONE_THREAD)); + if (ret) + goto err; + + kargs->cset = find_css_set(cset, dst_cgrp); + if (!kargs->cset) { + ret = -ENOMEM; + goto err; + } + + put_css_set(cset); + fput(f); + kargs->cgrp = dst_cgrp; + return ret; + +err: + cgroup_threadgroup_change_end(current); + mutex_unlock(&cgroup_mutex); + if (f) + fput(f); + if (dst_cgrp) + cgroup_put(dst_cgrp); + put_css_set(cset); + if (kargs->cset) + put_css_set(kargs->cset); + return ret; +} + +/** + * cgroup_css_set_put_fork - drop references we took during fork + * @kargs: the arguments passed to create the child process + * + * Drop references to the prepared css_set and target cgroup if + * CLONE_INTO_CGROUP was requested. + */ +static void cgroup_css_set_put_fork(struct kernel_clone_args *kargs) + __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex) +{ + cgroup_threadgroup_change_end(current); + + if (kargs->flags & CLONE_INTO_CGROUP) { + struct cgroup *cgrp = kargs->cgrp; + struct css_set *cset = kargs->cset; + + mutex_unlock(&cgroup_mutex); + + if (cset) { + put_css_set(cset); + kargs->cset = NULL; + } + + if (cgrp) { + cgroup_put(cgrp); + kargs->cgrp = NULL; + } + } +} + /** * cgroup_can_fork - called on a new task before the process is exposed * @child: the child process * + * This prepares a new css_set for the child process which the child will + * be attached to in cgroup_post_fork(). * This calls the subsystem can_fork() callbacks. If the cgroup_can_fork() * callback returns an error, the fork aborts with that error code. This * allows for a cgroup subsystem to conditionally allow or deny new forks. */ -int cgroup_can_fork(struct task_struct *child) - __acquires(&cgroup_threadgroup_rwsem) __releases(&cgroup_threadgroup_rwsem) +int cgroup_can_fork(struct task_struct *child, struct kernel_clone_args *kargs) { struct cgroup_subsys *ss; int i, j, ret; - cgroup_threadgroup_change_begin(current); + ret = cgroup_css_set_fork(kargs); + if (ret) + return ret; do_each_subsys_mask(ss, i, have_canfork_callback) { - ret = ss->can_fork(child); + ret = ss->can_fork(child, kargs->cset); if (ret) goto out_revert; } while_each_subsys_mask(); @@ -5937,32 +6066,34 @@ out_revert: if (j >= i) break; if (ss->cancel_fork) - ss->cancel_fork(child); + ss->cancel_fork(child, kargs->cset); } - cgroup_threadgroup_change_end(current); + cgroup_css_set_put_fork(kargs); return ret; } /** - * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork() - * @child: the child process - * - * This calls the cancel_fork() callbacks if a fork failed *after* - * cgroup_can_fork() succeded. - */ -void cgroup_cancel_fork(struct task_struct *child) - __releases(&cgroup_threadgroup_rwsem) + * cgroup_cancel_fork - called if a fork failed after cgroup_can_fork() + * @child: the child process + * @kargs: the arguments passed to create the child process + * + * This calls the cancel_fork() callbacks if a fork failed *after* + * cgroup_can_fork() succeded and cleans up references we took to + * prepare a new css_set for the child process in cgroup_can_fork(). + */ +void cgroup_cancel_fork(struct task_struct *child, + struct kernel_clone_args *kargs) { struct cgroup_subsys *ss; int i; for_each_subsys(ss, i) if (ss->cancel_fork) - ss->cancel_fork(child); + ss->cancel_fork(child, kargs->cset); - cgroup_threadgroup_change_end(current); + cgroup_css_set_put_fork(kargs); } /** @@ -5972,22 +6103,27 @@ void cgroup_cancel_fork(struct task_struct *child) * Attach the child process to its css_set calling the subsystem fork() * callbacks. */ -void cgroup_post_fork(struct task_struct *child) - __releases(&cgroup_threadgroup_rwsem) +void cgroup_post_fork(struct task_struct *child, + struct kernel_clone_args *kargs) + __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex) { struct cgroup_subsys *ss; struct css_set *cset; int i; + cset = kargs->cset; + kargs->cset = NULL; + spin_lock_irq(&css_set_lock); /* init tasks are special, only link regular threads */ if (likely(child->pid)) { WARN_ON_ONCE(!list_empty(&child->cg_list)); - cset = task_css_set(current); /* current is @child's parent */ - get_css_set(cset); cset->nr_tasks++; css_set_move_task(child, NULL, cset, false); + } else { + put_css_set(cset); + cset = NULL; } /* @@ -6020,7 +6156,16 @@ void cgroup_post_fork(struct task_struct *child) ss->fork(child); } while_each_subsys_mask(); - cgroup_threadgroup_change_end(current); + /* Make the new cset the root_cset of the new cgroup namespace. */ + if (kargs->flags & CLONE_NEWCGROUP) { + struct css_set *rcset = child->nsproxy->cgroup_ns->root_cset; + + get_css_set(cset); + child->nsproxy->cgroup_ns->root_cset = cset; + put_css_set(rcset); + } + + cgroup_css_set_put_fork(kargs); } /** diff --git a/kernel/cgroup/pids.c b/kernel/cgroup/pids.c index 138059eb730d..511af87f685e 100644 --- a/kernel/cgroup/pids.c +++ b/kernel/cgroup/pids.c @@ -33,6 +33,7 @@ #include #include #include +#include #define PIDS_MAX (PID_MAX_LIMIT + 1ULL) #define PIDS_MAX_STR "max" @@ -214,13 +215,16 @@ static void pids_cancel_attach(struct cgroup_taskset *tset) * task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies * on cgroup_threadgroup_change_begin() held by the copy_process(). */ -static int pids_can_fork(struct task_struct *task) +static int pids_can_fork(struct task_struct *task, struct css_set *cset) { struct cgroup_subsys_state *css; struct pids_cgroup *pids; int err; - css = task_css_check(current, pids_cgrp_id, true); + if (cset) + css = cset->subsys[pids_cgrp_id]; + else + css = task_css_check(current, pids_cgrp_id, true); pids = css_pids(css); err = pids_try_charge(pids, 1); if (err) { @@ -235,12 +239,15 @@ static int pids_can_fork(struct task_struct *task) return err; } -static void pids_cancel_fork(struct task_struct *task) +static void pids_cancel_fork(struct task_struct *task, struct css_set *cset) { struct cgroup_subsys_state *css; struct pids_cgroup *pids; - css = task_css_check(current, pids_cgrp_id, true); + if (cset) + css = cset->subsys[pids_cgrp_id]; + else + css = task_css_check(current, pids_cgrp_id, true); pids = css_pids(css); pids_uncharge(pids, 1); } diff --git a/kernel/fork.c b/kernel/fork.c index 9245b6e53f55..635d6369dfb9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2180,7 +2180,7 @@ static __latent_entropy struct task_struct *copy_process( * between here and cgroup_post_fork() if an organisation operation is in * progress. */ - retval = cgroup_can_fork(p); + retval = cgroup_can_fork(p, args); if (retval) goto bad_fork_put_pidfd; @@ -2287,7 +2287,7 @@ static __latent_entropy struct task_struct *copy_process( write_unlock_irq(&tasklist_lock); proc_fork_connector(p); - cgroup_post_fork(p); + cgroup_post_fork(p, args); perf_event_fork(p); trace_task_newtask(p, clone_flags); @@ -2298,7 +2298,7 @@ static __latent_entropy struct task_struct *copy_process( bad_fork_cancel_cgroup: spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); - cgroup_cancel_fork(p); + cgroup_cancel_fork(p, args); bad_fork_put_pidfd: if (clone_flags & CLONE_PIDFD) { fput(pidfile); @@ -2627,6 +2627,9 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, !valid_signal(args.exit_signal))) return -EINVAL; + if ((args.flags & CLONE_INTO_CGROUP) && args.cgroup < 0) + return -EINVAL; + *kargs = (struct kernel_clone_args){ .flags = args.flags, .pidfd = u64_to_user_ptr(args.pidfd), @@ -2637,6 +2640,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, .stack_size = args.stack_size, .tls = args.tls, .set_tid_size = args.set_tid_size, + .cgroup = args.cgroup, }; if (args.set_tid && @@ -2680,7 +2684,8 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs) static bool clone3_args_valid(struct kernel_clone_args *kargs) { /* Verify that no unknown flags are passed along. */ - if (kargs->flags & ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND)) + if (kargs->flags & + ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP)) return false; /* -- cgit v1.2.3-71-gd317 From 5a7ea52b6faec6f8877e49645a80349b2d970f0a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:58:17 +0100 Subject: PM: QoS: Drop pm_qos_update_request_timeout() The pm_qos_update_request_timeout() function is not called from anywhere, so drop it along with the work member in struct pm_qos_request needed by it. Also drop the useless pm_qos_update_request_timeout trace event that is only triggered by that function (so it never triggers at all) and update the trace events documentation accordingly. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- Documentation/trace/events-power.rst | 2 -- include/linux/pm_qos.h | 4 --- include/trace/events/power.h | 24 ------------------ kernel/power/qos.c | 48 ------------------------------------ 4 files changed, 78 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst index 2ef318962e29..eec7453a168e 100644 --- a/Documentation/trace/events-power.rst +++ b/Documentation/trace/events-power.rst @@ -78,11 +78,9 @@ target/flags update. pm_qos_add_request "pm_qos_class=%s value=%d" pm_qos_update_request "pm_qos_class=%s value=%d" pm_qos_remove_request "pm_qos_class=%s value=%d" - pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld" The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY"). The second parameter is value to be added/updated/removed. -The third parameter is timeout value in usec. :: pm_qos_update_target "action=%s prev_value=%d curr_value=%d" diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index 19eafca5680e..4747bdb6bed2 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -8,7 +8,6 @@ #include #include #include -#include enum { PM_QOS_RESERVED = 0, @@ -43,7 +42,6 @@ enum pm_qos_flags_status { struct pm_qos_request { struct plist_node node; int pm_qos_class; - struct delayed_work work; /* for pm_qos_update_request_timeout */ }; struct pm_qos_flags_request { @@ -149,8 +147,6 @@ void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class, s32 value); void pm_qos_update_request(struct pm_qos_request *req, s32 new_value); -void pm_qos_update_request_timeout(struct pm_qos_request *req, - s32 new_value, unsigned long timeout_us); void pm_qos_remove_request(struct pm_qos_request *req); int pm_qos_request(int pm_qos_class); diff --git a/include/trace/events/power.h b/include/trace/events/power.h index 7457e238e1b7..ecf39daabf16 100644 --- a/include/trace/events/power.h +++ b/include/trace/events/power.h @@ -404,30 +404,6 @@ DEFINE_EVENT(pm_qos_request, pm_qos_remove_request, TP_ARGS(pm_qos_class, value) ); -TRACE_EVENT(pm_qos_update_request_timeout, - - TP_PROTO(int pm_qos_class, s32 value, unsigned long timeout_us), - - TP_ARGS(pm_qos_class, value, timeout_us), - - TP_STRUCT__entry( - __field( int, pm_qos_class ) - __field( s32, value ) - __field( unsigned long, timeout_us ) - ), - - TP_fast_assign( - __entry->pm_qos_class = pm_qos_class; - __entry->value = value; - __entry->timeout_us = timeout_us; - ), - - TP_printk("pm_qos_class=%s value=%d, timeout_us=%ld", - __print_symbolic(__entry->pm_qos_class, - { PM_QOS_CPU_DMA_LATENCY, "CPU_DMA_LATENCY" }), - __entry->value, __entry->timeout_us) -); - DECLARE_EVENT_CLASS(pm_qos_update, TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value), diff --git a/kernel/power/qos.c b/kernel/power/qos.c index d932fa42e8e4..67dab7f330e4 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -295,21 +295,6 @@ static void __pm_qos_update_request(struct pm_qos_request *req, &req->node, PM_QOS_UPDATE_REQ, new_value); } -/** - * pm_qos_work_fn - the timeout handler of pm_qos_update_request_timeout - * @work: work struct for the delayed work (timeout) - * - * This cancels the timeout request by falling back to the default at timeout. - */ -static void pm_qos_work_fn(struct work_struct *work) -{ - struct pm_qos_request *req = container_of(to_delayed_work(work), - struct pm_qos_request, - work); - - __pm_qos_update_request(req, PM_QOS_DEFAULT_VALUE); -} - /** * pm_qos_add_request - inserts new qos request into the list * @req: pointer to a preallocated handle @@ -334,7 +319,6 @@ void pm_qos_add_request(struct pm_qos_request *req, return; } req->pm_qos_class = pm_qos_class; - INIT_DELAYED_WORK(&req->work, pm_qos_work_fn); trace_pm_qos_add_request(pm_qos_class, value); pm_qos_update_target(pm_qos_array[pm_qos_class]->constraints, &req->node, PM_QOS_ADD_REQ, value); @@ -362,40 +346,10 @@ void pm_qos_update_request(struct pm_qos_request *req, return; } - cancel_delayed_work_sync(&req->work); __pm_qos_update_request(req, new_value); } EXPORT_SYMBOL_GPL(pm_qos_update_request); -/** - * pm_qos_update_request_timeout - modifies an existing qos request temporarily. - * @req : handle to list element holding a pm_qos request to use - * @new_value: defines the temporal qos request - * @timeout_us: the effective duration of this qos request in usecs. - * - * After timeout_us, this qos request is cancelled automatically. - */ -void pm_qos_update_request_timeout(struct pm_qos_request *req, s32 new_value, - unsigned long timeout_us) -{ - if (!req) - return; - if (WARN(!pm_qos_request_active(req), - "%s called for unknown object.", __func__)) - return; - - cancel_delayed_work_sync(&req->work); - - trace_pm_qos_update_request_timeout(req->pm_qos_class, - new_value, timeout_us); - if (new_value != req->node.prio) - pm_qos_update_target( - pm_qos_array[req->pm_qos_class]->constraints, - &req->node, PM_QOS_UPDATE_REQ, new_value); - - schedule_delayed_work(&req->work, usecs_to_jiffies(timeout_us)); -} - /** * pm_qos_remove_request - modifies an existing qos request * @req: handle to request list element @@ -415,8 +369,6 @@ void pm_qos_remove_request(struct pm_qos_request *req) return; } - cancel_delayed_work_sync(&req->work); - trace_pm_qos_remove_request(req->pm_qos_class, PM_QOS_DEFAULT_VALUE); pm_qos_update_target(pm_qos_array[req->pm_qos_class]->constraints, &req->node, PM_QOS_REMOVE_REQ, -- cgit v1.2.3-71-gd317 From 87ad7356799622b69478076cd99df7a5024992cb Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:58:26 +0100 Subject: PM: QoS: Drop the PM_QOS_SUM QoS type The PM_QOS_SUM QoS type is not used, so drop it along with the code referring to it in pm_qos_get_value() and the related local variables in there. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 1 - kernel/power/qos.c | 9 --------- 2 files changed, 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index 4747bdb6bed2..48bfb96a9360 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -53,7 +53,6 @@ enum pm_qos_type { PM_QOS_UNITIALIZED, PM_QOS_MAX, /* return the largest value */ PM_QOS_MIN, /* return the smallest value */ - PM_QOS_SUM /* return the sum */ }; /* diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 67dab7f330e4..a6be7faa1974 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -101,9 +101,6 @@ static const struct file_operations pm_qos_power_fops = { /* unlocked internal variant */ static inline int pm_qos_get_value(struct pm_qos_constraints *c) { - struct plist_node *node; - int total_value = 0; - if (plist_head_empty(&c->list)) return c->no_constraint_value; @@ -114,12 +111,6 @@ static inline int pm_qos_get_value(struct pm_qos_constraints *c) case PM_QOS_MAX: return plist_last(&c->list)->prio; - case PM_QOS_SUM: - plist_for_each(node, &c->list) - total_value += node->prio; - - return total_value; - default: /* runtime check for not using enum */ BUG(); -- cgit v1.2.3-71-gd317 From 7b35370b2ebc2608ba0f2e0ecb996fa7f38e9731 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:58:33 +0100 Subject: PM: QoS: Clean up pm_qos_update_target() and pm_qos_update_flags() Clean up the pm_qos_update_target() function: * Update its kerneldoc comment. * Drop the redundant ret local variable from it. * Reorder definitions of local variables in it. * Update a comment in it. Also update the kerneldoc comment of pm_qos_update_flags() (e.g. notifiers are not called by it any more) and add one emtpy line to its body (for more visual clarity). No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- kernel/power/qos.c | 56 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 29 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index a6be7faa1974..6a36809d6160 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -129,24 +129,30 @@ static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) } /** - * pm_qos_update_target - manages the constraints list and calls the notifiers - * if needed - * @c: constraints data struct - * @node: request to add to the list, to update or to remove - * @action: action to take on the constraints list - * @value: value of the request to add or update + * pm_qos_update_target - Update a list of PM QoS constraint requests. + * @c: List of PM QoS requests. + * @node: Target list entry. + * @action: Action to carry out (add, update or remove). + * @value: New request value for the target list entry. * - * This function returns 1 if the aggregated constraint value has changed, 0 - * otherwise. + * Update the given list of PM QoS constraint requests, @c, by carrying an + * @action involving the @node list entry and @value on it. + * + * The recognized values of @action are PM_QOS_ADD_REQ (store @value in @node + * and add it to the list), PM_QOS_UPDATE_REQ (remove @node from the list, store + * @value in it and add it to the list again), and PM_QOS_REMOVE_REQ (remove + * @node from the list, ignore @value). + * + * Return: 1 if the aggregate constraint value has changed, 0 otherwise. */ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node, enum pm_qos_req_action action, int value) { - unsigned long flags; int prev_value, curr_value, new_value; - int ret; + unsigned long flags; spin_lock_irqsave(&pm_qos_lock, flags); + prev_value = pm_qos_get_value(c); if (value == PM_QOS_DEFAULT_VALUE) new_value = c->default_value; @@ -159,9 +165,8 @@ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node, break; case PM_QOS_UPDATE_REQ: /* - * to change the list, we atomically remove, reinit - * with new value and add, then see if the extremal - * changed + * To change the list, atomically remove, reinit with new value + * and add, then see if the aggregate has changed. */ plist_del(node, &c->list); /* fall through */ @@ -180,16 +185,14 @@ int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node, spin_unlock_irqrestore(&pm_qos_lock, flags); trace_pm_qos_update_target(action, prev_value, curr_value); - if (prev_value != curr_value) { - ret = 1; - if (c->notifiers) - blocking_notifier_call_chain(c->notifiers, - (unsigned long)curr_value, - NULL); - } else { - ret = 0; - } - return ret; + + if (prev_value == curr_value) + return 0; + + if (c->notifiers) + blocking_notifier_call_chain(c->notifiers, curr_value, NULL); + + return 1; } /** @@ -211,14 +214,12 @@ static void pm_qos_flags_remove_req(struct pm_qos_flags *pqf, /** * pm_qos_update_flags - Update a set of PM QoS flags. - * @pqf: Set of flags to update. + * @pqf: Set of PM QoS flags to update. * @req: Request to add to the set, to modify, or to remove from the set. * @action: Action to take on the set. * @val: Value of the request to add or modify. * - * Update the given set of PM QoS flags and call notifiers if the aggregate - * value has changed. Returns 1 if the aggregate constraint value has changed, - * 0 otherwise. + * Return: 1 if the aggregate constraint value has changed, 0 otherwise. */ bool pm_qos_update_flags(struct pm_qos_flags *pqf, struct pm_qos_flags_request *req, @@ -254,6 +255,7 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, spin_unlock_irqrestore(&pm_qos_lock, irqflags); trace_pm_qos_update_flags(action, prev_value, curr_value); + return prev_value != curr_value; } -- cgit v1.2.3-71-gd317 From dcd70ca1a3bf33aa5cf53aa6f72b8e51afb1ac48 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:58:39 +0100 Subject: PM: QoS: Clean up pm_qos_read_value() and pm_qos_get/set_value() Move the definition of pm_qos_read_value() before the one of pm_qos_get_value() and add a kerneldoc comment to it (as it is not static). Also replace the BUG() in pm_qos_get_value() with WARN() (to prevent the kernel from crashing if an unknown PM QoS type is used by mistake) and drop the comment next to it that is not necessary any more. Additionally, drop the unnecessary inline modifier from the header of pm_qos_set_value(). Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- kernel/power/qos.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 6a36809d6160..f09eca5ffe07 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -98,8 +98,16 @@ static const struct file_operations pm_qos_power_fops = { .llseek = noop_llseek, }; -/* unlocked internal variant */ -static inline int pm_qos_get_value(struct pm_qos_constraints *c) +/** + * pm_qos_read_value - Return the current effective constraint value. + * @c: List of PM QoS constraint requests. + */ +s32 pm_qos_read_value(struct pm_qos_constraints *c) +{ + return c->target_value; +} + +static int pm_qos_get_value(struct pm_qos_constraints *c) { if (plist_head_empty(&c->list)) return c->no_constraint_value; @@ -112,18 +120,12 @@ static inline int pm_qos_get_value(struct pm_qos_constraints *c) return plist_last(&c->list)->prio; default: - /* runtime check for not using enum */ - BUG(); + WARN(1, "Unknown PM QoS type in %s\n", __func__); return PM_QOS_DEFAULT_VALUE; } } -s32 pm_qos_read_value(struct pm_qos_constraints *c) -{ - return c->target_value; -} - -static inline void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) +static void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) { c->target_value = value; } -- cgit v1.2.3-71-gd317 From 63cffc05348ea1fe65aa99e80ed330e33bf54220 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 11 Feb 2020 23:59:22 +0100 Subject: PM: QoS: Drop iterations over global QoS classes After commit c3082a674f46 ("PM: QoS: Get rid of unused flags") the only global PM QoS class in use is PM_QOS_CPU_DMA_LATENCY, so it does not really make sense to iterate over global QoS classes anywhere, since there is only one. Remove iterations over global QoS classes from the code and use PM_QOS_CPU_DMA_LATENCY as the target class directly where needed. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- kernel/power/qos.c | 52 ++++++++++++++-------------------------------------- 1 file changed, 14 insertions(+), 38 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index f09eca5ffe07..57ff542a4f9d 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -412,7 +412,8 @@ int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier) } EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); -/* User space interface to PM QoS classes via misc devices */ +/* User space interface to global PM QoS via misc device. */ + static int register_pm_qos_misc(struct pm_qos_object *qos) { qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; @@ -422,35 +423,18 @@ static int register_pm_qos_misc(struct pm_qos_object *qos) return misc_register(&qos->pm_qos_power_miscdev); } -static int find_pm_qos_object_by_minor(int minor) -{ - int pm_qos_class; - - for (pm_qos_class = PM_QOS_CPU_DMA_LATENCY; - pm_qos_class < PM_QOS_NUM_CLASSES; pm_qos_class++) { - if (minor == - pm_qos_array[pm_qos_class]->pm_qos_power_miscdev.minor) - return pm_qos_class; - } - return -1; -} - static int pm_qos_power_open(struct inode *inode, struct file *filp) { - long pm_qos_class; + struct pm_qos_request *req; - pm_qos_class = find_pm_qos_object_by_minor(iminor(inode)); - if (pm_qos_class >= PM_QOS_CPU_DMA_LATENCY) { - struct pm_qos_request *req = kzalloc(sizeof(*req), GFP_KERNEL); - if (!req) - return -ENOMEM; + req = kzalloc(sizeof(*req), GFP_KERNEL); + if (!req) + return -ENOMEM; - pm_qos_add_request(req, pm_qos_class, PM_QOS_DEFAULT_VALUE); - filp->private_data = req; + pm_qos_add_request(req, PM_QOS_CPU_DMA_LATENCY, PM_QOS_DEFAULT_VALUE); + filp->private_data = req; - return 0; - } - return -EPERM; + return 0; } static int pm_qos_power_release(struct inode *inode, struct file *filp) @@ -464,7 +448,6 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp) return 0; } - static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, size_t count, loff_t *f_pos) { @@ -507,26 +490,19 @@ static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, return count; } - static int __init pm_qos_power_init(void) { - int ret = 0; - int i; + int ret; BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES); - for (i = PM_QOS_CPU_DMA_LATENCY; i < PM_QOS_NUM_CLASSES; i++) { - ret = register_pm_qos_misc(pm_qos_array[i]); - if (ret < 0) { - pr_err("%s: %s setup failed\n", - __func__, pm_qos_array[i]->name); - return ret; - } - } + ret = register_pm_qos_misc(pm_qos_array[PM_QOS_CPU_DMA_LATENCY]); + if (ret < 0) + pr_err("%s: %s setup failed\n", __func__, + pm_qos_array[PM_QOS_CPU_DMA_LATENCY]->name); return ret; } - late_initcall(pm_qos_power_init); /* Definitions related to the frequency QoS below. */ -- cgit v1.2.3-71-gd317 From 299a229830a29a2d691ac17b27964c7d820f16a6 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:00:12 +0100 Subject: PM: QoS: Clean up misc device file operations Reorder the code to avoid using extra function header declarations for the pm_qos_power_*() family of functions and drop those declarations. Also clean up the internals of those functions to consolidate checks, avoid using redundant local variables and similar. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- kernel/power/qos.c | 62 +++++++++++++++++++++++------------------------------- 1 file changed, 26 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 57ff542a4f9d..9f67584d4466 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -83,21 +83,6 @@ static struct pm_qos_object *pm_qos_array[] = { &cpu_dma_pm_qos, }; -static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, - size_t count, loff_t *f_pos); -static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, - size_t count, loff_t *f_pos); -static int pm_qos_power_open(struct inode *inode, struct file *filp); -static int pm_qos_power_release(struct inode *inode, struct file *filp); - -static const struct file_operations pm_qos_power_fops = { - .write = pm_qos_power_write, - .read = pm_qos_power_read, - .open = pm_qos_power_open, - .release = pm_qos_power_release, - .llseek = noop_llseek, -}; - /** * pm_qos_read_value - Return the current effective constraint value. * @c: List of PM QoS constraint requests. @@ -414,15 +399,6 @@ EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); /* User space interface to global PM QoS via misc device. */ -static int register_pm_qos_misc(struct pm_qos_object *qos) -{ - qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; - qos->pm_qos_power_miscdev.name = qos->name; - qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; - - return misc_register(&qos->pm_qos_power_miscdev); -} - static int pm_qos_power_open(struct inode *inode, struct file *filp) { struct pm_qos_request *req; @@ -439,9 +415,10 @@ static int pm_qos_power_open(struct inode *inode, struct file *filp) static int pm_qos_power_release(struct inode *inode, struct file *filp) { - struct pm_qos_request *req; + struct pm_qos_request *req = filp->private_data; + + filp->private_data = NULL; - req = filp->private_data; pm_qos_remove_request(req); kfree(req); @@ -449,15 +426,13 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp) } static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, - size_t count, loff_t *f_pos) + size_t count, loff_t *f_pos) { - s32 value; - unsigned long flags; struct pm_qos_request *req = filp->private_data; + unsigned long flags; + s32 value; - if (!req) - return -EINVAL; - if (!pm_qos_request_active(req)) + if (!req || !pm_qos_request_active(req)) return -EINVAL; spin_lock_irqsave(&pm_qos_lock, flags); @@ -468,10 +443,9 @@ static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, } static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, - size_t count, loff_t *f_pos) + size_t count, loff_t *f_pos) { s32 value; - struct pm_qos_request *req; if (count == sizeof(s32)) { if (copy_from_user(&value, buf, sizeof(s32))) @@ -484,12 +458,28 @@ static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, return ret; } - req = filp->private_data; - pm_qos_update_request(req, value); + pm_qos_update_request(filp->private_data, value); return count; } +static const struct file_operations pm_qos_power_fops = { + .write = pm_qos_power_write, + .read = pm_qos_power_read, + .open = pm_qos_power_open, + .release = pm_qos_power_release, + .llseek = noop_llseek, +}; + +static int register_pm_qos_misc(struct pm_qos_object *qos) +{ + qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; + qos->pm_qos_power_miscdev.name = qos->name; + qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; + + return misc_register(&qos->pm_qos_power_miscdev); +} + static int __init pm_qos_power_init(void) { int ret; -- cgit v1.2.3-71-gd317 From 02c92a378940ba1a46b6b7ad277f1b07f8093dbc Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:01:18 +0100 Subject: PM: QoS: Redefine struct pm_qos_request and drop struct pm_qos_object First, change the definition of struct pm_qos_request so that it contains a struct pm_qos_constraints pointer (called "qos") instead of a PM QoS class number (in preparation for dropping the PM QoS classes concept altogether going forward) and move its definition (along with the definition of struct pm_qos_flags_request that does not change) after the definition of struct pm_qos_constraints. Next, drop the definition of struct pm_qos_object and the null_pm_qos and cpu_dma_pm_qos variables of that type along with pm_qos_array[] holding pointers to them and change the code to refer to the pm_qos_constraints structure directly or to use the new qos pointer in struct pm_qos_request for that instead of going through pm_qos_array[] to access it. Also update kerneldoc comments that mention pm_qos_class to refer to PM_QOS_CPU_DMA_LATENCY directly instead. Finally, drop register_pm_qos_misc(), introduce cpu_latency_qos_miscdev (with the name field set to "cpu_dma_latency") to implement the CPU latency QoS interface in /dev/ and register it directly from pm_qos_power_init(). After these changes the notion of PM QoS classes remains only in the API (in the form of redundant function parameters that are ignored) and in the definitions of PM QoS trace events. While at it, some redundant local variables are dropped etc. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 20 ++++---- kernel/power/qos.c | 121 +++++++++++++++++-------------------------------- 2 files changed, 51 insertions(+), 90 deletions(-) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index 48bfb96a9360..bef110aa80cc 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -39,16 +39,6 @@ enum pm_qos_flags_status { #define PM_QOS_FLAG_NO_POWER_OFF (1 << 0) -struct pm_qos_request { - struct plist_node node; - int pm_qos_class; -}; - -struct pm_qos_flags_request { - struct list_head node; - s32 flags; /* Do not change to 64 bit */ -}; - enum pm_qos_type { PM_QOS_UNITIALIZED, PM_QOS_MAX, /* return the largest value */ @@ -69,6 +59,16 @@ struct pm_qos_constraints { struct blocking_notifier_head *notifiers; }; +struct pm_qos_request { + struct plist_node node; + struct pm_qos_constraints *qos; +}; + +struct pm_qos_flags_request { + struct list_head node; + s32 flags; /* Do not change to 64 bit */ +}; + struct pm_qos_flags { struct list_head list; s32 effective_flags; /* Do not change to 64 bit */ diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 9f67584d4466..952c5f55e23c 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -54,16 +54,8 @@ * or pm_qos_object list and pm_qos_objects need to happen with pm_qos_lock * held, taken with _irqsave. One lock to rule them all */ -struct pm_qos_object { - struct pm_qos_constraints *constraints; - struct miscdevice pm_qos_power_miscdev; - char *name; -}; - static DEFINE_SPINLOCK(pm_qos_lock); -static struct pm_qos_object null_pm_qos; - static BLOCKING_NOTIFIER_HEAD(cpu_dma_lat_notifier); static struct pm_qos_constraints cpu_dma_constraints = { .list = PLIST_HEAD_INIT(cpu_dma_constraints.list), @@ -73,15 +65,6 @@ static struct pm_qos_constraints cpu_dma_constraints = { .type = PM_QOS_MIN, .notifiers = &cpu_dma_lat_notifier, }; -static struct pm_qos_object cpu_dma_pm_qos = { - .constraints = &cpu_dma_constraints, - .name = "cpu_dma_latency", -}; - -static struct pm_qos_object *pm_qos_array[] = { - &null_pm_qos, - &cpu_dma_pm_qos, -}; /** * pm_qos_read_value - Return the current effective constraint value. @@ -248,46 +231,34 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, /** * pm_qos_request - returns current system wide qos expectation - * @pm_qos_class: identification of which qos value is requested + * @pm_qos_class: Ignored. * * This function returns the current target value. */ int pm_qos_request(int pm_qos_class) { - return pm_qos_read_value(pm_qos_array[pm_qos_class]->constraints); + return pm_qos_read_value(&cpu_dma_constraints); } EXPORT_SYMBOL_GPL(pm_qos_request); int pm_qos_request_active(struct pm_qos_request *req) { - return req->pm_qos_class != 0; + return req->qos == &cpu_dma_constraints; } EXPORT_SYMBOL_GPL(pm_qos_request_active); -static void __pm_qos_update_request(struct pm_qos_request *req, - s32 new_value) -{ - trace_pm_qos_update_request(req->pm_qos_class, new_value); - - if (new_value != req->node.prio) - pm_qos_update_target( - pm_qos_array[req->pm_qos_class]->constraints, - &req->node, PM_QOS_UPDATE_REQ, new_value); -} - /** * pm_qos_add_request - inserts new qos request into the list * @req: pointer to a preallocated handle - * @pm_qos_class: identifies which list of qos request to use + * @pm_qos_class: Ignored. * @value: defines the qos request * - * This function inserts a new entry in the pm_qos_class list of requested qos - * performance characteristics. It recomputes the aggregate QoS expectations - * for the pm_qos_class of parameters and initializes the pm_qos_request + * This function inserts a new entry in the PM_QOS_CPU_DMA_LATENCY list of + * requested QoS performance characteristics. It recomputes the aggregate QoS + * expectations for the PM_QOS_CPU_DMA_LATENCY list and initializes the @req * handle. Caller needs to save this handle for later use in updates and * removal. */ - void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class, s32 value) { @@ -298,10 +269,11 @@ void pm_qos_add_request(struct pm_qos_request *req, WARN(1, KERN_ERR "pm_qos_add_request() called for already added request\n"); return; } - req->pm_qos_class = pm_qos_class; - trace_pm_qos_add_request(pm_qos_class, value); - pm_qos_update_target(pm_qos_array[pm_qos_class]->constraints, - &req->node, PM_QOS_ADD_REQ, value); + + trace_pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY, value); + + req->qos = &cpu_dma_constraints; + pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ, value); } EXPORT_SYMBOL_GPL(pm_qos_add_request); @@ -310,13 +282,12 @@ EXPORT_SYMBOL_GPL(pm_qos_add_request); * @req : handle to list element holding a pm_qos request to use * @value: defines the qos request * - * Updates an existing qos request for the pm_qos_class of parameters along - * with updating the target pm_qos_class value. + * Updates an existing qos request for the PM_QOS_CPU_DMA_LATENCY list along + * with updating the target PM_QOS_CPU_DMA_LATENCY value. * * Attempts are made to make this code callable on hot code paths. */ -void pm_qos_update_request(struct pm_qos_request *req, - s32 new_value) +void pm_qos_update_request(struct pm_qos_request *req, s32 new_value) { if (!req) /*guard against callers passing in null */ return; @@ -326,7 +297,12 @@ void pm_qos_update_request(struct pm_qos_request *req, return; } - __pm_qos_update_request(req, new_value); + trace_pm_qos_update_request(PM_QOS_CPU_DMA_LATENCY, new_value); + + if (new_value == req->node.prio) + return; + + pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ, new_value); } EXPORT_SYMBOL_GPL(pm_qos_update_request); @@ -335,7 +311,7 @@ EXPORT_SYMBOL_GPL(pm_qos_update_request); * @req: handle to request list element * * Will remove pm qos request from the list of constraints and - * recompute the current target value for the pm_qos_class. Call this + * recompute the current target value for PM_QOS_CPU_DMA_LATENCY. Call this * on slow code paths. */ void pm_qos_remove_request(struct pm_qos_request *req) @@ -349,9 +325,9 @@ void pm_qos_remove_request(struct pm_qos_request *req) return; } - trace_pm_qos_remove_request(req->pm_qos_class, PM_QOS_DEFAULT_VALUE); - pm_qos_update_target(pm_qos_array[req->pm_qos_class]->constraints, - &req->node, PM_QOS_REMOVE_REQ, + trace_pm_qos_remove_request(PM_QOS_CPU_DMA_LATENCY, PM_QOS_DEFAULT_VALUE); + + pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE); memset(req, 0, sizeof(*req)); } @@ -359,41 +335,31 @@ EXPORT_SYMBOL_GPL(pm_qos_remove_request); /** * pm_qos_add_notifier - sets notification entry for changes to target value - * @pm_qos_class: identifies which qos target changes should be notified. + * @pm_qos_class: Ignored. * @notifier: notifier block managed by caller. * * will register the notifier into a notification chain that gets called - * upon changes to the pm_qos_class target value. + * upon changes to the PM_QOS_CPU_DMA_LATENCY target value. */ int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier) { - int retval; - - retval = blocking_notifier_chain_register( - pm_qos_array[pm_qos_class]->constraints->notifiers, - notifier); - - return retval; + return blocking_notifier_chain_register(cpu_dma_constraints.notifiers, + notifier); } EXPORT_SYMBOL_GPL(pm_qos_add_notifier); /** * pm_qos_remove_notifier - deletes notification entry from chain. - * @pm_qos_class: identifies which qos target changes are notified. + * @pm_qos_class: Ignored. * @notifier: notifier block to be removed. * * will remove the notifier from the notification chain that gets called - * upon changes to the pm_qos_class target value. + * upon changes to the PM_QOS_CPU_DMA_LATENCY target value. */ int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier) { - int retval; - - retval = blocking_notifier_chain_unregister( - pm_qos_array[pm_qos_class]->constraints->notifiers, - notifier); - - return retval; + return blocking_notifier_chain_unregister(cpu_dma_constraints.notifiers, + notifier); } EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); @@ -436,7 +402,7 @@ static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, return -EINVAL; spin_lock_irqsave(&pm_qos_lock, flags); - value = pm_qos_get_value(pm_qos_array[req->pm_qos_class]->constraints); + value = pm_qos_get_value(&cpu_dma_constraints); spin_unlock_irqrestore(&pm_qos_lock, flags); return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32)); @@ -471,25 +437,20 @@ static const struct file_operations pm_qos_power_fops = { .llseek = noop_llseek, }; -static int register_pm_qos_misc(struct pm_qos_object *qos) -{ - qos->pm_qos_power_miscdev.minor = MISC_DYNAMIC_MINOR; - qos->pm_qos_power_miscdev.name = qos->name; - qos->pm_qos_power_miscdev.fops = &pm_qos_power_fops; - - return misc_register(&qos->pm_qos_power_miscdev); -} +static struct miscdevice cpu_latency_qos_miscdev = { + .minor = MISC_DYNAMIC_MINOR, + .name = "cpu_dma_latency", + .fops = &pm_qos_power_fops, +}; static int __init pm_qos_power_init(void) { int ret; - BUILD_BUG_ON(ARRAY_SIZE(pm_qos_array) != PM_QOS_NUM_CLASSES); - - ret = register_pm_qos_misc(pm_qos_array[PM_QOS_CPU_DMA_LATENCY]); + ret = misc_register(&cpu_latency_qos_miscdev); if (ret < 0) pr_err("%s: %s setup failed\n", __func__, - pm_qos_array[PM_QOS_CPU_DMA_LATENCY]->name); + cpu_latency_qos_miscdev.name); return ret; } -- cgit v1.2.3-71-gd317 From 3a4a0042228a854d9b1073050620820b4a977e6e Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:02:30 +0100 Subject: PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY notifier chain Notice that pm_qos_remove_notifier() is not used at all and the only caller of pm_qos_add_notifier() is the cpuidle core, which only needs the PM_QOS_CPU_DMA_LATENCY notifier to invoke wake_up_all_idle_cpus() upon changes of the PM_QOS_CPU_DMA_LATENCY target value. First, to ensure that wake_up_all_idle_cpus() will be called whenever the PM_QOS_CPU_DMA_LATENCY target value changes, modify the pm_qos_add/update/remove_request() family of functions to check if the effective constraint for the PM_QOS_CPU_DMA_LATENCY has changed and call wake_up_all_idle_cpus() directly in that case. Next, drop the PM_QOS_CPU_DMA_LATENCY notifier from cpuidle as it is not necessary any more. Finally, drop both pm_qos_add_notifier() and pm_qos_remove_notifier(), as they have no callers now, along with cpu_dma_lat_notifier which is only used by them. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- drivers/cpuidle/cpuidle.c | 40 +--------------------------------------- include/linux/pm_qos.h | 2 -- kernel/power/qos.c | 47 +++++++++++------------------------------------ 3 files changed, 12 insertions(+), 77 deletions(-) (limited to 'kernel') diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index de81298051b3..c149d9e20dfd 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -736,53 +736,15 @@ int cpuidle_register(struct cpuidle_driver *drv, } EXPORT_SYMBOL_GPL(cpuidle_register); -#ifdef CONFIG_SMP - -/* - * This function gets called when a part of the kernel has a new latency - * requirement. This means we need to get all processors out of their C-state, - * and then recalculate a new suitable C-state. Just do a cross-cpu IPI; that - * wakes them all right up. - */ -static int cpuidle_latency_notify(struct notifier_block *b, - unsigned long l, void *v) -{ - wake_up_all_idle_cpus(); - return NOTIFY_OK; -} - -static struct notifier_block cpuidle_latency_notifier = { - .notifier_call = cpuidle_latency_notify, -}; - -static inline void latency_notifier_init(struct notifier_block *n) -{ - pm_qos_add_notifier(PM_QOS_CPU_DMA_LATENCY, n); -} - -#else /* CONFIG_SMP */ - -#define latency_notifier_init(x) do { } while (0) - -#endif /* CONFIG_SMP */ - /** * cpuidle_init - core initializer */ static int __init cpuidle_init(void) { - int ret; - if (cpuidle_disabled()) return -ENODEV; - ret = cpuidle_add_interface(cpu_subsys.dev_root); - if (ret) - return ret; - - latency_notifier_init(&cpuidle_latency_notifier); - - return 0; + return cpuidle_add_interface(cpu_subsys.dev_root); } module_param(off, int, 0444); diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index bef110aa80cc..cb57e5918a25 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -149,8 +149,6 @@ void pm_qos_update_request(struct pm_qos_request *req, void pm_qos_remove_request(struct pm_qos_request *req); int pm_qos_request(int pm_qos_class); -int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier); -int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier); int pm_qos_request_active(struct pm_qos_request *req); s32 pm_qos_read_value(struct pm_qos_constraints *c); diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 952c5f55e23c..201b43bc6457 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -56,14 +56,12 @@ */ static DEFINE_SPINLOCK(pm_qos_lock); -static BLOCKING_NOTIFIER_HEAD(cpu_dma_lat_notifier); static struct pm_qos_constraints cpu_dma_constraints = { .list = PLIST_HEAD_INIT(cpu_dma_constraints.list), .target_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, .default_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, .no_constraint_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, .type = PM_QOS_MIN, - .notifiers = &cpu_dma_lat_notifier, }; /** @@ -247,6 +245,14 @@ int pm_qos_request_active(struct pm_qos_request *req) } EXPORT_SYMBOL_GPL(pm_qos_request_active); +static void cpu_latency_qos_update(struct pm_qos_request *req, + enum pm_qos_req_action action, s32 value) +{ + int ret = pm_qos_update_target(req->qos, &req->node, action, value); + if (ret > 0) + wake_up_all_idle_cpus(); +} + /** * pm_qos_add_request - inserts new qos request into the list * @req: pointer to a preallocated handle @@ -273,7 +279,7 @@ void pm_qos_add_request(struct pm_qos_request *req, trace_pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY, value); req->qos = &cpu_dma_constraints; - pm_qos_update_target(req->qos, &req->node, PM_QOS_ADD_REQ, value); + cpu_latency_qos_update(req, PM_QOS_ADD_REQ, value); } EXPORT_SYMBOL_GPL(pm_qos_add_request); @@ -302,7 +308,7 @@ void pm_qos_update_request(struct pm_qos_request *req, s32 new_value) if (new_value == req->node.prio) return; - pm_qos_update_target(req->qos, &req->node, PM_QOS_UPDATE_REQ, new_value); + cpu_latency_qos_update(req, PM_QOS_UPDATE_REQ, new_value); } EXPORT_SYMBOL_GPL(pm_qos_update_request); @@ -327,42 +333,11 @@ void pm_qos_remove_request(struct pm_qos_request *req) trace_pm_qos_remove_request(PM_QOS_CPU_DMA_LATENCY, PM_QOS_DEFAULT_VALUE); - pm_qos_update_target(req->qos, &req->node, PM_QOS_REMOVE_REQ, - PM_QOS_DEFAULT_VALUE); + cpu_latency_qos_update(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE); memset(req, 0, sizeof(*req)); } EXPORT_SYMBOL_GPL(pm_qos_remove_request); -/** - * pm_qos_add_notifier - sets notification entry for changes to target value - * @pm_qos_class: Ignored. - * @notifier: notifier block managed by caller. - * - * will register the notifier into a notification chain that gets called - * upon changes to the PM_QOS_CPU_DMA_LATENCY target value. - */ -int pm_qos_add_notifier(int pm_qos_class, struct notifier_block *notifier) -{ - return blocking_notifier_chain_register(cpu_dma_constraints.notifiers, - notifier); -} -EXPORT_SYMBOL_GPL(pm_qos_add_notifier); - -/** - * pm_qos_remove_notifier - deletes notification entry from chain. - * @pm_qos_class: Ignored. - * @notifier: notifier block to be removed. - * - * will remove the notifier from the notification chain that gets called - * upon changes to the PM_QOS_CPU_DMA_LATENCY target value. - */ -int pm_qos_remove_notifier(int pm_qos_class, struct notifier_block *notifier) -{ - return blocking_notifier_chain_unregister(cpu_dma_constraints.notifiers, - notifier); -} -EXPORT_SYMBOL_GPL(pm_qos_remove_notifier); - /* User space interface to global PM QoS via misc device. */ static int pm_qos_power_open(struct inode *inode, struct file *filp) -- cgit v1.2.3-71-gd317 From 2552d3520132a22834e0be85c51168a7a798608c Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:04:31 +0100 Subject: PM: QoS: Rename things related to the CPU latency QoS First, rename PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE to PM_QOS_CPU_LATENCY_DEFAULT_VALUE and update all of the code referring to it accordingly. Next, rename cpu_dma_constraints to cpu_latency_constraints, move the definition of it closer to the functions referring to it and update all of them accordingly. [While at it, add a comment to mark the start of the code related to the CPU latency QoS.] Finally, rename the pm_qos_power_*() family of functions and pm_qos_power_fops to cpu_latency_qos_*() and cpu_latency_qos_fops, respectively, and update the definition of cpu_latency_qos_miscdev. [While at it, update the miscdev interface code start comment.] No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Acked-by: Greg Kroah-Hartman Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- drivers/tty/serial/8250/8250_omap.c | 6 ++-- drivers/tty/serial/omap-serial.c | 6 ++-- include/linux/pm_qos.h | 2 +- kernel/power/qos.c | 56 +++++++++++++++++++------------------ 4 files changed, 36 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c index 6f343ca08440..19f8d2f9e7ba 100644 --- a/drivers/tty/serial/8250/8250_omap.c +++ b/drivers/tty/serial/8250/8250_omap.c @@ -1222,8 +1222,8 @@ static int omap8250_probe(struct platform_device *pdev) DEFAULT_CLK_SPEED); } - priv->latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; - priv->calc_latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; + priv->latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; + priv->calc_latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; pm_qos_add_request(&priv->pm_qos_request, PM_QOS_CPU_DMA_LATENCY, priv->latency); INIT_WORK(&priv->qos_work, omap8250_uart_qos_work); @@ -1445,7 +1445,7 @@ static int omap8250_runtime_suspend(struct device *dev) if (up->dma && up->dma->rxchan) omap_8250_rx_dma_flush(up); - priv->latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; + priv->latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; schedule_work(&priv->qos_work); return 0; diff --git a/drivers/tty/serial/omap-serial.c b/drivers/tty/serial/omap-serial.c index 48017cec7f2f..ce2558767eee 100644 --- a/drivers/tty/serial/omap-serial.c +++ b/drivers/tty/serial/omap-serial.c @@ -1722,8 +1722,8 @@ static int serial_omap_probe(struct platform_device *pdev) DEFAULT_CLK_SPEED); } - up->latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; - up->calc_latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; + up->latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; + up->calc_latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; pm_qos_add_request(&up->pm_qos_request, PM_QOS_CPU_DMA_LATENCY, up->latency); INIT_WORK(&up->qos_work, serial_omap_uart_qos_work); @@ -1869,7 +1869,7 @@ static int serial_omap_runtime_suspend(struct device *dev) serial_omap_enable_wakeup(up, true); - up->latency = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE; + up->latency = PM_QOS_CPU_LATENCY_DEFAULT_VALUE; schedule_work(&up->qos_work); return 0; diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index cb57e5918a25..a3e0bfc6c470 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -28,7 +28,7 @@ enum pm_qos_flags_status { #define PM_QOS_LATENCY_ANY S32_MAX #define PM_QOS_LATENCY_ANY_NS ((s64)PM_QOS_LATENCY_ANY * NSEC_PER_USEC) -#define PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE (2000 * USEC_PER_SEC) +#define PM_QOS_CPU_LATENCY_DEFAULT_VALUE (2000 * USEC_PER_SEC) #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE PM_QOS_LATENCY_ANY #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT PM_QOS_LATENCY_ANY #define PM_QOS_RESUME_LATENCY_NO_CONSTRAINT_NS PM_QOS_LATENCY_ANY_NS diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 201b43bc6457..a6bf53e9db17 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -56,14 +56,6 @@ */ static DEFINE_SPINLOCK(pm_qos_lock); -static struct pm_qos_constraints cpu_dma_constraints = { - .list = PLIST_HEAD_INIT(cpu_dma_constraints.list), - .target_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, - .default_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, - .no_constraint_value = PM_QOS_CPU_DMA_LAT_DEFAULT_VALUE, - .type = PM_QOS_MIN, -}; - /** * pm_qos_read_value - Return the current effective constraint value. * @c: List of PM QoS constraint requests. @@ -227,6 +219,16 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, return prev_value != curr_value; } +/* Definitions related to the CPU latency QoS. */ + +static struct pm_qos_constraints cpu_latency_constraints = { + .list = PLIST_HEAD_INIT(cpu_latency_constraints.list), + .target_value = PM_QOS_CPU_LATENCY_DEFAULT_VALUE, + .default_value = PM_QOS_CPU_LATENCY_DEFAULT_VALUE, + .no_constraint_value = PM_QOS_CPU_LATENCY_DEFAULT_VALUE, + .type = PM_QOS_MIN, +}; + /** * pm_qos_request - returns current system wide qos expectation * @pm_qos_class: Ignored. @@ -235,13 +237,13 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, */ int pm_qos_request(int pm_qos_class) { - return pm_qos_read_value(&cpu_dma_constraints); + return pm_qos_read_value(&cpu_latency_constraints); } EXPORT_SYMBOL_GPL(pm_qos_request); int pm_qos_request_active(struct pm_qos_request *req) { - return req->qos == &cpu_dma_constraints; + return req->qos == &cpu_latency_constraints; } EXPORT_SYMBOL_GPL(pm_qos_request_active); @@ -278,7 +280,7 @@ void pm_qos_add_request(struct pm_qos_request *req, trace_pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY, value); - req->qos = &cpu_dma_constraints; + req->qos = &cpu_latency_constraints; cpu_latency_qos_update(req, PM_QOS_ADD_REQ, value); } EXPORT_SYMBOL_GPL(pm_qos_add_request); @@ -338,9 +340,9 @@ void pm_qos_remove_request(struct pm_qos_request *req) } EXPORT_SYMBOL_GPL(pm_qos_remove_request); -/* User space interface to global PM QoS via misc device. */ +/* User space interface to the CPU latency QoS via misc device. */ -static int pm_qos_power_open(struct inode *inode, struct file *filp) +static int cpu_latency_qos_open(struct inode *inode, struct file *filp) { struct pm_qos_request *req; @@ -354,7 +356,7 @@ static int pm_qos_power_open(struct inode *inode, struct file *filp) return 0; } -static int pm_qos_power_release(struct inode *inode, struct file *filp) +static int cpu_latency_qos_release(struct inode *inode, struct file *filp) { struct pm_qos_request *req = filp->private_data; @@ -366,8 +368,8 @@ static int pm_qos_power_release(struct inode *inode, struct file *filp) return 0; } -static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, - size_t count, loff_t *f_pos) +static ssize_t cpu_latency_qos_read(struct file *filp, char __user *buf, + size_t count, loff_t *f_pos) { struct pm_qos_request *req = filp->private_data; unsigned long flags; @@ -377,14 +379,14 @@ static ssize_t pm_qos_power_read(struct file *filp, char __user *buf, return -EINVAL; spin_lock_irqsave(&pm_qos_lock, flags); - value = pm_qos_get_value(&cpu_dma_constraints); + value = pm_qos_get_value(&cpu_latency_constraints); spin_unlock_irqrestore(&pm_qos_lock, flags); return simple_read_from_buffer(buf, count, f_pos, &value, sizeof(s32)); } -static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, - size_t count, loff_t *f_pos) +static ssize_t cpu_latency_qos_write(struct file *filp, const char __user *buf, + size_t count, loff_t *f_pos) { s32 value; @@ -404,21 +406,21 @@ static ssize_t pm_qos_power_write(struct file *filp, const char __user *buf, return count; } -static const struct file_operations pm_qos_power_fops = { - .write = pm_qos_power_write, - .read = pm_qos_power_read, - .open = pm_qos_power_open, - .release = pm_qos_power_release, +static const struct file_operations cpu_latency_qos_fops = { + .write = cpu_latency_qos_write, + .read = cpu_latency_qos_read, + .open = cpu_latency_qos_open, + .release = cpu_latency_qos_release, .llseek = noop_llseek, }; static struct miscdevice cpu_latency_qos_miscdev = { .minor = MISC_DYNAMIC_MINOR, .name = "cpu_dma_latency", - .fops = &pm_qos_power_fops, + .fops = &cpu_latency_qos_fops, }; -static int __init pm_qos_power_init(void) +static int __init cpu_latency_qos_init(void) { int ret; @@ -429,7 +431,7 @@ static int __init pm_qos_power_init(void) return ret; } -late_initcall(pm_qos_power_init); +late_initcall(cpu_latency_qos_init); /* Definitions related to the frequency QoS below. */ -- cgit v1.2.3-71-gd317 From 333eed7d20069e2d80446f5fdf9ac3868b55e7b9 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:06:17 +0100 Subject: PM: QoS: Simplify definitions of CPU latency QoS trace events Modify the definitions of the CPU latency QoS trace events to take one argument (since PM_QOS_CPU_DMA_LATENCY is always passed as the pm_qos_class argument to them) and update the documentation of them accordingly (while at it, make it explicitly mention CPU latency QoS and relocate it after the device PM QoS trace events documentation). The names and output format of the trace events do not change to preserve user space compatibility. No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- Documentation/trace/events-power.rst | 19 ++++++++++--------- include/trace/events/power.h | 35 +++++++++++++++++------------------ kernel/power/qos.c | 16 ++++++++-------- 3 files changed, 35 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst index eec7453a168e..f45bf11fa88d 100644 --- a/Documentation/trace/events-power.rst +++ b/Documentation/trace/events-power.rst @@ -73,14 +73,6 @@ The second parameter is the power domain target state. ================ The PM QoS events are used for QoS add/update/remove request and for target/flags update. -:: - - pm_qos_add_request "pm_qos_class=%s value=%d" - pm_qos_update_request "pm_qos_class=%s value=%d" - pm_qos_remove_request "pm_qos_class=%s value=%d" - -The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY"). -The second parameter is value to be added/updated/removed. :: pm_qos_update_target "action=%s prev_value=%d curr_value=%d" @@ -90,7 +82,7 @@ The first parameter gives the QoS action name (e.g. "ADD_REQ"). The second parameter is the previous QoS value. The third parameter is the current QoS value to update. -And, there are also events used for device PM QoS add/update/remove request. +There are also events used for device PM QoS add/update/remove request. :: dev_pm_qos_add_request "device=%s type=%s new_value=%d" @@ -101,3 +93,12 @@ The first parameter gives the device name which tries to add/update/remove QoS requests. The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY"). The third parameter is value to be added/updated/removed. + +And, there are events used for CPU latency QoS add/update/remove request. +:: + + pm_qos_add_request "value=%d" + pm_qos_update_request "value=%d" + pm_qos_remove_request "value=%d" + +The parameter is the value to be added/updated/removed. diff --git a/include/trace/events/power.h b/include/trace/events/power.h index ecf39daabf16..af5018aa9517 100644 --- a/include/trace/events/power.h +++ b/include/trace/events/power.h @@ -359,51 +359,50 @@ DEFINE_EVENT(power_domain, power_domain_target, ); /* - * The pm qos events are used for pm qos update + * CPU latency QoS events used for global CPU latency QoS list updates */ -DECLARE_EVENT_CLASS(pm_qos_request, +DECLARE_EVENT_CLASS(cpu_latency_qos_request, - TP_PROTO(int pm_qos_class, s32 value), + TP_PROTO(s32 value), - TP_ARGS(pm_qos_class, value), + TP_ARGS(value), TP_STRUCT__entry( - __field( int, pm_qos_class ) __field( s32, value ) ), TP_fast_assign( - __entry->pm_qos_class = pm_qos_class; __entry->value = value; ), - TP_printk("pm_qos_class=%s value=%d", - __print_symbolic(__entry->pm_qos_class, - { PM_QOS_CPU_DMA_LATENCY, "CPU_DMA_LATENCY" }), + TP_printk("CPU_DMA_LATENCY value=%d", __entry->value) ); -DEFINE_EVENT(pm_qos_request, pm_qos_add_request, +DEFINE_EVENT(cpu_latency_qos_request, pm_qos_add_request, - TP_PROTO(int pm_qos_class, s32 value), + TP_PROTO(s32 value), - TP_ARGS(pm_qos_class, value) + TP_ARGS(value) ); -DEFINE_EVENT(pm_qos_request, pm_qos_update_request, +DEFINE_EVENT(cpu_latency_qos_request, pm_qos_update_request, - TP_PROTO(int pm_qos_class, s32 value), + TP_PROTO(s32 value), - TP_ARGS(pm_qos_class, value) + TP_ARGS(value) ); -DEFINE_EVENT(pm_qos_request, pm_qos_remove_request, +DEFINE_EVENT(cpu_latency_qos_request, pm_qos_remove_request, - TP_PROTO(int pm_qos_class, s32 value), + TP_PROTO(s32 value), - TP_ARGS(pm_qos_class, value) + TP_ARGS(value) ); +/* + * General PM QoS events used for updates of PM QoS request lists + */ DECLARE_EVENT_CLASS(pm_qos_update, TP_PROTO(enum pm_qos_req_action action, int prev_value, int curr_value), diff --git a/kernel/power/qos.c b/kernel/power/qos.c index a6bf53e9db17..afac7010e0f2 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -247,8 +247,8 @@ int pm_qos_request_active(struct pm_qos_request *req) } EXPORT_SYMBOL_GPL(pm_qos_request_active); -static void cpu_latency_qos_update(struct pm_qos_request *req, - enum pm_qos_req_action action, s32 value) +static void cpu_latency_qos_apply(struct pm_qos_request *req, + enum pm_qos_req_action action, s32 value) { int ret = pm_qos_update_target(req->qos, &req->node, action, value); if (ret > 0) @@ -278,10 +278,10 @@ void pm_qos_add_request(struct pm_qos_request *req, return; } - trace_pm_qos_add_request(PM_QOS_CPU_DMA_LATENCY, value); + trace_pm_qos_add_request(value); req->qos = &cpu_latency_constraints; - cpu_latency_qos_update(req, PM_QOS_ADD_REQ, value); + cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value); } EXPORT_SYMBOL_GPL(pm_qos_add_request); @@ -305,12 +305,12 @@ void pm_qos_update_request(struct pm_qos_request *req, s32 new_value) return; } - trace_pm_qos_update_request(PM_QOS_CPU_DMA_LATENCY, new_value); + trace_pm_qos_update_request(new_value); if (new_value == req->node.prio) return; - cpu_latency_qos_update(req, PM_QOS_UPDATE_REQ, new_value); + cpu_latency_qos_apply(req, PM_QOS_UPDATE_REQ, new_value); } EXPORT_SYMBOL_GPL(pm_qos_update_request); @@ -333,9 +333,9 @@ void pm_qos_remove_request(struct pm_qos_request *req) return; } - trace_pm_qos_remove_request(PM_QOS_CPU_DMA_LATENCY, PM_QOS_DEFAULT_VALUE); + trace_pm_qos_remove_request(PM_QOS_DEFAULT_VALUE); - cpu_latency_qos_update(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE); + cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE); memset(req, 0, sizeof(*req)); } EXPORT_SYMBOL_GPL(pm_qos_remove_request); -- cgit v1.2.3-71-gd317 From e033b6c175a32870a2f7cd3741c0d48c858c2d04 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:07:01 +0100 Subject: PM: QoS: Adjust pm_qos_request() signature and reorder pm_qos.h Change the return type of pm_qos_request() to be the same as the one of pm_qos_read_value() called by it internally and stop exporting it to modules (because its only caller, cpuidle, is not modular). Also move the pm_qos_read_value() header away from the CPU latency QoS API function headers in pm_qos.h (because it technically does not belong to that API). No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 6 +++--- kernel/power/qos.c | 3 +-- 2 files changed, 4 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index a3e0bfc6c470..3c4bee29ecda 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -137,20 +137,20 @@ static inline int dev_pm_qos_request_active(struct dev_pm_qos_request *req) return req->dev != NULL; } +s32 pm_qos_read_value(struct pm_qos_constraints *c); int pm_qos_update_target(struct pm_qos_constraints *c, struct plist_node *node, enum pm_qos_req_action action, int value); bool pm_qos_update_flags(struct pm_qos_flags *pqf, struct pm_qos_flags_request *req, enum pm_qos_req_action action, s32 val); + void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class, s32 value); void pm_qos_update_request(struct pm_qos_request *req, s32 new_value); void pm_qos_remove_request(struct pm_qos_request *req); - -int pm_qos_request(int pm_qos_class); +s32 pm_qos_request(int pm_qos_class); int pm_qos_request_active(struct pm_qos_request *req); -s32 pm_qos_read_value(struct pm_qos_constraints *c); #ifdef CONFIG_PM enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask); diff --git a/kernel/power/qos.c b/kernel/power/qos.c index afac7010e0f2..7bb55aca03bb 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -235,11 +235,10 @@ static struct pm_qos_constraints cpu_latency_constraints = { * * This function returns the current target value. */ -int pm_qos_request(int pm_qos_class) +s32 pm_qos_request(int pm_qos_class) { return pm_qos_read_value(&cpu_latency_constraints); } -EXPORT_SYMBOL_GPL(pm_qos_request); int pm_qos_request_active(struct pm_qos_request *req) { -- cgit v1.2.3-71-gd317 From 67b06ba01857ed077e1a66bfa139156e7c68bab2 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:35:04 +0100 Subject: PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY and rename related functions Drop the PM QoS classes enum including PM_QOS_CPU_DMA_LATENCY, drop the wrappers around pm_qos_request(), pm_qos_request_active(), and pm_qos_add/update/remove_request() introduced previously, rename these functions, respectively, to cpu_latency_qos_limit(), cpu_latency_qos_request_active(), and cpu_latency_qos_add/update/remove_request(), and update their kerneldoc comments. [While at it, drop some useless comments from these functions.] No intentional functional impact. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 47 +++--------------------- kernel/power/qos.c | 98 +++++++++++++++++++++++++------------------------- 2 files changed, 54 insertions(+), 91 deletions(-) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index 63d39e66f95d..e0ca4d780457 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -9,14 +9,6 @@ #include #include -enum { - PM_QOS_RESERVED = 0, - PM_QOS_CPU_DMA_LATENCY, - - /* insert new class ID */ - PM_QOS_NUM_CLASSES, -}; - enum pm_qos_flags_status { PM_QOS_FLAGS_UNDEFINED = -1, PM_QOS_FLAGS_NONE, @@ -144,40 +136,11 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, struct pm_qos_flags_request *req, enum pm_qos_req_action action, s32 val); -void pm_qos_add_request(struct pm_qos_request *req, int pm_qos_class, - s32 value); -void pm_qos_update_request(struct pm_qos_request *req, - s32 new_value); -void pm_qos_remove_request(struct pm_qos_request *req); -s32 pm_qos_request(int pm_qos_class); -int pm_qos_request_active(struct pm_qos_request *req); - -static inline void cpu_latency_qos_add_request(struct pm_qos_request *req, - s32 value) -{ - pm_qos_add_request(req, PM_QOS_CPU_DMA_LATENCY, value); -} - -static inline void cpu_latency_qos_update_request(struct pm_qos_request *req, - s32 new_value) -{ - pm_qos_update_request(req, new_value); -} - -static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) -{ - pm_qos_remove_request(req); -} - -static inline bool cpu_latency_qos_request_active(struct pm_qos_request *req) -{ - return pm_qos_request_active(req); -} - -static inline s32 cpu_latency_qos_limit(void) -{ - return pm_qos_request(PM_QOS_CPU_DMA_LATENCY); -} +s32 cpu_latency_qos_limit(void); +bool cpu_latency_qos_request_active(struct pm_qos_request *req); +void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value); +void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value); +void cpu_latency_qos_remove_request(struct pm_qos_request *req); #ifdef CONFIG_PM enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask); diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 7bb55aca03bb..7374c76f409a 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -230,21 +230,25 @@ static struct pm_qos_constraints cpu_latency_constraints = { }; /** - * pm_qos_request - returns current system wide qos expectation - * @pm_qos_class: Ignored. - * - * This function returns the current target value. + * cpu_latency_qos_limit - Return current system-wide CPU latency QoS limit. */ -s32 pm_qos_request(int pm_qos_class) +s32 cpu_latency_qos_limit(void) { return pm_qos_read_value(&cpu_latency_constraints); } -int pm_qos_request_active(struct pm_qos_request *req) +/** + * cpu_latency_qos_request_active - Check the given PM QoS request. + * @req: PM QoS request to check. + * + * Return: 'true' if @req has been added to the CPU latency QoS list, 'false' + * otherwise. + */ +bool cpu_latency_qos_request_active(struct pm_qos_request *req) { return req->qos == &cpu_latency_constraints; } -EXPORT_SYMBOL_GPL(pm_qos_request_active); +EXPORT_SYMBOL_GPL(cpu_latency_qos_request_active); static void cpu_latency_qos_apply(struct pm_qos_request *req, enum pm_qos_req_action action, s32 value) @@ -255,25 +259,24 @@ static void cpu_latency_qos_apply(struct pm_qos_request *req, } /** - * pm_qos_add_request - inserts new qos request into the list - * @req: pointer to a preallocated handle - * @pm_qos_class: Ignored. - * @value: defines the qos request + * cpu_latency_qos_add_request - Add new CPU latency QoS request. + * @req: Pointer to a preallocated handle. + * @value: Requested constraint value. + * + * Use @value to initialize the request handle pointed to by @req, insert it as + * a new entry to the CPU latency QoS list and recompute the effective QoS + * constraint for that list. * - * This function inserts a new entry in the PM_QOS_CPU_DMA_LATENCY list of - * requested QoS performance characteristics. It recomputes the aggregate QoS - * expectations for the PM_QOS_CPU_DMA_LATENCY list and initializes the @req - * handle. Caller needs to save this handle for later use in updates and - * removal. + * Callers need to save the handle for later use in updates and removal of the + * QoS request represented by it. */ -void pm_qos_add_request(struct pm_qos_request *req, - int pm_qos_class, s32 value) +void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value) { - if (!req) /*guard against callers passing in null */ + if (!req) return; - if (pm_qos_request_active(req)) { - WARN(1, KERN_ERR "pm_qos_add_request() called for already added request\n"); + if (cpu_latency_qos_request_active(req)) { + WARN(1, KERN_ERR "%s called for already added request\n", __func__); return; } @@ -282,25 +285,24 @@ void pm_qos_add_request(struct pm_qos_request *req, req->qos = &cpu_latency_constraints; cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value); } -EXPORT_SYMBOL_GPL(pm_qos_add_request); +EXPORT_SYMBOL_GPL(cpu_latency_qos_add_request); /** - * pm_qos_update_request - modifies an existing qos request - * @req : handle to list element holding a pm_qos request to use - * @value: defines the qos request - * - * Updates an existing qos request for the PM_QOS_CPU_DMA_LATENCY list along - * with updating the target PM_QOS_CPU_DMA_LATENCY value. + * cpu_latency_qos_update_request - Modify existing CPU latency QoS request. + * @req : QoS request to update. + * @new_value: New requested constraint value. * - * Attempts are made to make this code callable on hot code paths. + * Use @new_value to update the QoS request represented by @req in the CPU + * latency QoS list along with updating the effective constraint value for that + * list. */ -void pm_qos_update_request(struct pm_qos_request *req, s32 new_value) +void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value) { - if (!req) /*guard against callers passing in null */ + if (!req) return; - if (!pm_qos_request_active(req)) { - WARN(1, KERN_ERR "pm_qos_update_request() called for unknown object\n"); + if (!cpu_latency_qos_request_active(req)) { + WARN(1, KERN_ERR "%s called for unknown object\n", __func__); return; } @@ -311,24 +313,22 @@ void pm_qos_update_request(struct pm_qos_request *req, s32 new_value) cpu_latency_qos_apply(req, PM_QOS_UPDATE_REQ, new_value); } -EXPORT_SYMBOL_GPL(pm_qos_update_request); +EXPORT_SYMBOL_GPL(cpu_latency_qos_update_request); /** - * pm_qos_remove_request - modifies an existing qos request - * @req: handle to request list element + * cpu_latency_qos_remove_request - Remove existing CPU latency QoS request. + * @req: QoS request to remove. * - * Will remove pm qos request from the list of constraints and - * recompute the current target value for PM_QOS_CPU_DMA_LATENCY. Call this - * on slow code paths. + * Remove the CPU latency QoS request represented by @req from the CPU latency + * QoS list along with updating the effective constraint value for that list. */ -void pm_qos_remove_request(struct pm_qos_request *req) +void cpu_latency_qos_remove_request(struct pm_qos_request *req) { - if (!req) /*guard against callers passing in null */ + if (!req) return; - /* silent return to keep pcm code cleaner */ - if (!pm_qos_request_active(req)) { - WARN(1, KERN_ERR "pm_qos_remove_request() called for unknown object\n"); + if (!cpu_latency_qos_request_active(req)) { + WARN(1, KERN_ERR "%s called for unknown object\n", __func__); return; } @@ -337,7 +337,7 @@ void pm_qos_remove_request(struct pm_qos_request *req) cpu_latency_qos_apply(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE); memset(req, 0, sizeof(*req)); } -EXPORT_SYMBOL_GPL(pm_qos_remove_request); +EXPORT_SYMBOL_GPL(cpu_latency_qos_remove_request); /* User space interface to the CPU latency QoS via misc device. */ @@ -349,7 +349,7 @@ static int cpu_latency_qos_open(struct inode *inode, struct file *filp) if (!req) return -ENOMEM; - pm_qos_add_request(req, PM_QOS_CPU_DMA_LATENCY, PM_QOS_DEFAULT_VALUE); + cpu_latency_qos_add_request(req, PM_QOS_DEFAULT_VALUE); filp->private_data = req; return 0; @@ -361,7 +361,7 @@ static int cpu_latency_qos_release(struct inode *inode, struct file *filp) filp->private_data = NULL; - pm_qos_remove_request(req); + cpu_latency_qos_remove_request(req); kfree(req); return 0; @@ -374,7 +374,7 @@ static ssize_t cpu_latency_qos_read(struct file *filp, char __user *buf, unsigned long flags; s32 value; - if (!req || !pm_qos_request_active(req)) + if (!req || !cpu_latency_qos_request_active(req)) return -EINVAL; spin_lock_irqsave(&pm_qos_lock, flags); @@ -400,7 +400,7 @@ static ssize_t cpu_latency_qos_write(struct file *filp, const char __user *buf, return ret; } - pm_qos_update_request(filp->private_data, value); + cpu_latency_qos_update_request(filp->private_data, value); return count; } -- cgit v1.2.3-71-gd317 From fe52de36dc5d60960230cabec88c357113d9c32a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:35:39 +0100 Subject: PM: QoS: Update file information comments Update the file information comments in include/linux/pm_qos.h and kernel/power/qos.c by adding titles along with copyright and authors information to them and changing the qos.c description to better reflect its contents (outdated information is dropped from it in particular). Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 15 +++++++++++---- kernel/power/qos.c | 34 ++++++++++++---------------------- 2 files changed, 23 insertions(+), 26 deletions(-) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index e0ca4d780457..df065db3f57a 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -1,10 +1,17 @@ /* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_PM_QOS_H -#define _LINUX_PM_QOS_H -/* interface for the pm_qos_power infrastructure of the linux kernel. +/* + * Definitions related to Power Management Quality of Service (PM QoS). + * + * Copyright (C) 2020 Intel Corporation * - * Mark Gross + * Authors: + * Mark Gross + * Rafael J. Wysocki */ + +#ifndef _LINUX_PM_QOS_H +#define _LINUX_PM_QOS_H + #include #include #include diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 7374c76f409a..ef73573db43d 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -1,31 +1,21 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * This module exposes the interface to kernel space for specifying - * QoS dependencies. It provides infrastructure for registration of: + * Power Management Quality of Service (PM QoS) support base. * - * Dependents on a QoS value : register requests - * Watchers of QoS value : get notified when target QoS value changes + * Copyright (C) 2020 Intel Corporation * - * This QoS design is best effort based. Dependents register their QoS needs. - * Watchers register to keep track of the current QoS needs of the system. + * Authors: + * Mark Gross + * Rafael J. Wysocki * - * There are 3 basic classes of QoS parameter: latency, timeout, throughput - * each have defined units: - * latency: usec - * timeout: usec <-- currently not used. - * throughput: kbs (kilo byte / sec) + * Provided here is an interface for specifying PM QoS dependencies. It allows + * entities depending on QoS constraints to register their requests which are + * aggregated as appropriate to produce effective constraints (target values) + * that can be monitored by entities needing to respect them, either by polling + * or through a built-in notification mechanism. * - * There are lists of pm_qos_objects each one wrapping requests, notifiers - * - * User mode requests on a QOS parameter register themselves to the - * subsystem by opening the device node /dev/... and writing there request to - * the node. As long as the process holds a file handle open to the node the - * client continues to be accounted for. Upon file release the usermode - * request is removed and a new qos target is computed. This way when the - * request that the application has is cleaned up when closes the file - * pointer or exits the pm_qos_object will get an opportunity to clean up. - * - * Mark Gross + * In addition to the basic functionality, more specific interfaces for managing + * global CPU latency QoS requests and frequency QoS requests are provided. */ /*#define DEBUG*/ -- cgit v1.2.3-71-gd317 From 814d51f8889bc4afa75f647eeffd5ff0a5620e8d Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Wed, 12 Feb 2020 00:37:11 +0100 Subject: PM: QoS: Make CPU latency QoS depend on CONFIG_CPU_IDLE Because cpuidle is the only user of the effective constraint coming from the CPU latency QoS, add #ifdef CONFIG_CPU_IDLE around that code to avoid building it unnecessarily. Signed-off-by: Rafael J. Wysocki Reviewed-by: Ulf Hansson Reviewed-by: Amit Kucheria Tested-by: Amit Kucheria --- include/linux/pm_qos.h | 13 +++++++++++++ kernel/power/qos.c | 2 ++ 2 files changed, 15 insertions(+) (limited to 'kernel') diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h index df065db3f57a..4a69d4af3ff8 100644 --- a/include/linux/pm_qos.h +++ b/include/linux/pm_qos.h @@ -143,11 +143,24 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, struct pm_qos_flags_request *req, enum pm_qos_req_action action, s32 val); +#ifdef CONFIG_CPU_IDLE s32 cpu_latency_qos_limit(void); bool cpu_latency_qos_request_active(struct pm_qos_request *req); void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value); void cpu_latency_qos_update_request(struct pm_qos_request *req, s32 new_value); void cpu_latency_qos_remove_request(struct pm_qos_request *req); +#else +static inline s32 cpu_latency_qos_limit(void) { return INT_MAX; } +static inline bool cpu_latency_qos_request_active(struct pm_qos_request *req) +{ + return false; +} +static inline void cpu_latency_qos_add_request(struct pm_qos_request *req, + s32 value) {} +static inline void cpu_latency_qos_update_request(struct pm_qos_request *req, + s32 new_value) {} +static inline void cpu_latency_qos_remove_request(struct pm_qos_request *req) {} +#endif #ifdef CONFIG_PM enum pm_qos_flags_status __dev_pm_qos_flags(struct device *dev, s32 mask); diff --git a/kernel/power/qos.c b/kernel/power/qos.c index ef73573db43d..32927682bcc4 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -209,6 +209,7 @@ bool pm_qos_update_flags(struct pm_qos_flags *pqf, return prev_value != curr_value; } +#ifdef CONFIG_CPU_IDLE /* Definitions related to the CPU latency QoS. */ static struct pm_qos_constraints cpu_latency_constraints = { @@ -421,6 +422,7 @@ static int __init cpu_latency_qos_init(void) return ret; } late_initcall(cpu_latency_qos_init); +#endif /* CONFIG_CPU_IDLE */ /* Definitions related to the frequency QoS below. */ -- cgit v1.2.3-71-gd317 From 490f561b783dac2c4825e288e6dbbf83481eea34 Mon Sep 17 00:00:00 2001 From: Frederic Weisbecker Date: Mon, 27 Jan 2020 16:41:52 +0100 Subject: context-tracking: Introduce CONFIG_HAVE_TIF_NOHZ A few archs (x86, arm, arm64) don't rely anymore on TIF_NOHZ to call into context tracking on user entry/exit but instead use static keys (or not) to optimize those calls. Ideally every arch should migrate to that behaviour in the long run. Settle a config option to let those archs remove their TIF_NOHZ definitions. Signed-off-by: Frederic Weisbecker Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Andy Lutomirski Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Ralf Baechle Cc: Paul Burton Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: David S. Miller --- arch/Kconfig | 16 +++++++++++----- arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/mips/Kconfig | 1 + arch/powerpc/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/x86/Kconfig | 1 + kernel/context_tracking.c | 2 ++ 8 files changed, 19 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/arch/Kconfig b/arch/Kconfig index 98de654b79b3..dbf420a9f87b 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -540,11 +540,17 @@ config HAVE_CONTEXT_TRACKING help Provide kernel/user boundaries probes necessary for subsystems that need it, such as userspace RCU extended quiescent state. - Syscalls need to be wrapped inside user_exit()-user_enter() through - the slow path using TIF_NOHZ flag. Exceptions handlers must be - wrapped as well. Irqs are already protected inside - rcu_irq_enter/rcu_irq_exit() but preemption or signal handling on - irq exit still need to be protected. + Syscalls need to be wrapped inside user_exit()-user_enter(), either + optimized behind static key or through the slow path using TIF_NOHZ + flag. Exceptions handlers must be wrapped as well. Irqs are already + protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal + handling on irq exit still need to be protected. + +config HAVE_TIF_NOHZ + bool + help + Arch relies on TIF_NOHZ and syscall slow path to implement context + tracking calls to user_enter()/user_exit(). config HAVE_VIRT_CPU_ACCOUNTING bool diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 97864aabc2a6..38b764cae559 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -72,6 +72,7 @@ config ARM select HAVE_ARM_SMCCC if CPU_V7 select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 select HAVE_CONTEXT_TRACKING + select HAVE_TIF_NOHZ select HAVE_COPY_THREAD_TLS select HAVE_C_RECORDMCOUNT select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 0b30e884e088..5c945fa3df26 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -140,6 +140,7 @@ config ARM64 select HAVE_CMPXCHG_DOUBLE select HAVE_CMPXCHG_LOCAL select HAVE_CONTEXT_TRACKING + select HAVE_TIF_NOHZ select HAVE_COPY_THREAD_TLS select HAVE_DEBUG_BUGVERBOSE select HAVE_DEBUG_KMEMLEAK diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 797d7f1ad5fe..2589d4760e45 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -51,6 +51,7 @@ config MIPS select HAVE_ASM_MODVERSIONS select HAVE_CBPF_JIT if !64BIT && !CPU_MICROMIPS select HAVE_CONTEXT_TRACKING + select HAVE_TIF_NOHZ select HAVE_COPY_THREAD_TLS select HAVE_C_RECORDMCOUNT select HAVE_DEBUG_KMEMLEAK diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 497b7d0b2d7e..6f40af294685 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -182,6 +182,7 @@ config PPC select HAVE_STACKPROTECTOR if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13) select HAVE_STACKPROTECTOR if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2) select HAVE_CONTEXT_TRACKING if PPC64 + select HAVE_TIF_NOHZ if PPC64 select HAVE_COPY_THREAD_TLS select HAVE_DEBUG_KMEMLEAK select HAVE_DEBUG_STACKOVERFLOW diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index c1dd6dd642f4..9cc9ab04bd99 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -71,6 +71,7 @@ config SPARC64 select HAVE_FTRACE_MCOUNT_RECORD select HAVE_SYSCALL_TRACEPOINTS select HAVE_CONTEXT_TRACKING + select HAVE_TIF_NOHZ select HAVE_DEBUG_KMEMLEAK select IOMMU_HELPER select SPARSE_IRQ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index beea77046f9b..549eed3460c9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -211,6 +211,7 @@ config X86 select HAVE_STACK_VALIDATION if X86_64 select HAVE_RSEQ select HAVE_SYSCALL_TRACEPOINTS + select HAVE_TIF_NOHZ if X86_64 select HAVE_UNSTABLE_SCHED_CLOCK select HAVE_USER_RETURN_NOTIFIER select HAVE_GENERIC_VDSO diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 0296b4bda8f1..ce430885c26c 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -198,11 +198,13 @@ void __init context_tracking_cpu_set(int cpu) if (initialized) return; +#ifdef CONFIG_HAVE_TIF_NOHZ /* * Set TIF_NOHZ to init/0 and let it propagate to all tasks through fork * This assumes that init is the only task at this early boot stage. */ set_tsk_thread_flag(&init_task, TIF_NOHZ); +#endif WARN_ON_ONCE(!tasklist_empty()); initialized = true; -- cgit v1.2.3-71-gd317 From 5d51bee725cc1497352d6b0b604e42a90c680540 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:55 +0100 Subject: clocksource: Add common vdso clock mode storage All architectures which use the generic VDSO code have their own storage for the VDSO clock mode. That's pointless and just requires duplicate code. Provide generic storage for it. The new Kconfig symbol is intermediate and will be removed once all architectures are converted over. Signed-off-by: Thomas Gleixner Tested-by: Vincenzo Frascino Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124403.028046322@linutronix.de --- include/linux/clocksource.h | 12 +++++++++++- kernel/time/clocksource.c | 9 +++++++++ kernel/time/vsyscall.c | 10 ++++++++-- lib/vdso/Kconfig | 3 +++ lib/vdso/gettimeofday.c | 13 +++++++++++-- 5 files changed, 42 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 2c4574b517d2..6d5ed1b4d24d 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -23,10 +23,19 @@ struct clocksource; struct module; -#ifdef CONFIG_ARCH_CLOCKSOURCE_DATA +#if defined(CONFIG_ARCH_CLOCKSOURCE_DATA) || \ + defined(CONFIG_GENERIC_VDSO_CLOCK_MODE) #include #endif +enum vdso_clock_mode { + VDSO_CLOCKMODE_NONE, +#ifdef CONFIG_GENERIC_VDSO_CLOCK_MODE + VDSO_ARCH_CLOCKMODES, +#endif + VDSO_CLOCKMODE_MAX, +}; + /** * struct clocksource - hardware abstraction for a free running counter * Provides mostly state-free accessors to the underlying hardware. @@ -97,6 +106,7 @@ struct clocksource { const char *name; struct list_head list; int rating; + enum vdso_clock_mode vdso_clock_mode; unsigned long flags; int (*enable)(struct clocksource *cs); diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 428beb69426a..7cb09c4cf21c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -928,6 +928,15 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq) clocksource_arch_init(cs); +#ifdef CONFIG_GENERIC_VDSO_CLOCK_MODE + if (cs->vdso_clock_mode < 0 || + cs->vdso_clock_mode >= VDSO_CLOCKMODE_MAX) { + pr_warn("clocksource %s registered with invalid VDSO mode %d. Disabling VDSO support.\n", + cs->name, cs->vdso_clock_mode); + cs->vdso_clock_mode = VDSO_CLOCKMODE_NONE; + } +#endif + /* Initialize mult/shift and max_idle_ns */ __clocksource_update_freq_scale(cs, scale, freq); diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c index 9577c89179cd..f9a5178c69bb 100644 --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -71,13 +71,19 @@ void update_vsyscall(struct timekeeper *tk) { struct vdso_data *vdata = __arch_get_k_vdso_data(); struct vdso_timestamp *vdso_ts; + s32 clock_mode; u64 nsec; /* copy vsyscall data */ vdso_write_begin(vdata); - vdata[CS_HRES_COARSE].clock_mode = __arch_get_clock_mode(tk); - vdata[CS_RAW].clock_mode = __arch_get_clock_mode(tk); +#ifdef CONFIG_GENERIC_VDSO_CLOCK_MODE + clock_mode = tk->tkr_mono.clock->vdso_clock_mode; +#else + clock_mode = __arch_get_clock_mode(tk); +#endif + vdata[CS_HRES_COARSE].clock_mode = clock_mode; + vdata[CS_RAW].clock_mode = clock_mode; /* CLOCK_REALTIME also required for time() */ vdso_ts = &vdata[CS_HRES_COARSE].basetime[CLOCK_REALTIME]; diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig index d883ac299508..d9f43c84fcc6 100644 --- a/lib/vdso/Kconfig +++ b/lib/vdso/Kconfig @@ -30,4 +30,7 @@ config GENERIC_VDSO_TIME_NS Selected by architectures which support time namespaces in the VDSO +config GENERIC_VDSO_CLOCK_MODE + bool + endif diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 5804e4e168e7..3f2d8b859130 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include @@ -64,10 +65,14 @@ static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, do { seq = vdso_read_begin(vd); + if (IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && + vd->clock_mode == VDSO_CLOCKMODE_NONE) + return -1; cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; - if (unlikely((s64)cycles < 0)) + if (!IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && + unlikely((s64)cycles < 0)) return -1; ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult); @@ -132,10 +137,14 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, } smp_rmb(); + if (IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && + vd->clock_mode == VDSO_CLOCKMODE_NONE) + return -1; cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; - if (unlikely((s64)cycles < 0)) + if (!IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && + unlikely((s64)cycles < 0)) return -1; ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult); -- cgit v1.2.3-71-gd317 From f86fd32db706613fe8d0104057efa6e83e0d7e8f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:38:59 +0100 Subject: lib/vdso: Cleanup clock mode storage leftovers Now that all architectures are converted to use the generic storage the helpers and conditionals can be removed. Signed-off-by: Thomas Gleixner Tested-by: Vincenzo Frascino Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124403.470699892@linutronix.de --- arch/arm/mm/Kconfig | 1 - arch/arm64/Kconfig | 1 - arch/mips/Kconfig | 1 - arch/x86/Kconfig | 1 - include/asm-generic/vdso/vsyscall.h | 7 ------- include/linux/clocksource.h | 4 ++-- kernel/time/vsyscall.c | 4 ---- lib/vdso/Kconfig | 3 --- lib/vdso/gettimeofday.c | 17 +++++------------ 9 files changed, 7 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/arch/arm/mm/Kconfig b/arch/arm/mm/Kconfig index 865e888bb84f..65e4482e3849 100644 --- a/arch/arm/mm/Kconfig +++ b/arch/arm/mm/Kconfig @@ -900,7 +900,6 @@ config VDSO select GENERIC_TIME_VSYSCALL select GENERIC_VDSO_32 select GENERIC_GETTIMEOFDAY - select GENERIC_VDSO_CLOCK_MODE help Place in the process address space an ELF shared object providing fast implementations of gettimeofday and diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7809d4976269..c6c32fb7f546 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -110,7 +110,6 @@ config ARM64 select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL select GENERIC_GETTIMEOFDAY - select GENERIC_VDSO_CLOCK_MODE select HANDLE_DOMAIN_IRQ select HARDIRQS_SW_RESEND select HAVE_PCI diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 23b5c0578776..654369a7209d 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -37,7 +37,6 @@ config MIPS select GENERIC_SCHED_CLOCK if !CAVIUM_OCTEON_SOC select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL - select GENERIC_VDSO_CLOCK_MODE select GUP_GET_PTE_LOW_HIGH if CPU_MIPS32 && PHYS_ADDR_T_64BIT select HANDLE_DOMAIN_IRQ select HAVE_ARCH_COMPILER_H diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 698e9c835cc5..8b995db3d10f 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,7 +125,6 @@ config X86 select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL select GENERIC_GETTIMEOFDAY - select GENERIC_VDSO_CLOCK_MODE select GENERIC_VDSO_TIME_NS select GUP_GET_PTE_LOW_HIGH if X86_PAE select HARDLOCKUP_CHECK_TIMESTAMP if X86_64 diff --git a/include/asm-generic/vdso/vsyscall.h b/include/asm-generic/vdso/vsyscall.h index cec543d9e87b..4a28797495d7 100644 --- a/include/asm-generic/vdso/vsyscall.h +++ b/include/asm-generic/vdso/vsyscall.h @@ -18,13 +18,6 @@ static __always_inline bool __arch_update_vdso_data(void) } #endif /* __arch_update_vdso_data */ -#ifndef __arch_get_clock_mode -static __always_inline int __arch_get_clock_mode(struct timekeeper *tk) -{ - return 0; -} -#endif /* __arch_get_clock_mode */ - #ifndef __arch_update_vsyscall static __always_inline void __arch_update_vsyscall(struct vdso_data *vdata, struct timekeeper *tk) diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 6d5ed1b4d24d..7fefe0b21a14 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -24,13 +24,13 @@ struct clocksource; struct module; #if defined(CONFIG_ARCH_CLOCKSOURCE_DATA) || \ - defined(CONFIG_GENERIC_VDSO_CLOCK_MODE) + defined(CONFIG_GENERIC_GETTIMEOFDAY) #include #endif enum vdso_clock_mode { VDSO_CLOCKMODE_NONE, -#ifdef CONFIG_GENERIC_VDSO_CLOCK_MODE +#ifdef CONFIG_GENERIC_GETTIMEOFDAY VDSO_ARCH_CLOCKMODES, #endif VDSO_CLOCKMODE_MAX, diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c index f9a5178c69bb..d31a5ef4ade5 100644 --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -77,11 +77,7 @@ void update_vsyscall(struct timekeeper *tk) /* copy vsyscall data */ vdso_write_begin(vdata); -#ifdef CONFIG_GENERIC_VDSO_CLOCK_MODE clock_mode = tk->tkr_mono.clock->vdso_clock_mode; -#else - clock_mode = __arch_get_clock_mode(tk); -#endif vdata[CS_HRES_COARSE].clock_mode = clock_mode; vdata[CS_RAW].clock_mode = clock_mode; diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig index d9f43c84fcc6..d883ac299508 100644 --- a/lib/vdso/Kconfig +++ b/lib/vdso/Kconfig @@ -30,7 +30,4 @@ config GENERIC_VDSO_TIME_NS Selected by architectures which support time namespaces in the VDSO -config GENERIC_VDSO_CLOCK_MODE - bool - endif diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 3f2d8b859130..00f8d1f1405b 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -65,16 +65,13 @@ static int do_hres_timens(const struct vdso_data *vdns, clockid_t clk, do { seq = vdso_read_begin(vd); - if (IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && - vd->clock_mode == VDSO_CLOCKMODE_NONE) + + if (unlikely(vd->clock_mode == VDSO_CLOCKMODE_NONE)) return -1; + cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; - if (!IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && - unlikely((s64)cycles < 0)) - return -1; - ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult); ns >>= vd->shift; sec = vdso_ts->sec; @@ -137,16 +134,12 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, } smp_rmb(); - if (IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && - vd->clock_mode == VDSO_CLOCKMODE_NONE) + if (unlikely(vd->clock_mode == VDSO_CLOCKMODE_NONE)) return -1; + cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; - if (!IS_ENABLED(CONFIG_GENERIC_VDSO_CLOCK_MODE) && - unlikely((s64)cycles < 0)) - return -1; - ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult); ns >>= vd->shift; sec = vdso_ts->sec; -- cgit v1.2.3-71-gd317 From c7a18100bdffdff440c7291db6e80863fab0461e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:39:00 +0100 Subject: lib/vdso: Avoid highres update if clocksource is not VDSO capable If the current clocksource is not VDSO capable there is no point in updating the high resolution parts of the VDSO data. Replace the architecture specific check with a check for a VDSO capable clocksource and skip the update if there is none. Signed-off-by: Thomas Gleixner Tested-by: Vincenzo Frascino Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124403.563379423@linutronix.de --- arch/arm/include/asm/vdso/vsyscall.h | 7 ------- include/asm-generic/vdso/vsyscall.h | 7 ------- kernel/time/vsyscall.c | 6 +++--- 3 files changed, 3 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/arch/arm/include/asm/vdso/vsyscall.h b/arch/arm/include/asm/vdso/vsyscall.h index 002f9edd8a2b..47e41ae8ccd0 100644 --- a/arch/arm/include/asm/vdso/vsyscall.h +++ b/arch/arm/include/asm/vdso/vsyscall.h @@ -21,13 +21,6 @@ struct vdso_data *__arm_get_k_vdso_data(void) } #define __arch_get_k_vdso_data __arm_get_k_vdso_data -static __always_inline -bool __arm_update_vdso_data(void) -{ - return cntvct_ok; -} -#define __arch_update_vdso_data __arm_update_vdso_data - static __always_inline void __arm_sync_vdso_data(struct vdso_data *vdata) { diff --git a/include/asm-generic/vdso/vsyscall.h b/include/asm-generic/vdso/vsyscall.h index 4a28797495d7..c835607f78ae 100644 --- a/include/asm-generic/vdso/vsyscall.h +++ b/include/asm-generic/vdso/vsyscall.h @@ -11,13 +11,6 @@ static __always_inline struct vdso_data *__arch_get_k_vdso_data(void) } #endif /* __arch_get_k_vdso_data */ -#ifndef __arch_update_vdso_data -static __always_inline bool __arch_update_vdso_data(void) -{ - return true; -} -#endif /* __arch_update_vdso_data */ - #ifndef __arch_update_vsyscall static __always_inline void __arch_update_vsyscall(struct vdso_data *vdata, struct timekeeper *tk) diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c index d31a5ef4ade5..54ce6eb2ca36 100644 --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -105,10 +105,10 @@ void update_vsyscall(struct timekeeper *tk) WRITE_ONCE(vdata[CS_HRES_COARSE].hrtimer_res, hrtimer_resolution); /* - * Architectures can opt out of updating the high resolution part - * of the VDSO. + * If the current clocksource is not VDSO capable, then spare the + * update of the high reolution parts. */ - if (__arch_update_vdso_data()) + if (clock_mode != VDSO_CLOCKMODE_NONE) update_vdso_data(vdata, tk); __arch_update_vsyscall(vdata, tk); -- cgit v1.2.3-71-gd317 From 2d6b01bd88ccabba06d342ef80eaab6b39d12497 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 7 Feb 2020 13:39:01 +0100 Subject: lib/vdso: Move VCLOCK_TIMENS to vdso_clock_modes Move the time namespace indicator clock mode to the other ones for consistency sake. Signed-off-by: Thomas Gleixner Reviewed-by: Vincenzo Frascino Link: https://lkml.kernel.org/r/20200207124403.656097274@linutronix.de --- include/linux/clocksource.h | 3 +++ include/vdso/datapage.h | 2 -- kernel/time/namespace.c | 7 ++++--- lib/vdso/gettimeofday.c | 18 ++++++++++-------- 4 files changed, 17 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 7fefe0b21a14..02e3282719bd 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -34,6 +34,9 @@ enum vdso_clock_mode { VDSO_ARCH_CLOCKMODES, #endif VDSO_CLOCKMODE_MAX, + + /* Indicator for time namespace VDSO */ + VDSO_CLOCKMODE_TIMENS = INT_MAX }; /** diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index c5f347cc5e55..30c4cb0428d1 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -21,8 +21,6 @@ #define CS_RAW 1 #define CS_BASES (CS_RAW + 1) -#define VCLOCK_TIMENS UINT_MAX - /** * struct vdso_timestamp - basetime per clock_id * @sec: seconds diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c index 12858507d75a..e6ba064ce773 100644 --- a/kernel/time/namespace.c +++ b/kernel/time/namespace.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -172,8 +173,8 @@ static struct timens_offset offset_from_ts(struct timespec64 off) * for vdso_data->clock_mode is a non-issue. The task is spin waiting for the * update to finish and for 'seq' to become even anyway. * - * Timens page has vdso_data->clock_mode set to VCLOCK_TIMENS which enforces - * the time namespace handling path. + * Timens page has vdso_data->clock_mode set to VDSO_CLOCKMODE_TIMENS which + * enforces the time namespace handling path. */ static void timens_setup_vdso_data(struct vdso_data *vdata, struct time_namespace *ns) @@ -183,7 +184,7 @@ static void timens_setup_vdso_data(struct vdso_data *vdata, struct timens_offset boottime = offset_from_ts(ns->offsets.boottime); vdata->seq = 1; - vdata->clock_mode = VCLOCK_TIMENS; + vdata->clock_mode = VDSO_CLOCKMODE_TIMENS; offset[CLOCK_MONOTONIC] = monotonic; offset[CLOCK_MONOTONIC_RAW] = monotonic; offset[CLOCK_MONOTONIC_COARSE] = monotonic; diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 00f8d1f1405b..a76ac8d17c5f 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -116,10 +116,10 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, do { /* - * Open coded to handle VCLOCK_TIMENS. Time namespace + * Open coded to handle VDSO_CLOCKMODE_TIMENS. Time namespace * enabled tasks have a special VVAR page installed which * has vd->seq set to 1 and vd->clock_mode set to - * VCLOCK_TIMENS. For non time namespace affected tasks + * VDSO_CLOCKMODE_TIMENS. For non time namespace affected tasks * this does not affect performance because if vd->seq is * odd, i.e. a concurrent update is in progress the extra * check for vd->clock_mode is just a few extra @@ -128,7 +128,7 @@ static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, */ while (unlikely((seq = READ_ONCE(vd->seq)) & 1)) { if (IS_ENABLED(CONFIG_TIME_NS) && - vd->clock_mode == VCLOCK_TIMENS) + vd->clock_mode == VDSO_CLOCKMODE_TIMENS) return do_hres_timens(vd, clk, ts); cpu_relax(); } @@ -200,12 +200,12 @@ static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk, do { /* - * Open coded to handle VCLOCK_TIMENS. See comment in + * Open coded to handle VDSO_CLOCK_TIMENS. See comment in * do_hres(). */ while ((seq = READ_ONCE(vd->seq)) & 1) { if (IS_ENABLED(CONFIG_TIME_NS) && - vd->clock_mode == VCLOCK_TIMENS) + vd->clock_mode == VDSO_CLOCKMODE_TIMENS) return do_coarse_timens(vd, clk, ts); cpu_relax(); } @@ -292,7 +292,7 @@ __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) if (unlikely(tz != NULL)) { if (IS_ENABLED(CONFIG_TIME_NS) && - vd->clock_mode == VCLOCK_TIMENS) + vd->clock_mode == VDSO_CLOCKMODE_TIMENS) vd = __arch_get_timens_vdso_data(); tz->tz_minuteswest = vd[CS_HRES_COARSE].tz_minuteswest; @@ -308,7 +308,8 @@ static __maybe_unused __kernel_old_time_t __cvdso_time(__kernel_old_time_t *time const struct vdso_data *vd = __arch_get_vdso_data(); __kernel_old_time_t t; - if (IS_ENABLED(CONFIG_TIME_NS) && vd->clock_mode == VCLOCK_TIMENS) + if (IS_ENABLED(CONFIG_TIME_NS) && + vd->clock_mode == VDSO_CLOCKMODE_TIMENS) vd = __arch_get_timens_vdso_data(); t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec); @@ -332,7 +333,8 @@ int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res) if (unlikely((u32) clock >= MAX_CLOCKS)) return -1; - if (IS_ENABLED(CONFIG_TIME_NS) && vd->clock_mode == VCLOCK_TIMENS) + if (IS_ENABLED(CONFIG_TIME_NS) && + vd->clock_mode == VDSO_CLOCKMODE_TIMENS) vd = __arch_get_timens_vdso_data(); /* -- cgit v1.2.3-71-gd317 From 6e317c32fd39a13e4854a27958d5e35d15d196be Mon Sep 17 00:00:00 2001 From: Alexander Popov Date: Sat, 18 Jan 2020 01:59:00 +0300 Subject: timer: Improve the comment describing schedule_timeout() When working commit 6dcd5d7a7a29c1e, a mistake was noticed by Linus: schedule_timeout() was called without setting the task state to anything particular. It calls the scheduler, but doesn't delay anything, because the task stays runnable. That happens because sched_submit_work() does nothing for tasks in TASK_RUNNING state. That turned out to be the intended behavior. Adding a WARN() is not useful as the task could be woken up right after setting the state and before reaching schedule_timeout(). Improve the comment about schedule_timeout() and describe that more explicitly. Signed-off-by: Alexander Popov Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200117225900.16340-1-alex.popov@linux.com --- kernel/time/timer.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 4820823515e9..cb34fac9d9f7 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1828,21 +1828,23 @@ static void process_timeout(struct timer_list *t) * schedule_timeout - sleep until timeout * @timeout: timeout value in jiffies * - * Make the current task sleep until @timeout jiffies have - * elapsed. The routine will return immediately unless - * the current task state has been set (see set_current_state()). + * Make the current task sleep until @timeout jiffies have elapsed. + * The function behavior depends on the current task state + * (see also set_current_state() description): * - * You can set the task state as follows - + * %TASK_RUNNING - the scheduler is called, but the task does not sleep + * at all. That happens because sched_submit_work() does nothing for + * tasks in %TASK_RUNNING state. * * %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to * pass before the routine returns unless the current task is explicitly - * woken up, (e.g. by wake_up_process())". + * woken up, (e.g. by wake_up_process()). * * %TASK_INTERRUPTIBLE - the routine may return early if a signal is * delivered to the current task or the current task is explicitly woken * up. * - * The current task state is guaranteed to be TASK_RUNNING when this + * The current task state is guaranteed to be %TASK_RUNNING when this * routine returns. * * Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule @@ -1850,7 +1852,7 @@ static void process_timeout(struct timer_list *t) * value will be %MAX_SCHEDULE_TIMEOUT. * * Returns 0 when the timer has expired otherwise the remaining time in - * jiffies will be returned. In all cases the return value is guaranteed + * jiffies will be returned. In all cases the return value is guaranteed * to be non-negative. */ signed long __sched schedule_timeout(signed long timeout) -- cgit v1.2.3-71-gd317 From 5fb1c2a5bbf79ccca8d17cf97f66085be5808027 Mon Sep 17 00:00:00 2001 From: Amol Grover Date: Sun, 16 Feb 2020 13:13:30 +0530 Subject: posix-timers: Pass lockdep expression to RCU lists head is traversed using hlist_for_each_entry_rcu outside an RCU read-side critical section but under the protection of hash_lock. Hence, add corresponding lockdep expression to silence false-positive lockdep warnings, and harden RCU lists. [ tglx: Removed the macro and put the condition right where it's used ] Signed-off-by: Amol Grover Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200216074330.GA14025@workstation-portable --- kernel/time/posix-timers.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c index ff0eb30de346..07709ac30439 100644 --- a/kernel/time/posix-timers.c +++ b/kernel/time/posix-timers.c @@ -121,7 +121,8 @@ static struct k_itimer *__posix_timers_find(struct hlist_head *head, { struct k_itimer *timer; - hlist_for_each_entry_rcu(timer, head, t_hash) { + hlist_for_each_entry_rcu(timer, head, t_hash, + lockdep_is_held(&hash_lock)) { if ((timer->it_signal == sig) && (timer->it_id == id)) return timer; } -- cgit v1.2.3-71-gd317 From 0f74226649fb2875a91b68f3750f55220aa73425 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Thu, 13 Feb 2020 09:14:09 -0600 Subject: kernel: module: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Jessica Yu --- kernel/module.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/module.c b/kernel/module.c index 33569a01d6e1..b88ec9cd2a7f 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1515,7 +1515,7 @@ struct module_sect_attr { struct module_sect_attrs { struct attribute_group grp; unsigned int nsections; - struct module_sect_attr attrs[0]; + struct module_sect_attr attrs[]; }; static ssize_t module_sect_show(struct module_attribute *mattr, @@ -1608,7 +1608,7 @@ static void remove_sect_attrs(struct module *mod) struct module_notes_attrs { struct kobject *dir; unsigned int notes; - struct bin_attribute attrs[0]; + struct bin_attribute attrs[]; }; static ssize_t module_notes_read(struct file *filp, struct kobject *kobj, -- cgit v1.2.3-71-gd317 From b80b033bedae68dae8fc703ab8a69811ce678f2e Mon Sep 17 00:00:00 2001 From: Song Liu Date: Fri, 14 Feb 2020 15:41:46 -0800 Subject: bpf: Allow bpf_perf_event_read_value in all BPF programs bpf_perf_event_read_value() is NMI safe. Enable it for all BPF programs. This can be used in fentry/fexit to profile BPF program and individual kernel function with hardware counters. Signed-off-by: Song Liu Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200214234146.2910011-1-songliubraving@fb.com --- kernel/trace/bpf_trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 19e793aa441a..4ddd5ac46094 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -843,6 +843,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_send_signal_proto; case BPF_FUNC_send_signal_thread: return &bpf_send_signal_thread_proto; + case BPF_FUNC_perf_event_read_value: + return &bpf_perf_event_read_value_proto; default: return NULL; } @@ -858,8 +860,6 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_stackid_proto; case BPF_FUNC_get_stack: return &bpf_get_stack_proto; - case BPF_FUNC_perf_event_read_value: - return &bpf_perf_event_read_value_proto; #ifdef CONFIG_BPF_KPROBE_OVERRIDE case BPF_FUNC_override_return: return &bpf_override_return_proto; -- cgit v1.2.3-71-gd317 From fff7b64355eac6e29b50229ad1512315bc04b44e Mon Sep 17 00:00:00 2001 From: Daniel Xu Date: Mon, 17 Feb 2020 19:04:31 -0800 Subject: bpf: Add bpf_read_branch_records() helper Branch records are a CPU feature that can be configured to record certain branches that are taken during code execution. This data is particularly interesting for profile guided optimizations. perf has had branch record support for a while but the data collection can be a bit coarse grained. We (Facebook) have seen in experiments that associating metadata with branch records can improve results (after postprocessing). We generally use bpf_probe_read_*() to get metadata out of userspace. That's why bpf support for branch records is useful. Aside from this particular use case, having branch data available to bpf progs can be useful to get stack traces out of userspace applications that omit frame pointers. Signed-off-by: Daniel Xu Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz --- include/uapi/linux/bpf.h | 25 ++++++++++++++++++++++++- kernel/trace/bpf_trace.c | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index f1d74a2bd234..a7e59756853f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2892,6 +2892,25 @@ union bpf_attr { * Obtain the 64bit jiffies * Return * The 64 bit jiffies + * + * int bpf_read_branch_records(struct bpf_perf_event_data *ctx, void *buf, u32 size, u64 flags) + * Description + * For an eBPF program attached to a perf event, retrieve the + * branch records (struct perf_branch_entry) associated to *ctx* + * and store it in the buffer pointed by *buf* up to size + * *size* bytes. + * Return + * On success, number of bytes written to *buf*. On error, a + * negative value. + * + * The *flags* can be set to **BPF_F_GET_BRANCH_RECORDS_SIZE** to + * instead return the number of bytes required to store all the + * branch entries. If this flag is set, *buf* may be NULL. + * + * **-EINVAL** if arguments invalid or **size** not a multiple + * of sizeof(struct perf_branch_entry). + * + * **-ENOENT** if architecture does not support branch records. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3012,7 +3031,8 @@ union bpf_attr { FN(probe_read_kernel_str), \ FN(tcp_send_ack), \ FN(send_signal_thread), \ - FN(jiffies64), + FN(jiffies64), \ + FN(read_branch_records), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3091,6 +3111,9 @@ enum bpf_func_id { /* BPF_FUNC_sk_storage_get flags */ #define BPF_SK_STORAGE_GET_F_CREATE (1ULL << 0) +/* BPF_FUNC_read_branch_records flags. */ +#define BPF_F_GET_BRANCH_RECORDS_SIZE (1ULL << 0) + /* Mode for BPF_FUNC_skb_adjust_room helper. */ enum bpf_adj_room_mode { BPF_ADJ_ROOM_NET, diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 4ddd5ac46094..b8661bd0d028 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1028,6 +1028,45 @@ static const struct bpf_func_proto bpf_perf_prog_read_value_proto = { .arg3_type = ARG_CONST_SIZE, }; +BPF_CALL_4(bpf_read_branch_records, struct bpf_perf_event_data_kern *, ctx, + void *, buf, u32, size, u64, flags) +{ +#ifndef CONFIG_X86 + return -ENOENT; +#else + static const u32 br_entry_size = sizeof(struct perf_branch_entry); + struct perf_branch_stack *br_stack = ctx->data->br_stack; + u32 to_copy; + + if (unlikely(flags & ~BPF_F_GET_BRANCH_RECORDS_SIZE)) + return -EINVAL; + + if (unlikely(!br_stack)) + return -EINVAL; + + if (flags & BPF_F_GET_BRANCH_RECORDS_SIZE) + return br_stack->nr * br_entry_size; + + if (!buf || (size % br_entry_size != 0)) + return -EINVAL; + + to_copy = min_t(u32, br_stack->nr * br_entry_size, size); + memcpy(buf, br_stack->entries, to_copy); + + return to_copy; +#endif +} + +static const struct bpf_func_proto bpf_read_branch_records_proto = { + .func = bpf_read_branch_records, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM_OR_NULL, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .arg4_type = ARG_ANYTHING, +}; + static const struct bpf_func_proto * pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -1040,6 +1079,8 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_stack_proto_tp; case BPF_FUNC_perf_prog_read_value: return &bpf_perf_prog_read_value_proto; + case BPF_FUNC_read_branch_records: + return &bpf_read_branch_records_proto; default: return tracing_func_proto(func_id, prog); } -- cgit v1.2.3-71-gd317 From 82e0516ce3a147365a5dd2a9bedd5ba43a18663d Mon Sep 17 00:00:00 2001 From: Scott Wood Date: Mon, 3 Feb 2020 19:35:58 -0500 Subject: sched/core: Remove duplicate assignment in sched_tick_remote() A redundant "curr = rq->curr" was added; remove it. Fixes: ebc0f83c78a2 ("timers/nohz: Update NOHZ load in remote tick") Signed-off-by: Scott Wood Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com --- kernel/sched/core.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 45f79bcc3146..377ec26e9159 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3683,7 +3683,6 @@ static void sched_tick_remote(struct work_struct *work) if (cpu_is_offline(cpu)) goto out_unlock; - curr = rq->curr; update_rq_clock(rq); if (!is_idle_task(curr)) { -- cgit v1.2.3-71-gd317 From b7a331615d254191e7f5f0e35aec9adcd6acdc54 Mon Sep 17 00:00:00 2001 From: Morten Rasmussen Date: Thu, 6 Feb 2020 19:19:54 +0000 Subject: sched/fair: Add asymmetric CPU capacity wakeup scan Issue ===== On asymmetric CPU capacity topologies, we currently rely on wake_cap() to drive select_task_rq_fair() towards either: - its slow-path (find_idlest_cpu()) if either the previous or current (waking) CPU has too little capacity for the waking task - its fast-path (select_idle_sibling()) otherwise Commit: 3273163c6775 ("sched/fair: Let asymmetric CPU configurations balance at wake-up") points out that this relies on the assumption that "[...]the CPU capacities within an SD_SHARE_PKG_RESOURCES domain (sd_llc) are homogeneous". This assumption no longer holds on newer generations of big.LITTLE systems (DynamIQ), which can accommodate CPUs of different compute capacity within a single LLC domain. To hopefully paint a better picture, a regular big.LITTLE topology would look like this: +---------+ +---------+ | L2 | | L2 | +----+----+ +----+----+ |CPU0|CPU1| |CPU2|CPU3| +----+----+ +----+----+ ^^^ ^^^ LITTLEs bigs which would result in the following scheduler topology: DIE [ ] <- sd_asym_cpucapacity MC [ ] [ ] <- sd_llc 0 1 2 3 Conversely, a DynamIQ topology could look like: +-------------------+ | L3 | +----+----+----+----+ | L2 | L2 | L2 | L2 | +----+----+----+----+ |CPU0|CPU1|CPU2|CPU3| +----+----+----+----+ ^^^^^ ^^^^^ LITTLEs bigs which would result in the following scheduler topology: MC [ ] <- sd_llc, sd_asym_cpucapacity 0 1 2 3 What this means is that, on DynamIQ systems, we could pass the wake_cap() test (IOW presume the waking task fits on the CPU capacities of some LLC domain), thus go through select_idle_sibling(). This function operates on an LLC domain, which here spans both bigs and LITTLEs, so it could very well pick a CPU of too small capacity for the task, despite there being fitting idle CPUs - it very much depends on the CPU iteration order, on which we have absolutely no guarantees capacity-wise. Implementation ============== Introduce yet another select_idle_sibling() helper function that takes CPU capacity into account. The policy is to pick the first idle CPU which is big enough for the task (task_util * margin < cpu_capacity). If no idle CPU is big enough, we pick the idle one with the highest capacity. Unlike other select_idle_sibling() helpers, this one operates on the sd_asym_cpucapacity sched_domain pointer, which is guaranteed to span all known CPU capacities in the system. As such, this will work for both "legacy" big.LITTLE (LITTLEs & bigs split at MC, joined at DIE) and for newer DynamIQ systems (e.g. LITTLEs and bigs in the same MC domain). Note that this limits the scope of select_idle_sibling() to select_idle_capacity() for asymmetric CPU capacity systems - the LLC domain will not be scanned, and no further heuristic will be applied. Signed-off-by: Morten Rasmussen Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Reviewed-by: Quentin Perret Link: https://lkml.kernel.org/r/20200206191957.12325-2-valentin.schneider@arm.com --- kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1a0ce83e835a..6fb47a2f7383 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5896,6 +5896,40 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return cpu; } +/* + * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which + * the task fits. If no CPU is big enough, but there are idle ones, try to + * maximize capacity. + */ +static int +select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) +{ + unsigned long best_cap = 0; + int cpu, best_cpu = -1; + struct cpumask *cpus; + + sync_entity_load_avg(&p->se); + + cpus = this_cpu_cpumask_var_ptr(select_idle_mask); + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + + for_each_cpu_wrap(cpu, cpus, target) { + unsigned long cpu_cap = capacity_of(cpu); + + if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) + continue; + if (task_fits_capacity(p, cpu_cap)) + return cpu; + + if (cpu_cap > best_cap) { + best_cap = cpu_cap; + best_cpu = cpu; + } + } + + return best_cpu; +} + /* * Try and locate an idle core/thread in the LLC cache domain. */ @@ -5904,6 +5938,28 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i, recent_used_cpu; + /* + * For asymmetric CPU capacity systems, our domain of interest is + * sd_asym_cpucapacity rather than sd_llc. + */ + if (static_branch_unlikely(&sched_asym_cpucapacity)) { + sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target)); + /* + * On an asymmetric CPU capacity system where an exclusive + * cpuset defines a symmetric island (i.e. one unique + * capacity_orig value through the cpuset), the key will be set + * but the CPUs within that cpuset will not have a domain with + * SD_ASYM_CPUCAPACITY. These should follow the usual symmetric + * capacity path. + */ + if (!sd) + goto symmetric; + + i = select_idle_capacity(p, sd, target); + return ((unsigned)i < nr_cpumask_bits) ? i : target; + } + +symmetric: if (available_idle_cpu(target) || sched_idle_cpu(target)) return target; -- cgit v1.2.3-71-gd317 From a526d466798d65cff120ee00ef92931075bf3769 Mon Sep 17 00:00:00 2001 From: Morten Rasmussen Date: Thu, 6 Feb 2020 19:19:55 +0000 Subject: sched/topology: Remove SD_BALANCE_WAKE on asymmetric capacity systems SD_BALANCE_WAKE was previously added to lower sched_domain levels on asymmetric CPU capacity systems by commit: 9ee1cda5ee25 ("sched/core: Enable SD_BALANCE_WAKE for asymmetric capacity systems") to enable the use of find_idlest_cpu() and friends to find an appropriate CPU for tasks. That responsibility has now been shifted to select_idle_sibling() and friends, and hence the flag can be removed. Note that this causes asymmetric CPU capacity systems to no longer enter the slow wakeup path (find_idlest_cpu()) on wakeups - only on execs and forks (which is aligned with all other mainline topologies). Signed-off-by: Morten Rasmussen [Changelog tweaks] Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Reviewed-by: Quentin Perret Link: https://lkml.kernel.org/r/20200206191957.12325-3-valentin.schneider@arm.com --- kernel/sched/topology.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index dfb64c08a407..00911884b7e7 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1374,18 +1374,9 @@ sd_init(struct sched_domain_topology_level *tl, * Convert topological properties into behaviour. */ - if (sd->flags & SD_ASYM_CPUCAPACITY) { - struct sched_domain *t = sd; - - /* - * Don't attempt to spread across CPUs of different capacities. - */ - if (sd->child) - sd->child->flags &= ~SD_PREFER_SIBLING; - - for_each_lower_domain(t) - t->flags |= SD_BALANCE_WAKE; - } + /* Don't attempt to spread across CPUs of different capacities. */ + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child) + sd->child->flags &= ~SD_PREFER_SIBLING; if (sd->flags & SD_SHARE_CPUCAPACITY) { sd->imbalance_pct = 110; -- cgit v1.2.3-71-gd317 From f8459197e75b045d8d1d87b9856486b39e375721 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Thu, 6 Feb 2020 19:19:56 +0000 Subject: sched/core: Remove for_each_lower_domain() The last remaining user of this macro has just been removed, get rid of it. Suggested-by: Dietmar Eggemann Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Reviewed-by: Quentin Perret Link: https://lkml.kernel.org/r/20200206191957.12325-4-valentin.schneider@arm.com --- kernel/sched/sched.h | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0844e81964e5..878910e8b299 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1337,8 +1337,6 @@ extern void sched_ttwu_pending(void); for (__sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); \ __sd; __sd = __sd->parent) -#define for_each_lower_domain(sd) for (; sd; sd = sd->child) - /** * highest_flag_domain - Return highest sched_domain containing flag. * @cpu: The CPU whose highest level of sched domain is to -- cgit v1.2.3-71-gd317 From 000619680c3714020ce9db17eef6a4a7ce2dc28b Mon Sep 17 00:00:00 2001 From: Morten Rasmussen Date: Thu, 6 Feb 2020 19:19:57 +0000 Subject: sched/fair: Remove wake_cap() Capacity-awareness in the wake-up path previously involved disabling wake_affine in certain scenarios. We have just made select_idle_sibling() capacity-aware, so this isn't needed anymore. Remove wake_cap() entirely. Signed-off-by: Morten Rasmussen [Changelog tweaks] Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) [Changelog tweaks] Signed-off-by: Ingo Molnar Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200206191957.12325-5-valentin.schneider@arm.com --- kernel/sched/fair.c | 30 +----------------------------- 1 file changed, 1 insertion(+), 29 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6fb47a2f7383..a7e11b1bb64c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6145,33 +6145,6 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p) return min_t(unsigned long, util, capacity_orig_of(cpu)); } -/* - * Disable WAKE_AFFINE in the case where task @p doesn't fit in the - * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. - * - * In that case WAKE_AFFINE doesn't make sense and we'll let - * BALANCE_WAKE sort things out. - */ -static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) -{ - long min_cap, max_cap; - - if (!static_branch_unlikely(&sched_asym_cpucapacity)) - return 0; - - min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); - max_cap = cpu_rq(cpu)->rd->max_cpu_capacity; - - /* Minimum capacity is close to max, no need to abort wake_affine */ - if (max_cap - min_cap < max_cap >> 3) - return 0; - - /* Bring task utilization in sync with prev_cpu */ - sync_entity_load_avg(&p->se); - - return !task_fits_capacity(p, min_cap); -} - /* * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued) * to @dst_cpu. @@ -6436,8 +6409,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f new_cpu = prev_cpu; } - want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) && - cpumask_test_cpu(cpu, p->cpus_ptr); + want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr); } rcu_read_lock(); -- cgit v1.2.3-71-gd317 From 82dd8419e225958f01708cda8a3fc6c3c5356228 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 15 Dec 2019 11:38:57 -0800 Subject: rcu: Warn on for_each_leaf_node_cpu_mask() from non-leaf The for_each_leaf_node_cpu_mask() and for_each_leaf_node_possible_cpu() macros must be invoked only on leaf rcu_node structures. Failing to abide by this restriction can result in infinite loops on systems with more than 64 CPUs (or for more than 32 CPUs on 32-bit systems). This commit therefore adds WARN_ON_ONCE() calls to make misuse of these two macros easier to debug. Reported-by: Qian Cai Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 05f936ed167a..f6ce173e35f6 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -325,7 +325,8 @@ static inline void rcu_init_levelspread(int *levelspread, const int *levelcnt) * Iterate over all possible CPUs in a leaf RCU node. */ #define for_each_leaf_node_possible_cpu(rnp, cpu) \ - for ((cpu) = cpumask_next((rnp)->grplo - 1, cpu_possible_mask); \ + for (WARN_ON_ONCE(!rcu_is_leaf_node(rnp)), \ + (cpu) = cpumask_next((rnp)->grplo - 1, cpu_possible_mask); \ (cpu) <= rnp->grphi; \ (cpu) = cpumask_next((cpu), cpu_possible_mask)) @@ -335,7 +336,8 @@ static inline void rcu_init_levelspread(int *levelspread, const int *levelcnt) #define rcu_find_next_bit(rnp, cpu, mask) \ ((rnp)->grplo + find_next_bit(&(mask), BITS_PER_LONG, (cpu))) #define for_each_leaf_node_cpu_mask(rnp, cpu, mask) \ - for ((cpu) = rcu_find_next_bit((rnp), 0, (mask)); \ + for (WARN_ON_ONCE(!rcu_is_leaf_node(rnp)), \ + (cpu) = rcu_find_next_bit((rnp), 0, (mask)); \ (cpu) <= rnp->grphi; \ (cpu) = rcu_find_next_bit((rnp), (cpu) + 1 - (rnp->grplo), (mask))) -- cgit v1.2.3-71-gd317 From 24bb9eccf7ff335c16c2970ac7cd5c32a92821a6 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 22 Dec 2019 19:55:50 -0800 Subject: rcu: Fix exp_funnel_lock()/rcu_exp_wait_wake() datarace The rcu_node structure's ->exp_seq_rq field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index dcbd75791f39..d7e04849c7ab 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -589,7 +589,7 @@ static void rcu_exp_wait_wake(unsigned long s) spin_lock(&rnp->exp_lock); /* Recheck, avoid hang in case someone just arrived. */ if (ULONG_CMP_LT(rnp->exp_seq_rq, s)) - rnp->exp_seq_rq = s; + WRITE_ONCE(rnp->exp_seq_rq, s); spin_unlock(&rnp->exp_lock); } smp_mb(); /* All above changes before wakeup. */ -- cgit v1.2.3-71-gd317 From 8a7e8f51714004112a0bbbf751f3dd0fcbbbc983 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 2 Jan 2020 16:48:05 -0800 Subject: rcu: Provide debug symbols and line numbers in KCSAN runs This commit adds "-g -fno-omit-frame-pointer" to ease interpretation of KCSAN output, but only for CONFIG_KCSAN=y kerrnels. Signed-off-by: Paul E. McKenney --- kernel/rcu/Makefile | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index 82d5fba48b2f..f91f2c2cf138 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -3,6 +3,10 @@ # and is generally not a function of system call inputs. KCOV_INSTRUMENT := n +ifeq ($(CONFIG_KCSAN),y) +KBUILD_CFLAGS += -g -fno-omit-frame-pointer +endif + obj-y += update.o sync.o obj-$(CONFIG_TREE_SRCU) += srcutree.o obj-$(CONFIG_TINY_SRCU) += srcutiny.o -- cgit v1.2.3-71-gd317 From 7672d647ddae37d2ea267159950bcc311e962434 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 11:38:51 -0800 Subject: rcu: Add WRITE_ONCE() to rcu_node ->qsmask update The rcu_node structure's ->qsmask field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d91c9156fab2..bb57c24dbe9a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1881,7 +1881,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp, WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */ WARN_ON_ONCE(!rcu_is_leaf_node(rnp) && rcu_preempt_blocked_readers_cgp(rnp)); - rnp->qsmask &= ~mask; + WRITE_ONCE(rnp->qsmask, rnp->qsmask & ~mask); trace_rcu_quiescent_state_report(rcu_state.name, rnp->gp_seq, mask, rnp->qsmask, rnp->level, rnp->grplo, rnp->grphi, -- cgit v1.2.3-71-gd317 From b0c18c87730a4de6da0303fa99aea43e814233f9 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 12:12:06 -0800 Subject: rcu: Add WRITE_ONCE to rcu_node ->exp_seq_rq store The rcu_node structure's ->exp_seq_rq field is read locklessly, so this commit adds the WRITE_ONCE() to a load in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d7e04849c7ab..85b009e05637 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -314,7 +314,7 @@ static bool exp_funnel_lock(unsigned long s) sync_exp_work_done(s)); return true; } - rnp->exp_seq_rq = s; /* Followers can wait on us. */ + WRITE_ONCE(rnp->exp_seq_rq, s); /* Followers can wait on us. */ spin_unlock(&rnp->exp_lock); trace_rcu_exp_funnel_lock(rcu_state.name, rnp->level, rnp->grplo, rnp->grphi, TPS("nxtlvl")); -- cgit v1.2.3-71-gd317 From 0937d045732b5dd5e5df1fb6355d5f4e01f3c4d7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 14:53:31 -0800 Subject: rcu: Add READ_ONCE() to rcu_node ->gp_seq The rcu_node structure's ->gp_seq field is read locklessly, so this commit adds the READ_ONCE() to several loads in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Not appropriate for backporting because this affects only tracing and warnings. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index bb57c24dbe9a..236692d7762f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1126,8 +1126,9 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) static void trace_rcu_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, unsigned long gp_seq_req, const char *s) { - trace_rcu_future_grace_period(rcu_state.name, rnp->gp_seq, gp_seq_req, - rnp->level, rnp->grplo, rnp->grphi, s); + trace_rcu_future_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), + gp_seq_req, rnp->level, + rnp->grplo, rnp->grphi, s); } /* @@ -1904,7 +1905,7 @@ static void rcu_report_qs_rnp(unsigned long mask, struct rcu_node *rnp, rnp_c = rnp; rnp = rnp->parent; raw_spin_lock_irqsave_rcu_node(rnp, flags); - oldmask = rnp_c->qsmask; + oldmask = READ_ONCE(rnp_c->qsmask); } /* @@ -2052,7 +2053,7 @@ int rcutree_dying_cpu(unsigned int cpu) return 0; blkd = !!(rnp->qsmask & rdp->grpmask); - trace_rcu_grace_period(rcu_state.name, rnp->gp_seq, + trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), blkd ? TPS("cpuofl") : TPS("cpuofl-bgp")); return 0; } -- cgit v1.2.3-71-gd317 From 2906d2154cd6ad6e4452cc25315d3cea7bb7f2d7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 15:17:12 -0800 Subject: rcu: Add WRITE_ONCE() to rcu_state ->gp_req_activity The rcu_state structure's ->gp_req_activity field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 236692d7762f..944390981aed 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1200,7 +1200,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, } trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("Startedroot")); WRITE_ONCE(rcu_state.gp_flags, rcu_state.gp_flags | RCU_GP_FLAG_INIT); - rcu_state.gp_req_activity = jiffies; + WRITE_ONCE(rcu_state.gp_req_activity, jiffies); if (!rcu_state.gp_kthread) { trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread")); goto unlock_out; @@ -1775,7 +1775,7 @@ static void rcu_gp_cleanup(void) rcu_segcblist_is_offloaded(&rdp->cblist); if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) { WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); - rcu_state.gp_req_activity = jiffies; + WRITE_ONCE(rcu_state.gp_req_activity, jiffies); trace_rcu_grace_period(rcu_state.name, READ_ONCE(rcu_state.gp_seq), TPS("newreq")); -- cgit v1.2.3-71-gd317 From 105abf82b0a62664edabcd4c5b4d9414c74ef399 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 15:44:23 -0800 Subject: rcu: Add WRITE_ONCE() to rcu_node ->qsmaskinitnext The rcu_state structure's ->qsmaskinitnext field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely for systems not doing incessant CPU-hotplug operations. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 944390981aed..346321aa3c90 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3379,7 +3379,7 @@ void rcu_cpu_starting(unsigned int cpu) rnp = rdp->mynode; mask = rdp->grpmask; raw_spin_lock_irqsave_rcu_node(rnp, flags); - rnp->qsmaskinitnext |= mask; + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); oldmask = rnp->expmaskinitnext; rnp->expmaskinitnext |= mask; oldmask ^= rnp->expmaskinitnext; @@ -3432,7 +3432,7 @@ void rcu_report_dead(unsigned int cpu) rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); raw_spin_lock_irqsave_rcu_node(rnp, flags); } - rnp->qsmaskinitnext &= ~mask; + WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); raw_spin_unlock(&rcu_state.ofl_lock); -- cgit v1.2.3-71-gd317 From 0050c7b2d27c3cc126df59bd8094fb3d25b00ade Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 15:59:12 -0800 Subject: locking/rtmutex: rcu: Add WRITE_ONCE() to rt_mutex ->owner The rt_mutex structure's ->owner field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon --- kernel/locking/rtmutex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 851bbb10819d..c9f090d64f00 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -57,7 +57,7 @@ rt_mutex_set_owner(struct rt_mutex *lock, struct task_struct *owner) if (rt_mutex_has_waiters(lock)) val |= RT_MUTEX_HAS_WAITERS; - lock->owner = (struct task_struct *)val; + WRITE_ONCE(lock->owner, (struct task_struct *)val); } static inline void clear_rt_mutex_waiters(struct rt_mutex *lock) -- cgit v1.2.3-71-gd317 From bfeebe24212d374f82bbf5b005371fe13acabb93 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 16:14:08 -0800 Subject: rcu: Add READ_ONCE() to rcu_segcblist ->tails[] The rcu_segcblist structure's ->tails[] array entries are read locklessly, so this commit adds the READ_ONCE() to a load in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 5f4fd3b8777c..426a472e7308 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -182,7 +182,7 @@ void rcu_segcblist_offload(struct rcu_segcblist *rsclp) bool rcu_segcblist_ready_cbs(struct rcu_segcblist *rsclp) { return rcu_segcblist_is_enabled(rsclp) && - &rsclp->head != rsclp->tails[RCU_DONE_TAIL]; + &rsclp->head != READ_ONCE(rsclp->tails[RCU_DONE_TAIL]); } /* -- cgit v1.2.3-71-gd317 From 8ff37290d6622e130553a38ec2662a728e46cdba Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 4 Jan 2020 11:33:17 -0800 Subject: rcu: Add *_ONCE() for grace-period progress indicators The various RCU structures' ->gp_seq, ->gp_seq_needed, ->gp_req_activity, and ->gp_activity fields are read locklessly, so they must be updated with WRITE_ONCE() and, when read locklessly, with READ_ONCE(). This commit makes these changes. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 14 +++++++------- kernel/rcu/tree_plugin.h | 2 +- kernel/rcu/tree_stall.h | 26 +++++++++++++++----------- 3 files changed, 23 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 346321aa3c90..53946b169699 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1175,7 +1175,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, TPS("Prestarted")); goto unlock_out; } - rnp->gp_seq_needed = gp_seq_req; + WRITE_ONCE(rnp->gp_seq_needed, gp_seq_req); if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq))) { /* * We just marked the leaf or internal node, and a @@ -1210,8 +1210,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, unlock_out: /* Push furthest requested GP to leaf node and rcu_data structure. */ if (ULONG_CMP_LT(gp_seq_req, rnp->gp_seq_needed)) { - rnp_start->gp_seq_needed = rnp->gp_seq_needed; - rdp->gp_seq_needed = rnp->gp_seq_needed; + WRITE_ONCE(rnp_start->gp_seq_needed, rnp->gp_seq_needed); + WRITE_ONCE(rdp->gp_seq_needed, rnp->gp_seq_needed); } if (rnp != rnp_start) raw_spin_unlock_rcu_node(rnp); @@ -1423,7 +1423,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) } rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */ if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap) - rdp->gp_seq_needed = rnp->gp_seq_needed; + WRITE_ONCE(rdp->gp_seq_needed, rnp->gp_seq_needed); WRITE_ONCE(rdp->gpwrap, false); rcu_gpnum_ovf(rnp, rdp); return ret; @@ -3276,12 +3276,12 @@ int rcutree_prepare_cpu(unsigned int cpu) rnp = rdp->mynode; raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ rdp->beenonline = true; /* We have now been online. */ - rdp->gp_seq = rnp->gp_seq; - rdp->gp_seq_needed = rnp->gp_seq; + rdp->gp_seq = READ_ONCE(rnp->gp_seq); + rdp->gp_seq_needed = rdp->gp_seq; rdp->cpu_no_qs.b.norm = true; rdp->core_needs_qs = false; rdp->rcu_iw_pending = false; - rdp->rcu_iw_gp_seq = rnp->gp_seq - 1; + rdp->rcu_iw_gp_seq = rdp->gp_seq - 1; trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuonl")); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); rcu_prepare_kthreads(cpu); diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c6ea81cd4189..b5ba14864015 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -753,7 +753,7 @@ dump_blkd_tasks(struct rcu_node *rnp, int ncheck) raw_lockdep_assert_held_rcu_node(rnp); pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n", __func__, rnp->grplo, rnp->grphi, rnp->level, - (long)rnp->gp_seq, (long)rnp->completedqs); + (long)READ_ONCE(rnp->gp_seq), (long)rnp->completedqs); for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent) pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n", __func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext); diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 55f9b84790d3..43dc688c3785 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -592,21 +592,22 @@ void show_rcu_gp_kthreads(void) (long)READ_ONCE(rcu_get_root()->gp_seq_needed), READ_ONCE(rcu_state.gp_flags)); rcu_for_each_node_breadth_first(rnp) { - if (ULONG_CMP_GE(rcu_state.gp_seq, rnp->gp_seq_needed)) + if (ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), + READ_ONCE(rnp->gp_seq_needed))) continue; pr_info("\trcu_node %d:%d ->gp_seq %ld ->gp_seq_needed %ld\n", - rnp->grplo, rnp->grphi, (long)rnp->gp_seq, - (long)rnp->gp_seq_needed); + rnp->grplo, rnp->grphi, (long)READ_ONCE(rnp->gp_seq), + (long)READ_ONCE(rnp->gp_seq_needed)); if (!rcu_is_leaf_node(rnp)) continue; for_each_leaf_node_possible_cpu(rnp, cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); if (rdp->gpwrap || - ULONG_CMP_GE(rcu_state.gp_seq, - rdp->gp_seq_needed)) + ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), + READ_ONCE(rdp->gp_seq_needed))) continue; pr_info("\tcpu %d ->gp_seq_needed %ld\n", - cpu, (long)rdp->gp_seq_needed); + cpu, (long)READ_ONCE(rdp->gp_seq_needed)); } } for_each_possible_cpu(cpu) { @@ -631,7 +632,8 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, static atomic_t warned = ATOMIC_INIT(0); if (!IS_ENABLED(CONFIG_PROVE_RCU) || rcu_gp_in_progress() || - ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed)) + ULONG_CMP_GE(READ_ONCE(rnp_root->gp_seq), + READ_ONCE(rnp_root->gp_seq_needed))) return; j = jiffies; /* Expensive access, and in common case don't get here. */ if (time_before(j, READ_ONCE(rcu_state.gp_req_activity) + gpssdelay) || @@ -642,7 +644,8 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, raw_spin_lock_irqsave_rcu_node(rnp, flags); j = jiffies; if (rcu_gp_in_progress() || - ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed) || + ULONG_CMP_GE(READ_ONCE(rnp_root->gp_seq), + READ_ONCE(rnp_root->gp_seq_needed)) || time_before(j, READ_ONCE(rcu_state.gp_req_activity) + gpssdelay) || time_before(j, READ_ONCE(rcu_state.gp_activity) + gpssdelay) || atomic_read(&warned)) { @@ -655,9 +658,10 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, raw_spin_lock_rcu_node(rnp_root); /* irqs already disabled. */ j = jiffies; if (rcu_gp_in_progress() || - ULONG_CMP_GE(rnp_root->gp_seq, rnp_root->gp_seq_needed) || - time_before(j, rcu_state.gp_req_activity + gpssdelay) || - time_before(j, rcu_state.gp_activity + gpssdelay) || + ULONG_CMP_GE(READ_ONCE(rnp_root->gp_seq), + READ_ONCE(rnp_root->gp_seq_needed)) || + time_before(j, READ_ONCE(rcu_state.gp_req_activity) + gpssdelay) || + time_before(j, READ_ONCE(rcu_state.gp_activity) + gpssdelay) || atomic_xchg(&warned, 1)) { if (rnp_root != rnp) /* irqs remain disabled. */ -- cgit v1.2.3-71-gd317 From 65bb0dc437c3e57a6cde2b81170c8af4b9c90735 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 6 Jan 2020 21:08:02 +0100 Subject: rcu: Fix typos in file-header comments Convert to plural and add a note that this is for Tree RCU. Signed-off-by: SeongJae Park Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 2 +- kernel/rcu/tree.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 657e6a7d1c03..7ddb29cc7dba 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -5,7 +5,7 @@ * Copyright (C) IBM Corporation, 2006 * Copyright (C) Fujitsu, 2012 * - * Author: Paul McKenney + * Authors: Paul McKenney * Lai Jiangshan * * For detailed explanation of Read-Copy Update mechanism see - diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 53946b169699..a70f56bb56a7 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1,12 +1,12 @@ // SPDX-License-Identifier: GPL-2.0+ /* - * Read-Copy Update mechanism for mutual exclusion + * Read-Copy Update mechanism for mutual exclusion (tree-based version) * * Copyright IBM Corporation, 2008 * * Authors: Dipankar Sarma * Manfred Spraul - * Paul E. McKenney Hierarchical version + * Paul E. McKenney * * Based on the original work by Paul McKenney * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen. -- cgit v1.2.3-71-gd317 From a5b8950180f8e5acb802d1672e0b4d0ceee6126e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 7 Jan 2020 15:48:39 -0800 Subject: rcu: Add READ_ONCE() to rcu_data ->gpwrap The rcu_data structure's ->gpwrap field is read locklessly, and so this commit adds the required READ_ONCE() to a pair of laods in order to avoid destructive compiler optimizations. This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- kernel/rcu/tree_stall.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a70f56bb56a7..e851a12920e6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1322,7 +1322,7 @@ static void rcu_accelerate_cbs_unlocked(struct rcu_node *rnp, rcu_lockdep_assert_cblist_protected(rdp); c = rcu_seq_snap(&rcu_state.gp_seq); - if (!rdp->gpwrap && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { + if (!READ_ONCE(rdp->gpwrap) && ULONG_CMP_GE(rdp->gp_seq_needed, c)) { /* Old request still live, so mark recent callbacks. */ (void)rcu_segcblist_accelerate(&rdp->cblist, c); return; diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 43dc688c3785..bca637b274fb 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -602,7 +602,7 @@ void show_rcu_gp_kthreads(void) continue; for_each_leaf_node_possible_cpu(rnp, cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); - if (rdp->gpwrap || + if (READ_ONCE(rdp->gpwrap) || ULONG_CMP_GE(READ_ONCE(rcu_state.gp_seq), READ_ONCE(rdp->gp_seq_needed))) continue; -- cgit v1.2.3-71-gd317 From 2a2ae872ef7aa958f2152b8b24c6e94cf5f1d0df Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 8 Jan 2020 20:06:25 -0800 Subject: rcu: Add *_ONCE() to rcu_data ->rcu_forced_tick The rcu_data structure's ->rcu_forced_tick field is read locklessly, so this commit adds WRITE_ONCE() to all updates and READ_ONCE() to all lockless reads. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e851a12920e6..be59a5d7299d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -818,11 +818,12 @@ static __always_inline void rcu_nmi_enter_common(bool irq) incby = 1; } else if (tick_nohz_full_cpu(rdp->cpu) && rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE && - READ_ONCE(rdp->rcu_urgent_qs) && !rdp->rcu_forced_tick) { + READ_ONCE(rdp->rcu_urgent_qs) && + !READ_ONCE(rdp->rcu_forced_tick)) { raw_spin_lock_rcu_node(rdp->mynode); // Recheck under lock. if (rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) { - rdp->rcu_forced_tick = true; + WRITE_ONCE(rdp->rcu_forced_tick, true); tick_dep_set_cpu(rdp->cpu, TICK_DEP_BIT_RCU); } raw_spin_unlock_rcu_node(rdp->mynode); @@ -899,7 +900,7 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp) WRITE_ONCE(rdp->rcu_need_heavy_qs, false); if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick) { tick_dep_clear_cpu(rdp->cpu, TICK_DEP_BIT_RCU); - rdp->rcu_forced_tick = false; + WRITE_ONCE(rdp->rcu_forced_tick, false); } } -- cgit v1.2.3-71-gd317 From 3ca3b0e2cbe0050d1777a22b7fc13cad620eb2ba Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 8 Jan 2020 20:12:59 -0800 Subject: rcu: Add *_ONCE() to rcu_node ->boost_kthread_status The rcu_node structure's ->boost_kthread_status field is accessed locklessly, so this commit causes all updates to use WRITE_ONCE() and all reads to use READ_ONCE(). This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b5ba14864015..0f8b714f09f5 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1032,18 +1032,18 @@ static int rcu_boost_kthread(void *arg) trace_rcu_utilization(TPS("Start boost kthread@init")); for (;;) { - rnp->boost_kthread_status = RCU_KTHREAD_WAITING; + WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_WAITING); trace_rcu_utilization(TPS("End boost kthread@rcu_wait")); rcu_wait(rnp->boost_tasks || rnp->exp_tasks); trace_rcu_utilization(TPS("Start boost kthread@rcu_wait")); - rnp->boost_kthread_status = RCU_KTHREAD_RUNNING; + WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_RUNNING); more2boost = rcu_boost(rnp); if (more2boost) spincnt++; else spincnt = 0; if (spincnt > 10) { - rnp->boost_kthread_status = RCU_KTHREAD_YIELDING; + WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_YIELDING); trace_rcu_utilization(TPS("End boost kthread@rcu_yield")); schedule_timeout_interruptible(2); trace_rcu_utilization(TPS("Start boost kthread@rcu_yield")); @@ -1082,7 +1082,7 @@ static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags) rnp->boost_tasks = rnp->gp_tasks; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); rcu_wake_cond(rnp->boost_kthread_task, - rnp->boost_kthread_status); + READ_ONCE(rnp->boost_kthread_status)); } else { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } -- cgit v1.2.3-71-gd317 From 90c018942c2babab73814648b37808fc3bf2ed1a Mon Sep 17 00:00:00 2001 From: Eric Dumazet Date: Thu, 7 Nov 2019 11:37:38 -0800 Subject: timer: Use hlist_unhashed_lockless() in timer_pending() The timer_pending() function is mostly used in lockless contexts, so Without proper annotations, KCSAN might detect a data-race [1]. Using hlist_unhashed_lockless() instead of hand-coding it seems appropriate (as suggested by Paul E. McKenney). [1] BUG: KCSAN: data-race in del_timer / detach_if_pending write to 0xffff88808697d870 of 8 bytes by task 10 on cpu 0: __hlist_del include/linux/list.h:764 [inline] detach_timer kernel/time/timer.c:815 [inline] detach_if_pending+0xcd/0x2d0 kernel/time/timer.c:832 try_to_del_timer_sync+0x60/0xb0 kernel/time/timer.c:1226 del_timer_sync+0x6b/0xa0 kernel/time/timer.c:1365 schedule_timeout+0x2d2/0x6e0 kernel/time/timer.c:1896 rcu_gp_fqs_loop+0x37c/0x580 kernel/rcu/tree.c:1639 rcu_gp_kthread+0x143/0x230 kernel/rcu/tree.c:1799 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 read to 0xffff88808697d870 of 8 bytes by task 12060 on cpu 1: del_timer+0x3b/0xb0 kernel/time/timer.c:1198 sk_stop_timer+0x25/0x60 net/core/sock.c:2845 inet_csk_clear_xmit_timers+0x69/0xa0 net/ipv4/inet_connection_sock.c:523 tcp_clear_xmit_timers include/net/tcp.h:606 [inline] tcp_v4_destroy_sock+0xa3/0x3f0 net/ipv4/tcp_ipv4.c:2096 inet_csk_destroy_sock+0xf4/0x250 net/ipv4/inet_connection_sock.c:836 tcp_close+0x6f3/0x970 net/ipv4/tcp.c:2497 inet_release+0x86/0x100 net/ipv4/af_inet.c:427 __sock_release+0x85/0x160 net/socket.c:590 sock_close+0x24/0x30 net/socket.c:1268 __fput+0x1e1/0x520 fs/file_table.c:280 ____fput+0x1f/0x30 fs/file_table.c:313 task_work_run+0xf6/0x130 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_usermode_loop+0x2b4/0x2c0 arch/x86/entry/common.c:163 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 12060 Comm: syz-executor.5 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, Signed-off-by: Eric Dumazet Cc: Thomas Gleixner [ paulmck: Pulled in Eric's later amendments. ] Signed-off-by: Paul E. McKenney --- include/linux/timer.h | 2 +- kernel/time/timer.c | 7 ++++--- 2 files changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/timer.h b/include/linux/timer.h index 1e6650ed066d..0dc19a8c39c9 100644 --- a/include/linux/timer.h +++ b/include/linux/timer.h @@ -164,7 +164,7 @@ static inline void destroy_timer_on_stack(struct timer_list *timer) { } */ static inline int timer_pending(const struct timer_list * timer) { - return timer->entry.pprev != NULL; + return !hlist_unhashed_lockless(&timer->entry); } extern void add_timer_on(struct timer_list *timer, int cpu); diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 4820823515e9..568564ae3597 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -944,6 +944,7 @@ static struct timer_base *lock_timer_base(struct timer_list *timer, #define MOD_TIMER_PENDING_ONLY 0x01 #define MOD_TIMER_REDUCE 0x02 +#define MOD_TIMER_NOTPENDING 0x04 static inline int __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int options) @@ -960,7 +961,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int option * the timer is re-modified to have the same timeout or ends up in the * same array bucket then just return: */ - if (timer_pending(timer)) { + if (!(options & MOD_TIMER_NOTPENDING) && timer_pending(timer)) { /* * The downside of this optimization is that it can result in * larger granularity than you would get from adding a new @@ -1133,7 +1134,7 @@ EXPORT_SYMBOL(timer_reduce); void add_timer(struct timer_list *timer) { BUG_ON(timer_pending(timer)); - mod_timer(timer, timer->expires); + __mod_timer(timer, timer->expires, MOD_TIMER_NOTPENDING); } EXPORT_SYMBOL(add_timer); @@ -1891,7 +1892,7 @@ signed long __sched schedule_timeout(signed long timeout) timer.task = current; timer_setup_on_stack(&timer.timer, process_timeout, 0); - __mod_timer(&timer.timer, expire, 0); + __mod_timer(&timer.timer, expire, MOD_TIMER_NOTPENDING); schedule(); del_singleshot_timer_sync(&timer.timer); -- cgit v1.2.3-71-gd317 From 57721fd15a02f7df9dad1f3cca27f21e03ee118f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 15 Jan 2020 19:17:02 -0800 Subject: rcu: Remove dead code from rcu_segcblist_insert_pend_cbs() The rcu_segcblist_insert_pend_cbs() function currently (partially) initializes the rcu_cblist that it pulls callbacks from. However, all the resulting stores are dead because all callers pass in the address of an on-stack cblist that is not used afterwards. This commit therefore removes this pointless initialization. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 426a472e7308..9a0f66133b4b 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -381,8 +381,6 @@ void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp, return; /* Nothing to do. */ WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rclp->head); WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], rclp->tail); - rclp->head = NULL; - rclp->tail = &rclp->head; } /* -- cgit v1.2.3-71-gd317 From 59881bcd85a06565c53fd13ce3b0ad7fa55e560c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 20 Jan 2020 15:29:04 -0800 Subject: rcu: Add WRITE_ONCE() to rcu_state ->gp_start The rcu_state structure's ->gp_start field is read locklessly, so this commit adds the WRITE_ONCE() to an update in order to provide proper documentation and READ_ONCE()/WRITE_ONCE() pairing. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_stall.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index bca637b274fb..488b71d03030 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -102,7 +102,7 @@ static void record_gp_stall_check_time(void) unsigned long j = jiffies; unsigned long j1; - rcu_state.gp_start = j; + WRITE_ONCE(rcu_state.gp_start, j); j1 = rcu_jiffies_till_stall_check(); /* Record ->gp_start before ->jiffies_stall. */ smp_store_release(&rcu_state.jiffies_stall, j + j1); /* ^^^ */ -- cgit v1.2.3-71-gd317 From aa24f93753e256c4b14fe46f7261f150cff2a50c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 20 Jan 2020 15:43:45 -0800 Subject: rcu: Fix rcu_barrier_callback() race condition The rcu_barrier_callback() function does an atomic_dec_and_test(), and if it is the last CPU to check in, does the required wakeup. Either way, it does an event trace. Unfortunately, this is susceptible to the following sequence of events: o CPU 0 invokes rcu_barrier_callback(), but atomic_dec_and_test() says that it is not last. But at this point, CPU 0 is delayed, perhaps due to an NMI, SMI, or vCPU preemption. o CPU 1 invokes rcu_barrier_callback(), and atomic_dec_and_test() says that it is last. So CPU 1 traces completion and does the needed wakeup. o The awakened rcu_barrier() function does cleanup and releases rcu_state.barrier_mutex. o Another CPU now acquires rcu_state.barrier_mutex and starts another round of rcu_barrier() processing, including updating rcu_state.barrier_sequence. o CPU 0 gets its act back together and does its tracing. Except that rcu_state.barrier_sequence has already been updated, so its tracing is incorrect and probably quite confusing. (Wait! Why did this CPU check in twice for one rcu_barrier() invocation???) This commit therefore causes rcu_barrier_callback() to take a snapshot of the value of rcu_state.barrier_sequence before invoking atomic_dec_and_test(), thus guaranteeing that the event-trace output is sensible, even if the timing of the event-trace output might still be confusing. (Wait! Why did the old rcu_barrier() complete before all of its CPUs checked in???) But being that this is RCU, only so much confusion can reasonably be eliminated. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to the mild consequences of the failure, namely a confusing event trace. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index be59a5d7299d..62383ce5161f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3077,15 +3077,22 @@ static void rcu_barrier_trace(const char *s, int cpu, unsigned long done) /* * RCU callback function for rcu_barrier(). If we are last, wake * up the task executing rcu_barrier(). + * + * Note that the value of rcu_state.barrier_sequence must be captured + * before the atomic_dec_and_test(). Otherwise, if this CPU is not last, + * other CPUs might count the value down to zero before this CPU gets + * around to invoking rcu_barrier_trace(), which might result in bogus + * data from the next instance of rcu_barrier(). */ static void rcu_barrier_callback(struct rcu_head *rhp) { + unsigned long __maybe_unused s = rcu_state.barrier_sequence; + if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) { - rcu_barrier_trace(TPS("LastCB"), -1, - rcu_state.barrier_sequence); + rcu_barrier_trace(TPS("LastCB"), -1, s); complete(&rcu_state.barrier_completion); } else { - rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence); + rcu_barrier_trace(TPS("CB"), -1, s); } } -- cgit v1.2.3-71-gd317 From 5648d6591230c811972543ff146ce969babdd732 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 21 Jan 2020 12:30:22 -0800 Subject: rcu: Don't flag non-starting GPs before GP kthread is running Currently rcu_check_gp_start_stall() complains if a grace period takes too long to start, where "too long" is roughly one RCU CPU stall-warning interval. This has worked well, but there are some debugging Kconfig options (such as CONFIG_EFI_PGT_DUMP=y) that can make booting take a very long time, so much so that the stall-warning interval has expired before RCU's grace-period kthread has even been spawned. This commit therefore resets the rcu_state.gp_req_activity and rcu_state.gp_activity timestamps just before the grace-period kthread is spawned, and modifies the checks and adds ordering to ensure that if rcu_check_gp_start_stall() sees that the grace-period kthread has been spawned, that it will also see the resets applied to the rcu_state.gp_req_activity and rcu_state.gp_activity timestamps. Reported-by: Qian Cai Signed-off-by: Paul E. McKenney [ paulmck: Fix whitespace issues reported by Qian Cai. ] Tested-by: Qian Cai [ paulmck: Simplify grace-period wakeup check per Steve Rostedt feedback. ] --- kernel/rcu/tree.c | 28 ++++++++++++++++------------ kernel/rcu/tree_stall.h | 7 ++++--- 2 files changed, 20 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 62383ce5161f..4a4a975503e5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1202,7 +1202,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("Startedroot")); WRITE_ONCE(rcu_state.gp_flags, rcu_state.gp_flags | RCU_GP_FLAG_INIT); WRITE_ONCE(rcu_state.gp_req_activity, jiffies); - if (!rcu_state.gp_kthread) { + if (!READ_ONCE(rcu_state.gp_kthread)) { trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread")); goto unlock_out; } @@ -1237,12 +1237,13 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) } /* - * Awaken the grace-period kthread. Don't do a self-awaken (unless in - * an interrupt or softirq handler), and don't bother awakening when there - * is nothing for the grace-period kthread to do (as in several CPUs raced - * to awaken, and we lost), and finally don't try to awaken a kthread that - * has not yet been created. If all those checks are passed, track some - * debug information and awaken. + * Awaken the grace-period kthread. Don't do a self-awaken (unless in an + * interrupt or softirq handler, in which case we just might immediately + * sleep upon return, resulting in a grace-period hang), and don't bother + * awakening when there is nothing for the grace-period kthread to do + * (as in several CPUs raced to awaken, we lost), and finally don't try + * to awaken a kthread that has not yet been created. If all those checks + * are passed, track some debug information and awaken. * * So why do the self-wakeup when in an interrupt or softirq handler * in the grace-period kthread's context? Because the kthread might have @@ -1252,10 +1253,10 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp) */ static void rcu_gp_kthread_wake(void) { - if ((current == rcu_state.gp_kthread && - !in_irq() && !in_serving_softirq()) || - !READ_ONCE(rcu_state.gp_flags) || - !rcu_state.gp_kthread) + struct task_struct *t = READ_ONCE(rcu_state.gp_kthread); + + if ((current == t && !in_irq() && !in_serving_softirq()) || + !READ_ONCE(rcu_state.gp_flags) || !t) return; WRITE_ONCE(rcu_state.gp_wake_time, jiffies); WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq)); @@ -3554,7 +3555,10 @@ static int __init rcu_spawn_gp_kthread(void) } rnp = rcu_get_root(); raw_spin_lock_irqsave_rcu_node(rnp, flags); - rcu_state.gp_kthread = t; + WRITE_ONCE(rcu_state.gp_activity, jiffies); + WRITE_ONCE(rcu_state.gp_req_activity, jiffies); + // Reset .gp_activity and .gp_req_activity before setting .gp_kthread. + smp_store_release(&rcu_state.gp_kthread, t); /* ^^^ */ raw_spin_unlock_irqrestore_rcu_node(rnp, flags); wake_up_process(t); rcu_spawn_nocb_kthreads(); diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 488b71d03030..16ad7ad9a185 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -578,6 +578,7 @@ void show_rcu_gp_kthreads(void) unsigned long jw; struct rcu_data *rdp; struct rcu_node *rnp; + struct task_struct *t = READ_ONCE(rcu_state.gp_kthread); j = jiffies; ja = j - READ_ONCE(rcu_state.gp_activity); @@ -585,8 +586,7 @@ void show_rcu_gp_kthreads(void) jw = j - READ_ONCE(rcu_state.gp_wake_time); pr_info("%s: wait state: %s(%d) ->state: %#lx delta ->gp_activity %lu ->gp_req_activity %lu ->gp_wake_time %lu ->gp_wake_seq %ld ->gp_seq %ld ->gp_seq_needed %ld ->gp_flags %#x\n", rcu_state.name, gp_state_getname(rcu_state.gp_state), - rcu_state.gp_state, - rcu_state.gp_kthread ? rcu_state.gp_kthread->state : 0x1ffffL, + rcu_state.gp_state, t ? t->state : 0x1ffffL, ja, jr, jw, (long)READ_ONCE(rcu_state.gp_wake_seq), (long)READ_ONCE(rcu_state.gp_seq), (long)READ_ONCE(rcu_get_root()->gp_seq_needed), @@ -633,7 +633,8 @@ static void rcu_check_gp_start_stall(struct rcu_node *rnp, struct rcu_data *rdp, if (!IS_ENABLED(CONFIG_PROVE_RCU) || rcu_gp_in_progress() || ULONG_CMP_GE(READ_ONCE(rnp_root->gp_seq), - READ_ONCE(rnp_root->gp_seq_needed))) + READ_ONCE(rnp_root->gp_seq_needed)) || + !smp_load_acquire(&rcu_state.gp_kthread)) // Get stable kthread. return; j = jiffies; /* Expensive access, and in common case don't get here. */ if (time_before(j, READ_ONCE(rcu_state.gp_req_activity) + gpssdelay) || -- cgit v1.2.3-71-gd317 From 9ced454807191e44ef093aeeee68194be9ce3a1a Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 20 Jan 2020 22:42:15 +0000 Subject: rcu: Add missing annotation for rcu_nocb_bypass_lock() Sparse reports warning at rcu_nocb_bypass_lock() |warning: context imbalance in rcu_nocb_bypass_lock() - wrong count at exit To fix this, this commit adds an __acquires(&rdp->nocb_bypass_lock). Given that rcu_nocb_bypass_lock() does actually call raw_spin_lock() when raw_spin_trylock() fails, this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0f8b714f09f5..6db2cad7dab7 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1486,6 +1486,7 @@ module_param(nocb_nobypass_lim_per_jiffy, int, 0); * flag the contention. */ static void rcu_nocb_bypass_lock(struct rcu_data *rdp) + __acquires(&rdp->nocb_bypass_lock) { lockdep_assert_irqs_disabled(); if (raw_spin_trylock(&rdp->nocb_bypass_lock)) -- cgit v1.2.3-71-gd317 From 92c0b889f2ff6898710d49458b6eae1de50895c6 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Thu, 30 Jan 2020 00:30:09 +0000 Subject: rcu/nocb: Add missing annotation for rcu_nocb_bypass_unlock() Sparse reports warning at rcu_nocb_bypass_unlock() warning: context imbalance in rcu_nocb_bypass_unlock() - unexpected unlock The root cause is a missing annotation of rcu_nocb_bypass_unlock() which causes the warning. This commit therefore adds the missing __releases(&rdp->nocb_bypass_lock) annotation. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney Acked-by: Boqun Feng --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 6db2cad7dab7..384651915d74 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1530,6 +1530,7 @@ static bool rcu_nocb_bypass_trylock(struct rcu_data *rdp) * Release the specified rcu_data structure's ->nocb_bypass_lock. */ static void rcu_nocb_bypass_unlock(struct rcu_data *rdp) + __releases(&rdp->nocb_bypass_lock) { lockdep_assert_irqs_disabled(); raw_spin_unlock(&rdp->nocb_bypass_lock); -- cgit v1.2.3-71-gd317 From faa059c397dec8a452c79e9dba64419113ea64e2 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 3 Feb 2020 14:20:00 -0800 Subject: rcu: Optimize and protect atomic_cmpxchg() loop This commit reworks the atomic_cmpxchg() loop in rcu_eqs_special_set() to do only the initial read from the current CPU's rcu_data structure's ->dynticks field explicitly. On subsequent passes, this value is instead retained from the failing atomic_cmpxchg() operation. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4a4a975503e5..6c624814ee2d 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -342,14 +342,17 @@ bool rcu_eqs_special_set(int cpu) { int old; int new; + int new_old; struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + new_old = atomic_read(&rdp->dynticks); do { - old = atomic_read(&rdp->dynticks); + old = new_old; if (old & RCU_DYNTICK_CTRL_CTR) return false; new = old | RCU_DYNTICK_CTRL_MASK; - } while (atomic_cmpxchg(&rdp->dynticks, old, new) != old); + new_old = atomic_cmpxchg(&rdp->dynticks, old, new); + } while (new_old != old); return true; } -- cgit v1.2.3-71-gd317 From 13817dd589f426aee9c36e3670e7cb2a7f067d23 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 4 Feb 2020 08:56:41 -0800 Subject: rcu: Tighten rcu_lockdep_assert_cblist_protected() check The ->nocb_lock lockdep assertion is currently guarded by cpu_online(), which is incorrect for no-CBs CPUs, whose callback lists must be protected by ->nocb_lock regardless of whether or not the corresponding CPU is online. This situation could result in failure to detect bugs resulting from failing to hold ->nocb_lock for offline CPUs. This commit therefore removes the cpu_online() guard. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 384651915d74..70b3c0f4ea37 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1579,8 +1579,7 @@ static void rcu_nocb_unlock_irqrestore(struct rcu_data *rdp, static void rcu_lockdep_assert_cblist_protected(struct rcu_data *rdp) { lockdep_assert_irqs_disabled(); - if (rcu_segcblist_is_offloaded(&rdp->cblist) && - cpu_online(rdp->cpu)) + if (rcu_segcblist_is_offloaded(&rdp->cblist)) lockdep_assert_held(&rdp->nocb_lock); } -- cgit v1.2.3-71-gd317 From 3d05031ae6de6ad084aa41263aed1343065233d5 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 4 Feb 2020 14:55:29 -0800 Subject: rcu: Make nocb_gp_wait() double-check unexpected-callback warning Currently, nocb_gp_wait() unconditionally complains if there is a callback not already associated with a grace period. This assumes that either there was no such callback initially on the one hand, or that the rcu_advance_cbs() function assigned all such callbacks to a grace period on the other. However, in theory there are some situations that would prevent rcu_advance_cbs() from assigning all of the callbacks. This commit therefore checks for unassociated callbacks immediately after rcu_advance_cbs() returns, while the corresponding rcu_node structure's ->lock is still held. If there are unassociated callbacks at that point, the subsequent WARN_ON_ONCE() is disabled. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 70b3c0f4ea37..36e71c99970a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1931,6 +1931,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) struct rcu_data *rdp; struct rcu_node *rnp; unsigned long wait_gp_seq = 0; // Suppress "use uninitialized" warning. + bool wasempty = false; /* * Each pass through the following loop checks for CBs and for the @@ -1970,10 +1971,13 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) rcu_seq_done(&rnp->gp_seq, cur_gp_seq))) { raw_spin_lock_rcu_node(rnp); /* irqs disabled. */ needwake_gp = rcu_advance_cbs(rnp, rdp); + wasempty = rcu_segcblist_restempty(&rdp->cblist, + RCU_NEXT_READY_TAIL); raw_spin_unlock_rcu_node(rnp); /* irqs disabled. */ } // Need to wait on some grace period? - WARN_ON_ONCE(!rcu_segcblist_restempty(&rdp->cblist, + WARN_ON_ONCE(wasempty && + !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)); if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq)) { if (!needwait_gp || -- cgit v1.2.3-71-gd317 From 34c881745549e78f31ec65f319457c82aacc53bd Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 20 Jan 2020 15:42:25 +0100 Subject: rcu: Support kfree_bulk() interface in kfree_rcu() The kfree_rcu() logic can be improved further by using kfree_bulk() interface along with "basic batching support" introduced earlier. The are at least two advantages of using "bulk" interface: - in case of large number of kfree_rcu() requests kfree_bulk() reduces the per-object overhead caused by calling kfree() per-object. - reduces the number of cache-misses due to "pointer chasing" between objects which can be far spread between each other. This approach defines a new kfree_rcu_bulk_data structure that stores pointers in an array with a specific size. Number of entries in that array depends on PAGE_SIZE making kfree_rcu_bulk_data structure to be exactly one page. Since it deals with "block-chain" technique there is an extra need in dynamic allocation when a new block is required. Memory is allocated with GFP_NOWAIT | __GFP_NOWARN flags, i.e. that allows to skip direct reclaim under low memory condition to prevent stalling and fails silently under high memory pressure. The "emergency path" gets maintained when a system is run out of memory. In that case objects are linked into regular list. The "rcuperf" was run to analyze this change in terms of memory consumption and kfree_bulk() throughput. 1) Testing on the Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, 12xCPUs with following parameters: kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 kfree_vary_obj_size=1 dev.2020.01.10a branch Default / CONFIG_SLAB 53607352517 ns, loops: 200000, batches: 1885, memory footprint: 1248MB 53529637912 ns, loops: 200000, batches: 1921, memory footprint: 1193MB 53570175705 ns, loops: 200000, batches: 1929, memory footprint: 1250MB Patch / CONFIG_SLAB 23981587315 ns, loops: 200000, batches: 810, memory footprint: 1219MB 23879375281 ns, loops: 200000, batches: 822, memory footprint: 1190MB 24086841707 ns, loops: 200000, batches: 794, memory footprint: 1380MB Default / CONFIG_SLUB 51291025022 ns, loops: 200000, batches: 1713, memory footprint: 741MB 51278911477 ns, loops: 200000, batches: 1671, memory footprint: 719MB 51256183045 ns, loops: 200000, batches: 1719, memory footprint: 647MB Patch / CONFIG_SLUB 50709919132 ns, loops: 200000, batches: 1618, memory footprint: 456MB 50736297452 ns, loops: 200000, batches: 1633, memory footprint: 507MB 50660403893 ns, loops: 200000, batches: 1628, memory footprint: 429MB in case of CONFIG_SLAB there is double increase in performance and slightly higher memory usage. As for CONFIG_SLUB, the performance figures are better together with lower memory usage. 2) Testing on the HiKey-960, arm64, 8xCPUs with below parameters: CONFIG_SLAB=y kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1 102898760401 ns, loops: 200000, batches: 5822, memory footprint: 158MB 89947009882 ns, loops: 200000, batches: 6715, memory footprint: 115MB rcuperf shows approximately ~12% better throughput in case of using "bulk" interface. The "drain logic" or its RCU callback does the work faster that leads to better throughput. Signed-off-by: Uladzislau Rezki (Sony) Tested-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 204 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 169 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d91c9156fab2..51a3aa884a7c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2689,22 +2689,47 @@ EXPORT_SYMBOL_GPL(call_rcu); #define KFREE_DRAIN_JIFFIES (HZ / 50) #define KFREE_N_BATCHES 2 +/* + * This macro defines how many entries the "records" array + * will contain. It is based on the fact that the size of + * kfree_rcu_bulk_data structure becomes exactly one page. + */ +#define KFREE_BULK_MAX_ENTR ((PAGE_SIZE / sizeof(void *)) - 3) + +/** + * struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers + * @nr_records: Number of active pointers in the array + * @records: Array of the kfree_rcu() pointers + * @next: Next bulk object in the block chain + * @head_free_debug: For debug, when CONFIG_DEBUG_OBJECTS_RCU_HEAD is set + */ +struct kfree_rcu_bulk_data { + unsigned long nr_records; + void *records[KFREE_BULK_MAX_ENTR]; + struct kfree_rcu_bulk_data *next; + struct rcu_head *head_free_debug; +}; + /** * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period * @head_free: List of kfree_rcu() objects waiting for a grace period + * @bhead_free: Bulk-List of kfree_rcu() objects waiting for a grace period * @krcp: Pointer to @kfree_rcu_cpu structure */ struct kfree_rcu_cpu_work { struct rcu_work rcu_work; struct rcu_head *head_free; + struct kfree_rcu_bulk_data *bhead_free; struct kfree_rcu_cpu *krcp; }; /** * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period * @head: List of kfree_rcu() objects not yet waiting for a grace period + * @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period + * @bcached: Keeps at most one object for later reuse when build chain blocks * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period * @lock: Synchronize access to this structure * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES @@ -2718,6 +2743,8 @@ struct kfree_rcu_cpu_work { */ struct kfree_rcu_cpu { struct rcu_head *head; + struct kfree_rcu_bulk_data *bhead; + struct kfree_rcu_bulk_data *bcached; struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES]; spinlock_t lock; struct delayed_work monitor_work; @@ -2727,14 +2754,24 @@ struct kfree_rcu_cpu { static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc); +static __always_inline void +debug_rcu_head_unqueue_bulk(struct rcu_head *head) +{ +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD + for (; head; head = head->next) + debug_rcu_head_unqueue(head); +#endif +} + /* * This function is invoked in workqueue context after a grace period. - * It frees all the objects queued on ->head_free. + * It frees all the objects queued on ->bhead_free or ->head_free. */ static void kfree_rcu_work(struct work_struct *work) { unsigned long flags; struct rcu_head *head, *next; + struct kfree_rcu_bulk_data *bhead, *bnext; struct kfree_rcu_cpu *krcp; struct kfree_rcu_cpu_work *krwp; @@ -2744,22 +2781,41 @@ static void kfree_rcu_work(struct work_struct *work) spin_lock_irqsave(&krcp->lock, flags); head = krwp->head_free; krwp->head_free = NULL; + bhead = krwp->bhead_free; + krwp->bhead_free = NULL; spin_unlock_irqrestore(&krcp->lock, flags); - // List "head" is now private, so traverse locklessly. + /* "bhead" is now private, so traverse locklessly. */ + for (; bhead; bhead = bnext) { + bnext = bhead->next; + + debug_rcu_head_unqueue_bulk(bhead->head_free_debug); + + rcu_lock_acquire(&rcu_callback_map); + kfree_bulk(bhead->nr_records, bhead->records); + rcu_lock_release(&rcu_callback_map); + + if (cmpxchg(&krcp->bcached, NULL, bhead)) + free_page((unsigned long) bhead); + + cond_resched_tasks_rcu_qs(); + } + + /* + * Emergency case only. It can happen under low memory + * condition when an allocation gets failed, so the "bulk" + * path can not be temporary maintained. + */ for (; head; head = next) { unsigned long offset = (unsigned long)head->func; next = head->next; - // Potentially optimize with kfree_bulk in future. debug_rcu_head_unqueue(head); rcu_lock_acquire(&rcu_callback_map); trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset); - if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) { - /* Could be optimized with kfree_bulk() in future. */ + if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) kfree((void *)head - offset); - } rcu_lock_release(&rcu_callback_map); cond_resched_tasks_rcu_qs(); @@ -2774,26 +2830,48 @@ static void kfree_rcu_work(struct work_struct *work) */ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp) { + struct kfree_rcu_cpu_work *krwp; + bool queued = false; int i; - struct kfree_rcu_cpu_work *krwp = NULL; lockdep_assert_held(&krcp->lock); - for (i = 0; i < KFREE_N_BATCHES; i++) - if (!krcp->krw_arr[i].head_free) { - krwp = &(krcp->krw_arr[i]); - break; - } - // If a previous RCU batch is in progress, we cannot immediately - // queue another one, so return false to tell caller to retry. - if (!krwp) - return false; + for (i = 0; i < KFREE_N_BATCHES; i++) { + krwp = &(krcp->krw_arr[i]); - krwp->head_free = krcp->head; - krcp->head = NULL; - INIT_RCU_WORK(&krwp->rcu_work, kfree_rcu_work); - queue_rcu_work(system_wq, &krwp->rcu_work); - return true; + /* + * Try to detach bhead or head and attach it over any + * available corresponding free channel. It can be that + * a previous RCU batch is in progress, it means that + * immediately to queue another one is not possible so + * return false to tell caller to retry. + */ + if ((krcp->bhead && !krwp->bhead_free) || + (krcp->head && !krwp->head_free)) { + /* Channel 1. */ + if (!krwp->bhead_free) { + krwp->bhead_free = krcp->bhead; + krcp->bhead = NULL; + } + + /* Channel 2. */ + if (!krwp->head_free) { + krwp->head_free = krcp->head; + krcp->head = NULL; + } + + /* + * One work is per one batch, so there are two "free channels", + * "bhead_free" and "head_free" the batch can handle. It can be + * that the work is in the pending state when two channels have + * been detached following each other, one by one. + */ + queue_rcu_work(system_wq, &krwp->rcu_work); + queued = true; + } + } + + return queued; } static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp, @@ -2830,19 +2908,65 @@ static void kfree_rcu_monitor(struct work_struct *work) spin_unlock_irqrestore(&krcp->lock, flags); } +static inline bool +kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, + struct rcu_head *head, rcu_callback_t func) +{ + struct kfree_rcu_bulk_data *bnode; + + if (unlikely(!krcp->initialized)) + return false; + + lockdep_assert_held(&krcp->lock); + + /* Check if a new block is required. */ + if (!krcp->bhead || + krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) { + bnode = xchg(&krcp->bcached, NULL); + if (!bnode) { + WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE); + + bnode = (struct kfree_rcu_bulk_data *) + __get_free_page(GFP_NOWAIT | __GFP_NOWARN); + } + + /* Switch to emergency path. */ + if (unlikely(!bnode)) + return false; + + /* Initialize the new block. */ + bnode->nr_records = 0; + bnode->next = krcp->bhead; + bnode->head_free_debug = NULL; + + /* Attach it to the head. */ + krcp->bhead = bnode; + } + +#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD + head->func = func; + head->next = krcp->bhead->head_free_debug; + krcp->bhead->head_free_debug = head; +#endif + + /* Finally insert. */ + krcp->bhead->records[krcp->bhead->nr_records++] = + (void *) head - (unsigned long) func; + + return true; +} + /* - * Queue a request for lazy invocation of kfree() after a grace period. + * Queue a request for lazy invocation of kfree_bulk()/kfree() after a grace + * period. Please note there are two paths are maintained, one is the main one + * that uses kfree_bulk() interface and second one is emergency one, that is + * used only when the main path can not be maintained temporary, due to memory + * pressure. * * Each kfree_call_rcu() request is added to a batch. The batch will be drained - * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch - * will be kfree'd in workqueue context. This allows us to: - * - * 1. Batch requests together to reduce the number of grace periods during - * heavy kfree_rcu() load. - * - * 2. It makes it possible to use kfree_bulk() on a large number of - * kfree_rcu() requests thus reducing cache misses and the per-object - * overhead of kfree(). + * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will + * be free'd in workqueue context. This allows us to: batch requests together to + * reduce the number of grace periods during heavy kfree_rcu() load. */ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) { @@ -2861,9 +2985,16 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) __func__, head); goto unlock_return; } - head->func = func; - head->next = krcp->head; - krcp->head = head; + + /* + * Under high memory pressure GFP_NOWAIT can fail, + * in that case the emergency path is maintained. + */ + if (unlikely(!kfree_call_rcu_add_ptr_to_bulk(krcp, head, func))) { + head->func = func; + head->next = krcp->head; + krcp->head = head; + } // Set timer to drain after KFREE_DRAIN_JIFFIES. if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING && @@ -3769,8 +3900,11 @@ static void __init kfree_rcu_batch_init(void) struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); spin_lock_init(&krcp->lock); - for (i = 0; i < KFREE_N_BATCHES; i++) + for (i = 0; i < KFREE_N_BATCHES; i++) { + INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; + } + INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); krcp->initialized = true; } -- cgit v1.2.3-71-gd317 From 613707929b304737e6eb841588772f1994f6702b Mon Sep 17 00:00:00 2001 From: "Uladzislau Rezki (Sony)" Date: Mon, 20 Jan 2020 15:42:26 +0100 Subject: rcu: Add a trace event for kfree_rcu() use of kfree_bulk() The event is given three parameters, first one is the name of RCU flavour, second one is the number of elements in array for free and last one is an address of the array holding pointers to be freed by the kfree_bulk() function. To enable the trace event your kernel has to be build with CONFIG_RCU_TRACE=y, after that it is possible to track the events using ftrace subsystem. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- include/trace/events/rcu.h | 28 ++++++++++++++++++++++++++++ kernel/rcu/tree.c | 3 +++ 2 files changed, 31 insertions(+) (limited to 'kernel') diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 5e49b06e8104..49a49e68b916 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -623,6 +623,34 @@ TRACE_EVENT_RCU(rcu_invoke_kfree_callback, __entry->rcuname, __entry->rhp, __entry->offset) ); +/* + * Tracepoint for the invocation of a single RCU callback of the special + * kfree_bulk() form. The first argument is the RCU flavor, the second + * argument is a number of elements in array to free, the third is an + * address of the array holding nr_records entries. + */ +TRACE_EVENT_RCU(rcu_invoke_kfree_bulk_callback, + + TP_PROTO(const char *rcuname, unsigned long nr_records, void **p), + + TP_ARGS(rcuname, nr_records, p), + + TP_STRUCT__entry( + __field(const char *, rcuname) + __field(unsigned long, nr_records) + __field(void **, p) + ), + + TP_fast_assign( + __entry->rcuname = rcuname; + __entry->nr_records = nr_records; + __entry->p = p; + ), + + TP_printk("%s bulk=0x%p nr_records=%lu", + __entry->rcuname, __entry->p, __entry->nr_records) +); + /* * Tracepoint for exiting rcu_do_batch after RCU callbacks have been * invoked. The first argument is the name of the RCU flavor, diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 51a3aa884a7c..909f97efb1ed 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2792,6 +2792,9 @@ static void kfree_rcu_work(struct work_struct *work) debug_rcu_head_unqueue_bulk(bhead->head_free_debug); rcu_lock_acquire(&rcu_callback_map); + trace_rcu_invoke_kfree_bulk_callback(rcu_state.name, + bhead->nr_records, bhead->records); + kfree_bulk(bhead->nr_records, bhead->records); rcu_lock_release(&rcu_callback_map); -- cgit v1.2.3-71-gd317 From 80c503e0e68fbe271680ab48f0fe29bc034b01b7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 23 Jan 2020 09:19:01 -0800 Subject: locktorture: Print ratio of acquisitions, not failures The __torture_print_stats() function in locktorture.c carefully initializes local variable "min" to statp[0].n_lock_acquired, but then compares it to statp[i].n_lock_fail. Given that the .n_lock_fail field should normally be zero, and given the initialization, it seems reasonable to display the maximum and minimum number acquisitions instead of miscomputing the maximum and minimum number of failures. This commit therefore switches from failures to acquisitions. And this turns out to be not only a day-zero bug, but entirely my own fault. I hate it when that happens! Fixes: 0af3fe1efa53 ("locktorture: Add a lock-torture kernel module") Reported-by: Will Deacon Signed-off-by: Paul E. McKenney Acked-by: Will Deacon Cc: Davidlohr Bueso Cc: Josh Triplett Cc: Peter Zijlstra --- kernel/locking/locktorture.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 99475a66c94f..687c1d83dc20 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -696,10 +696,10 @@ static void __torture_print_stats(char *page, if (statp[i].n_lock_fail) fail = true; sum += statp[i].n_lock_acquired; - if (max < statp[i].n_lock_fail) - max = statp[i].n_lock_fail; - if (min > statp[i].n_lock_fail) - min = statp[i].n_lock_fail; + if (max < statp[i].n_lock_acquired) + max = statp[i].n_lock_acquired; + if (min > statp[i].n_lock_acquired) + min = statp[i].n_lock_acquired; } page += sprintf(page, "%s: Total: %lld Max/Min: %ld/%ld %s Fail: %d %s\n", -- cgit v1.2.3-71-gd317 From c0e1472d80784206ead1dd803dd4bc10e282b17f Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 24 Jan 2020 12:58:15 -0800 Subject: locktorture: Use private random-number generators Both lock_torture_writer() and lock_torture_reader() use the "static" keyword on their DEFINE_TORTURE_RANDOM(rand) declarations, which means that a single instance of a random-number generator are shared among all the writers and another is shared among all the readers. Unfortunately, this random-number generator was not designed for concurrent access. This commit therefore removes both "static" keywords so that each reader and each writer gets its own random-number generator. Signed-off-by: Paul E. McKenney --- kernel/locking/locktorture.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 687c1d83dc20..5baf904e8f39 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -618,7 +618,7 @@ static struct lock_torture_ops percpu_rwsem_lock_ops = { static int lock_torture_writer(void *arg) { struct lock_stress_stats *lwsp = arg; - static DEFINE_TORTURE_RANDOM(rand); + DEFINE_TORTURE_RANDOM(rand); VERBOSE_TOROUT_STRING("lock_torture_writer task started"); set_user_nice(current, MAX_NICE); @@ -655,7 +655,7 @@ static int lock_torture_writer(void *arg) static int lock_torture_reader(void *arg) { struct lock_stress_stats *lrsp = arg; - static DEFINE_TORTURE_RANDOM(rand); + DEFINE_TORTURE_RANDOM(rand); VERBOSE_TOROUT_STRING("lock_torture_reader task started"); set_user_nice(current, MAX_NICE); -- cgit v1.2.3-71-gd317 From 28e09a2e48486ce8ff0a72e21570d59b1243b308 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 30 Jan 2020 20:37:04 -0800 Subject: locktorture: Forgive apparent unfairness if CPU hotplug If CPU hotplug testing is enabled, a lock might appear to be maximally unfair just because one of the CPUs was offline almost all the time. This commit therefore forgives unfairness if CPU hotplug testing was enabled. Signed-off-by: Paul E. McKenney --- kernel/locking/locktorture.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/locktorture.c b/kernel/locking/locktorture.c index 5baf904e8f39..5efbfc68ce99 100644 --- a/kernel/locking/locktorture.c +++ b/kernel/locking/locktorture.c @@ -704,7 +704,8 @@ static void __torture_print_stats(char *page, page += sprintf(page, "%s: Total: %lld Max/Min: %ld/%ld %s Fail: %d %s\n", write ? "Writes" : "Reads ", - sum, max, min, max / 2 > min ? "???" : "", + sum, max, min, + !onoff_interval && max / 2 > min ? "???" : "", fail, fail ? "!!!" : ""); if (fail) atomic_inc(&cxt.n_lock_torture_errors); -- cgit v1.2.3-71-gd317 From b5ea03709d12e98fa341aecfa6940cc9f49e8817 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 9 Dec 2019 15:19:45 -0800 Subject: rcu: Clear ->core_needs_qs at GP end or self-reported QS The rcu_data structure's ->core_needs_qs field does not necessarily get cleared in a timely fashion after the corresponding CPUs' quiescent state has been reported. From a functional viewpoint, no harm done, but this can result in excessive invocation of RCU core processing, as witnessed by the kernel test robot, which saw greatly increased softirq overhead. This commit therefore restores the rcu_report_qs_rdp() function's clearing of this field, but only when running on the corresponding CPU. Cases where some other CPU reports the quiescent state (for example, on behalf of an idle CPU) are handled by setting this field appropriately within the __note_gp_changes() function's end-of-grace-period checks. This handling is carried out regardless of whether the end of a grace period actually happened, thus handling the case where a CPU goes non-idle after a quiescent state is reported on its behalf, but before the grace period ends. This fix also avoids cross-CPU updates to ->core_needs_qs, While in the area, this commit changes the __note_gp_changes() need_gp variable's name to need_qs because it is a quiescent state that is needed from the CPU in question. Fixes: ed93dfc6bc00 ("rcu: Confine ->core_needs_qs accesses to the corresponding CPU") Reported-by: kernel test robot Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d91c9156fab2..31d01f80a1f6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1386,7 +1386,7 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp, static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) { bool ret = false; - bool need_gp; + bool need_qs; const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && rcu_segcblist_is_offloaded(&rdp->cblist); @@ -1400,10 +1400,13 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) unlikely(READ_ONCE(rdp->gpwrap))) { if (!offloaded) ret = rcu_advance_cbs(rnp, rdp); /* Advance CBs. */ + rdp->core_needs_qs = false; trace_rcu_grace_period(rcu_state.name, rdp->gp_seq, TPS("cpuend")); } else { if (!offloaded) ret = rcu_accelerate_cbs(rnp, rdp); /* Recent CBs. */ + if (rdp->core_needs_qs) + rdp->core_needs_qs = !!(rnp->qsmask & rdp->grpmask); } /* Now handle the beginnings of any new-to-this-CPU grace periods. */ @@ -1415,9 +1418,9 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) * go looking for one. */ trace_rcu_grace_period(rcu_state.name, rnp->gp_seq, TPS("cpustart")); - need_gp = !!(rnp->qsmask & rdp->grpmask); - rdp->cpu_no_qs.b.norm = need_gp; - rdp->core_needs_qs = need_gp; + need_qs = !!(rnp->qsmask & rdp->grpmask); + rdp->cpu_no_qs.b.norm = need_qs; + rdp->core_needs_qs = need_qs; zero_cpu_stall_ticks(rdp); } rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */ @@ -1987,6 +1990,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp) return; } mask = rdp->grpmask; + if (rdp->cpu == smp_processor_id()) + rdp->core_needs_qs = false; if ((rnp->qsmask & mask) == 0) { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } else { -- cgit v1.2.3-71-gd317 From b2b00ddf193bf83dc561d965c67b18eb54ebcd83 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 30 Oct 2019 11:56:10 -0700 Subject: rcu: React to callback overload by aggressively seeking quiescent states In default configutions, RCU currently waits at least 100 milliseconds before asking cond_resched() and/or resched_rcu() for help seeking quiescent states to end a grace period. But 100 milliseconds can be one good long time during an RCU callback flood, for example, as can happen when user processes repeatedly open and close files in a tight loop. These 100-millisecond gaps in successive grace periods during a callback flood can result in excessive numbers of callbacks piling up, unnecessarily increasing memory footprint. This commit therefore asks cond_resched() and/or resched_rcu() for help as early as the first FQS scan when at least one of the CPUs has more than 20,000 callbacks queued, a number that can be changed using the new rcutree.qovld kernel boot parameter. An auxiliary qovld_calc variable is used to avoid acquisition of locks that have not yet been initialized. Early tests indicate that this reduces the RCU-callback memory footprint during rcutorture floods by from 50% to 4x, depending on configuration. Reported-by: Joel Fernandes (Google) Reported-by: Tejun Heo [ paulmck: Fix bug located by Qian Cai. ] Signed-off-by: Paul E. McKenney Tested-by: Dexuan Cui Tested-by: Qian Cai --- Documentation/admin-guide/kernel-parameters.txt | 9 +++ kernel/rcu/tree.c | 75 +++++++++++++++++++++++-- kernel/rcu/tree.h | 4 ++ kernel/rcu/tree_plugin.h | 2 + 4 files changed, 86 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index dbc22d684627..dbd4b4a65209 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3980,6 +3980,15 @@ Set threshold of queued RCU callbacks below which batch limiting is re-enabled. + rcutree.qovld= [KNL] + Set threshold of queued RCU callbacks beyond which + RCU's force-quiescent-state scan will aggressively + enlist help from cond_resched() and sched IPIs to + help CPUs more quickly reach quiescent states. + Set to less than zero to make this be set based + on rcutree.qhimark at boot time and to zero to + disable more aggressive help enlistment. + rcutree.rcu_idle_gp_delay= [KNL] Set wakeup interval for idle CPUs that have RCU callbacks (RCU_FAST_NO_HZ=y). diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 31d01f80a1f6..48fba2257748 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -150,6 +150,7 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) static void invoke_rcu_core(void); static void rcu_report_exp_rdp(struct rcu_data *rdp); static void sync_sched_exp_online_cleanup(int cpu); +static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp); /* rcuc/rcub kthread realtime priority */ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0; @@ -410,10 +411,15 @@ static long blimit = DEFAULT_RCU_BLIMIT; static long qhimark = DEFAULT_RCU_QHIMARK; #define DEFAULT_RCU_QLOMARK 100 /* Once only this many pending, use blimit. */ static long qlowmark = DEFAULT_RCU_QLOMARK; +#define DEFAULT_RCU_QOVLD_MULT 2 +#define DEFAULT_RCU_QOVLD (DEFAULT_RCU_QOVLD_MULT * DEFAULT_RCU_QHIMARK) +static long qovld = DEFAULT_RCU_QOVLD; /* If this many pending, hammer QS. */ +static long qovld_calc = -1; /* No pre-initialization lock acquisitions! */ module_param(blimit, long, 0444); module_param(qhimark, long, 0444); module_param(qlowmark, long, 0444); +module_param(qovld, long, 0444); static ulong jiffies_till_first_fqs = ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; @@ -1072,7 +1078,8 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) rnhqp = &per_cpu(rcu_data.rcu_need_heavy_qs, rdp->cpu); if (!READ_ONCE(*rnhqp) && (time_after(jiffies, rcu_state.gp_start + jtsq * 2) || - time_after(jiffies, rcu_state.jiffies_resched))) { + time_after(jiffies, rcu_state.jiffies_resched) || + rcu_state.cbovld)) { WRITE_ONCE(*rnhqp, true); /* Store rcu_need_heavy_qs before rcu_urgent_qs. */ smp_store_release(ruqp, true); @@ -1089,8 +1096,8 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) * So hit them over the head with the resched_cpu() hammer! */ if (tick_nohz_full_cpu(rdp->cpu) && - time_after(jiffies, - READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) { + (time_after(jiffies, READ_ONCE(rdp->last_fqs_resched) + jtsq * 3) || + rcu_state.cbovld)) { WRITE_ONCE(*ruqp, true); resched_cpu(rdp->cpu); WRITE_ONCE(rdp->last_fqs_resched, jiffies); @@ -1704,8 +1711,9 @@ static void rcu_gp_fqs_loop(void) */ static void rcu_gp_cleanup(void) { - unsigned long gp_duration; + int cpu; bool needgp = false; + unsigned long gp_duration; unsigned long new_gp_seq; bool offloaded; struct rcu_data *rdp; @@ -1751,6 +1759,12 @@ static void rcu_gp_cleanup(void) needgp = __note_gp_changes(rnp, rdp) || needgp; /* smp_mb() provided by prior unlock-lock pair. */ needgp = rcu_future_gp_cleanup(rnp) || needgp; + // Reset overload indication for CPUs no longer overloaded + if (rcu_is_leaf_node(rnp)) + for_each_leaf_node_cpu_mask(rnp, cpu, rnp->cbovldmask) { + rdp = per_cpu_ptr(&rcu_data, cpu); + check_cb_ovld_locked(rdp, rnp); + } sq = rcu_nocb_gp_get(rnp); raw_spin_unlock_irq_rcu_node(rnp); rcu_nocb_gp_cleanup(sq); @@ -2299,10 +2313,13 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp)) struct rcu_data *rdp; struct rcu_node *rnp; + rcu_state.cbovld = rcu_state.cbovldnext; + rcu_state.cbovldnext = false; rcu_for_each_leaf_node(rnp) { cond_resched_tasks_rcu_qs(); mask = 0; raw_spin_lock_irqsave_rcu_node(rnp, flags); + rcu_state.cbovldnext |= !!rnp->cbovldmask; if (rnp->qsmask == 0) { if (!IS_ENABLED(CONFIG_PREEMPT_RCU) || rcu_preempt_blocked_readers_cgp(rnp)) { @@ -2583,6 +2600,48 @@ static void rcu_leak_callback(struct rcu_head *rhp) { } +/* + * Check and if necessary update the leaf rcu_node structure's + * ->cbovldmask bit corresponding to the current CPU based on that CPU's + * number of queued RCU callbacks. The caller must hold the leaf rcu_node + * structure's ->lock. + */ +static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp) +{ + raw_lockdep_assert_held_rcu_node(rnp); + if (qovld_calc <= 0) + return; // Early boot and wildcard value set. + if (rcu_segcblist_n_cbs(&rdp->cblist) >= qovld_calc) + WRITE_ONCE(rnp->cbovldmask, rnp->cbovldmask | rdp->grpmask); + else + WRITE_ONCE(rnp->cbovldmask, rnp->cbovldmask & ~rdp->grpmask); +} + +/* + * Check and if necessary update the leaf rcu_node structure's + * ->cbovldmask bit corresponding to the current CPU based on that CPU's + * number of queued RCU callbacks. No locks need be held, but the + * caller must have disabled interrupts. + * + * Note that this function ignores the possibility that there are a lot + * of callbacks all of which have already seen the end of their respective + * grace periods. This omission is due to the need for no-CBs CPUs to + * be holding ->nocb_lock to do this check, which is too heavy for a + * common-case operation. + */ +static void check_cb_ovld(struct rcu_data *rdp) +{ + struct rcu_node *const rnp = rdp->mynode; + + if (qovld_calc <= 0 || + ((rcu_segcblist_n_cbs(&rdp->cblist) >= qovld_calc) == + !!(READ_ONCE(rnp->cbovldmask) & rdp->grpmask))) + return; // Early boot wildcard value or already set correctly. + raw_spin_lock_rcu_node(rnp); + check_cb_ovld_locked(rdp, rnp); + raw_spin_unlock_rcu_node(rnp); +} + /* * Helper function for call_rcu() and friends. The cpu argument will * normally be -1, indicating "currently running CPU". It may specify @@ -2626,6 +2685,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) rcu_segcblist_init(&rdp->cblist); } + check_cb_ovld(rdp); if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) return; // Enqueued onto ->nocb_bypass, so just leave. /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */ @@ -3814,6 +3874,13 @@ void __init rcu_init(void) rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0); WARN_ON(!rcu_par_gp_wq); srcu_init(); + + /* Fill in default value for rcutree.qovld boot parameter. */ + /* -After- the rcu_node ->lock fields are initialized! */ + if (qovld < 0) + qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark; + else + qovld_calc = qovld; } #include "tree_stall.h" diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 0c87e4c161c2..9dc2ec021da5 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -68,6 +68,8 @@ struct rcu_node { /* Online CPUs for next expedited GP. */ /* Any CPU that has ever been online will */ /* have its bit set. */ + unsigned long cbovldmask; + /* CPUs experiencing callback overload. */ unsigned long ffmask; /* Fully functional CPUs. */ unsigned long grpmask; /* Mask to apply to parent qsmask. */ /* Only one bit will be set in this mask. */ @@ -321,6 +323,8 @@ struct rcu_state { atomic_t expedited_need_qs; /* # CPUs left to check in. */ struct swait_queue_head expedited_wq; /* Wait for check-ins. */ int ncpus_snap; /* # CPUs seen last time. */ + u8 cbovld; /* Callback overload now? */ + u8 cbovldnext; /* ^ ^ next time? */ unsigned long jiffies_force_qs; /* Time at which to invoke */ /* force_quiescent_state(). */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index c6ea81cd4189..0be8fad08daa 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -56,6 +56,8 @@ static void __init rcu_bootup_announce_oddness(void) pr_info("\tBoot-time adjustment of callback high-water mark to %ld.\n", qhimark); if (qlowmark != DEFAULT_RCU_QLOMARK) pr_info("\tBoot-time adjustment of callback low-water mark to %ld.\n", qlowmark); + if (qovld != DEFAULT_RCU_QOVLD) + pr_info("\tBoot-time adjustment of callback overload leval to %ld.\n", qovld); if (jiffies_till_first_fqs != ULONG_MAX) pr_info("\tBoot-time adjustment of first FQS scan delay to %ld jiffies.\n", jiffies_till_first_fqs); if (jiffies_till_next_fqs != ULONG_MAX) -- cgit v1.2.3-71-gd317 From 8c14263d351b4766a040346ee917b8d81583a460 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 7 Nov 2019 01:10:55 -0800 Subject: rcu: React to callback overload by boosting RCU readers RCU priority boosting currently is not applied until the grace period is at least 250 milliseconds old (or the number of milliseconds specified by the CONFIG_RCU_BOOST_DELAY Kconfig option). Although this has worked well, it can result in OOM under conditions of RCU callback flooding. One can argue that the real-time systems using RCU priority boosting should carefully avoid RCU callback flooding, but one can just as well argue that an OOM is a rather obnoxious error message. This commit therefore disables the RCU priority boosting delay when there are excessive numbers of callbacks queued. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 0be8fad08daa..4d4637c361b7 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1079,7 +1079,7 @@ static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags) (rnp->gp_tasks != NULL && rnp->boost_tasks == NULL && rnp->qsmask == 0 && - ULONG_CMP_GE(jiffies, rnp->boost_time))) { + (ULONG_CMP_GE(jiffies, rnp->boost_time) || rcu_state.cbovld))) { if (rnp->exp_tasks == NULL) rnp->boost_tasks = rnp->gp_tasks; raw_spin_unlock_irqrestore_rcu_node(rnp, flags); -- cgit v1.2.3-71-gd317 From aa96a93ba2bbad5e1b24706daffa14c4f1c50d2a Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Thu, 12 Dec 2019 17:36:43 +0000 Subject: rcu: Fix spelling mistake "leval" -> "level" This commit fixes a spelling mistake in a pr_info() message. Signed-off-by: Colin Ian King Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 4d4637c361b7..0765784012f8 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -57,7 +57,7 @@ static void __init rcu_bootup_announce_oddness(void) if (qlowmark != DEFAULT_RCU_QLOMARK) pr_info("\tBoot-time adjustment of callback low-water mark to %ld.\n", qlowmark); if (qovld != DEFAULT_RCU_QOVLD) - pr_info("\tBoot-time adjustment of callback overload leval to %ld.\n", qovld); + pr_info("\tBoot-time adjustment of callback overload level to %ld.\n", qovld); if (jiffies_till_first_fqs != ULONG_MAX) pr_info("\tBoot-time adjustment of first FQS scan delay to %ld jiffies.\n", jiffies_till_first_fqs); if (jiffies_till_next_fqs != ULONG_MAX) -- cgit v1.2.3-71-gd317 From b692dc4adfcff5430467bec8efed3c07d1772eaf Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 11 Feb 2020 07:29:02 -0800 Subject: rcu: Update __call_rcu() comments The __call_rcu() function's header comment refers to a cpu argument that no longer exists, and the comment of the return path from rcu_nocb_try_bypass() ignores the non-no-CBs CPU case. This commit therefore update both comments. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 48fba2257748..0a9de9fd43ed 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2642,12 +2642,7 @@ static void check_cb_ovld(struct rcu_data *rdp) raw_spin_unlock_rcu_node(rnp); } -/* - * Helper function for call_rcu() and friends. The cpu argument will - * normally be -1, indicating "currently running CPU". It may specify - * a CPU only if that CPU is a no-CBs CPU. Currently, only rcu_barrier() - * is expected to specify a CPU. - */ +/* Helper function for call_rcu() and friends. */ static void __call_rcu(struct rcu_head *head, rcu_callback_t func) { @@ -2688,7 +2683,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) check_cb_ovld(rdp); if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags)) return; // Enqueued onto ->nocb_bypass, so just leave. - /* If we get here, rcu_nocb_try_bypass() acquired ->nocb_lock. */ + // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock. rcu_segcblist_enqueue(&rdp->cblist, head); if (__is_kfree_rcu_offset((unsigned long)func)) trace_rcu_kfree_callback(rcu_state.name, head, -- cgit v1.2.3-71-gd317 From fcb7381265e6cceb1d54283878d145db52b9d9d7 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 6 Jan 2020 11:59:58 -0800 Subject: rcu-tasks: *_ONCE() for rcu_tasks_cbs_head The RCU tasks list of callbacks, rcu_tasks_cbs_head, is sampled locklessly by rcu_tasks_kthread() when waiting for work to do. This commit therefore applies READ_ONCE() to that lockless sampling and WRITE_ONCE() to the single potential store outside of rcu_tasks_kthread. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/update.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 6c4b862f57d6..a27df761b149 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -528,7 +528,7 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func) rhp->func = func; raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags); needwake = !rcu_tasks_cbs_head; - *rcu_tasks_cbs_tail = rhp; + WRITE_ONCE(*rcu_tasks_cbs_tail, rhp); rcu_tasks_cbs_tail = &rhp->next; raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); /* We can't create the thread unless interrupts are enabled. */ @@ -658,7 +658,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) /* If there were none, wait a bit and start over. */ if (!list) { wait_event_interruptible(rcu_tasks_cbs_wq, - rcu_tasks_cbs_head); + READ_ONCE(rcu_tasks_cbs_head)); if (!rcu_tasks_cbs_head) { WARN_ON(signal_pending(current)); schedule_timeout_interruptible(HZ/10); -- cgit v1.2.3-71-gd317 From e1e9bdc00ade6651c11bf5628ee9d313ee6089ac Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 20 Jan 2020 22:41:26 +0000 Subject: rcu: Add missing annotation for exit_tasks_rcu_start() Sparse reports a warning at exit_tasks_rcu_start(void) |warning: context imbalance in exit_tasks_rcu_start() - wrong count at exit To fix this, this commit adds an __acquires(&tasks_rcu_exit_srcu). Given that exit_tasks_rcu_start() does actually call __srcu_read_lock(), this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney Reviewed-by: Joel Fernandes (Google) --- kernel/rcu/update.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index a27df761b149..a04fe54bc4ab 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -801,7 +801,7 @@ static int __init rcu_spawn_tasks_kthread(void) core_initcall(rcu_spawn_tasks_kthread); /* Do the srcu_read_lock() for the above synchronize_srcu(). */ -void exit_tasks_rcu_start(void) +void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) { preempt_disable(); current->rcu_tasks_idx = __srcu_read_lock(&tasks_rcu_exit_srcu); -- cgit v1.2.3-71-gd317 From 90ba11ba99e0a4cc75302335d10a225b27a44918 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Mon, 20 Jan 2020 22:41:54 +0000 Subject: rcu: Add missing annotation for exit_tasks_rcu_finish() Sparse reports a warning at exit_tasks_rcu_finish(void) |warning: context imbalance in exit_tasks_rcu_finish() |- wrong count at exit To fix this, this commit adds a __releases(&tasks_rcu_exit_srcu). Given that exit_tasks_rcu_finish() does actually call __srcu_read_lock(), this not only fixes the warning but also improves on the readability of the code. Signed-off-by: Jules Irenge Signed-off-by: Paul E. McKenney Reviewed-by: Joel Fernandes (Google) --- kernel/rcu/update.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index a04fe54bc4ab..ede656c5e1e9 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -809,7 +809,7 @@ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) } /* Do the srcu_read_unlock() for the above synchronize_srcu(). */ -void exit_tasks_rcu_finish(void) +void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu) { preempt_disable(); __srcu_read_unlock(&tasks_rcu_exit_srcu, current->rcu_tasks_idx); -- cgit v1.2.3-71-gd317 From 7ff8b4502bc0f576450d4ecbda97fa30e8002ed1 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 22 Dec 2019 19:32:54 -0800 Subject: srcu: Fix __call_srcu()/process_srcu() datarace The srcu_node structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 657e6a7d1c03..b1edac93e403 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -550,7 +550,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) snp->srcu_have_cbs[idx] = gpseq; rcu_seq_set_state(&snp->srcu_have_cbs[idx], 1); if (ULONG_CMP_LT(snp->srcu_gp_seq_needed_exp, gpseq)) - snp->srcu_gp_seq_needed_exp = gpseq; + WRITE_ONCE(snp->srcu_gp_seq_needed_exp, gpseq); mask = snp->srcu_data_have_cbs[idx]; snp->srcu_data_have_cbs[idx] = 0; spin_unlock_irq_rcu_node(snp); @@ -660,7 +660,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, if (snp == sdp->mynode) snp->srcu_data_have_cbs[idx] |= sdp->grpmask; if (!do_norm && ULONG_CMP_LT(snp->srcu_gp_seq_needed_exp, s)) - snp->srcu_gp_seq_needed_exp = s; + WRITE_ONCE(snp->srcu_gp_seq_needed_exp, s); spin_unlock_irqrestore_rcu_node(snp, flags); } -- cgit v1.2.3-71-gd317 From 8c9e0cb32315835825ea1ad725a858a2d2ce4a8e Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 22 Dec 2019 19:36:33 -0800 Subject: srcu: Fix __call_srcu()/srcu_get_delay() datarace The srcu_struct structure's ->srcu_gp_seq_needed_exp field is accessed locklessly, so updates must use WRITE_ONCE(). This commit therefore adds the needed WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index b1edac93e403..79848f7d575d 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -534,7 +534,7 @@ static void srcu_gp_end(struct srcu_struct *ssp) rcu_seq_end(&ssp->srcu_gp_seq); gpseq = rcu_seq_current(&ssp->srcu_gp_seq); if (ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, gpseq)) - ssp->srcu_gp_seq_needed_exp = gpseq; + WRITE_ONCE(ssp->srcu_gp_seq_needed_exp, gpseq); spin_unlock_irq_rcu_node(ssp); mutex_unlock(&ssp->srcu_gp_mutex); /* A new grace period can start at this point. But only one. */ @@ -614,7 +614,7 @@ static void srcu_funnel_exp_start(struct srcu_struct *ssp, struct srcu_node *snp } spin_lock_irqsave_rcu_node(ssp, flags); if (ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, s)) - ssp->srcu_gp_seq_needed_exp = s; + WRITE_ONCE(ssp->srcu_gp_seq_needed_exp, s); spin_unlock_irqrestore_rcu_node(ssp, flags); } @@ -674,7 +674,7 @@ static void srcu_funnel_gp_start(struct srcu_struct *ssp, struct srcu_data *sdp, smp_store_release(&ssp->srcu_gp_seq_needed, s); /*^^^*/ } if (!do_norm && ULONG_CMP_LT(ssp->srcu_gp_seq_needed_exp, s)) - ssp->srcu_gp_seq_needed_exp = s; + WRITE_ONCE(ssp->srcu_gp_seq_needed_exp, s); /* If grace period not already done and none in progress, start it. */ if (!rcu_seq_done(&ssp->srcu_gp_seq, s) && -- cgit v1.2.3-71-gd317 From 39f91504a03a7a2abdb52205106942fa4a548d0d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sun, 22 Dec 2019 19:39:35 -0800 Subject: srcu: Fix process_srcu()/srcu_batches_completed() datarace The srcu_struct structure's ->srcu_idx field is accessed locklessly, so reads must use READ_ONCE(). This commit therefore adds the needed READ_ONCE() invocation where it was missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely. Signed-off-by: Paul E. McKenney --- kernel/rcu/srcutree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 79848f7d575d..119a37319e67 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -1079,7 +1079,7 @@ EXPORT_SYMBOL_GPL(srcu_barrier); */ unsigned long srcu_batches_completed(struct srcu_struct *ssp) { - return ssp->srcu_idx; + return READ_ONCE(ssp->srcu_idx); } EXPORT_SYMBOL_GPL(srcu_batches_completed); -- cgit v1.2.3-71-gd317 From 710426068dc60f2d2e139478d6185710802cdc0a Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 11:42:05 -0800 Subject: srcu: Hold srcu_struct ->lock when updating ->srcu_gp_seq A read of the srcu_struct structure's ->srcu_gp_seq field should not need READ_ONCE() when that structure's ->lock is held. Except that this lock is not always held when updating this field. This commit therefore acquires the lock around updates and removes a now-unneeded READ_ONCE(). This data race was reported by KCSAN. Signed-off-by: Paul E. McKenney [ paulmck: Switch from READ_ONCE() to lock per Peter Zilstra question. ] Acked-by: Peter Zijlstra (Intel) --- kernel/rcu/srcutree.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 119a37319e67..c19c1df0d198 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -450,7 +450,7 @@ static void srcu_gp_start(struct srcu_struct *ssp) spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */ smp_mb(); /* Order prior store to ->srcu_gp_seq_needed vs. GP start. */ rcu_seq_start(&ssp->srcu_gp_seq); - state = rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)); + state = rcu_seq_state(ssp->srcu_gp_seq); WARN_ON_ONCE(state != SRCU_STATE_SCAN1); } @@ -1130,7 +1130,9 @@ static void srcu_advance_state(struct srcu_struct *ssp) return; /* readers present, retry later. */ } srcu_flip(ssp); + spin_lock_irq_rcu_node(ssp); rcu_seq_set_state(&ssp->srcu_gp_seq, SRCU_STATE_SCAN2); + spin_unlock_irq_rcu_node(ssp); } if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) == SRCU_STATE_SCAN2) { -- cgit v1.2.3-71-gd317 From 59ee0326ccf712f9a637d5df2465a16a784cbfb0 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 28 Nov 2019 18:54:06 -0800 Subject: rcutorture: Suppress forward-progress complaints during early boot Some larger systems can take in excess of 50 seconds to complete their early boot initcalls prior to spawing init. This does not in any way help the forward-progress judgments of built-in rcutorture (when rcutorture is built as a module, the insmod or modprobe command normally cannot happen until some time after boot completes). This commit therefore suppresses such complaints until about the time that init is spawned. This also includes a fix to a stupid error located by kbuild test robot. [ paulmck: Apply kbuild test robot feedback. ] Signed-off-by: Paul E. McKenney [ paulmck: Fix to nohz_full slow-expediting recovery logic, per bpetkov. ] [ paulmck: Restrict splat to CONFIG_PREEMPT_RT=y kernels and simplify. ] Tested-by: Borislav Petkov --- include/linux/rcutiny.h | 1 + include/linux/rcutree.h | 1 + kernel/rcu/rcutorture.c | 3 ++- kernel/rcu/tree_exp.h | 7 ++++++- kernel/rcu/update.c | 12 ++++++++++++ 5 files changed, 22 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index b2b2dc990da9..045c28b71f4f 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -83,6 +83,7 @@ void rcu_scheduler_starting(void); static inline void rcu_scheduler_starting(void) { } #endif /* #else #ifndef CONFIG_SRCU */ static inline void rcu_end_inkernel_boot(void) { } +static inline bool rcu_inkernel_boot_has_ended(void) { return true; } static inline bool rcu_is_watching(void) { return true; } static inline void rcu_momentary_dyntick_idle(void) { } static inline void kfree_rcu_scheduler_running(void) { } diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 2f787b9029d1..45f3f66bb04d 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -54,6 +54,7 @@ void exit_rcu(void); void rcu_scheduler_starting(void); extern int rcu_scheduler_active __read_mostly; void rcu_end_inkernel_boot(void); +bool rcu_inkernel_boot_has_ended(void); bool rcu_is_watching(void); #ifndef CONFIG_PREEMPTION void rcu_all_qs(void); diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 1aeecc165b21..9ba49788cb48 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1067,7 +1067,8 @@ rcu_torture_writer(void *arg) if (stutter_wait("rcu_torture_writer") && !READ_ONCE(rcu_fwd_cb_nodelay) && !cur_ops->slow_gps && - !torture_must_stop()) + !torture_must_stop() && + rcu_inkernel_boot_has_ended()) for (i = 0; i < ARRAY_SIZE(rcu_tortures); i++) if (list_empty(&rcu_tortures[i].rtort_free) && rcu_access_pointer(rcu_torture_current) != diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index dcbd75791f39..a72d16eb869e 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -485,6 +485,7 @@ static bool synchronize_rcu_expedited_wait_once(long tlimit) static void synchronize_rcu_expedited_wait(void) { int cpu; + unsigned long j; unsigned long jiffies_stall; unsigned long jiffies_start; unsigned long mask; @@ -496,7 +497,7 @@ static void synchronize_rcu_expedited_wait(void) trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait")); jiffies_stall = rcu_jiffies_till_stall_check(); jiffies_start = jiffies; - if (IS_ENABLED(CONFIG_NO_HZ_FULL)) { + if (tick_nohz_full_enabled() && rcu_inkernel_boot_has_ended()) { if (synchronize_rcu_expedited_wait_once(1)) return; rcu_for_each_leaf_node(rnp) { @@ -508,6 +509,10 @@ static void synchronize_rcu_expedited_wait(void) tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP); } } + j = READ_ONCE(jiffies_till_first_fqs); + if (synchronize_rcu_expedited_wait_once(j + HZ)) + return; + WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)); } for (;;) { diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 6c4b862f57d6..feaaec5747a3 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -183,6 +183,8 @@ void rcu_unexpedite_gp(void) } EXPORT_SYMBOL_GPL(rcu_unexpedite_gp); +static bool rcu_boot_ended __read_mostly; + /* * Inform RCU of the end of the in-kernel boot sequence. */ @@ -191,7 +193,17 @@ void rcu_end_inkernel_boot(void) rcu_unexpedite_gp(); if (rcu_normal_after_boot) WRITE_ONCE(rcu_normal, 1); + rcu_boot_ended = 1; +} + +/* + * Let rcutorture know when it is OK to turn it up to eleven. + */ +bool rcu_inkernel_boot_has_ended(void) +{ + return rcu_boot_ended; } +EXPORT_SYMBOL_GPL(rcu_inkernel_boot_has_ended); #endif /* #ifndef CONFIG_TINY_RCU */ -- cgit v1.2.3-71-gd317 From 435508095ab5b6870e8140948983920ce4684e9b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 4 Dec 2019 15:58:41 -0800 Subject: rcutorture: Refrain from callback flooding during boot Additional rcutorture aggression can result in, believe it or not, boot times in excess of three minutes on large hyperthreaded systems. This is long enough for rcutorture to decide to do some callback flooding, which seems a bit excessive given that userspace cannot have started until long after boot, and it is userspace that does the real-world callback flooding. Worse yet, because Tiny RCU lacks forward-progress functionality, the looping-in-the-kernel tests can also be problematic during early boot. This commit therefore causes rcutorture to hold off on callback flooding until about the time that init is spawned, and the same for looping-in-the-kernel tests for Tiny RCU. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 9ba49788cb48..08fa4ef23914 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1994,8 +1994,11 @@ static int rcu_torture_fwd_prog(void *args) schedule_timeout_interruptible(fwd_progress_holdoff * HZ); WRITE_ONCE(rcu_fwd_emergency_stop, false); register_oom_notifier(&rcutorture_oom_nb); - rcu_torture_fwd_prog_nr(rfp, &tested, &tested_tries); - rcu_torture_fwd_prog_cr(rfp); + if (!IS_ENABLED(CONFIG_TINY_RCU) || + rcu_inkernel_boot_has_ended()) + rcu_torture_fwd_prog_nr(rfp, &tested, &tested_tries); + if (rcu_inkernel_boot_has_ended()) + rcu_torture_fwd_prog_cr(rfp); unregister_oom_notifier(&rcutorture_oom_nb); /* Avoid slow periods, better to test when busy. */ -- cgit v1.2.3-71-gd317 From a59ee765a6890e7f4281070008a2654337458311 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 5 Dec 2019 10:49:11 -0800 Subject: torture: Forgive -EBUSY from boottime CPU-hotplug operations During boot, CPU hotplug is often disabled, for example by PCI probing. On large systems that take substantial time to boot, this can result in spurious RCU_HOTPLUG errors. This commit therefore forgives any boottime -EBUSY CPU-hotplug failures by adjusting counters to pretend that the corresponding attempt never happened. A non-splat record of the failed attempt is emitted to the console with the added string "(-EBUSY forgiven during boot)". Signed-off-by: Paul E. McKenney --- kernel/torture.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/torture.c b/kernel/torture.c index 7c13f5558b71..e377b5b17de8 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -84,6 +84,7 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes, { unsigned long delta; int ret; + char *s; unsigned long starttime; if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu)) @@ -99,10 +100,16 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes, (*n_offl_attempts)++; ret = cpu_down(cpu); if (ret) { + s = ""; + if (!rcu_inkernel_boot_has_ended() && ret == -EBUSY) { + // PCI probe frequently disables hotplug during boot. + (*n_offl_attempts)--; + s = " (-EBUSY forgiven during boot)"; + } if (verbose) pr_alert("%s" TORTURE_FLAG - "torture_onoff task: offline %d failed: errno %d\n", - torture_type, cpu, ret); + "torture_onoff task: offline %d failed%s: errno %d\n", + torture_type, cpu, s, ret); } else { if (verbose > 1) pr_alert("%s" TORTURE_FLAG @@ -137,6 +144,7 @@ bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes, { unsigned long delta; int ret; + char *s; unsigned long starttime; if (cpu_online(cpu) || !cpu_is_hotpluggable(cpu)) @@ -150,10 +158,16 @@ bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes, (*n_onl_attempts)++; ret = cpu_up(cpu); if (ret) { + s = ""; + if (!rcu_inkernel_boot_has_ended() && ret == -EBUSY) { + // PCI probe frequently disables hotplug during boot. + (*n_onl_attempts)--; + s = " (-EBUSY forgiven during boot)"; + } if (verbose) pr_alert("%s" TORTURE_FLAG - "torture_onoff task: online %d failed: errno %d\n", - torture_type, cpu, ret); + "torture_onoff task: online %d failed%s: errno %d\n", + torture_type, cpu, s, ret); } else { if (verbose > 1) pr_alert("%s" TORTURE_FLAG -- cgit v1.2.3-71-gd317 From 58c53360b36d2077cbb843e7ad2bf75f0498271c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 5 Dec 2019 11:29:01 -0800 Subject: rcutorture: Allow boottime stall warnings to be suppressed In normal production, an RCU CPU stall warning at boottime is often just as bad as at any other time. In fact, given the desire for fast boot, any sort of long-term stall at boot is a bad idea. However, heavy rcutorture testing on large hyperthreaded systems can generate boottime RCU CPU stalls as a matter of course. This commit therefore provides a kernel boot parameter that suppresses reporting of boottime RCU CPU stall warnings and similarly of rcutorture writer stalls. Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ kernel/rcu/rcu.h | 17 +++++++++++++++++ kernel/rcu/rcutorture.c | 2 +- kernel/rcu/tree_exp.h | 2 +- kernel/rcu/tree_stall.h | 6 +++--- kernel/rcu/update.c | 8 +++++++- 6 files changed, 35 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index dbc22d684627..ee007b5c874f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4195,6 +4195,12 @@ rcupdate.rcu_cpu_stall_suppress= [KNL] Suppress RCU CPU stall warning messages. + rcupdate.rcu_cpu_stall_suppress_at_boot= [KNL] + Suppress RCU CPU stall warning messages and + rcutorture writer stall warnings that occur + during early boot, that is, during the time + before the init task is spawned. + rcupdate.rcu_cpu_stall_timeout= [KNL] Set timeout for RCU CPU stall warning messages. diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 05f936ed167a..1779cbf33cd1 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -198,6 +198,13 @@ static inline void debug_rcu_head_unqueue(struct rcu_head *head) } #endif /* #else !CONFIG_DEBUG_OBJECTS_RCU_HEAD */ +extern int rcu_cpu_stall_suppress_at_boot; + +static inline bool rcu_stall_is_suppressed_at_boot(void) +{ + return rcu_cpu_stall_suppress_at_boot && !rcu_inkernel_boot_has_ended(); +} + #ifdef CONFIG_RCU_STALL_COMMON extern int rcu_cpu_stall_ftrace_dump; @@ -205,6 +212,11 @@ extern int rcu_cpu_stall_suppress; extern int rcu_cpu_stall_timeout; int rcu_jiffies_till_stall_check(void); +static inline bool rcu_stall_is_suppressed(void) +{ + return rcu_stall_is_suppressed_at_boot() || rcu_cpu_stall_suppress; +} + #define rcu_ftrace_dump_stall_suppress() \ do { \ if (!rcu_cpu_stall_suppress) \ @@ -218,6 +230,11 @@ do { \ } while (0) #else /* #endif #ifdef CONFIG_RCU_STALL_COMMON */ + +static inline bool rcu_stall_is_suppressed(void) +{ + return rcu_stall_is_suppressed_at_boot(); +} #define rcu_ftrace_dump_stall_suppress() #define rcu_ftrace_dump_stall_unsuppress() #endif /* #ifdef CONFIG_RCU_STALL_COMMON */ diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 08fa4ef23914..16c84ec182bd 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1479,7 +1479,7 @@ rcu_torture_stats_print(void) if (cur_ops->stats) cur_ops->stats(); if (rtcv_snap == rcu_torture_current_version && - rcu_torture_current != NULL) { + rcu_torture_current != NULL && !rcu_stall_is_suppressed()) { int __maybe_unused flags = 0; unsigned long __maybe_unused gp_seq = 0; diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index a72d16eb869e..c28d9f0034c3 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -518,7 +518,7 @@ static void synchronize_rcu_expedited_wait(void) for (;;) { if (synchronize_rcu_expedited_wait_once(jiffies_stall)) return; - if (rcu_cpu_stall_suppress) + if (rcu_stall_is_suppressed()) continue; panic_on_rcu_stall(); pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {", diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 55f9b84790d3..7ee8a1cc0d8b 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -383,7 +383,7 @@ static void print_other_cpu_stall(unsigned long gp_seq) /* Kick and suppress, if so configured. */ rcu_stall_kick_kthreads(); - if (rcu_cpu_stall_suppress) + if (rcu_stall_is_suppressed()) return; /* @@ -452,7 +452,7 @@ static void print_cpu_stall(void) /* Kick and suppress, if so configured. */ rcu_stall_kick_kthreads(); - if (rcu_cpu_stall_suppress) + if (rcu_stall_is_suppressed()) return; /* @@ -504,7 +504,7 @@ static void check_cpu_stall(struct rcu_data *rdp) unsigned long js; struct rcu_node *rnp; - if ((rcu_cpu_stall_suppress && !rcu_kick_kthreads) || + if ((rcu_stall_is_suppressed() && !rcu_kick_kthreads) || !rcu_gp_in_progress()) return; rcu_stall_kick_kthreads(); diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index feaaec5747a3..085f08a898fe 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -476,13 +476,19 @@ EXPORT_SYMBOL_GPL(rcutorture_sched_setaffinity); #ifdef CONFIG_RCU_STALL_COMMON int rcu_cpu_stall_ftrace_dump __read_mostly; module_param(rcu_cpu_stall_ftrace_dump, int, 0644); -int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */ +int rcu_cpu_stall_suppress __read_mostly; // !0 = suppress stall warnings. EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress); module_param(rcu_cpu_stall_suppress, int, 0644); int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT; module_param(rcu_cpu_stall_timeout, int, 0644); #endif /* #ifdef CONFIG_RCU_STALL_COMMON */ +// Suppress boot-time RCU CPU stall warnings and rcutorture writer stall +// warnings. Also used by rcutorture even if stall warnings are excluded. +int rcu_cpu_stall_suppress_at_boot __read_mostly; // !0 = suppress boot stalls. +EXPORT_SYMBOL_GPL(rcu_cpu_stall_suppress_at_boot); +module_param(rcu_cpu_stall_suppress_at_boot, int, 0444); + #ifdef CONFIG_TASKS_RCU /* -- cgit v1.2.3-71-gd317 From 4ab00bdd99a906c089b5c20ee7b5cb91e7c61123 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 5 Dec 2019 15:53:28 -0800 Subject: rcutorture: Suppress boottime bad-sequence warnings In normal production, an excessively long wait on a grace period (synchronize_rcu(), for example) at boottime is often just as bad as at any other time. In fact, given the desire for fast boot, any sort of long wait at boot is a bad idea. However, heavy rcutorture testing on large hyperthreaded systems can generate such long waits during boot as a matter of course. This commit therefore causes the rcupdate.rcu_cpu_stall_suppress_at_boot kernel boot parameter to suppress reporting of bootime bad-sequence warning due to excessively long grace-period waits. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 16c84ec182bd..5efd9503df56 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1423,7 +1423,8 @@ rcu_torture_stats_print(void) pr_alert("%s%s ", torture_type, TORTURE_FLAG); pr_cont("rtc: %p %s: %lu tfle: %d rta: %d rtaf: %d rtf: %d ", rcu_torture_current, - rcu_torture_current ? "ver" : "VER", + rcu_torture_current && !rcu_stall_is_suppressed_at_boot() + ? "ver" : "VER", rcu_torture_current_version, list_empty(&rcu_torture_freelist), atomic_read(&n_rcu_torture_alloc), -- cgit v1.2.3-71-gd317 From 8171d3e0dafd37a9c833904e5a936f4154a1e95b Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 6 Dec 2019 15:02:59 -0800 Subject: torture: Allow disabling of boottime CPU-hotplug torture operations In theory, RCU-hotplug operations are supposed to work as soon as there is more than one CPU online. However, in practice, in normal production there is no way to make them happen until userspace is up and running. Besides which, on smaller systems, rcutorture doesn't start doing hotplug operations until 30 seconds after the start of boot, which on most systems also means the better part of 30 seconds after the end of boot. This commit therefore provides a new torture.disable_onoff_at_boot kernel boot parameter that suppresses CPU-hotplug torture operations until about the time that init is spawned. Of course, if you know of a need for boottime CPU-hotplug operations, then you should avoid passing this argument to any of the torture tests. You might also want to look at the splats linked to below. Link: https://lore.kernel.org/lkml/20191206185208.GA25636@paulmck-ThinkPad-P72/ Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 4 ++++ kernel/torture.c | 7 +++++++ 2 files changed, 11 insertions(+) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ee007b5c874f..868f59a48580 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4873,6 +4873,10 @@ topology updates sent by the hypervisor to this LPAR. + torture.disable_onoff_at_boot= [KNL] + Prevent the CPU-hotplug component of torturing + until after init has spawned. + tp720= [HW,PS2] tpm_suspend_pcr=[HW,TPM] diff --git a/kernel/torture.c b/kernel/torture.c index e377b5b17de8..8683375dc0c7 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -42,6 +42,9 @@ MODULE_LICENSE("GPL"); MODULE_AUTHOR("Paul E. McKenney "); +static bool disable_onoff_at_boot; +module_param(disable_onoff_at_boot, bool, 0444); + static char *torture_type; static int verbose; @@ -229,6 +232,10 @@ torture_onoff(void *arg) VERBOSE_TOROUT_STRING("torture_onoff end holdoff"); } while (!torture_must_stop()) { + if (disable_onoff_at_boot && !rcu_inkernel_boot_has_ended()) { + schedule_timeout_interruptible(HZ / 10); + continue; + } cpu = (torture_random(&rand) >> 4) % (maxcpu + 1); if (!torture_offline(cpu, &n_offline_attempts, &n_offline_successes, -- cgit v1.2.3-71-gd317 From 202489101f2e6cee3f6dffc087a4abd5fdfcebda Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 21 Dec 2019 10:41:48 -0800 Subject: rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race The ->rtort_pipe_count field in the rcu_torture structure checks for too-short grace periods, and is therefore read by rcutorture's readers while being updated by rcutorture's writers. This commit therefore adds the needed READ_ONCE() and WRITE_ONCE() invocations. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being rcutorture. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 5efd9503df56..edd97465a0f7 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -375,11 +375,12 @@ rcu_torture_pipe_update_one(struct rcu_torture *rp) { int i; - i = rp->rtort_pipe_count; + i = READ_ONCE(rp->rtort_pipe_count); if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); - if (++rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { + WRITE_ONCE(rp->rtort_pipe_count, i + 1); + if (rp->rtort_pipe_count >= RCU_TORTURE_PIPE_LEN) { rp->rtort_mbtest = 0; return true; } @@ -1015,7 +1016,8 @@ rcu_torture_writer(void *arg) if (i > RCU_TORTURE_PIPE_LEN) i = RCU_TORTURE_PIPE_LEN; atomic_inc(&rcu_torture_wcount[i]); - old_rp->rtort_pipe_count++; + WRITE_ONCE(old_rp->rtort_pipe_count, + old_rp->rtort_pipe_count + 1); switch (synctype[torture_random(&rand) % nsynctypes]) { case RTWS_DEF_FREE: rcu_torture_writer_state = RTWS_DEF_FREE; @@ -1291,7 +1293,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp) atomic_inc(&n_rcu_torture_mberror); rtrsp = rcutorture_loop_extend(&readstate, trsp, rtrsp); preempt_disable(); - pipe_count = p->rtort_pipe_count; + pipe_count = READ_ONCE(p->rtort_pipe_count); if (pipe_count > RCU_TORTURE_PIPE_LEN) { /* Should not happen, but... */ pipe_count = RCU_TORTURE_PIPE_LEN; -- cgit v1.2.3-71-gd317 From 102c14d2f87976e8390d2cb892ccd14e3532e020 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Sat, 21 Dec 2019 11:23:50 -0800 Subject: rcutorture: Fix stray access to rcu_fwd_cb_nodelay The rcu_fwd_cb_nodelay variable suppresses excessively long read-side delays while carrying out an rcutorture forward-progress test. As such, it is accessed both by readers and updaters, and most of the accesses therefore use *_ONCE(). Except for one in rcu_read_delay(), which this commit fixes. This data race was reported by KCSAN. Not appropriate for backporting due to this being rcutorture. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index edd97465a0f7..124160a610fa 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -339,7 +339,7 @@ rcu_read_delay(struct torture_random_state *rrsp, struct rt_read_seg *rtrsp) * period, and we want a long delay occasionally to trigger * force_quiescent_state. */ - if (!rcu_fwd_cb_nodelay && + if (!READ_ONCE(rcu_fwd_cb_nodelay) && !(torture_random(rrsp) % (nrealreaders * 2000 * longdelay_ms))) { started = cur_ops->get_gp_seq(); ts = rcu_trace_clock_local(); -- cgit v1.2.3-71-gd317 From f042a436c8dc9f9cfe8ed1ee5de372697269657d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Fri, 3 Jan 2020 16:27:00 -0800 Subject: rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch The rcutorture rcu_torture_count and rcu_torture_batch per-CPU variables are read locklessly, so this commit adds the READ_ONCE() to a load in order to avoid various types of compiler vandalism^Woptimization. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being rcutorture. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 124160a610fa..0b9ce9a00623 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1413,8 +1413,8 @@ rcu_torture_stats_print(void) for_each_possible_cpu(cpu) { for (i = 0; i < RCU_TORTURE_PIPE_LEN + 1; i++) { - pipesummary[i] += per_cpu(rcu_torture_count, cpu)[i]; - batchsummary[i] += per_cpu(rcu_torture_batch, cpu)[i]; + pipesummary[i] += READ_ONCE(per_cpu(rcu_torture_count, cpu)[i]); + batchsummary[i] += READ_ONCE(per_cpu(rcu_torture_batch, cpu)[i]); } } for (i = RCU_TORTURE_PIPE_LEN - 1; i >= 0; i--) { -- cgit v1.2.3-71-gd317 From 5396d31d3a396039502f75a128bd8064819cba61 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 8 Jan 2020 19:58:13 -0800 Subject: rcutorture: Annotation lockless accesses to rcu_torture_current The rcutorture global variable rcu_torture_current is accessed locklessly, so it must use the RCU pointer load/store primitives. This commit therefore adds several that were missed. This data race was reported by KCSAN. Not appropriate for backporting due to failure being unlikely and due to this being used only by rcutorture. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 0b9ce9a00623..7e01e9a87352 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1407,6 +1407,7 @@ rcu_torture_stats_print(void) int i; long pipesummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; long batchsummary[RCU_TORTURE_PIPE_LEN + 1] = { 0 }; + struct rcu_torture *rtcp; static unsigned long rtcv_snap = ULONG_MAX; static bool splatted; struct task_struct *wtp; @@ -1423,10 +1424,10 @@ rcu_torture_stats_print(void) } pr_alert("%s%s ", torture_type, TORTURE_FLAG); + rtcp = rcu_access_pointer(rcu_torture_current); pr_cont("rtc: %p %s: %lu tfle: %d rta: %d rtaf: %d rtf: %d ", - rcu_torture_current, - rcu_torture_current && !rcu_stall_is_suppressed_at_boot() - ? "ver" : "VER", + rtcp, + rtcp && !rcu_stall_is_suppressed_at_boot() ? "ver" : "VER", rcu_torture_current_version, list_empty(&rcu_torture_freelist), atomic_read(&n_rcu_torture_alloc), @@ -1482,7 +1483,8 @@ rcu_torture_stats_print(void) if (cur_ops->stats) cur_ops->stats(); if (rtcv_snap == rcu_torture_current_version && - rcu_torture_current != NULL && !rcu_stall_is_suppressed()) { + rcu_access_pointer(rcu_torture_current) && + !rcu_stall_is_suppressed()) { int __maybe_unused flags = 0; unsigned long __maybe_unused gp_seq = 0; -- cgit v1.2.3-71-gd317 From 12af660321263d7744d5d06b765c7b03fade3858 Mon Sep 17 00:00:00 2001 From: "Joel Fernandes (Google)" Date: Thu, 19 Dec 2019 11:22:42 -0500 Subject: rcuperf: Measure memory footprint during kfree_rcu() test During changes to kfree_rcu() code, we often check the amount of free memory. As an alternative to checking this manually, this commit adds a measurement in the test itself. It measures four times during the test for available memory, digitally filters these measurements to produce a running average with a weight of 0.5, and compares this digitally filtered value with the amount of available memory at the beginning of the test. Something like the following is printed at the end of the run: Total time taken by all kfree'ers: 6369738407 ns, loops: 10000, batches: 764, memory footprint: 216MB Signed-off-by: Joel Fernandes (Google) Signed-off-by: Paul E. McKenney --- kernel/rcu/rcuperf.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index da94b89cd531..a4a8d097d84d 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -611,6 +612,7 @@ kfree_perf_thread(void *arg) long me = (long)arg; struct kfree_obj *alloc_ptr; u64 start_time, end_time; + long long mem_begin, mem_during = 0; VERBOSE_PERFOUT_STRING("kfree_perf_thread task started"); set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); @@ -626,6 +628,12 @@ kfree_perf_thread(void *arg) } do { + if (!mem_during) { + mem_during = mem_begin = si_mem_available(); + } else if (loop % (kfree_loops / 4) == 0) { + mem_during = (mem_during + si_mem_available()) / 2; + } + for (i = 0; i < kfree_alloc_num; i++) { alloc_ptr = kmalloc(sizeof(struct kfree_obj), GFP_KERNEL); if (!alloc_ptr) @@ -645,9 +653,11 @@ kfree_perf_thread(void *arg) else b_rcu_gp_test_finished = cur_ops->get_gp_seq(); - pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld\n", + pr_alert("Total time taken by all kfree'ers: %llu ns, loops: %d, batches: %ld, memory footprint: %lldMB\n", (unsigned long long)(end_time - start_time), kfree_loops, - rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started)); + rcuperf_seq_diff(b_rcu_gp_test_finished, b_rcu_gp_test_started), + (mem_begin - mem_during) >> (20 - PAGE_SHIFT)); + if (shutdown) { smp_mb(); /* Assign before wake. */ wake_up(&shutdown_wq); -- cgit v1.2.3-71-gd317 From 50d4b62970e21e9573daf0e3c1138b4d1ebcca47 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 4 Feb 2020 15:00:56 -0800 Subject: rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU Currently, rcu_torture_barrier_cbs() posts callbacks from whatever CPU it is running on, which means that all these kthreads might well be posting from the same CPU, which would drastically reduce the effectiveness of this test. This commit therefore uses IPIs to make the callbacks be posted from the corresponding CPU (given by local variable myid). If the IPI fails (which can happen if the target CPU is offline or does not exist at all), the callback is posted on whatever CPU is currently running. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 7e01e9a87352..f82515cded34 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2053,6 +2053,14 @@ static void rcu_torture_barrier_cbf(struct rcu_head *rcu) atomic_inc(&barrier_cbs_invoked); } +/* IPI handler to get callback posted on desired CPU, if online. */ +static void rcu_torture_barrier1cb(void *rcu_void) +{ + struct rcu_head *rhp = rcu_void; + + cur_ops->call(rhp, rcu_torture_barrier_cbf); +} + /* kthread function to register callbacks used to test RCU barriers. */ static int rcu_torture_barrier_cbs(void *arg) { @@ -2076,9 +2084,11 @@ static int rcu_torture_barrier_cbs(void *arg) * The above smp_load_acquire() ensures barrier_phase load * is ordered before the following ->call(). */ - local_irq_disable(); /* Just to test no-irq call_rcu(). */ - cur_ops->call(&rcu, rcu_torture_barrier_cbf); - local_irq_enable(); + if (smp_call_function_single(myid, rcu_torture_barrier1cb, + &rcu, 1)) { + // IPI failed, so use direct call from current CPU. + cur_ops->call(&rcu, rcu_torture_barrier_cbf); + } if (atomic_dec_and_test(&barrier_cbs_count)) wake_up(&barrier_wq); } while (!torture_must_stop()); -- cgit v1.2.3-71-gd317 From 9470a18fabd056e67ee12059dab04faf6e1f253c Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 5 Feb 2020 12:54:34 -0800 Subject: rcutorture: Manually clean up after rcu_barrier() failure Currently, if rcu_barrier() returns too soon, the test waits 100ms and then does another instance of the test. However, if rcu_barrier() were to have waited for more than 100ms too short a time, this could cause the test's rcu_head structures to be reused while they were still on RCU's callback lists. This can result in knock-on errors that obscure the original rcu_barrier() test failure. This commit therefore adds code that attempts to wait until all of the test's callbacks have been invoked. Of course, if RCU completely lost track of the corresponding rcu_head structures, this wait could be forever. This commit therefore also complains if this attempted recovery takes more than one second, and it also gives up when the test ends. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index f82515cded34..5453bd557f43 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2124,7 +2124,21 @@ static int rcu_torture_barrier(void *arg) pr_err("barrier_cbs_invoked = %d, n_barrier_cbs = %d\n", atomic_read(&barrier_cbs_invoked), n_barrier_cbs); - WARN_ON_ONCE(1); + WARN_ON(1); + // Wait manually for the remaining callbacks + i = 0; + do { + if (WARN_ON(i++ > HZ)) + i = INT_MIN; + schedule_timeout_interruptible(1); + cur_ops->cb_barrier(); + } while (atomic_read(&barrier_cbs_invoked) != + n_barrier_cbs && + !torture_must_stop()); + smp_mb(); // Can't trust ordering if broken. + if (!torture_must_stop()) + pr_err("Recovered: barrier_cbs_invoked = %d\n", + atomic_read(&barrier_cbs_invoked)); } else { n_barrier_successes++; } -- cgit v1.2.3-71-gd317 From 9fed9000c5c6cacfcaaa48aff74818072ae294cc Mon Sep 17 00:00:00 2001 From: Jakub Sitnicki Date: Tue, 18 Feb 2020 17:10:20 +0000 Subject: bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH SOCKMAP & SOCKHASH now support storing references to listening sockets. Nothing keeps us from using these map types a collection of sockets to select from in BPF reuseport programs. Whitelist the map types with the bpf_sk_select_reuseport helper. The restriction that the socket has to be a member of a reuseport group still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb set are not a valid target and we signal it with -EINVAL. The main benefit from this change is that, in contrast to REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a listening socket can be just one BPF map at the same time. Signed-off-by: Jakub Sitnicki Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com --- kernel/bpf/verifier.c | 10 +++++++--- net/core/filter.c | 15 ++++++++++----- 2 files changed, 17 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1cc945daa9c8..6d15dfbd4b88 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3693,14 +3693,16 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, if (func_id != BPF_FUNC_sk_redirect_map && func_id != BPF_FUNC_sock_map_update && func_id != BPF_FUNC_map_delete_elem && - func_id != BPF_FUNC_msg_redirect_map) + func_id != BPF_FUNC_msg_redirect_map && + func_id != BPF_FUNC_sk_select_reuseport) goto error; break; case BPF_MAP_TYPE_SOCKHASH: if (func_id != BPF_FUNC_sk_redirect_hash && func_id != BPF_FUNC_sock_hash_update && func_id != BPF_FUNC_map_delete_elem && - func_id != BPF_FUNC_msg_redirect_hash) + func_id != BPF_FUNC_msg_redirect_hash && + func_id != BPF_FUNC_sk_select_reuseport) goto error; break; case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: @@ -3774,7 +3776,9 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, goto error; break; case BPF_FUNC_sk_select_reuseport: - if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) + if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY && + map->map_type != BPF_MAP_TYPE_SOCKMAP && + map->map_type != BPF_MAP_TYPE_SOCKHASH) goto error; break; case BPF_FUNC_map_peek_elem: diff --git a/net/core/filter.c b/net/core/filter.c index c180871e606d..77d2f471b3bb 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8620,6 +8620,7 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern, struct bpf_map *, map, void *, key, u32, flags) { + bool is_sockarray = map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY; struct sock_reuseport *reuse; struct sock *selected_sk; @@ -8628,12 +8629,16 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern, return -ENOENT; reuse = rcu_dereference(selected_sk->sk_reuseport_cb); - if (!reuse) - /* selected_sk is unhashed (e.g. by close()) after the - * above map_lookup_elem(). Treat selected_sk has already - * been removed from the map. + if (!reuse) { + /* reuseport_array has only sk with non NULL sk_reuseport_cb. + * The only (!reuse) case here is - the sk has already been + * unhashed (e.g. by close()), so treat it as -ENOENT. + * + * Other maps (e.g. sock_map) do not provide this guarantee and + * the sk may never be in the reuseport group to begin with. */ - return -ENOENT; + return is_sockarray ? -ENOENT : -EINVAL; + } if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) { struct sock *sk; -- cgit v1.2.3-71-gd317 From 035ff358f2d9e2f5e1639ba4defe4dc40ac642dd Mon Sep 17 00:00:00 2001 From: Jakub Sitnicki Date: Tue, 18 Feb 2020 17:10:21 +0000 Subject: net: Generate reuseport group ID on group creation Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose reuseport_lock") has introduced lazy generation of reuseport group IDs that survive group resize. By comparing the identifier we check if BPF reuseport program is not trying to select a socket from a BPF map that belongs to a different reuseport group than the one the packet is for. Because SOCKARRAY used to be the only BPF map type that can be used with reuseport BPF, it was possible to delay the generation of reuseport group ID until a socket from the group was inserted into BPF map for the first time. Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options, either generate the reuseport ID on map update, like SOCKARRAY does, or allocate an ID from the start when reuseport group gets created. This patch takes the latter approach to keep sockmap free of calls into reuseport code. This streamlines the reuseport_id access as its lifetime now matches the longevity of reuseport object. The cost of this simplification, however, is that we allocate reuseport IDs for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their setups. With the way identifiers are currently generated, we can have at most S32_MAX reuseport groups, which hopefully is sufficient. If we ever get close to the limit, we can switch an u64 counter like sk_cookie. Another change is that we now always call into SOCKARRAY logic to unlink the socket from the map when unhashing or closing the socket. Previously we did it only when at least one socket from the group was in a BPF map. It is worth noting that this doesn't conflict with sockmap tear-down in case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport group. sockmap tear-down happens first: prot->unhash `- tcp_bpf_unhash |- tcp_bpf_remove | `- while (sk_psock_link_pop(psock)) | `- sk_psock_unlink | `- sock_map_delete_from_link | `- __sock_map_delete | `- sock_map_unref | `- sk_psock_put | `- sk_psock_drop | `- rcu_assign_sk_user_data(sk, NULL) `- inet_unhash `- reuseport_detach_sock `- bpf_sk_reuseport_detach `- WRITE_ONCE(sk->sk_user_data, NULL) Suggested-by: Martin Lau Signed-off-by: Jakub Sitnicki Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com --- include/net/sock_reuseport.h | 2 -- kernel/bpf/reuseport_array.c | 5 ----- net/core/filter.c | 12 +---------- net/core/sock_reuseport.c | 50 +++++++++++++++++++------------------------- 4 files changed, 22 insertions(+), 47 deletions(-) (limited to 'kernel') diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 43f4a818d88f..3ecaa15d1850 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -55,6 +55,4 @@ static inline bool reuseport_has_conns(struct sock *sk, bool set) return ret; } -int reuseport_get_id(struct sock_reuseport *reuse); - #endif /* _SOCK_REUSEPORT_H */ diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c index 50c083ba978c..01badd3eda7a 100644 --- a/kernel/bpf/reuseport_array.c +++ b/kernel/bpf/reuseport_array.c @@ -305,11 +305,6 @@ int bpf_fd_reuseport_array_update_elem(struct bpf_map *map, void *key, if (err) goto put_file_unlock; - /* Ensure reuse->reuseport_id is set */ - err = reuseport_get_id(reuse); - if (err < 0) - goto put_file_unlock; - WRITE_ONCE(nsk->sk_user_data, &array->ptrs[index]); rcu_assign_pointer(array->ptrs[index], nsk); free_osk = osk; diff --git a/net/core/filter.c b/net/core/filter.c index 77d2f471b3bb..925b23de218b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8641,18 +8641,8 @@ BPF_CALL_4(sk_select_reuseport, struct sk_reuseport_kern *, reuse_kern, } if (unlikely(reuse->reuseport_id != reuse_kern->reuseport_id)) { - struct sock *sk; - - if (unlikely(!reuse_kern->reuseport_id)) - /* There is a small race between adding the - * sk to the map and setting the - * reuse_kern->reuseport_id. - * Treat it as the sk has not been added to - * the bpf map yet. - */ - return -ENOENT; + struct sock *sk = reuse_kern->sk; - sk = reuse_kern->sk; if (sk->sk_protocol != selected_sk->sk_protocol) return -EPROTOTYPE; else if (sk->sk_family != selected_sk->sk_family) diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 91e9f2223c39..adcb3aea576d 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -16,27 +16,8 @@ DEFINE_SPINLOCK(reuseport_lock); -#define REUSEPORT_MIN_ID 1 static DEFINE_IDA(reuseport_ida); -int reuseport_get_id(struct sock_reuseport *reuse) -{ - int id; - - if (reuse->reuseport_id) - return reuse->reuseport_id; - - id = ida_simple_get(&reuseport_ida, REUSEPORT_MIN_ID, 0, - /* Called under reuseport_lock */ - GFP_ATOMIC); - if (id < 0) - return id; - - reuse->reuseport_id = id; - - return reuse->reuseport_id; -} - static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) { unsigned int size = sizeof(struct sock_reuseport) + @@ -55,6 +36,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) int reuseport_alloc(struct sock *sk, bool bind_inany) { struct sock_reuseport *reuse; + int id, ret = 0; /* bh lock used since this function call may precede hlist lock in * soft irq of receive path or setsockopt from process context @@ -78,10 +60,18 @@ int reuseport_alloc(struct sock *sk, bool bind_inany) reuse = __reuseport_alloc(INIT_SOCKS); if (!reuse) { - spin_unlock_bh(&reuseport_lock); - return -ENOMEM; + ret = -ENOMEM; + goto out; } + id = ida_alloc(&reuseport_ida, GFP_ATOMIC); + if (id < 0) { + kfree(reuse); + ret = id; + goto out; + } + + reuse->reuseport_id = id; reuse->socks[0] = sk; reuse->num_socks = 1; reuse->bind_inany = bind_inany; @@ -90,7 +80,7 @@ int reuseport_alloc(struct sock *sk, bool bind_inany) out: spin_unlock_bh(&reuseport_lock); - return 0; + return ret; } EXPORT_SYMBOL(reuseport_alloc); @@ -134,8 +124,7 @@ static void reuseport_free_rcu(struct rcu_head *head) reuse = container_of(head, struct sock_reuseport, rcu); sk_reuseport_prog_free(rcu_dereference_protected(reuse->prog, 1)); - if (reuse->reuseport_id) - ida_simple_remove(&reuseport_ida, reuse->reuseport_id); + ida_free(&reuseport_ida, reuse->reuseport_id); kfree(reuse); } @@ -199,12 +188,15 @@ void reuseport_detach_sock(struct sock *sk) reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); - /* At least one of the sk in this reuseport group is added to - * a bpf map. Notify the bpf side. The bpf map logic will - * remove the sk if it is indeed added to a bpf map. + /* Notify the bpf side. The sk may be added to a sockarray + * map. If so, sockarray logic will remove it from the map. + * + * Other bpf map types that work with reuseport, like sockmap, + * don't need an explicit callback from here. They override sk + * unhash/close ops to remove the sk from the map before we + * get to this point. */ - if (reuse->reuseport_id) - bpf_sk_reuseport_detach(sk); + bpf_sk_reuseport_detach(sk); rcu_assign_pointer(sk->sk_reuseport_cb, NULL); -- cgit v1.2.3-71-gd317 From 41ccdbfd5427bbbf3ed58b16750113b38fad1780 Mon Sep 17 00:00:00 2001 From: Daniel Jordan Date: Mon, 10 Feb 2020 13:11:00 -0500 Subject: padata: fix uninitialized return value in padata_replace() According to Geert's report[0], kernel/padata.c: warning: 'err' may be used uninitialized in this function [-Wuninitialized]: => 539:2 Warning is seen only with older compilers on certain archs. The runtime effect is potentially returning garbage down the stack when padata's cpumasks are modified before any pcrypt requests have run. Simplest fix is to initialize err to the success value. [0] http://lkml.kernel.org/r/20200210135506.11536-1-geert@linux-m68k.org Reported-by: Geert Uytterhoeven Fixes: bbefa1dd6a6d ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Daniel Jordan Cc: Herbert Xu Cc: Steffen Klassert Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu --- kernel/padata.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/padata.c b/kernel/padata.c index 72777c10bb9c..62082597d4a2 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -512,7 +512,7 @@ static int padata_replace_one(struct padata_shell *ps) static int padata_replace(struct padata_instance *pinst) { struct padata_shell *ps; - int err; + int err = 0; pinst->flags |= PADATA_RESET; -- cgit v1.2.3-71-gd317 From f22aef4afb0d6cc22e408d8254cf6d02d7982ef1 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:12 +0000 Subject: sched/numa: Trace when no candidate CPU was found on the preferred node sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node. The case where no candidate CPU could be found is not traced which is an important gap. The tracepoint is not fired when the task is not allowed to run on any CPU on the preferred node or the task is already running on the target CPU but neither are interesting corner cases. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Steven Rostedt Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-3-mgorman@techsingularity.net --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f38ff5a335d3..f524ce3cea82 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1848,8 +1848,10 @@ static int task_numa_migrate(struct task_struct *p) } /* No better CPU than the current one was found. */ - if (env.best_cpu == -1) + if (env.best_cpu == -1) { + trace_sched_stick_numa(p, env.src_cpu, -1); return -EAGAIN; + } best_rq = cpu_rq(env.best_cpu); if (env.best_task == NULL) { -- cgit v1.2.3-71-gd317 From b2b2042b204796190af7c20069ab790a614c36d0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:13 +0000 Subject: sched/numa: Distinguish between the different task_numa_migrate() failure cases sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node but from the trace, it's possibile to tell the difference between "no CPU found", "migration to idle CPU failed" and "tasks could not be swapped". Extend the tracepoint accordingly. Signed-off-by: Mel Gorman [ Minor edits. ] Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Steven Rostedt Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-4-mgorman@techsingularity.net --- include/trace/events/sched.h | 49 ++++++++++++++++++++++++-------------------- kernel/sched/fair.c | 6 +++--- 2 files changed, 30 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 420e80e56e55..9c3ebb7c83a5 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -487,7 +487,11 @@ TRACE_EVENT(sched_process_hang, ); #endif /* CONFIG_DETECT_HUNG_TASK */ -DECLARE_EVENT_CLASS(sched_move_task_template, +/* + * Tracks migration of tasks from one runqueue to another. Can be used to + * detect if automatic NUMA balancing is bouncing between nodes. + */ +TRACE_EVENT(sched_move_numa, TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), @@ -519,23 +523,7 @@ DECLARE_EVENT_CLASS(sched_move_task_template, __entry->dst_cpu, __entry->dst_nid) ); -/* - * Tracks migration of tasks from one runqueue to another. Can be used to - * detect if automatic NUMA balancing is bouncing between nodes - */ -DEFINE_EVENT(sched_move_task_template, sched_move_numa, - TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), - - TP_ARGS(tsk, src_cpu, dst_cpu) -); - -DEFINE_EVENT(sched_move_task_template, sched_stick_numa, - TP_PROTO(struct task_struct *tsk, int src_cpu, int dst_cpu), - - TP_ARGS(tsk, src_cpu, dst_cpu) -); - -TRACE_EVENT(sched_swap_numa, +DECLARE_EVENT_CLASS(sched_numa_pair_template, TP_PROTO(struct task_struct *src_tsk, int src_cpu, struct task_struct *dst_tsk, int dst_cpu), @@ -561,11 +549,11 @@ TRACE_EVENT(sched_swap_numa, __entry->src_ngid = task_numa_group_id(src_tsk); __entry->src_cpu = src_cpu; __entry->src_nid = cpu_to_node(src_cpu); - __entry->dst_pid = task_pid_nr(dst_tsk); - __entry->dst_tgid = task_tgid_nr(dst_tsk); - __entry->dst_ngid = task_numa_group_id(dst_tsk); + __entry->dst_pid = dst_tsk ? task_pid_nr(dst_tsk) : 0; + __entry->dst_tgid = dst_tsk ? task_tgid_nr(dst_tsk) : 0; + __entry->dst_ngid = dst_tsk ? task_numa_group_id(dst_tsk) : 0; __entry->dst_cpu = dst_cpu; - __entry->dst_nid = cpu_to_node(dst_cpu); + __entry->dst_nid = dst_cpu >= 0 ? cpu_to_node(dst_cpu) : -1; ), TP_printk("src_pid=%d src_tgid=%d src_ngid=%d src_cpu=%d src_nid=%d dst_pid=%d dst_tgid=%d dst_ngid=%d dst_cpu=%d dst_nid=%d", @@ -575,6 +563,23 @@ TRACE_EVENT(sched_swap_numa, __entry->dst_cpu, __entry->dst_nid) ); +DEFINE_EVENT(sched_numa_pair_template, sched_stick_numa, + + TP_PROTO(struct task_struct *src_tsk, int src_cpu, + struct task_struct *dst_tsk, int dst_cpu), + + TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu) +); + +DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa, + + TP_PROTO(struct task_struct *src_tsk, int src_cpu, + struct task_struct *dst_tsk, int dst_cpu), + + TP_ARGS(src_tsk, src_cpu, dst_tsk, dst_cpu) +); + + /* * Tracepoint for waking a polling cpu without an IPI. */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f524ce3cea82..5d9c23c134af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1849,7 +1849,7 @@ static int task_numa_migrate(struct task_struct *p) /* No better CPU than the current one was found. */ if (env.best_cpu == -1) { - trace_sched_stick_numa(p, env.src_cpu, -1); + trace_sched_stick_numa(p, env.src_cpu, NULL, -1); return -EAGAIN; } @@ -1858,7 +1858,7 @@ static int task_numa_migrate(struct task_struct *p) ret = migrate_task_to(p, env.best_cpu); WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) - trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); + trace_sched_stick_numa(p, env.src_cpu, NULL, env.best_cpu); return ret; } @@ -1866,7 +1866,7 @@ static int task_numa_migrate(struct task_struct *p) WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) - trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); + trace_sched_stick_numa(p, env.src_cpu, env.best_task, env.best_cpu); put_task_struct(env.best_task); return ret; } -- cgit v1.2.3-71-gd317 From 6d4d22468dae3d8757af9f8b81b848a76ef4409d Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 24 Feb 2020 09:52:14 +0000 Subject: sched/fair: Reorder enqueue/dequeue_task_fair path The walk through the cgroup hierarchy during the enqueue/dequeue of a task is split in 2 distinct parts for throttled cfs_rq without any added value but making code less readable. Change the code ordering such that everything related to a cfs_rq (throttled or not) will be done in the same loop. In addition, the same steps ordering is used when updating a cfs_rq: - update_load_avg - update_cfs_group - update *h_nr_running This reordering enables the use of h_nr_running in PELT algorithm. No functional and performance changes are expected and have been noticed during tests. Signed-off-by: Vincent Guittot Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Reviewed-by: "Dietmar Eggemann " Acked-by: Peter Zijlstra Cc: Juri Lelli Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net --- kernel/sched/fair.c | 42 ++++++++++++++++++++---------------------- 1 file changed, 20 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d9c23c134af..a6c7f8bfc0e5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5260,32 +5260,31 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) cfs_rq = cfs_rq_of(se); enqueue_entity(cfs_rq, se, flags); - /* - * end evaluation on encountering a throttled cfs_rq - * - * note: in the case of encountering a throttled cfs_rq we will - * post the final h_nr_running increment below. - */ - if (cfs_rq_throttled(cfs_rq)) - break; cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; + /* end evaluation on encountering a throttled cfs_rq */ + if (cfs_rq_throttled(cfs_rq)) + goto enqueue_throttle; + flags = ENQUEUE_WAKEUP; } for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - cfs_rq->h_nr_running++; - cfs_rq->idle_h_nr_running += idle_h_nr_running; + /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - break; + goto enqueue_throttle; update_load_avg(cfs_rq, se, UPDATE_TG); update_cfs_group(se); + + cfs_rq->h_nr_running++; + cfs_rq->idle_h_nr_running += idle_h_nr_running; } +enqueue_throttle: if (!se) { add_nr_running(rq, 1); /* @@ -5346,17 +5345,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) cfs_rq = cfs_rq_of(se); dequeue_entity(cfs_rq, se, flags); - /* - * end evaluation on encountering a throttled cfs_rq - * - * note: in the case of encountering a throttled cfs_rq we will - * post the final h_nr_running decrement below. - */ - if (cfs_rq_throttled(cfs_rq)) - break; cfs_rq->h_nr_running--; cfs_rq->idle_h_nr_running -= idle_h_nr_running; + /* end evaluation on encountering a throttled cfs_rq */ + if (cfs_rq_throttled(cfs_rq)) + goto dequeue_throttle; + /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { /* Avoid re-evaluating load for this entity: */ @@ -5374,16 +5369,19 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - cfs_rq->h_nr_running--; - cfs_rq->idle_h_nr_running -= idle_h_nr_running; + /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - break; + goto dequeue_throttle; update_load_avg(cfs_rq, se, UPDATE_TG); update_cfs_group(se); + + cfs_rq->h_nr_running--; + cfs_rq->idle_h_nr_running -= idle_h_nr_running; } +dequeue_throttle: if (!se) sub_nr_running(rq, 1); -- cgit v1.2.3-71-gd317 From 6499b1b2dd1b8d404a16b9fbbf1af6b9b3c1d83d Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 24 Feb 2020 09:52:15 +0000 Subject: sched/numa: Replace runnable_load_avg by load_avg Similarly to what has been done for the normal load balancer, we can replace runnable_load_avg by load_avg in numa load balancing and track the other statistics like the utilization and the number of running tasks to get to better view of the current state of a node. Signed-off-by: Vincent Guittot Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Reviewed-by: "Dietmar Eggemann " Acked-by: Peter Zijlstra Cc: Juri Lelli Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-6-mgorman@techsingularity.net --- kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++++++++----------------- 1 file changed, 70 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a6c7f8bfc0e5..bc3d6518a06c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4; } -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq); - -static unsigned long cpu_runnable_load(struct rq *rq) -{ - return cfs_rq_runnable_load_avg(&rq->cfs); -} +/* + * 'numa_type' describes the node at the moment of load balancing. + */ +enum numa_type { + /* The node has spare capacity that can be used to run more tasks. */ + node_has_spare = 0, + /* + * The node is fully used and the tasks don't compete for more CPU + * cycles. Nevertheless, some tasks might wait before running. + */ + node_fully_busy, + /* + * The node is overloaded and can't provide expected CPU cycles to all + * tasks. + */ + node_overloaded +}; /* Cached statistics for all CPUs within a node */ struct numa_stats { unsigned long load; - + unsigned long util; /* Total compute capacity of CPUs on a node */ unsigned long compute_capacity; + unsigned int nr_running; + unsigned int weight; + enum numa_type node_type; }; -/* - * XXX borrowed from update_sg_lb_stats - */ -static void update_numa_stats(struct numa_stats *ns, int nid) -{ - int cpu; - - memset(ns, 0, sizeof(*ns)); - for_each_cpu(cpu, cpumask_of_node(nid)) { - struct rq *rq = cpu_rq(cpu); - - ns->load += cpu_runnable_load(rq); - ns->compute_capacity += capacity_of(cpu); - } - -} - struct task_numa_env { struct task_struct *p; @@ -1521,6 +1518,47 @@ struct task_numa_env { int best_cpu; }; +static unsigned long cpu_load(struct rq *rq); +static unsigned long cpu_util(int cpu); + +static inline enum +numa_type numa_classify(unsigned int imbalance_pct, + struct numa_stats *ns) +{ + if ((ns->nr_running > ns->weight) && + ((ns->compute_capacity * 100) < (ns->util * imbalance_pct))) + return node_overloaded; + + if ((ns->nr_running < ns->weight) || + ((ns->compute_capacity * 100) > (ns->util * imbalance_pct))) + return node_has_spare; + + return node_fully_busy; +} + +/* + * XXX borrowed from update_sg_lb_stats + */ +static void update_numa_stats(struct task_numa_env *env, + struct numa_stats *ns, int nid) +{ + int cpu; + + memset(ns, 0, sizeof(*ns)); + for_each_cpu(cpu, cpumask_of_node(nid)) { + struct rq *rq = cpu_rq(cpu); + + ns->load += cpu_load(rq); + ns->util += cpu_util(cpu); + ns->nr_running += rq->cfs.h_nr_running; + ns->compute_capacity += capacity_of(cpu); + } + + ns->weight = cpumask_weight(cpumask_of_node(nid)); + + ns->node_type = numa_classify(env->imbalance_pct, ns); +} + static void task_numa_assign(struct task_numa_env *env, struct task_struct *p, long imp) { @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load, long orig_src_load, orig_dst_load; long src_capacity, dst_capacity; + + /* If dst node has spare capacity, there is no real load imbalance */ + if (env->dst_stats.node_type == node_has_spare) + return false; + /* * The load is corrected for the CPU capacity available on each node. * @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p) dist = env.dist = node_distance(env.src_nid, env.dst_nid); taskweight = task_weight(p, env.src_nid, dist); groupweight = group_weight(p, env.src_nid, dist); - update_numa_stats(&env.src_stats, env.src_nid); + update_numa_stats(&env, &env.src_stats, env.src_nid); taskimp = task_weight(p, env.dst_nid, dist) - taskweight; groupimp = group_weight(p, env.dst_nid, dist) - groupweight; - update_numa_stats(&env.dst_stats, env.dst_nid); + update_numa_stats(&env, &env.dst_stats, env.dst_nid); /* Try to find a spot on the preferred nid. */ task_numa_find_cpu(&env, taskimp, groupimp); @@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; - update_numa_stats(&env.dst_stats, env.dst_nid); + update_numa_stats(&env, &env.dst_stats, env.dst_nid); task_numa_find_cpu(&env, taskimp, groupimp); } } @@ -3686,11 +3729,6 @@ static void remove_entity_load_avg(struct sched_entity *se) raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags); } -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq) -{ - return cfs_rq->avg.runnable_load_avg; -} - static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) { return cfs_rq->avg.load_avg; -- cgit v1.2.3-71-gd317 From fb86f5b2119245afd339280099b4e9417cc0b03a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:16 +0000 Subject: sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity The standard load balancer generally tries to keep the number of running tasks or idle CPUs balanced between NUMA domains. The NUMA balancer allows tasks to move if there is spare capacity but this causes a conflict and utilisation between NUMA nodes gets badly skewed. This patch uses similar logic between the NUMA balancer and load balancer when deciding if a task migrating to its preferred node can use an idle CPU. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-7-mgorman@techsingularity.net --- kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++-------------------- 1 file changed, 50 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc3d6518a06c..7a3c66f1762f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1520,6 +1520,7 @@ struct task_numa_env { static unsigned long cpu_load(struct rq *rq); static unsigned long cpu_util(int cpu); +static inline long adjust_numa_imbalance(int imbalance, int src_nr_running); static inline enum numa_type numa_classify(unsigned int imbalance_pct, @@ -1594,11 +1595,6 @@ static bool load_too_imbalanced(long src_load, long dst_load, long orig_src_load, orig_dst_load; long src_capacity, dst_capacity; - - /* If dst node has spare capacity, there is no real load imbalance */ - if (env->dst_stats.node_type == node_has_spare) - return false; - /* * The load is corrected for the CPU capacity available on each node. * @@ -1757,19 +1753,42 @@ unlock: static void task_numa_find_cpu(struct task_numa_env *env, long taskimp, long groupimp) { - long src_load, dst_load, load; bool maymove = false; int cpu; - load = task_h_load(env->p); - dst_load = env->dst_stats.load + load; - src_load = env->src_stats.load - load; - /* - * If the improvement from just moving env->p direction is better - * than swapping tasks around, check if a move is possible. + * If dst node has spare capacity, then check if there is an + * imbalance that would be overruled by the load balancer. */ - maymove = !load_too_imbalanced(src_load, dst_load, env); + if (env->dst_stats.node_type == node_has_spare) { + unsigned int imbalance; + int src_running, dst_running; + + /* + * Would movement cause an imbalance? Note that if src has + * more running tasks that the imbalance is ignored as the + * move improves the imbalance from the perspective of the + * CPU load balancer. + * */ + src_running = env->src_stats.nr_running - 1; + dst_running = env->dst_stats.nr_running + 1; + imbalance = max(0, dst_running - src_running); + imbalance = adjust_numa_imbalance(imbalance, src_running); + + /* Use idle CPU if there is no imbalance */ + if (!imbalance) + maymove = true; + } else { + long src_load, dst_load, load; + /* + * If the improvement from just moving env->p direction is better + * than swapping tasks around, check if a move is possible. + */ + load = task_h_load(env->p); + dst_load = env->dst_stats.load + load; + src_load = env->src_stats.load - load; + maymove = !load_too_imbalanced(src_load, dst_load, env); + } for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { /* Skip this CPU if the source task cannot migrate */ @@ -8694,6 +8713,21 @@ next_group: } } +static inline long adjust_numa_imbalance(int imbalance, int src_nr_running) +{ + unsigned int imbalance_min; + + /* + * Allow a small imbalance based on a simple pair of communicating + * tasks that remain local when the source domain is almost idle. + */ + imbalance_min = 2; + if (src_nr_running <= imbalance_min) + return 0; + + return imbalance; +} + /** * calculate_imbalance - Calculate the amount of imbalance present within the * groups of a given sched_domain during load balance. @@ -8790,24 +8824,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s } /* Consider allowing a small imbalance between NUMA groups */ - if (env->sd->flags & SD_NUMA) { - unsigned int imbalance_min; - - /* - * Compute an allowed imbalance based on a simple - * pair of communicating tasks that should remain - * local and ignore them. - * - * NOTE: Generally this would have been based on - * the domain size and this was evaluated. However, - * the benefit is similar across a range of workloads - * and machines but scaling by the domain size adds - * the risk that lower domains have to be rebalanced. - */ - imbalance_min = 2; - if (busiest->sum_nr_running <= imbalance_min) - env->imbalance = 0; - } + if (env->sd->flags & SD_NUMA) + env->imbalance = adjust_numa_imbalance(env->imbalance, + busiest->sum_nr_running); return; } -- cgit v1.2.3-71-gd317 From 0dacee1bfa70e171be3a12a30414c228453048d2 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 24 Feb 2020 09:52:17 +0000 Subject: sched/pelt: Remove unused runnable load average Now that runnable_load_avg is no more used, we can remove it to make space for a new signal. Signed-off-by: Vincent Guittot Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Reviewed-by: "Dietmar Eggemann " Acked-by: Peter Zijlstra Cc: Juri Lelli Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net --- include/linux/sched.h | 5 +- kernel/sched/core.c | 2 - kernel/sched/debug.c | 8 ---- kernel/sched/fair.c | 130 +++++++------------------------------------------- kernel/sched/pelt.c | 62 ++++++++++-------------- kernel/sched/sched.h | 7 +-- 6 files changed, 43 insertions(+), 171 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..037eaffabc24 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -357,7 +357,7 @@ struct util_est { /* * The load_avg/util_avg accumulates an infinite geometric series - * (see __update_load_avg() in kernel/sched/fair.c). + * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c). * * [load_avg definition] * @@ -401,11 +401,9 @@ struct util_est { struct sched_avg { u64 last_update_time; u64 load_sum; - u64 runnable_load_sum; u32 util_sum; u32 period_contrib; unsigned long load_avg; - unsigned long runnable_load_avg; unsigned long util_avg; struct util_est util_est; } ____cacheline_aligned; @@ -449,7 +447,6 @@ struct sched_statistics { struct sched_entity { /* For load-balancing: */ struct load_weight load; - unsigned long runnable_weight; struct rb_node run_node; struct list_head group_node; unsigned int on_rq; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e94819d573be..8e6f38073ab3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -761,7 +761,6 @@ static void set_load_weight(struct task_struct *p, bool update_load) if (task_has_idle_policy(p)) { load->weight = scale_load(WEIGHT_IDLEPRIO); load->inv_weight = WMULT_IDLEPRIO; - p->se.runnable_weight = load->weight; return; } @@ -774,7 +773,6 @@ static void set_load_weight(struct task_struct *p, bool update_load) } else { load->weight = scale_load(sched_prio_to_weight[prio]); load->inv_weight = sched_prio_to_wmult[prio]; - p->se.runnable_weight = load->weight; } } diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 879d3ccf3806..cfecaad387c0 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -402,11 +402,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group } P(se->load.weight); - P(se->runnable_weight); #ifdef CONFIG_SMP P(se->avg.load_avg); P(se->avg.util_avg); - P(se->avg.runnable_load_avg); #endif #undef PN_SCHEDSTAT @@ -524,11 +522,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running); SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight); #ifdef CONFIG_SMP - SEQ_printf(m, " .%-30s: %ld\n", "runnable_weight", cfs_rq->runnable_weight); SEQ_printf(m, " .%-30s: %lu\n", "load_avg", cfs_rq->avg.load_avg); - SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg", - cfs_rq->avg.runnable_load_avg); SEQ_printf(m, " .%-30s: %lu\n", "util_avg", cfs_rq->avg.util_avg); SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued", @@ -947,13 +942,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, "nr_involuntary_switches", (long long)p->nivcsw); P(se.load.weight); - P(se.runnable_weight); #ifdef CONFIG_SMP P(se.avg.load_sum); - P(se.avg.runnable_load_sum); P(se.avg.util_sum); P(se.avg.load_avg); - P(se.avg.runnable_load_avg); P(se.avg.util_avg); P(se.avg.last_update_time); P(se.avg.util_est.ewma); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a3c66f1762f..b0fb3d6a6185 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -741,9 +741,7 @@ void init_entity_runnable_average(struct sched_entity *se) * nothing has been attached to the task group yet. */ if (entity_is_task(se)) - sa->runnable_load_avg = sa->load_avg = scale_load_down(se->load.weight); - - se->runnable_weight = se->load.weight; + sa->load_avg = scale_load_down(se->load.weight); /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */ } @@ -2898,25 +2896,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) } while (0) #ifdef CONFIG_SMP -static inline void -enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ - cfs_rq->runnable_weight += se->runnable_weight; - - cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg; - cfs_rq->avg.runnable_load_sum += se_runnable(se) * se->avg.runnable_load_sum; -} - -static inline void -dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) -{ - cfs_rq->runnable_weight -= se->runnable_weight; - - sub_positive(&cfs_rq->avg.runnable_load_avg, se->avg.runnable_load_avg); - sub_positive(&cfs_rq->avg.runnable_load_sum, - se_runnable(se) * se->avg.runnable_load_sum); -} - static inline void enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { @@ -2932,28 +2911,22 @@ dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) } #else static inline void -enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } -static inline void -dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } -static inline void enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } static inline void dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { } #endif static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, - unsigned long weight, unsigned long runnable) + unsigned long weight) { if (se->on_rq) { /* commit outstanding execution time */ if (cfs_rq->curr == se) update_curr(cfs_rq); account_entity_dequeue(cfs_rq, se); - dequeue_runnable_load_avg(cfs_rq, se); } dequeue_load_avg(cfs_rq, se); - se->runnable_weight = runnable; update_load_set(&se->load, weight); #ifdef CONFIG_SMP @@ -2961,16 +2934,13 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, u32 divider = LOAD_AVG_MAX - 1024 + se->avg.period_contrib; se->avg.load_avg = div_u64(se_weight(se) * se->avg.load_sum, divider); - se->avg.runnable_load_avg = - div_u64(se_runnable(se) * se->avg.runnable_load_sum, divider); } while (0); #endif enqueue_load_avg(cfs_rq, se); - if (se->on_rq) { + if (se->on_rq) account_entity_enqueue(cfs_rq, se); - enqueue_runnable_load_avg(cfs_rq, se); - } + } void reweight_task(struct task_struct *p, int prio) @@ -2980,7 +2950,7 @@ void reweight_task(struct task_struct *p, int prio) struct load_weight *load = &se->load; unsigned long weight = scale_load(sched_prio_to_weight[prio]); - reweight_entity(cfs_rq, se, weight, weight); + reweight_entity(cfs_rq, se, weight); load->inv_weight = sched_prio_to_wmult[prio]; } @@ -3092,50 +3062,6 @@ static long calc_group_shares(struct cfs_rq *cfs_rq) */ return clamp_t(long, shares, MIN_SHARES, tg_shares); } - -/* - * This calculates the effective runnable weight for a group entity based on - * the group entity weight calculated above. - * - * Because of the above approximation (2), our group entity weight is - * an load_avg based ratio (3). This means that it includes blocked load and - * does not represent the runnable weight. - * - * Approximate the group entity's runnable weight per ratio from the group - * runqueue: - * - * grq->avg.runnable_load_avg - * ge->runnable_weight = ge->load.weight * -------------------------- (7) - * grq->avg.load_avg - * - * However, analogous to above, since the avg numbers are slow, this leads to - * transients in the from-idle case. Instead we use: - * - * ge->runnable_weight = ge->load.weight * - * - * max(grq->avg.runnable_load_avg, grq->runnable_weight) - * ----------------------------------------------------- (8) - * max(grq->avg.load_avg, grq->load.weight) - * - * Where these max() serve both to use the 'instant' values to fix the slow - * from-idle and avoid the /0 on to-idle, similar to (6). - */ -static long calc_group_runnable(struct cfs_rq *cfs_rq, long shares) -{ - long runnable, load_avg; - - load_avg = max(cfs_rq->avg.load_avg, - scale_load_down(cfs_rq->load.weight)); - - runnable = max(cfs_rq->avg.runnable_load_avg, - scale_load_down(cfs_rq->runnable_weight)); - - runnable *= shares; - if (load_avg) - runnable /= load_avg; - - return clamp_t(long, runnable, MIN_SHARES, shares); -} #endif /* CONFIG_SMP */ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); @@ -3147,7 +3073,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); static void update_cfs_group(struct sched_entity *se) { struct cfs_rq *gcfs_rq = group_cfs_rq(se); - long shares, runnable; + long shares; if (!gcfs_rq) return; @@ -3156,16 +3082,15 @@ static void update_cfs_group(struct sched_entity *se) return; #ifndef CONFIG_SMP - runnable = shares = READ_ONCE(gcfs_rq->tg->shares); + shares = READ_ONCE(gcfs_rq->tg->shares); if (likely(se->load.weight == shares)) return; #else shares = calc_group_shares(gcfs_rq); - runnable = calc_group_runnable(gcfs_rq, shares); #endif - reweight_entity(cfs_rq_of(se), se, shares, runnable); + reweight_entity(cfs_rq_of(se), se, shares); } #else /* CONFIG_FAIR_GROUP_SCHED */ @@ -3290,11 +3215,11 @@ void set_task_rq_fair(struct sched_entity *se, * _IFF_ we look at the pure running and runnable sums. Because they * represent the very same entity, just at different points in the hierarchy. * - * Per the above update_tg_cfs_util() is trivial and simply copies the running - * sum over (but still wrong, because the group entity and group rq do not have - * their PELT windows aligned). + * Per the above update_tg_cfs_util() is trivial * and simply copies the + * running sum over (but still wrong, because the group entity and group rq do + * not have their PELT windows aligned). * - * However, update_tg_cfs_runnable() is more complex. So we have: + * However, update_tg_cfs_load() is more complex. So we have: * * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg (2) * @@ -3375,11 +3300,11 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq } static inline void -update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum; - unsigned long runnable_load_avg, load_avg; - u64 runnable_load_sum, load_sum = 0; + unsigned long load_avg; + u64 load_sum = 0; s64 delta_sum; if (!runnable_sum) @@ -3427,20 +3352,6 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf se->avg.load_avg = load_avg; add_positive(&cfs_rq->avg.load_avg, delta_avg); add_positive(&cfs_rq->avg.load_sum, delta_sum); - - runnable_load_sum = (s64)se_runnable(se) * runnable_sum; - runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX); - - if (se->on_rq) { - delta_sum = runnable_load_sum - - se_weight(se) * se->avg.runnable_load_sum; - delta_avg = runnable_load_avg - se->avg.runnable_load_avg; - add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg); - add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum); - } - - se->avg.runnable_load_sum = runnable_sum; - se->avg.runnable_load_avg = runnable_load_avg; } static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) @@ -3468,7 +3379,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum); update_tg_cfs_util(cfs_rq, se, gcfs_rq); - update_tg_cfs_runnable(cfs_rq, se, gcfs_rq); + update_tg_cfs_load(cfs_rq, se, gcfs_rq); trace_pelt_cfs_tp(cfs_rq); trace_pelt_se_tp(se); @@ -3612,8 +3523,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se)); } - se->avg.runnable_load_sum = se->avg.load_sum; - enqueue_load_avg(cfs_rq, se); cfs_rq->avg.util_avg += se->avg.util_avg; cfs_rq->avg.util_sum += se->avg.util_sum; @@ -4074,14 +3983,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When enqueuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. - * - Add its load to cfs_rq->runnable_avg * - For group_entity, update its weight to reflect the new share of * its group cfs_rq * - Add its new weight to cfs_rq->load.weight */ update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); update_cfs_group(se); - enqueue_runnable_load_avg(cfs_rq, se); account_entity_enqueue(cfs_rq, se); if (flags & ENQUEUE_WAKEUP) @@ -4158,13 +4065,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. - * - Subtract its load from the cfs_rq->runnable_avg. * - Subtract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ update_load_avg(cfs_rq, se, UPDATE_TG); - dequeue_runnable_load_avg(cfs_rq, se); update_stats_dequeue(cfs_rq, se, flags); @@ -7649,9 +7554,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) if (cfs_rq->avg.util_sum) return false; - if (cfs_rq->avg.runnable_load_sum) - return false; - return true; } diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index bd006b79b360..3eb0ed333dcb 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) */ static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa, - unsigned long load, unsigned long runnable, int running) + unsigned long load, int running) { u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; @@ -121,8 +121,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa, */ if (periods) { sa->load_sum = decay_load(sa->load_sum, periods); - sa->runnable_load_sum = - decay_load(sa->runnable_load_sum, periods); sa->util_sum = decay_load((u64)(sa->util_sum), periods); /* @@ -148,8 +146,6 @@ accumulate_sum(u64 delta, struct sched_avg *sa, if (load) sa->load_sum += load * contrib; - if (runnable) - sa->runnable_load_sum += runnable * contrib; if (running) sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; @@ -186,7 +182,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa, */ static __always_inline int ___update_load_sum(u64 now, struct sched_avg *sa, - unsigned long load, unsigned long runnable, int running) + unsigned long load, int running) { u64 delta; @@ -222,7 +218,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa, * Also see the comment in accumulate_sum(). */ if (!load) - runnable = running = 0; + running = 0; /* * Now we know we crossed measurement unit boundaries. The *_avg @@ -231,14 +227,14 @@ ___update_load_sum(u64 now, struct sched_avg *sa, * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ - if (!accumulate_sum(delta, sa, load, runnable, running)) + if (!accumulate_sum(delta, sa, load, running)) return 0; return 1; } static __always_inline void -___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runnable) +___update_load_avg(struct sched_avg *sa, unsigned long load) { u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib; @@ -246,7 +242,6 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna * Step 2: update *_avg. */ sa->load_avg = div_u64(load * sa->load_sum, divider); - sa->runnable_load_avg = div_u64(runnable * sa->runnable_load_sum, divider); WRITE_ONCE(sa->util_avg, sa->util_sum / divider); } @@ -254,17 +249,13 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna * sched_entity: * * task: - * se_runnable() == se_weight() + * se_weight() = se->load.weight * * group: [ see update_cfs_group() ] * se_weight() = tg->weight * grq->load_avg / tg->load_avg - * se_runnable() = se_weight(se) * grq->runnable_load_avg / grq->load_avg * - * load_sum := runnable_sum - * load_avg = se_weight(se) * runnable_avg - * - * runnable_load_sum := runnable_sum - * runnable_load_avg = se_runnable(se) * runnable_avg + * load_sum := runnable + * load_avg = se_weight(se) * load_sum * * XXX collapse load_sum and runnable_load_sum * @@ -272,15 +263,12 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna * * load_sum = \Sum se_weight(se) * se->avg.load_sum * load_avg = \Sum se->avg.load_avg - * - * runnable_load_sum = \Sum se_runnable(se) * se->avg.runnable_load_sum - * runnable_load_avg = \Sum se->avg.runable_load_avg */ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) { - if (___update_load_sum(now, &se->avg, 0, 0, 0)) { - ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); + if (___update_load_sum(now, &se->avg, 0, 0)) { + ___update_load_avg(&se->avg, se_weight(se)); trace_pelt_se_tp(se); return 1; } @@ -290,10 +278,9 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq, - cfs_rq->curr == se)) { + if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) { - ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); + ___update_load_avg(&se->avg, se_weight(se)); cfs_se_util_change(&se->avg); trace_pelt_se_tp(se); return 1; @@ -306,10 +293,9 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { if (___update_load_sum(now, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), - scale_load_down(cfs_rq->runnable_weight), cfs_rq->curr != NULL)) { - ___update_load_avg(&cfs_rq->avg, 1, 1); + ___update_load_avg(&cfs_rq->avg, 1); trace_pelt_cfs_tp(cfs_rq); return 1; } @@ -322,20 +308,19 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) * * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked * util_sum = cpu_scale * load_sum - * runnable_load_sum = load_sum + * runnable_sum = util_sum * - * load_avg and runnable_load_avg are not supported and meaningless. + * load_avg is not supported and meaningless. * */ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) { if (___update_load_sum(now, &rq->avg_rt, - running, running, running)) { - ___update_load_avg(&rq->avg_rt, 1, 1); + ___update_load_avg(&rq->avg_rt, 1); trace_pelt_rt_tp(rq); return 1; } @@ -348,18 +333,19 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) * * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked * util_sum = cpu_scale * load_sum - * runnable_load_sum = load_sum + * runnable_sum = util_sum + * + * load_avg is not supported and meaningless. * */ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) { if (___update_load_sum(now, &rq->avg_dl, - running, running, running)) { - ___update_load_avg(&rq->avg_dl, 1, 1); + ___update_load_avg(&rq->avg_dl, 1); trace_pelt_dl_tp(rq); return 1; } @@ -373,7 +359,9 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) * * util_sum = \Sum se->avg.util_sum but se->avg.util_sum is not tracked * util_sum = cpu_scale * load_sum - * runnable_load_sum = load_sum + * runnable_sum = util_sum + * + * load_avg is not supported and meaningless. * */ @@ -401,16 +389,14 @@ int update_irq_load_avg(struct rq *rq, u64 running) * rq->clock += delta with delta >= running */ ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, - 0, 0, 0); ret += ___update_load_sum(rq->clock, &rq->avg_irq, - 1, 1, 1); if (ret) { - ___update_load_avg(&rq->avg_irq, 1, 1); + ___update_load_avg(&rq->avg_irq, 1); trace_pelt_irq_tp(rq); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 12bf82d86156..ce27e588fa7c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -489,7 +489,6 @@ struct cfs_bandwidth { }; /* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load; - unsigned long runnable_weight; unsigned int nr_running; unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */ unsigned int idle_h_nr_running; /* SCHED_IDLE */ @@ -688,8 +687,10 @@ struct dl_rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* An entity is a task if it doesn't "own" a runqueue */ #define entity_is_task(se) (!se->my_q) + #else #define entity_is_task(se) 1 + #endif #ifdef CONFIG_SMP @@ -701,10 +702,6 @@ static inline long se_weight(struct sched_entity *se) return scale_load_down(se->load.weight); } -static inline long se_runnable(struct sched_entity *se) -{ - return scale_load_down(se->runnable_weight); -} static inline bool sched_asym_prefer(int a, int b) { -- cgit v1.2.3-71-gd317 From 9f68395333ad7f5bfe2f83473fed363d4229f11c Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 24 Feb 2020 09:52:18 +0000 Subject: sched/pelt: Add a new runnable average signal Now that runnable_load_avg has been removed, we can replace it by a new signal that will highlight the runnable pressure on a cfs_rq. This signal track the waiting time of tasks on rq and can help to better define the state of rqs. At now, only util_avg is used to define the state of a rq: A rq with more that around 80% of utilization and more than 1 tasks is considered as overloaded. But the util_avg signal of a rq can become temporaly low after that a task migrated onto another rq which can bias the classification of the rq. When tasks compete for the same rq, their runnable average signal will be higher than util_avg as it will include the waiting time and we can use this signal to better classify cfs_rqs. The new runnable_avg will track the runnable time of a task which simply adds the waiting time to the running time. The runnable _avg of cfs_rq will be the /Sum of se's runnable_avg and the runnable_avg of group entity will follow the one of the rq similarly to util_avg. Signed-off-by: Vincent Guittot Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Reviewed-by: "Dietmar Eggemann " Acked-by: Peter Zijlstra Cc: Juri Lelli Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net --- include/linux/sched.h | 26 ++++++++++------- kernel/sched/debug.c | 9 ++++-- kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/pelt.c | 39 ++++++++++++++++++-------- kernel/sched/sched.h | 22 ++++++++++++++- 5 files changed, 142 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 037eaffabc24..2e9199bf947b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -356,28 +356,30 @@ struct util_est { } __attribute__((__aligned__(sizeof(u64)))); /* - * The load_avg/util_avg accumulates an infinite geometric series + * The load/runnable/util_avg accumulates an infinite geometric series * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c). * * [load_avg definition] * * load_avg = runnable% * scale_load_down(load) * - * where runnable% is the time ratio that a sched_entity is runnable. - * For cfs_rq, it is the aggregated load_avg of all runnable and - * blocked sched_entities. + * [runnable_avg definition] + * + * runnable_avg = runnable% * SCHED_CAPACITY_SCALE * * [util_avg definition] * * util_avg = running% * SCHED_CAPACITY_SCALE * - * where running% is the time ratio that a sched_entity is running on - * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable - * and blocked sched_entities. + * where runnable% is the time ratio that a sched_entity is runnable and + * running% the time ratio that a sched_entity is running. + * + * For cfs_rq, they are the aggregated values of all runnable and blocked + * sched_entities. * - * load_avg and util_avg don't direcly factor frequency scaling and CPU - * capacity scaling. The scaling is done through the rq_clock_pelt that - * is used for computing those signals (see update_rq_clock_pelt()) + * The load/runnable/util_avg doesn't direcly factor frequency scaling and CPU + * capacity scaling. The scaling is done through the rq_clock_pelt that is used + * for computing those signals (see update_rq_clock_pelt()) * * N.B., the above ratios (runnable% and running%) themselves are in the * range of [0, 1]. To do fixed point arithmetics, we therefore scale them @@ -401,9 +403,11 @@ struct util_est { struct sched_avg { u64 last_update_time; u64 load_sum; + u64 runnable_sum; u32 util_sum; u32 period_contrib; unsigned long load_avg; + unsigned long runnable_avg; unsigned long util_avg; struct util_est util_est; } ____cacheline_aligned; @@ -467,6 +471,8 @@ struct sched_entity { struct cfs_rq *cfs_rq; /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; + /* cached value of my_q->h_nr_running */ + unsigned long runnable_weight; #endif #ifdef CONFIG_SMP diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index cfecaad387c0..8331bc04aea2 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -405,6 +405,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group #ifdef CONFIG_SMP P(se->avg.load_avg); P(se->avg.util_avg); + P(se->avg.runnable_avg); #endif #undef PN_SCHEDSTAT @@ -524,6 +525,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) #ifdef CONFIG_SMP SEQ_printf(m, " .%-30s: %lu\n", "load_avg", cfs_rq->avg.load_avg); + SEQ_printf(m, " .%-30s: %lu\n", "runnable_avg", + cfs_rq->avg.runnable_avg); SEQ_printf(m, " .%-30s: %lu\n", "util_avg", cfs_rq->avg.util_avg); SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued", @@ -532,8 +535,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq) cfs_rq->removed.load_avg); SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg", cfs_rq->removed.util_avg); - SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum", - cfs_rq->removed.runnable_sum); + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_avg", + cfs_rq->removed.runnable_avg); #ifdef CONFIG_FAIR_GROUP_SCHED SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib", cfs_rq->tg_load_avg_contrib); @@ -944,8 +947,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(se.load.weight); #ifdef CONFIG_SMP P(se.avg.load_sum); + P(se.avg.runnable_sum); P(se.avg.util_sum); P(se.avg.load_avg); + P(se.avg.runnable_avg); P(se.avg.util_avg); P(se.avg.last_update_time); P(se.avg.util_est.ewma); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b0fb3d6a6185..49b36d62cc35 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -794,6 +794,8 @@ void post_init_entity_util_avg(struct task_struct *p) } } + sa->runnable_avg = cpu_scale; + if (p->sched_class != &fair_sched_class) { /* * For !fair tasks do: @@ -3215,9 +3217,9 @@ void set_task_rq_fair(struct sched_entity *se, * _IFF_ we look at the pure running and runnable sums. Because they * represent the very same entity, just at different points in the hierarchy. * - * Per the above update_tg_cfs_util() is trivial * and simply copies the - * running sum over (but still wrong, because the group entity and group rq do - * not have their PELT windows aligned). + * Per the above update_tg_cfs_util() and update_tg_cfs_runnable() are trivial + * and simply copies the running/runnable sum over (but still wrong, because + * the group entity and group rq do not have their PELT windows aligned). * * However, update_tg_cfs_load() is more complex. So we have: * @@ -3299,6 +3301,32 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX; } +static inline void +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) +{ + long delta = gcfs_rq->avg.runnable_avg - se->avg.runnable_avg; + + /* Nothing to update */ + if (!delta) + return; + + /* + * The relation between sum and avg is: + * + * LOAD_AVG_MAX - 1024 + sa->period_contrib + * + * however, the PELT windows are not aligned between grq and gse. + */ + + /* Set new sched_entity's runnable */ + se->avg.runnable_avg = gcfs_rq->avg.runnable_avg; + se->avg.runnable_sum = se->avg.runnable_avg * LOAD_AVG_MAX; + + /* Update parent cfs_rq runnable */ + add_positive(&cfs_rq->avg.runnable_avg, delta); + cfs_rq->avg.runnable_sum = cfs_rq->avg.runnable_avg * LOAD_AVG_MAX; +} + static inline void update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq) { @@ -3379,6 +3407,7 @@ static inline int propagate_entity_load_avg(struct sched_entity *se) add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum); update_tg_cfs_util(cfs_rq, se, gcfs_rq); + update_tg_cfs_runnable(cfs_rq, se, gcfs_rq); update_tg_cfs_load(cfs_rq, se, gcfs_rq); trace_pelt_cfs_tp(cfs_rq); @@ -3449,7 +3478,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) { - unsigned long removed_load = 0, removed_util = 0, removed_runnable_sum = 0; + unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0; struct sched_avg *sa = &cfs_rq->avg; int decayed = 0; @@ -3460,7 +3489,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) raw_spin_lock(&cfs_rq->removed.lock); swap(cfs_rq->removed.util_avg, removed_util); swap(cfs_rq->removed.load_avg, removed_load); - swap(cfs_rq->removed.runnable_sum, removed_runnable_sum); + swap(cfs_rq->removed.runnable_avg, removed_runnable); cfs_rq->removed.nr = 0; raw_spin_unlock(&cfs_rq->removed.lock); @@ -3472,7 +3501,16 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) sub_positive(&sa->util_avg, r); sub_positive(&sa->util_sum, r * divider); - add_tg_cfs_propagate(cfs_rq, -(long)removed_runnable_sum); + r = removed_runnable; + sub_positive(&sa->runnable_avg, r); + sub_positive(&sa->runnable_sum, r * divider); + + /* + * removed_runnable is the unweighted version of removed_load so we + * can use it to estimate removed_load_sum. + */ + add_tg_cfs_propagate(cfs_rq, + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); decayed = 1; } @@ -3517,6 +3555,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s */ se->avg.util_sum = se->avg.util_avg * divider; + se->avg.runnable_sum = se->avg.runnable_avg * divider; + se->avg.load_sum = divider; if (se_weight(se)) { se->avg.load_sum = @@ -3526,6 +3566,8 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s enqueue_load_avg(cfs_rq, se); cfs_rq->avg.util_avg += se->avg.util_avg; cfs_rq->avg.util_sum += se->avg.util_sum; + cfs_rq->avg.runnable_avg += se->avg.runnable_avg; + cfs_rq->avg.runnable_sum += se->avg.runnable_sum; add_tg_cfs_propagate(cfs_rq, se->avg.load_sum); @@ -3547,6 +3589,8 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s dequeue_load_avg(cfs_rq, se); sub_positive(&cfs_rq->avg.util_avg, se->avg.util_avg); sub_positive(&cfs_rq->avg.util_sum, se->avg.util_sum); + sub_positive(&cfs_rq->avg.runnable_avg, se->avg.runnable_avg); + sub_positive(&cfs_rq->avg.runnable_sum, se->avg.runnable_sum); add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum); @@ -3653,10 +3697,15 @@ static void remove_entity_load_avg(struct sched_entity *se) ++cfs_rq->removed.nr; cfs_rq->removed.util_avg += se->avg.util_avg; cfs_rq->removed.load_avg += se->avg.load_avg; - cfs_rq->removed.runnable_sum += se->avg.load_sum; /* == runnable_sum */ + cfs_rq->removed.runnable_avg += se->avg.runnable_avg; raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags); } +static inline unsigned long cfs_rq_runnable_avg(struct cfs_rq *cfs_rq) +{ + return cfs_rq->avg.runnable_avg; +} + static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq) { return cfs_rq->avg.load_avg; @@ -3983,11 +4032,13 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When enqueuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. + * - Add its load to cfs_rq->runnable_avg * - For group_entity, update its weight to reflect the new share of * its group cfs_rq * - Add its new weight to cfs_rq->load.weight */ update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); + se_update_runnable(se); update_cfs_group(se); account_entity_enqueue(cfs_rq, se); @@ -4065,11 +4116,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. + * - Subtract its load from the cfs_rq->runnable_avg. * - Subtract its previous weight from cfs_rq->load.weight. * - For group entity, update its weight to reflect the new share * of its group cfs_rq. */ update_load_avg(cfs_rq, se, UPDATE_TG); + se_update_runnable(se); update_stats_dequeue(cfs_rq, se, flags); @@ -5240,6 +5293,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) goto enqueue_throttle; update_load_avg(cfs_rq, se, UPDATE_TG); + se_update_runnable(se); update_cfs_group(se); cfs_rq->h_nr_running++; @@ -5337,6 +5391,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) goto dequeue_throttle; update_load_avg(cfs_rq, se, UPDATE_TG); + se_update_runnable(se); update_cfs_group(se); cfs_rq->h_nr_running--; @@ -5409,6 +5464,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p) return load; } +static unsigned long cpu_runnable(struct rq *rq) +{ + return cfs_rq_runnable_avg(&rq->cfs); +} + static unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; @@ -7554,6 +7614,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) if (cfs_rq->avg.util_sum) return false; + if (cfs_rq->avg.runnable_sum) + return false; + return true; } diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 3eb0ed333dcb..c40d57a2a248 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -108,7 +108,7 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) */ static __always_inline u32 accumulate_sum(u64 delta, struct sched_avg *sa, - unsigned long load, int running) + unsigned long load, unsigned long runnable, int running) { u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; @@ -121,6 +121,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa, */ if (periods) { sa->load_sum = decay_load(sa->load_sum, periods); + sa->runnable_sum = + decay_load(sa->runnable_sum, periods); sa->util_sum = decay_load((u64)(sa->util_sum), periods); /* @@ -146,6 +148,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa, if (load) sa->load_sum += load * contrib; + if (runnable) + sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT; if (running) sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; @@ -182,7 +186,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa, */ static __always_inline int ___update_load_sum(u64 now, struct sched_avg *sa, - unsigned long load, int running) + unsigned long load, unsigned long runnable, int running) { u64 delta; @@ -218,7 +222,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa, * Also see the comment in accumulate_sum(). */ if (!load) - running = 0; + runnable = running = 0; /* * Now we know we crossed measurement unit boundaries. The *_avg @@ -227,7 +231,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa, * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ - if (!accumulate_sum(delta, sa, load, running)) + if (!accumulate_sum(delta, sa, load, runnable, running)) return 0; return 1; @@ -242,6 +246,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load) * Step 2: update *_avg. */ sa->load_avg = div_u64(load * sa->load_sum, divider); + sa->runnable_avg = div_u64(sa->runnable_sum, divider); WRITE_ONCE(sa->util_avg, sa->util_sum / divider); } @@ -250,24 +255,30 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load) * * task: * se_weight() = se->load.weight + * se_runnable() = !!on_rq * * group: [ see update_cfs_group() ] * se_weight() = tg->weight * grq->load_avg / tg->load_avg + * se_runnable() = grq->h_nr_running + * + * runnable_sum = se_runnable() * runnable = grq->runnable_sum + * runnable_avg = runnable_sum * * load_sum := runnable * load_avg = se_weight(se) * load_sum * - * XXX collapse load_sum and runnable_load_sum - * * cfq_rq: * + * runnable_sum = \Sum se->avg.runnable_sum + * runnable_avg = \Sum se->avg.runnable_avg + * * load_sum = \Sum se_weight(se) * se->avg.load_sum * load_avg = \Sum se->avg.load_avg */ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) { - if (___update_load_sum(now, &se->avg, 0, 0)) { + if (___update_load_sum(now, &se->avg, 0, 0, 0)) { ___update_load_avg(&se->avg, se_weight(se)); trace_pelt_se_tp(se); return 1; @@ -278,7 +289,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (___update_load_sum(now, &se->avg, !!se->on_rq, cfs_rq->curr == se)) { + if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se), + cfs_rq->curr == se)) { ___update_load_avg(&se->avg, se_weight(se)); cfs_se_util_change(&se->avg); @@ -293,6 +305,7 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { if (___update_load_sum(now, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), + cfs_rq->h_nr_running, cfs_rq->curr != NULL)) { ___update_load_avg(&cfs_rq->avg, 1); @@ -310,13 +323,14 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) * util_sum = cpu_scale * load_sum * runnable_sum = util_sum * - * load_avg is not supported and meaningless. + * load_avg and runnable_avg are not supported and meaningless. * */ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) { if (___update_load_sum(now, &rq->avg_rt, + running, running, running)) { @@ -335,13 +349,14 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) * util_sum = cpu_scale * load_sum * runnable_sum = util_sum * - * load_avg is not supported and meaningless. + * load_avg and runnable_avg are not supported and meaningless. * */ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) { if (___update_load_sum(now, &rq->avg_dl, + running, running, running)) { @@ -361,7 +376,7 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) * util_sum = cpu_scale * load_sum * runnable_sum = util_sum * - * load_avg is not supported and meaningless. + * load_avg and runnable_avg are not supported and meaningless. * */ @@ -389,9 +404,11 @@ int update_irq_load_avg(struct rq *rq, u64 running) * rq->clock += delta with delta >= running */ ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, + 0, 0, 0); ret += ___update_load_sum(rq->clock, &rq->avg_irq, + 1, 1, 1); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ce27e588fa7c..2a0caf394dd4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -527,7 +527,7 @@ struct cfs_rq { int nr; unsigned long load_avg; unsigned long util_avg; - unsigned long runnable_sum; + unsigned long runnable_avg; } removed; #ifdef CONFIG_FAIR_GROUP_SCHED @@ -688,9 +688,29 @@ struct dl_rq { /* An entity is a task if it doesn't "own" a runqueue */ #define entity_is_task(se) (!se->my_q) +static inline void se_update_runnable(struct sched_entity *se) +{ + if (!entity_is_task(se)) + se->runnable_weight = se->my_q->h_nr_running; +} + +static inline long se_runnable(struct sched_entity *se) +{ + if (entity_is_task(se)) + return !!se->on_rq; + else + return se->runnable_weight; +} + #else #define entity_is_task(se) 1 +static inline void se_update_runnable(struct sched_entity *se) {} + +static inline long se_runnable(struct sched_entity *se) +{ + return !!se->on_rq; +} #endif #ifdef CONFIG_SMP -- cgit v1.2.3-71-gd317 From 070f5e860ee2bf588c99ef7b4c202451faa48236 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Mon, 24 Feb 2020 09:52:19 +0000 Subject: sched/fair: Take into account runnable_avg to classify group Take into account the new runnable_avg signal to classify a group and to mitigate the volatility of util_avg in face of intensive migration or new task with random utilization. Signed-off-by: Vincent Guittot Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Reviewed-by: "Dietmar Eggemann " Acked-by: Peter Zijlstra Cc: Juri Lelli Cc: Steven Rostedt Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-10-mgorman@techsingularity.net --- kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49b36d62cc35..87521acb3698 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5469,6 +5469,24 @@ static unsigned long cpu_runnable(struct rq *rq) return cfs_rq_runnable_avg(&rq->cfs); } +static unsigned long cpu_runnable_without(struct rq *rq, struct task_struct *p) +{ + struct cfs_rq *cfs_rq; + unsigned int runnable; + + /* Task has no contribution or is new */ + if (cpu_of(rq) != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time)) + return cpu_runnable(rq); + + cfs_rq = &rq->cfs; + runnable = READ_ONCE(cfs_rq->avg.runnable_avg); + + /* Discount task's runnable from CPU's runnable */ + lsub_positive(&runnable, p->se.avg.runnable_avg); + + return runnable; +} + static unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; @@ -7752,7 +7770,8 @@ struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ unsigned long group_load; /* Total load over the CPUs of the group */ unsigned long group_capacity; - unsigned long group_util; /* Total utilization of the group */ + unsigned long group_util; /* Total utilization over the CPUs of the group */ + unsigned long group_runnable; /* Total runnable time over the CPUs of the group */ unsigned int sum_nr_running; /* Nr of tasks running in the group */ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */ unsigned int idle_cpus; @@ -7973,6 +7992,10 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs) if (sgs->sum_nr_running < sgs->group_weight) return true; + if ((sgs->group_capacity * imbalance_pct) < + (sgs->group_runnable * 100)) + return false; + if ((sgs->group_capacity * 100) > (sgs->group_util * imbalance_pct)) return true; @@ -7998,6 +8021,10 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs) (sgs->group_util * imbalance_pct)) return true; + if ((sgs->group_capacity * imbalance_pct) < + (sgs->group_runnable * 100)) + return true; + return false; } @@ -8092,6 +8119,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_load += cpu_load(rq); sgs->group_util += cpu_util(i); + sgs->group_runnable += cpu_runnable(rq); sgs->sum_h_nr_running += rq->cfs.h_nr_running; nr_running = rq->nr_running; @@ -8367,6 +8395,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd, sgs->group_load += cpu_load_without(rq, p); sgs->group_util += cpu_util_without(i, p); + sgs->group_runnable += cpu_runnable_without(rq, p); local = task_running_on_cpu(i, p); sgs->sum_h_nr_running += rq->cfs.h_nr_running - local; -- cgit v1.2.3-71-gd317 From ff7db0bf24db919f69121bf5df8f3cb6d79f49af Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:20 +0000 Subject: sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks task_numa_find_cpu() can scan a node multiple times. Minimally it scans to gather statistics and later to find a suitable target. In some cases, the second scan will simply pick an idle CPU if the load is not imbalanced. This patch caches information on an idle core while gathering statistics and uses it immediately if load is not imbalanced to avoid a second scan of the node runqueues. Preference is given to an idle core rather than an idle SMT sibling to avoid packing HT siblings due to linearly scanning the node cpumask. As a side-effect, even when the second scan is necessary, the importance of using select_idle_sibling is much reduced because information on idle CPUs is cached and can be reused. Note that this patch actually makes is harder to move to an idle CPU as multiple tasks can race for the same idle CPU due to a race checking numa_migrate_on. This is addressed in the next patch. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-11-mgorman@techsingularity.net --- kernel/sched/fair.c | 119 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 102 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87521acb3698..2da21f44e4d0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1500,8 +1500,29 @@ struct numa_stats { unsigned int nr_running; unsigned int weight; enum numa_type node_type; + int idle_cpu; }; +static inline bool is_core_idle(int cpu) +{ +#ifdef CONFIG_SCHED_SMT + int sibling; + + for_each_cpu(sibling, cpu_smt_mask(cpu)) { + if (cpu == sibling) + continue; + + if (!idle_cpu(cpu)) + return false; + } +#endif + + return true; +} + +/* Forward declarations of select_idle_sibling helpers */ +static inline bool test_idle_cores(int cpu, bool def); + struct task_numa_env { struct task_struct *p; @@ -1537,15 +1558,39 @@ numa_type numa_classify(unsigned int imbalance_pct, return node_fully_busy; } +static inline int numa_idle_core(int idle_core, int cpu) +{ +#ifdef CONFIG_SCHED_SMT + if (!static_branch_likely(&sched_smt_present) || + idle_core >= 0 || !test_idle_cores(cpu, false)) + return idle_core; + + /* + * Prefer cores instead of packing HT siblings + * and triggering future load balancing. + */ + if (is_core_idle(cpu)) + idle_core = cpu; +#endif + + return idle_core; +} + /* - * XXX borrowed from update_sg_lb_stats + * Gather all necessary information to make NUMA balancing placement + * decisions that are compatible with standard load balancer. This + * borrows code and logic from update_sg_lb_stats but sharing a + * common implementation is impractical. */ static void update_numa_stats(struct task_numa_env *env, - struct numa_stats *ns, int nid) + struct numa_stats *ns, int nid, + bool find_idle) { - int cpu; + int cpu, idle_core = -1; memset(ns, 0, sizeof(*ns)); + ns->idle_cpu = -1; + for_each_cpu(cpu, cpumask_of_node(nid)) { struct rq *rq = cpu_rq(cpu); @@ -1553,11 +1598,25 @@ static void update_numa_stats(struct task_numa_env *env, ns->util += cpu_util(cpu); ns->nr_running += rq->cfs.h_nr_running; ns->compute_capacity += capacity_of(cpu); + + if (find_idle && !rq->nr_running && idle_cpu(cpu)) { + if (READ_ONCE(rq->numa_migrate_on) || + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) + continue; + + if (ns->idle_cpu == -1) + ns->idle_cpu = cpu; + + idle_core = numa_idle_core(idle_core, cpu); + } } ns->weight = cpumask_weight(cpumask_of_node(nid)); ns->node_type = numa_classify(env->imbalance_pct, ns); + + if (idle_core >= 0) + ns->idle_cpu = idle_core; } static void task_numa_assign(struct task_numa_env *env, @@ -1566,7 +1625,7 @@ static void task_numa_assign(struct task_numa_env *env, struct rq *rq = cpu_rq(env->dst_cpu); /* Bail out if run-queue part of active NUMA balance. */ - if (xchg(&rq->numa_migrate_on, 1)) + if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) return; /* @@ -1730,19 +1789,39 @@ static void task_numa_compare(struct task_numa_env *env, goto unlock; assign: - /* - * One idle CPU per node is evaluated for a task numa move. - * Call select_idle_sibling to maybe find a better one. - */ + /* Evaluate an idle CPU for a task numa move. */ if (!cur) { + int cpu = env->dst_stats.idle_cpu; + + /* Nothing cached so current CPU went idle since the search. */ + if (cpu < 0) + cpu = env->dst_cpu; + /* - * select_idle_siblings() uses an per-CPU cpumask that - * can be used from IRQ context. + * If the CPU is no longer truly idle and the previous best CPU + * is, keep using it. */ - local_irq_disable(); - env->dst_cpu = select_idle_sibling(env->p, env->src_cpu, + if (!idle_cpu(cpu) && env->best_cpu >= 0 && + idle_cpu(env->best_cpu)) { + cpu = env->best_cpu; + } + + /* + * Use select_idle_sibling if the previously found idle CPU is + * not idle any more. + */ + if (!idle_cpu(cpu)) { + /* + * select_idle_siblings() uses an per-CPU cpumask that + * can be used from IRQ context. + */ + local_irq_disable(); + cpu = select_idle_sibling(env->p, env->src_cpu, env->dst_cpu); - local_irq_enable(); + local_irq_enable(); + } + + env->dst_cpu = cpu; } task_numa_assign(env, cur, imp); @@ -1776,8 +1855,14 @@ static void task_numa_find_cpu(struct task_numa_env *env, imbalance = adjust_numa_imbalance(imbalance, src_running); /* Use idle CPU if there is no imbalance */ - if (!imbalance) + if (!imbalance) { maymove = true; + if (env->dst_stats.idle_cpu >= 0) { + env->dst_cpu = env->dst_stats.idle_cpu; + task_numa_assign(env, NULL, 0); + return; + } + } } else { long src_load, dst_load, load; /* @@ -1850,10 +1935,10 @@ static int task_numa_migrate(struct task_struct *p) dist = env.dist = node_distance(env.src_nid, env.dst_nid); taskweight = task_weight(p, env.src_nid, dist); groupweight = group_weight(p, env.src_nid, dist); - update_numa_stats(&env, &env.src_stats, env.src_nid); + update_numa_stats(&env, &env.src_stats, env.src_nid, false); taskimp = task_weight(p, env.dst_nid, dist) - taskweight; groupimp = group_weight(p, env.dst_nid, dist) - groupweight; - update_numa_stats(&env, &env.dst_stats, env.dst_nid); + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); /* Try to find a spot on the preferred nid. */ task_numa_find_cpu(&env, taskimp, groupimp); @@ -1886,7 +1971,7 @@ static int task_numa_migrate(struct task_struct *p) env.dist = dist; env.dst_nid = nid; - update_numa_stats(&env, &env.dst_stats, env.dst_nid); + update_numa_stats(&env, &env.dst_stats, env.dst_nid, true); task_numa_find_cpu(&env, taskimp, groupimp); } } -- cgit v1.2.3-71-gd317 From 5fb52dd93a2fe9a738f730de9da108bd1f6c30d0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:21 +0000 Subject: sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Multiple tasks can attempt to select and idle CPU but fail because numa_migrate_on is already set and the migration fails. Instead of failing, scan for an alternative idle CPU. select_idle_sibling is not used because it requires IRQs to be disabled and it ignores numa_migrate_on allowing multiple tasks to stack. This scan may still fail if there are idle candidate CPUs due to races but if this occurs, it's best that a task stay on an available CPU that move to a contended one. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-12-mgorman@techsingularity.net --- kernel/sched/fair.c | 40 ++++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2da21f44e4d0..050c1b19bfc0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1624,15 +1624,34 @@ static void task_numa_assign(struct task_numa_env *env, { struct rq *rq = cpu_rq(env->dst_cpu); - /* Bail out if run-queue part of active NUMA balance. */ - if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) + /* Check if run-queue part of active NUMA balance. */ + if (env->best_cpu != env->dst_cpu && xchg(&rq->numa_migrate_on, 1)) { + int cpu; + int start = env->dst_cpu; + + /* Find alternative idle CPU. */ + for_each_cpu_wrap(cpu, cpumask_of_node(env->dst_nid), start) { + if (cpu == env->best_cpu || !idle_cpu(cpu) || + !cpumask_test_cpu(cpu, env->p->cpus_ptr)) { + continue; + } + + env->dst_cpu = cpu; + rq = cpu_rq(env->dst_cpu); + if (!xchg(&rq->numa_migrate_on, 1)) + goto assign; + } + + /* Failed to find an alternative idle CPU */ return; + } +assign: /* * Clear previous best_cpu/rq numa-migrate flag, since task now * found a better CPU to move/swap. */ - if (env->best_cpu != -1) { + if (env->best_cpu != -1 && env->best_cpu != env->dst_cpu) { rq = cpu_rq(env->best_cpu); WRITE_ONCE(rq->numa_migrate_on, 0); } @@ -1806,21 +1825,6 @@ assign: cpu = env->best_cpu; } - /* - * Use select_idle_sibling if the previously found idle CPU is - * not idle any more. - */ - if (!idle_cpu(cpu)) { - /* - * select_idle_siblings() uses an per-CPU cpumask that - * can be used from IRQ context. - */ - local_irq_disable(); - cpu = select_idle_sibling(env->p, env->src_cpu, - env->dst_cpu); - local_irq_enable(); - } - env->dst_cpu = cpu; } -- cgit v1.2.3-71-gd317 From 88cca72c9673e631b63eca7a1dba4a9722a3f414 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:22 +0000 Subject: sched/numa: Bias swapping tasks based on their preferred node When swapping tasks for NUMA balancing, it is preferred that tasks move to or remain on their preferred node. When considering an imbalance, encourage tasks to move to their preferred node and discourage tasks from moving away from their preferred node. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-13-mgorman@techsingularity.net --- kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++------ 1 file changed, 37 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 050c1b19bfc0..8c1ac01a10d5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1741,18 +1741,27 @@ static void task_numa_compare(struct task_numa_env *env, goto unlock; } + /* Skip this swap candidate if cannot move to the source cpu. */ + if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr)) + goto unlock; + + /* + * Skip this swap candidate if it is not moving to its preferred + * node and the best task is. + */ + if (env->best_task && + env->best_task->numa_preferred_nid == env->src_nid && + cur->numa_preferred_nid != env->src_nid) { + goto unlock; + } + /* * "imp" is the fault differential for the source task between the * source and destination node. Calculate the total differential for * the source task and potential destination task. The more negative * the value is, the more remote accesses that would be expected to * be incurred if the tasks were swapped. - */ - /* Skip this swap candidate if cannot move to the source cpu */ - if (!cpumask_test_cpu(env->src_cpu, cur->cpus_ptr)) - goto unlock; - - /* + * * If dst and source tasks are in the same NUMA group, or not * in any group then look only at task weights. */ @@ -1779,12 +1788,34 @@ static void task_numa_compare(struct task_numa_env *env, task_weight(cur, env->dst_nid, dist); } + /* Discourage picking a task already on its preferred node */ + if (cur->numa_preferred_nid == env->dst_nid) + imp -= imp / 16; + + /* + * Encourage picking a task that moves to its preferred node. + * This potentially makes imp larger than it's maximum of + * 1998 (see SMALLIMP and task_weight for why) but in this + * case, it does not matter. + */ + if (cur->numa_preferred_nid == env->src_nid) + imp += imp / 8; + if (maymove && moveimp > imp && moveimp > env->best_imp) { imp = moveimp; cur = NULL; goto assign; } + /* + * Prefer swapping with a task moving to its preferred node over a + * task that is not. + */ + if (env->best_task && cur->numa_preferred_nid == env->src_nid && + env->best_task->numa_preferred_nid != env->src_nid) { + goto assign; + } + /* * If the NUMA importance is less than SMALLIMP, * task migration might only result in ping pong -- cgit v1.2.3-71-gd317 From a0f03b617c3b2644d3d47bf7d9e60aed01bd5b10 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Mon, 24 Feb 2020 09:52:23 +0000 Subject: sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found When domains are imbalanced or overloaded a search of all CPUs on the target domain is searched and compared with task_numa_compare. In some circumstances, a candidate is found that is an obvious win. o A task can move to an idle CPU and an idle CPU is found o A swap candidate is found that would move to its preferred domain This patch terminates the search when either condition is met. Signed-off-by: Mel Gorman Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Valentin Schneider Cc: Phil Auld Cc: Hillf Danton Link: https://lore.kernel.org/r/20200224095223.13361-14-mgorman@techsingularity.net --- kernel/sched/fair.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8c1ac01a10d5..fcc968669aea 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1707,7 +1707,7 @@ static bool load_too_imbalanced(long src_load, long dst_load, * into account that it might be best if task running on the dst_cpu should * be exchanged with the source task */ -static void task_numa_compare(struct task_numa_env *env, +static bool task_numa_compare(struct task_numa_env *env, long taskimp, long groupimp, bool maymove) { struct numa_group *cur_ng, *p_ng = deref_curr_numa_group(env->p); @@ -1718,9 +1718,10 @@ static void task_numa_compare(struct task_numa_env *env, int dist = env->dist; long moveimp = imp; long load; + bool stopsearch = false; if (READ_ONCE(dst_rq->numa_migrate_on)) - return; + return false; rcu_read_lock(); cur = rcu_dereference(dst_rq->curr); @@ -1731,8 +1732,10 @@ static void task_numa_compare(struct task_numa_env *env, * Because we have preemption enabled we can get migrated around and * end try selecting ourselves (current == env->p) as a swap candidate. */ - if (cur == env->p) + if (cur == env->p) { + stopsearch = true; goto unlock; + } if (!cur) { if (maymove && moveimp >= env->best_imp) @@ -1860,8 +1863,27 @@ assign: } task_numa_assign(env, cur, imp); + + /* + * If a move to idle is allowed because there is capacity or load + * balance improves then stop the search. While a better swap + * candidate may exist, a search is not free. + */ + if (maymove && !cur && env->best_cpu >= 0 && idle_cpu(env->best_cpu)) + stopsearch = true; + + /* + * If a swap candidate must be identified and the current best task + * moves its preferred node then stop the search. + */ + if (!maymove && env->best_task && + env->best_task->numa_preferred_nid == env->src_nid) { + stopsearch = true; + } unlock: rcu_read_unlock(); + + return stopsearch; } static void task_numa_find_cpu(struct task_numa_env *env, @@ -1916,7 +1938,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, continue; env->dst_cpu = cpu; - task_numa_compare(env, taskimp, groupimp, maymove); + if (task_numa_compare(env, taskimp, groupimp, maymove)) + break; } } -- cgit v1.2.3-71-gd317 From 7bc3e6e55acf065500a24621f3b313e7e5998acf Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 19 Feb 2020 18:22:26 -0600 Subject: proc: Use a list of inodes to flush from proc Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot ", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman --- fs/proc/base.c | 111 ++++++++++++++++-------------------------------- fs/proc/inode.c | 2 +- fs/proc/internal.h | 1 + include/linux/pid.h | 1 + include/linux/proc_fs.h | 4 +- kernel/exit.c | 4 +- kernel/pid.c | 1 + 7 files changed, 45 insertions(+), 79 deletions(-) (limited to 'kernel') diff --git a/fs/proc/base.c b/fs/proc/base.c index c7c64272b0fa..e7efe9d6f3d6 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1834,11 +1834,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode, *rgid = gid; } +void proc_pid_evict_inode(struct proc_inode *ei) +{ + struct pid *pid = ei->pid; + + if (S_ISDIR(ei->vfs_inode.i_mode)) { + spin_lock(&pid->wait_pidfd.lock); + hlist_del_init_rcu(&ei->sibling_inodes); + spin_unlock(&pid->wait_pidfd.lock); + } + + put_pid(pid); +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; + struct pid *pid; /* We need a new inode */ @@ -1856,10 +1870,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* * grab the reference to task. */ - ei->pid = get_task_pid(task, PIDTYPE_PID); - if (!ei->pid) + pid = get_task_pid(task, PIDTYPE_PID); + if (!pid) goto out_unlock; + /* Let the pid remember us for quick removal */ + ei->pid = pid; + if (S_ISDIR(mode)) { + spin_lock(&pid->wait_pidfd.lock); + hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes); + spin_unlock(&pid->wait_pidfd.lock); + } + task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); security_task_to_inode(task, inode); @@ -3230,90 +3252,29 @@ static const struct inode_operations proc_tgid_base_inode_operations = { .permission = proc_pid_permission, }; -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid) -{ - struct dentry *dentry, *leader, *dir; - char buf[10 + 1]; - struct qstr name; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", pid); - /* no ->d_hash() rejects on procfs */ - dentry = d_hash_and_lookup(mnt->mnt_root, &name); - if (dentry) { - d_invalidate(dentry); - dput(dentry); - } - - if (pid == tgid) - return; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", tgid); - leader = d_hash_and_lookup(mnt->mnt_root, &name); - if (!leader) - goto out; - - name.name = "task"; - name.len = strlen(name.name); - dir = d_hash_and_lookup(leader, &name); - if (!dir) - goto out_put_leader; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", pid); - dentry = d_hash_and_lookup(dir, &name); - if (dentry) { - d_invalidate(dentry); - dput(dentry); - } - - dput(dir); -out_put_leader: - dput(leader); -out: - return; -} - /** - * proc_flush_task - Remove dcache entries for @task from the /proc dcache. - * @task: task that should be flushed. + * proc_flush_pid - Remove dcache entries for @pid from the /proc dcache. + * @pid: pid that should be flushed. * - * When flushing dentries from proc, one needs to flush them from global - * proc (proc_mnt) and from all the namespaces' procs this task was seen - * in. This call is supposed to do all of this job. - * - * Looks in the dcache for - * /proc/@pid - * /proc/@tgid/task/@pid - * if either directory is present flushes it and all of it'ts children - * from the dcache. + * This function walks a list of inodes (that belong to any proc + * filesystem) that are attached to the pid and flushes them from + * the dentry cache. * * It is safe and reasonable to cache /proc entries for a task until * that task exits. After that they just clog up the dcache with * useless entries, possibly causing useful dcache entries to be - * flushed instead. This routine is proved to flush those useless - * dcache entries at process exit time. + * flushed instead. This routine is provided to flush those useless + * dcache entries when a process is reaped. * * NOTE: This routine is just an optimization so it does not guarantee - * that no dcache entries will exist at process exit time it - * just makes it very unlikely that any will persist. + * that no dcache entries will exist after a process is reaped + * it just makes it very unlikely that any will persist. */ -void proc_flush_task(struct task_struct *task) +void proc_flush_pid(struct pid *pid) { - int i; - struct pid *pid, *tgid; - struct upid *upid; - - pid = task_pid(task); - tgid = task_tgid(task); - - for (i = 0; i <= pid->level; i++) { - upid = &pid->numbers[i]; - proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, - tgid->numbers[i].nr); - } + proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock); + put_pid(pid); } static struct dentry *proc_pid_instantiate(struct dentry * dentry, diff --git a/fs/proc/inode.c b/fs/proc/inode.c index d9243b24554a..1e730ea1dcd6 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -40,7 +40,7 @@ static void proc_evict_inode(struct inode *inode) /* Stop tracking associated processes */ if (ei->pid) { - put_pid(ei->pid); + proc_pid_evict_inode(ei); ei->pid = NULL; } diff --git a/fs/proc/internal.h b/fs/proc/internal.h index fd470172675f..9e294f0290e5 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -158,6 +158,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *, extern const struct dentry_operations pid_dentry_operations; extern int pid_getattr(const struct path *, struct kstat *, u32, unsigned int); extern int proc_setattr(struct dentry *, struct iattr *); +extern void proc_pid_evict_inode(struct proc_inode *); extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t); extern void pid_update_inode(struct task_struct *, struct inode *); extern int pid_delete_dentry(const struct dentry *); diff --git a/include/linux/pid.h b/include/linux/pid.h index 998ae7d24450..01a0d4e28506 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -62,6 +62,7 @@ struct pid unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; + struct hlist_head inodes; /* wait queue for pidfd notifications */ wait_queue_head_t wait_pidfd; struct rcu_head rcu; diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 3dfa92633af3..40a7982b7285 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -32,7 +32,7 @@ struct proc_ops { typedef int (*proc_write_t)(struct file *, char *, size_t); extern void proc_root_init(void); -extern void proc_flush_task(struct task_struct *); +extern void proc_flush_pid(struct pid *); extern struct proc_dir_entry *proc_symlink(const char *, struct proc_dir_entry *, const char *); @@ -105,7 +105,7 @@ static inline void proc_root_init(void) { } -static inline void proc_flush_task(struct task_struct *task) +static inline void proc_flush_pid(struct pid *pid) { } diff --git a/kernel/exit.c b/kernel/exit.c index 2833ffb0c211..502b4995b688 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -191,6 +191,7 @@ void put_task_struct_rcu_user(struct task_struct *task) void release_task(struct task_struct *p) { struct task_struct *leader; + struct pid *thread_pid; int zap_leader; repeat: /* don't need to get the RCU readlock here - the process is dead and @@ -199,11 +200,11 @@ repeat: atomic_dec(&__task_cred(p)->user->processes); rcu_read_unlock(); - proc_flush_task(p); cgroup_release(p); write_lock_irq(&tasklist_lock); ptrace_release_task(p); + thread_pid = get_pid(p->thread_pid); __exit_signal(p); /* @@ -226,6 +227,7 @@ repeat: } write_unlock_irq(&tasklist_lock); + proc_flush_pid(thread_pid); release_thread(p); put_task_struct_rcu_user(p); diff --git a/kernel/pid.c b/kernel/pid.c index 0f4ecb57214c..ca08d6a3aa77 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -258,6 +258,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, INIT_HLIST_HEAD(&pid->tasks[type]); init_waitqueue_head(&pid->wait_pidfd); + INIT_HLIST_HEAD(&pid->inodes); upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock); -- cgit v1.2.3-71-gd317 From 94dacdbd5d2dfa2cffd308f128d78c99f855f5be Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:32 +0100 Subject: bpf: Tighten the requirements for preallocated hash maps The assumption that only programs attached to perf NMI events can deadlock on memory allocators is wrong. Assume the following simplified callchain: kmalloc() from regular non BPF context cache empty freelist empty lock(zone->lock); tracepoint or kprobe BPF() update_elem() lock(bucket) kmalloc() cache empty freelist empty lock(zone->lock); <- DEADLOCK There are other ways which do not involve locking to create wreckage: kmalloc() from regular non BPF context local_irq_save(); ... obj = slab_first(); kprobe() BPF() update_elem() lock(bucket) kmalloc() local_irq_save(); ... obj = slab_first(); <- Same object as above ... So preallocation _must_ be enforced for all variants of intrusive instrumentation. Unfortunately immediate enforcement would break backwards compatibility, so for now such programs still are allowed to run, but a one time warning is emitted in dmesg and the verifier emits a warning in the verifier log as well so developers are made aware about this and can fix their programs before the enforcement becomes mandatory. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de --- kernel/bpf/verifier.c | 39 ++++++++++++++++++++++++++++----------- 1 file changed, 28 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6d15dfbd4b88..532b7b0320ba 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8143,26 +8143,43 @@ static bool is_tracing_prog_type(enum bpf_prog_type type) } } +static bool is_preallocated_map(struct bpf_map *map) +{ + if (!check_map_prealloc(map)) + return false; + if (map->inner_map_meta && !check_map_prealloc(map->inner_map_meta)) + return false; + return true; +} + static int check_map_prog_compatibility(struct bpf_verifier_env *env, struct bpf_map *map, struct bpf_prog *prog) { - /* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use - * preallocated hash maps, since doing memory allocation - * in overflow_handler can crash depending on where nmi got - * triggered. + /* + * Validate that trace type programs use preallocated hash maps. + * + * For programs attached to PERF events this is mandatory as the + * perf NMI can hit any arbitrary code sequence. + * + * All other trace types using preallocated hash maps are unsafe as + * well because tracepoint or kprobes can be inside locked regions + * of the memory allocator or at a place where a recursion into the + * memory allocator would see inconsistent state. + * + * For now running such programs is allowed for backwards + * compatibility reasons, but warnings are emitted so developers are + * made aware of the unsafety and can fix their programs before this + * is enforced. */ - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { - if (!check_map_prealloc(map)) { + if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { + if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } - if (map->inner_map_meta && - !check_map_prealloc(map->inner_map_meta)) { - verbose(env, "perf_event programs can only use preallocated inner hash map\n"); - return -EINVAL; - } + WARN_ONCE(1, "trace type BPF program uses run-time allocation\n"); + verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } if ((is_tracing_prog_type(prog->type) || -- cgit v1.2.3-71-gd317 From 2ed905c521e56aead6987df94c083efb0ee59895 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:33 +0100 Subject: bpf: Enforce preallocation for instrumentation programs on RT Aside of the general unsafety of run-time map allocation for instrumentation type programs RT enabled kernels have another constraint: The instrumentation programs are invoked with preemption disabled, but the memory allocator spinlocks cannot be acquired in atomic context because they are converted to 'sleeping' spinlocks on RT. Therefore enforce map preallocation for these programs types when RT is enabled. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de --- kernel/bpf/verifier.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 532b7b0320ba..289383edfc8c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8168,16 +8168,21 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, * of the memory allocator or at a place where a recursion into the * memory allocator would see inconsistent state. * - * For now running such programs is allowed for backwards - * compatibility reasons, but warnings are emitted so developers are - * made aware of the unsafety and can fix their programs before this - * is enforced. + * On RT enabled kernels run-time allocation of all trace type + * programs is strictly prohibited due to lock type constraints. On + * !RT kernels it is allowed for backwards compatibility reasons for + * now, but warnings are emitted so developers are made aware of + * the unsafety and can fix their programs before this is enforced. */ if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + verbose(env, "trace type programs can only use preallocated hash map\n"); + return -EINVAL; + } WARN_ONCE(1, "trace type BPF program uses run-time allocation\n"); verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } -- cgit v1.2.3-71-gd317 From dbca151cad736c99f4817076daf9fd02ed0c2daa Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:34 +0100 Subject: bpf: Update locking comment in hashtab code The comment where the bucket lock is acquired says: /* bpf_map_update_elem() can be called in_irq() */ which is not really helpful and aside of that it does not explain the subtle details of the hash bucket locks expecially in the context of BPF and perf, kprobes and tracing. Add a comment at the top of the file which explains the protection scopes and the details how potential deadlocks are prevented. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de --- kernel/bpf/hashtab.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index a1468e3f5af2..65711a220fe0 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -27,6 +27,26 @@ .map_delete_batch = \ generic_map_delete_batch +/* + * The bucket lock has two protection scopes: + * + * 1) Serializing concurrent operations from BPF programs on differrent + * CPUs + * + * 2) Serializing concurrent operations from BPF programs and sys_bpf() + * + * BPF programs can execute in any context including perf, kprobes and + * tracing. As there are almost no limits where perf, kprobes and tracing + * can be invoked from the lock operations need to be protected against + * deadlocks. Deadlocks can be caused by recursion and by an invocation in + * the lock held section when functions which acquire this lock are invoked + * from sys_bpf(). BPF recursion is prevented by incrementing the per CPU + * variable bpf_prog_active, which prevents BPF programs attached to perf + * events, kprobes and tracing to be invoked before the prior invocation + * from one of these contexts completed. sys_bpf() uses the same mechanism + * by pinning the task to the current CPU and incrementing the recursion + * protection accross the map operation. + */ struct bucket { struct hlist_nulls_head head; raw_spinlock_t lock; @@ -884,7 +904,6 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, */ } - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -964,7 +983,6 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, return -ENOMEM; memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size); - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1019,7 +1037,6 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, b = __select_bucket(htab, hash); head = &b->head; - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1083,7 +1100,6 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, return -ENOMEM; } - /* bpf_map_update_elem() can be called in_irq() */ raw_spin_lock_irqsave(&b->lock, flags); l_old = lookup_elem_raw(head, hash, key, key_size); -- cgit v1.2.3-71-gd317 From f03efe49bd16c017107ff5079d08ea428e390dde Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:35 +0100 Subject: bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation. This is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de --- kernel/trace/bpf_trace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b8661bd0d028..4d42a5d05ec9 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1516,10 +1516,9 @@ void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) static __always_inline void __bpf_trace_run(struct bpf_prog *prog, u64 *args) { + cant_sleep(); rcu_read_lock(); - preempt_disable(); (void) BPF_PROG_RUN(prog, args); - preempt_enable(); rcu_read_unlock(); } -- cgit v1.2.3-71-gd317 From 1b7a51a63b031092b8b8acda3f6820b62a8b5e5d Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:36 +0100 Subject: bpf/trace: Remove EXPORT from trace_call_bpf() All callers are built in. No point to export this. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 4d42a5d05ec9..15fafaed027c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -119,7 +119,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) return ret; } -EXPORT_SYMBOL_GPL(trace_call_bpf); #ifdef CONFIG_BPF_KPROBE_OVERRIDE BPF_CALL_2(bpf_override_return, struct pt_regs *, regs, unsigned long, rc) -- cgit v1.2.3-71-gd317 From 70ed0706a48e3da3eb4515214fef658ff1184b9f Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Mon, 24 Feb 2020 11:27:15 -0800 Subject: bpf: disable preemption for bpf progs attached to uprobe trace_call_bpf() no longer disables preemption on its own. All callers of this function has to do it explicitly. Signed-off-by: Alexei Starovoitov Acked-by: Thomas Gleixner --- kernel/trace/trace_uprobe.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 18d16f3ef980..2a8e8e9c1c75 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1333,8 +1333,15 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, int size, esize; int rctx; - if (bpf_prog_array_valid(call) && !trace_call_bpf(call, regs)) - return; + if (bpf_prog_array_valid(call)) { + u32 ret; + + preempt_disable(); + ret = trace_call_bpf(call, regs); + preempt_enable(); + if (!ret) + return; + } esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); -- cgit v1.2.3-71-gd317 From b0a81b94cc50a112601721fcc2f91fab78d7b9f3 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:37 +0100 Subject: bpf/trace: Remove redundant preempt_disable from trace_call_bpf() Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is invoked from a trace point via __DO_TRACE() which already disables preemption _before_ invoking any of the functions which are attached to a trace point. Remove it and add a cant_sleep() check. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de --- kernel/trace/bpf_trace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 15fafaed027c..07764c761073 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -83,7 +83,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) if (in_nmi()) /* not supported yet */ return 1; - preempt_disable(); + cant_sleep(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { /* @@ -115,7 +115,6 @@ unsigned int trace_call_bpf(struct trace_event_call *call, void *ctx) out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); return ret; } -- cgit v1.2.3-71-gd317 From 1d7bf6b7d3e8353c3fac648f3f9b3010458570c2 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:38 +0100 Subject: perf/bpf: Remove preempt disable around BPF invocation The BPF invocation from the perf event overflow handler does not require to disable preemption because this is called from NMI or at least hard interrupt context which is already non-preemptible. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.151953573@linutronix.de --- kernel/events/core.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index e453589da97c..bbdfac0182f4 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9206,7 +9206,6 @@ static void bpf_overflow_handler(struct perf_event *event, int ret = 0; ctx.regs = perf_arch_bpf_user_pt_regs(regs); - preempt_disable(); if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) goto out; rcu_read_lock(); @@ -9214,7 +9213,6 @@ static void bpf_overflow_handler(struct perf_event *event, rcu_read_unlock(); out: __this_cpu_dec(bpf_prog_active); - preempt_enable(); if (!ret) return; -- cgit v1.2.3-71-gd317 From 8a37963c7ac9ecb7f86f8ebda020e3f8d6d7b8a0 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:39 +0100 Subject: bpf: Remove recursion prevention from rcu free callback If an element is freed via RCU then recursion into BPF instrumentation functions is not a concern. The element is already detached from the map and the RCU callback does not hold any locks on which a kprobe, perf event or tracepoint attached BPF program could deadlock. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de --- kernel/bpf/hashtab.c | 8 -------- 1 file changed, 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 65711a220fe0..431cef22d29d 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -706,15 +706,7 @@ static void htab_elem_free_rcu(struct rcu_head *head) struct htab_elem *l = container_of(head, struct htab_elem, rcu); struct bpf_htab *htab = l->htab; - /* must increment bpf_prog_active to avoid kprobe+bpf triggering while - * we're calling kfree, otherwise deadlock is possible if kprobes - * are placed somewhere inside of slub - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); htab_elem_free(htab, l); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); } static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l) -- cgit v1.2.3-71-gd317 From 569de905ebc30b9c61be7aa557403aeb5a9141a4 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:40 +0100 Subject: bpf: Dont iterate over possible CPUs with interrupts disabled pcpu_freelist_populate() is disabling interrupts and then iterates over the possible CPUs. The reason why this disables interrupts is to silence lockdep because the invoked ___pcpu_freelist_push() takes spin locks. Neither the interrupt disabling nor the locking are required in this function because it's called during initialization and the resulting map is not yet visible to anything. Split out the actual push assignement into an inline, call it from the loop and remove the interrupt disable. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.365930116@linutronix.de --- kernel/bpf/percpu_freelist.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c index 6e090140b924..b367430e611c 100644 --- a/kernel/bpf/percpu_freelist.c +++ b/kernel/bpf/percpu_freelist.c @@ -25,12 +25,18 @@ void pcpu_freelist_destroy(struct pcpu_freelist *s) free_percpu(s->freelist); } +static inline void pcpu_freelist_push_node(struct pcpu_freelist_head *head, + struct pcpu_freelist_node *node) +{ + node->next = head->first; + head->first = node; +} + static inline void ___pcpu_freelist_push(struct pcpu_freelist_head *head, struct pcpu_freelist_node *node) { raw_spin_lock(&head->lock); - node->next = head->first; - head->first = node; + pcpu_freelist_push_node(head, node); raw_spin_unlock(&head->lock); } @@ -56,21 +62,16 @@ void pcpu_freelist_populate(struct pcpu_freelist *s, void *buf, u32 elem_size, u32 nr_elems) { struct pcpu_freelist_head *head; - unsigned long flags; int i, cpu, pcpu_entries; pcpu_entries = nr_elems / num_possible_cpus() + 1; i = 0; - /* disable irq to workaround lockdep false positive - * in bpf usage pcpu_freelist_populate() will never race - * with pcpu_freelist_push() - */ - local_irq_save(flags); for_each_possible_cpu(cpu) { again: head = per_cpu_ptr(s->freelist, cpu); - ___pcpu_freelist_push(head, buf); + /* No locking required as this is not visible yet. */ + pcpu_freelist_push_node(head, buf); i++; buf += elem_size; if (i == nr_elems) @@ -78,7 +79,6 @@ again: if (i % pcpu_entries) goto again; } - local_irq_restore(flags); } struct pcpu_freelist_node *__pcpu_freelist_pop(struct pcpu_freelist *s) -- cgit v1.2.3-71-gd317 From 3d9f773cf2876c01a505b9fe27270901d464e90a Mon Sep 17 00:00:00 2001 From: David Miller Date: Mon, 24 Feb 2020 15:01:43 +0100 Subject: bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. All of these cases are strictly of the form: preempt_disable(); BPF_PROG_RUN(...); preempt_enable(); Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN() with: migrate_disable(); BPF_PROG_RUN(...); migrate_enable(); On non RT enabled kernels this maps to preempt_disable/enable() and on RT enabled kernels this solely prevents migration, which is sufficient as there is no requirement to prevent reentrancy to any BPF program from a preempting task. The only requirement is that the program stays on the same CPU. Therefore, this is a trivially correct transformation. The seccomp loop does not need protection over the loop. It only needs protection per BPF filter program [ tglx: Converted to bpf_prog_run_pin_on_cpu() ] Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de --- include/linux/filter.h | 4 +--- kernel/seccomp.c | 4 +--- net/core/flow_dissector.c | 4 +--- net/core/skmsg.c | 8 ++------ net/kcm/kcmsock.c | 4 +--- 5 files changed, 6 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/include/linux/filter.h b/include/linux/filter.h index 1982a52eb4c9..9270de2a0df8 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -717,9 +717,7 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog, if (unlikely(prog->cb_access)) memset(cb_data, 0, BPF_SKB_CB_LEN); - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dcb57bf..787041eb011b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -268,16 +268,14 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). */ - preempt_disable(); for (; f; f = f->prev) { - u32 cur_ret = BPF_PROG_RUN(f->prog, sd); + u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd); if (ACTION_ONLY(cur_ret) < ACTION_ONLY(ret)) { ret = cur_ret; *match = f; } } - preempt_enable(); return ret; } #endif /* CONFIG_SECCOMP_FILTER */ diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index a1670dff0629..3eff84824c8b 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -920,9 +920,7 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx, (int)FLOW_DISSECTOR_F_STOP_AT_ENCAP); flow_keys->flags = flags; - preempt_disable(); - result = BPF_PROG_RUN(prog, ctx); - preempt_enable(); + result = bpf_prog_run_pin_on_cpu(prog, ctx); flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen); flow_keys->thoff = clamp_t(u16, flow_keys->thoff, diff --git a/net/core/skmsg.c b/net/core/skmsg.c index eeb28cb85664..c479372f2cd2 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -628,7 +628,6 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, struct bpf_prog *prog; int ret; - preempt_disable(); rcu_read_lock(); prog = READ_ONCE(psock->progs.msg_parser); if (unlikely(!prog)) { @@ -638,7 +637,7 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, sk_msg_compute_data_pointers(msg); msg->sk = sk; - ret = BPF_PROG_RUN(prog, msg); + ret = bpf_prog_run_pin_on_cpu(prog, msg); ret = sk_psock_map_verd(ret, msg->sk_redir); psock->apply_bytes = msg->apply_bytes; if (ret == __SK_REDIRECT) { @@ -653,7 +652,6 @@ int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock, } out: rcu_read_unlock(); - preempt_enable(); return ret; } EXPORT_SYMBOL_GPL(sk_psock_msg_verdict); @@ -665,9 +663,7 @@ static int sk_psock_bpf_run(struct sk_psock *psock, struct bpf_prog *prog, skb->sk = psock->sk; bpf_compute_data_end_sk_skb(skb); - preempt_disable(); - ret = BPF_PROG_RUN(prog, skb); - preempt_enable(); + ret = bpf_prog_run_pin_on_cpu(prog, skb); /* strparser clones the skb before handing it to a upper layer, * meaning skb_orphan has been called. We NULL sk on the way out * to ensure we don't trigger a BUG_ON() in skb/sk operations diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c index ea9e73428ed9..56fac24a627a 100644 --- a/net/kcm/kcmsock.c +++ b/net/kcm/kcmsock.c @@ -380,9 +380,7 @@ static int kcm_parse_func_strparser(struct strparser *strp, struct sk_buff *skb) struct bpf_prog *prog = psock->bpf_prog; int res; - preempt_disable(); - res = BPF_PROG_RUN(prog, skb); - preempt_enable(); + res = bpf_prog_run_pin_on_cpu(prog, skb); return res; } -- cgit v1.2.3-71-gd317 From 02ad05965491ca72034327d47da6cb25f3a92603 Mon Sep 17 00:00:00 2001 From: David Miller Date: Mon, 24 Feb 2020 15:01:45 +0100 Subject: bpf: Use migrate_disable/enabe() in trampoline code. Instead of preemption disable/enable to reflect the purpose. This allows PREEMPT_RT to substitute it with an actual migration disable implementation. On non RT kernels this is still mapped to preempt_disable/enable(). Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145643.891428873@linutronix.de --- kernel/bpf/trampoline.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 6b264a92064b..704fa787fec0 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -367,8 +367,9 @@ out: mutex_unlock(&trampoline_mutex); } -/* The logic is similar to BPF_PROG_RUN, but with explicit rcu and preempt that - * are needed for trampoline. The macro is split into +/* The logic is similar to BPF_PROG_RUN, but with an explicit + * rcu_read_lock() and migrate_disable() which are required + * for the trampoline. The macro is split into * call _bpf_prog_enter * call prog->bpf_func * call __bpf_prog_exit @@ -378,7 +379,7 @@ u64 notrace __bpf_prog_enter(void) u64 start = 0; rcu_read_lock(); - preempt_disable(); + migrate_disable(); if (static_branch_unlikely(&bpf_stats_enabled_key)) start = sched_clock(); return start; @@ -401,7 +402,7 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) stats->nsecs += sched_clock() - start; u64_stats_update_end(&stats->syncp); } - preempt_enable(); + migrate_enable(); rcu_read_unlock(); } -- cgit v1.2.3-71-gd317 From 085fee1a72a9fba101a4a68a2c02fa8bd2b6f913 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:48 +0100 Subject: bpf: Use recursion prevention helpers in hashtab code The required protection is that the caller cannot be migrated to a different CPU as these places take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_disable/enable() and this_cpu_inc/dec() pairs with the new recursion prevention helpers to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable in the helpers map to preempt_disable/enable(), i.e. no functional change. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de --- kernel/bpf/hashtab.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 431cef22d29d..ef83b012d8d8 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1333,8 +1333,7 @@ alloc: } again: - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); again_nocopy: dst_key = keys; @@ -1362,8 +1361,7 @@ again_nocopy: */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); goto after_loop; } @@ -1374,8 +1372,7 @@ again_nocopy: */ raw_spin_unlock_irqrestore(&b->lock, flags); rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); kvfree(keys); kvfree(values); goto alloc; @@ -1445,8 +1442,7 @@ next_batch: } rcu_read_unlock(); - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys, key_size * bucket_cnt) || copy_to_user(uvalues + total * value_size, values, -- cgit v1.2.3-71-gd317 From b6e5dae15a61b0cc9219799926813baad0b58967 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:49 +0100 Subject: bpf: Replace open coded recursion prevention in sys_bpf() The required protection is that the caller cannot be migrated to a different CPU as these functions end up in places which take either a hash bucket lock or might trigger a kprobe inside the memory allocator. Both scenarios can lead to deadlocks. The deadlock prevention is per CPU by incrementing a per CPU variable which temporarily blocks the invocation of BPF programs from perf and kprobes. Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs with the new helper functions. These functions are already prepared to make BPF work on PREEMPT_RT enabled kernels. No functional change for !RT kernels. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.317843926@linutronix.de --- kernel/bpf/syscall.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index a91ad518c050..a79743a89815 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -171,11 +171,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, flags); } - /* must increment bpf_prog_active to avoid kprobe+bpf triggering from - * inside bpf map update or delete otherwise deadlocks are possible - */ - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_update(map, key, value, flags); @@ -206,8 +202,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, err = map->ops->map_update_elem(map, key, value, flags); rcu_read_unlock(); } - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @@ -222,8 +217,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, if (bpf_map_is_dev_bound(map)) return bpf_map_offload_lookup_elem(map, key, value); - preempt_disable(); - this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); if (map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) { err = bpf_percpu_hash_copy(map, key, value); @@ -268,8 +262,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value, rcu_read_unlock(); } - this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); return err; @@ -1136,13 +1129,11 @@ static int map_delete_elem(union bpf_attr *attr) goto out; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); out: kfree(key); @@ -1254,13 +1245,11 @@ int generic_map_delete_batch(struct bpf_map *map, break; } - preempt_disable(); - __this_cpu_inc(bpf_prog_active); + bpf_disable_instrumentation(); rcu_read_lock(); err = map->ops->map_delete_elem(map, key); rcu_read_unlock(); - __this_cpu_dec(bpf_prog_active); - preempt_enable(); + bpf_enable_instrumentation(); maybe_wait_bpf_programs(map); if (err) break; -- cgit v1.2.3-71-gd317 From d01f9b198ca985b49717d8cd0b1f57806cb1319a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:50 +0100 Subject: bpf: Factor out hashtab bucket lock operations As a preparation for making the BPF locking RT friendly, factor out the hash bucket lock operations into inline functions. This allows to do the necessary RT modification in one place instead of sprinkling it all over the place. No functional change. The now unused htab argument of the lock/unlock functions will be used in the next step which adds PREEMPT_RT support. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de --- kernel/bpf/hashtab.c | 69 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 46 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index ef83b012d8d8..9a3b926e915c 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -88,6 +88,32 @@ struct htab_elem { char key[0] __aligned(8); }; +static void htab_init_buckets(struct bpf_htab *htab) +{ + unsigned i; + + for (i = 0; i < htab->n_buckets; i++) { + INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); + raw_spin_lock_init(&htab->buckets[i].lock); + } +} + +static inline unsigned long htab_lock_bucket(const struct bpf_htab *htab, + struct bucket *b) +{ + unsigned long flags; + + raw_spin_lock_irqsave(&b->lock, flags); + return flags; +} + +static inline void htab_unlock_bucket(const struct bpf_htab *htab, + struct bucket *b, + unsigned long flags) +{ + raw_spin_unlock_irqrestore(&b->lock, flags); +} + static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); static bool htab_is_lru(const struct bpf_htab *htab) @@ -348,8 +374,8 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU); bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC); struct bpf_htab *htab; - int err, i; u64 cost; + int err; htab = kzalloc(sizeof(*htab), GFP_USER); if (!htab) @@ -411,10 +437,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr) else htab->hashrnd = get_random_int(); - for (i = 0; i < htab->n_buckets; i++) { - INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - raw_spin_lock_init(&htab->buckets[i].lock); - } + htab_init_buckets(htab); if (prealloc) { err = prealloc_init(htab); @@ -622,7 +645,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node) b = __select_bucket(htab, tgt_l->hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) if (l == tgt_l) { @@ -630,7 +653,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node) break; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return l == tgt_l; } @@ -896,7 +919,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, */ } - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -937,7 +960,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -975,7 +998,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, return -ENOMEM; memcpy(l_new->key + round_up(map->key_size, 8), value, map->value_size); - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -994,7 +1017,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (ret) bpf_lru_push_free(&htab->lru, &l_new->lru_node); @@ -1029,7 +1052,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1052,7 +1075,7 @@ static int __htab_percpu_map_update_elem(struct bpf_map *map, void *key, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -1092,7 +1115,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, return -ENOMEM; } - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l_old = lookup_elem_raw(head, hash, key, key_size); @@ -1114,7 +1137,7 @@ static int __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key, } ret = 0; err: - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l_new) bpf_lru_push_free(&htab->lru, &l_new->lru_node); return ret; @@ -1152,7 +1175,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @@ -1162,7 +1185,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); return ret; } @@ -1184,7 +1207,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) b = __select_bucket(htab, hash); head = &b->head; - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); l = lookup_elem_raw(head, hash, key, key_size); @@ -1193,7 +1216,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) ret = 0; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); if (l) bpf_lru_push_free(&htab->lru, &l->lru_node); return ret; @@ -1342,7 +1365,7 @@ again_nocopy: head = &b->head; /* do not grab the lock unless need it (bucket_cnt > 0). */ if (locked) - raw_spin_lock_irqsave(&b->lock, flags); + flags = htab_lock_bucket(htab, b); bucket_cnt = 0; hlist_nulls_for_each_entry_rcu(l, n, head, hash_node) @@ -1359,7 +1382,7 @@ again_nocopy: /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); bpf_enable_instrumentation(); goto after_loop; @@ -1370,7 +1393,7 @@ again_nocopy: /* Note that since bucket_cnt > 0 here, it is implicit * that the locked was grabbed, so release it. */ - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); rcu_read_unlock(); bpf_enable_instrumentation(); kvfree(keys); @@ -1423,7 +1446,7 @@ again_nocopy: dst_val += value_size; } - raw_spin_unlock_irqrestore(&b->lock, flags); + htab_unlock_bucket(htab, b, flags); locked = false; while (node_to_free) { -- cgit v1.2.3-71-gd317 From 7f805d17f1523c7b2c0da319ddb427d6c5d94ff1 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:51 +0100 Subject: bpf: Prepare hashtab locking for PREEMPT_RT PREEMPT_RT forbids certain operations like memory allocations (even with GFP_ATOMIC) from atomic contexts. This is required because even with GFP_ATOMIC the memory allocator calls into code pathes which acquire locks with long held lock sections. To ensure the deterministic behaviour these locks are regular spinlocks, which are converted to 'sleepable' spinlocks on RT. The only true atomic contexts on an RT kernel are the low level hardware handling, scheduling, low level interrupt handling, NMIs etc. None of these contexts should ever do memory allocations. As regular device interrupt handlers and soft interrupts are forced into thread context, the existing code which does spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); just works. In theory the BPF locks could be converted to regular spinlocks as well, but the bucket locks and percpu_freelist locks can be taken from arbitrary contexts (perf, kprobes, tracepoints) which are required to be atomic contexts even on RT. These mechanisms require preallocated maps, so there is no need to invoke memory allocations within the lock held sections. BPF maps which need dynamic allocation are only used from (forced) thread context on RT and can therefore use regular spinlocks which in turn allows to invoke memory allocations from the lock held section. To achieve this make the hash bucket lock a union of a raw and a regular spinlock and initialize and lock/unlock either the raw spinlock for preallocated maps or the regular variant for maps which require memory allocations. On a non RT kernel this distinction is neither possible nor required. spinlock maps to raw_spinlock and the extra code and conditional is optimized out by the compiler. No functional change. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de --- kernel/bpf/hashtab.c | 65 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 56 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 9a3b926e915c..53d9483fee10 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -46,10 +46,43 @@ * from one of these contexts completed. sys_bpf() uses the same mechanism * by pinning the task to the current CPU and incrementing the recursion * protection accross the map operation. + * + * This has subtle implications on PREEMPT_RT. PREEMPT_RT forbids certain + * operations like memory allocations (even with GFP_ATOMIC) from atomic + * contexts. This is required because even with GFP_ATOMIC the memory + * allocator calls into code pathes which acquire locks with long held lock + * sections. To ensure the deterministic behaviour these locks are regular + * spinlocks, which are converted to 'sleepable' spinlocks on RT. The only + * true atomic contexts on an RT kernel are the low level hardware + * handling, scheduling, low level interrupt handling, NMIs etc. None of + * these contexts should ever do memory allocations. + * + * As regular device interrupt handlers and soft interrupts are forced into + * thread context, the existing code which does + * spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*(); + * just works. + * + * In theory the BPF locks could be converted to regular spinlocks as well, + * but the bucket locks and percpu_freelist locks can be taken from + * arbitrary contexts (perf, kprobes, tracepoints) which are required to be + * atomic contexts even on RT. These mechanisms require preallocated maps, + * so there is no need to invoke memory allocations within the lock held + * sections. + * + * BPF maps which need dynamic allocation are only used from (forced) + * thread context on RT and can therefore use regular spinlocks which in + * turn allows to invoke memory allocations from the lock held section. + * + * On a non RT kernel this distinction is neither possible nor required. + * spinlock maps to raw_spinlock and the extra code is optimized out by the + * compiler. */ struct bucket { struct hlist_nulls_head head; - raw_spinlock_t lock; + union { + raw_spinlock_t raw_lock; + spinlock_t lock; + }; }; struct bpf_htab { @@ -88,13 +121,26 @@ struct htab_elem { char key[0] __aligned(8); }; +static inline bool htab_is_prealloc(const struct bpf_htab *htab) +{ + return !(htab->map.map_flags & BPF_F_NO_PREALLOC); +} + +static inline bool htab_use_raw_lock(const struct bpf_htab *htab) +{ + return (!IS_ENABLED(CONFIG_PREEMPT_RT) || htab_is_prealloc(htab)); +} + static void htab_init_buckets(struct bpf_htab *htab) { unsigned i; for (i = 0; i < htab->n_buckets; i++) { INIT_HLIST_NULLS_HEAD(&htab->buckets[i].head, i); - raw_spin_lock_init(&htab->buckets[i].lock); + if (htab_use_raw_lock(htab)) + raw_spin_lock_init(&htab->buckets[i].raw_lock); + else + spin_lock_init(&htab->buckets[i].lock); } } @@ -103,7 +149,10 @@ static inline unsigned long htab_lock_bucket(const struct bpf_htab *htab, { unsigned long flags; - raw_spin_lock_irqsave(&b->lock, flags); + if (htab_use_raw_lock(htab)) + raw_spin_lock_irqsave(&b->raw_lock, flags); + else + spin_lock_irqsave(&b->lock, flags); return flags; } @@ -111,7 +160,10 @@ static inline void htab_unlock_bucket(const struct bpf_htab *htab, struct bucket *b, unsigned long flags) { - raw_spin_unlock_irqrestore(&b->lock, flags); + if (htab_use_raw_lock(htab)) + raw_spin_unlock_irqrestore(&b->raw_lock, flags); + else + spin_unlock_irqrestore(&b->lock, flags); } static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node); @@ -128,11 +180,6 @@ static bool htab_is_percpu(const struct bpf_htab *htab) htab->map.map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH; } -static bool htab_is_prealloc(const struct bpf_htab *htab) -{ - return !(htab->map.map_flags & BPF_F_NO_PREALLOC); -} - static inline void htab_elem_set_ptr(struct htab_elem *l, u32 key_size, void __percpu *pptr) { -- cgit v1.2.3-71-gd317 From 66150d0dde030c5ee68ccd93e4c54a73c47ebebd Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Mon, 24 Feb 2020 15:01:52 +0100 Subject: bpf, lpm: Make locking RT friendly The LPM trie map cannot be used in contexts like perf, kprobes and tracing as this map type dynamically allocates memory. The memory allocation happens with a raw spinlock held which is a truly spinning lock on a PREEMPT RT enabled kernel which disables preemption and interrupts. As RT does not allow memory allocation from such a section for various reasons, convert the raw spinlock to a regular spinlock. On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks which provide the proper protection but keep the code preemptible. On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does not cause any functional change. Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.602129531@linutronix.de --- kernel/bpf/lpm_trie.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 56e6c75d354d..3b3c420bc8ed 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -34,7 +34,7 @@ struct lpm_trie { size_t n_entries; size_t max_prefixlen; size_t data_size; - raw_spinlock_t lock; + spinlock_t lock; }; /* This trie implements a longest prefix match algorithm that can be used to @@ -315,7 +315,7 @@ static int trie_update_elem(struct bpf_map *map, if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Allocate and fill a new node */ @@ -422,7 +422,7 @@ out: kfree(im_node); } - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @@ -442,7 +442,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key) if (key->prefixlen > trie->max_prefixlen) return -EINVAL; - raw_spin_lock_irqsave(&trie->lock, irq_flags); + spin_lock_irqsave(&trie->lock, irq_flags); /* Walk the tree looking for an exact key/length match and keeping * track of the path we traverse. We will need to know the node @@ -518,7 +518,7 @@ static int trie_delete_elem(struct bpf_map *map, void *_key) kfree_rcu(node, rcu); out: - raw_spin_unlock_irqrestore(&trie->lock, irq_flags); + spin_unlock_irqrestore(&trie->lock, irq_flags); return ret; } @@ -575,7 +575,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr) if (ret) goto out_err; - raw_spin_lock_init(&trie->lock); + spin_lock_init(&trie->lock); return &trie->map; out_err: -- cgit v1.2.3-71-gd317 From 099bfaa731ec347d3f16a463ae53b88a1700c0af Mon Sep 17 00:00:00 2001 From: David Miller Date: Mon, 24 Feb 2020 15:01:53 +0100 Subject: bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled In a RT kernel down_read_trylock() cannot be used from NMI context and up_read_non_owner() is another problematic issue. So in such a configuration, simply elide the annotated stackmap and just report the raw IPs. In the longer term, it might be possible to provide a atomic friendly versions of the page cache traversal which will at least provide the info if the pages are resident and don't need to be paged in. [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work callback and add a comment ] Signed-off-by: David S. Miller Signed-off-by: Thomas Gleixner Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de --- kernel/bpf/stackmap.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 3f958b90d914..db76339fe358 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -40,6 +40,9 @@ static void do_up_read(struct irq_work *entry) { struct stack_map_irq_work *work; + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT))) + return; + work = container_of(entry, struct stack_map_irq_work, irq_work); up_read_non_owner(work->sem); work->sem = NULL; @@ -288,10 +291,19 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, struct stack_map_irq_work *work = NULL; if (irqs_disabled()) { - work = this_cpu_ptr(&up_read_work); - if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) - /* cannot queue more up_read, fallback */ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + work = this_cpu_ptr(&up_read_work); + if (atomic_read(&work->irq_work.flags) & IRQ_WORK_BUSY) { + /* cannot queue more up_read, fallback */ + irq_work_busy = true; + } + } else { + /* + * PREEMPT_RT does not allow to trylock mmap sem in + * interrupt disabled context. Force the fallback code. + */ irq_work_busy = true; + } } /* -- cgit v1.2.3-71-gd317 From d7f10df86202273155a9d8f8553bc2ad28e0dd46 Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Wed, 26 Feb 2020 18:17:44 -0600 Subject: bpf: Replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Daniel Borkmann Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor --- include/linux/bpf-cgroup.h | 2 +- include/linux/bpf.h | 2 +- include/uapi/linux/bpf.h | 2 +- kernel/bpf/bpf_struct_ops.c | 2 +- kernel/bpf/hashtab.c | 2 +- kernel/bpf/lpm_trie.c | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index a11d5b7dbbf3..a7cd5c7a2509 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -36,7 +36,7 @@ struct bpf_cgroup_storage_map; struct bpf_storage_buffer { struct rcu_head rcu; - char data[0]; + char data[]; }; struct bpf_cgroup_storage { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 1acd5bf70350..9aa33b8f3d55 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -859,7 +859,7 @@ struct bpf_prog_array_item { struct bpf_prog_array { struct rcu_head rcu; - struct bpf_prog_array_item items[0]; + struct bpf_prog_array_item items[]; }; struct bpf_prog_array *bpf_prog_array_alloc(u32 prog_cnt, gfp_t flags); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 906e9f2752db..8e98ced0963b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -73,7 +73,7 @@ struct bpf_insn { /* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */ struct bpf_lpm_trie_key { __u32 prefixlen; /* up to 32 for AF_INET, 128 for AF_INET6 */ - __u8 data[0]; /* Arbitrary size */ + __u8 data[]; /* Arbitrary size */ }; struct bpf_cgroup_storage_key { diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 042f95534f86..c498f0fffb40 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -23,7 +23,7 @@ enum bpf_struct_ops_state { struct bpf_struct_ops_value { BPF_STRUCT_OPS_COMMON_VALUE; - char data[0] ____cacheline_aligned_in_smp; + char data[] ____cacheline_aligned_in_smp; }; struct bpf_struct_ops_map { diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 53d9483fee10..d541c8486c95 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -118,7 +118,7 @@ struct htab_elem { struct bpf_lru_node lru_node; }; u32 hash; - char key[0] __aligned(8); + char key[] __aligned(8); }; static inline bool htab_is_prealloc(const struct bpf_htab *htab) diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 3b3c420bc8ed..65c236cf341e 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -25,7 +25,7 @@ struct lpm_trie_node { struct lpm_trie_node __rcu *child[2]; u32 prefixlen; u32 flags; - u8 data[0]; + u8 data[]; }; struct lpm_trie { -- cgit v1.2.3-71-gd317 From 1ed4d92458a969e71e7914550b6f0c730c14d84e Mon Sep 17 00:00:00 2001 From: Martin KaFai Lau Date: Tue, 25 Feb 2020 15:04:21 -0800 Subject: bpf: INET_DIAG support in bpf_sk_storage This patch adds INET_DIAG support to bpf_sk_storage. 1. Although this series adds bpf_sk_storage diag capability to inet sk, bpf_sk_storage is in general applicable to all fullsock. Hence, the bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as the argument. 2. The request will be like: INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32) ...... Considering there could have multiple bpf_sk_storages in a sk, instead of reusing INET_DIAG_INFO ("ss -i"), the user can select some specific bpf_sk_storage to dump by specifying an array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD. If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages of a sk. 3. The reply will be like: INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) SK_DIAG_BPF_STORAGE (nla_nest) SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) ...... 4. Unlike other INET_DIAG info of a sk which is pretty static, the size required to dump the bpf_sk_storage(s) of a sk is dynamic as the system adding more bpf_sk_storage_map. It is hard to set a static min_dump_alloc size. Hence, this series learns it at the runtime and adjust the cb->min_dump_alloc as it iterates all sk(s) of a system. The "unsigned int *res_diag_size" in bpf_sk_storage_diag_put() is for this purpose. The next patch will update the cb->min_dump_alloc as it iterates the sk(s). Signed-off-by: Martin KaFai Lau Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com --- include/linux/bpf.h | 1 + include/net/bpf_sk_storage.h | 27 ++++ include/uapi/linux/sock_diag.h | 26 ++++ kernel/bpf/syscall.c | 15 +++ net/core/bpf_sk_storage.c | 283 ++++++++++++++++++++++++++++++++++++++++- 5 files changed, 346 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 9aa33b8f3d55..6015a4daf118 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1023,6 +1023,7 @@ void __bpf_free_used_maps(struct bpf_prog_aux *aux, void bpf_prog_free_id(struct bpf_prog *prog, bool do_idr_lock); void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock); +struct bpf_map *bpf_map_get(u32 ufd); struct bpf_map *bpf_map_get_with_uref(u32 ufd); struct bpf_map *__bpf_map_get(struct fd f); void bpf_map_inc(struct bpf_map *map); diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h index 8e4f831d2e52..5036c94c0503 100644 --- a/include/net/bpf_sk_storage.h +++ b/include/net/bpf_sk_storage.h @@ -10,14 +10,41 @@ void bpf_sk_storage_free(struct sock *sk); extern const struct bpf_func_proto bpf_sk_storage_get_proto; extern const struct bpf_func_proto bpf_sk_storage_delete_proto; +struct bpf_sk_storage_diag; +struct sk_buff; +struct nlattr; +struct sock; + #ifdef CONFIG_BPF_SYSCALL int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk); +struct bpf_sk_storage_diag * +bpf_sk_storage_diag_alloc(const struct nlattr *nla_stgs); +void bpf_sk_storage_diag_free(struct bpf_sk_storage_diag *diag); +int bpf_sk_storage_diag_put(struct bpf_sk_storage_diag *diag, + struct sock *sk, struct sk_buff *skb, + int stg_array_type, + unsigned int *res_diag_size); #else static inline int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) { return 0; } +static inline struct bpf_sk_storage_diag * +bpf_sk_storage_diag_alloc(const struct nlattr *nla) +{ + return NULL; +} +static inline void bpf_sk_storage_diag_free(struct bpf_sk_storage_diag *diag) +{ +} +static inline int bpf_sk_storage_diag_put(struct bpf_sk_storage_diag *diag, + struct sock *sk, struct sk_buff *skb, + int stg_array_type, + unsigned int *res_diag_size) +{ + return 0; +} #endif #endif /* _BPF_SK_STORAGE_H */ diff --git a/include/uapi/linux/sock_diag.h b/include/uapi/linux/sock_diag.h index e5925009a652..5f74a5f6091d 100644 --- a/include/uapi/linux/sock_diag.h +++ b/include/uapi/linux/sock_diag.h @@ -36,4 +36,30 @@ enum sknetlink_groups { }; #define SKNLGRP_MAX (__SKNLGRP_MAX - 1) +enum { + SK_DIAG_BPF_STORAGE_REQ_NONE, + SK_DIAG_BPF_STORAGE_REQ_MAP_FD, + __SK_DIAG_BPF_STORAGE_REQ_MAX, +}; + +#define SK_DIAG_BPF_STORAGE_REQ_MAX (__SK_DIAG_BPF_STORAGE_REQ_MAX - 1) + +enum { + SK_DIAG_BPF_STORAGE_REP_NONE, + SK_DIAG_BPF_STORAGE, + __SK_DIAG_BPF_STORAGE_REP_MAX, +}; + +#define SK_DIAB_BPF_STORAGE_REP_MAX (__SK_DIAG_BPF_STORAGE_REP_MAX - 1) + +enum { + SK_DIAG_BPF_STORAGE_NONE, + SK_DIAG_BPF_STORAGE_PAD, + SK_DIAG_BPF_STORAGE_MAP_ID, + SK_DIAG_BPF_STORAGE_MAP_VALUE, + __SK_DIAG_BPF_STORAGE_MAX, +}; + +#define SK_DIAG_BPF_STORAGE_MAX (__SK_DIAG_BPF_STORAGE_MAX - 1) + #endif /* _UAPI__SOCK_DIAG_H__ */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index a79743a89815..c536c65256ad 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -902,6 +902,21 @@ void bpf_map_inc_with_uref(struct bpf_map *map) } EXPORT_SYMBOL_GPL(bpf_map_inc_with_uref); +struct bpf_map *bpf_map_get(u32 ufd) +{ + struct fd f = fdget(ufd); + struct bpf_map *map; + + map = __bpf_map_get(f); + if (IS_ERR(map)) + return map; + + bpf_map_inc(map); + fdput(f); + + return map; +} + struct bpf_map *bpf_map_get_with_uref(u32 ufd) { struct fd f = fdget(ufd); diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index 3ab23f698221..3415a4896c59 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -8,6 +8,7 @@ #include #include #include +#include #include static atomic_t cache_idx; @@ -606,6 +607,14 @@ static void bpf_sk_storage_map_free(struct bpf_map *map) kfree(map); } +/* U16_MAX is much more than enough for sk local storage + * considering a tcp_sock is ~2k. + */ +#define MAX_VALUE_SIZE \ + min_t(u32, \ + (KMALLOC_MAX_SIZE - MAX_BPF_STACK - sizeof(struct bpf_sk_storage_elem)), \ + (U16_MAX - sizeof(struct bpf_sk_storage_elem))) + static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr) { if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK || @@ -619,12 +628,7 @@ static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr) if (!capable(CAP_SYS_ADMIN)) return -EPERM; - if (attr->value_size >= KMALLOC_MAX_SIZE - - MAX_BPF_STACK - sizeof(struct bpf_sk_storage_elem) || - /* U16_MAX is much more than enough for sk local storage - * considering a tcp_sock is ~2k. - */ - attr->value_size > U16_MAX - sizeof(struct bpf_sk_storage_elem)) + if (attr->value_size > MAX_VALUE_SIZE) return -E2BIG; return 0; @@ -910,3 +914,270 @@ const struct bpf_func_proto bpf_sk_storage_delete_proto = { .arg1_type = ARG_CONST_MAP_PTR, .arg2_type = ARG_PTR_TO_SOCKET, }; + +struct bpf_sk_storage_diag { + u32 nr_maps; + struct bpf_map *maps[]; +}; + +/* The reply will be like: + * INET_DIAG_BPF_SK_STORAGES (nla_nest) + * SK_DIAG_BPF_STORAGE (nla_nest) + * SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) + * SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) + * SK_DIAG_BPF_STORAGE (nla_nest) + * SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) + * SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) + * .... + */ +static int nla_value_size(u32 value_size) +{ + /* SK_DIAG_BPF_STORAGE (nla_nest) + * SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32) + * SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit) + */ + return nla_total_size(0) + nla_total_size(sizeof(u32)) + + nla_total_size_64bit(value_size); +} + +void bpf_sk_storage_diag_free(struct bpf_sk_storage_diag *diag) +{ + u32 i; + + if (!diag) + return; + + for (i = 0; i < diag->nr_maps; i++) + bpf_map_put(diag->maps[i]); + + kfree(diag); +} +EXPORT_SYMBOL_GPL(bpf_sk_storage_diag_free); + +static bool diag_check_dup(const struct bpf_sk_storage_diag *diag, + const struct bpf_map *map) +{ + u32 i; + + for (i = 0; i < diag->nr_maps; i++) { + if (diag->maps[i] == map) + return true; + } + + return false; +} + +struct bpf_sk_storage_diag * +bpf_sk_storage_diag_alloc(const struct nlattr *nla_stgs) +{ + struct bpf_sk_storage_diag *diag; + struct nlattr *nla; + u32 nr_maps = 0; + int rem, err; + + /* bpf_sk_storage_map is currently limited to CAP_SYS_ADMIN as + * the map_alloc_check() side also does. + */ + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + nla_for_each_nested(nla, nla_stgs, rem) { + if (nla_type(nla) == SK_DIAG_BPF_STORAGE_REQ_MAP_FD) + nr_maps++; + } + + diag = kzalloc(sizeof(*diag) + sizeof(diag->maps[0]) * nr_maps, + GFP_KERNEL); + if (!diag) + return ERR_PTR(-ENOMEM); + + nla_for_each_nested(nla, nla_stgs, rem) { + struct bpf_map *map; + int map_fd; + + if (nla_type(nla) != SK_DIAG_BPF_STORAGE_REQ_MAP_FD) + continue; + + map_fd = nla_get_u32(nla); + map = bpf_map_get(map_fd); + if (IS_ERR(map)) { + err = PTR_ERR(map); + goto err_free; + } + if (map->map_type != BPF_MAP_TYPE_SK_STORAGE) { + bpf_map_put(map); + err = -EINVAL; + goto err_free; + } + if (diag_check_dup(diag, map)) { + bpf_map_put(map); + err = -EEXIST; + goto err_free; + } + diag->maps[diag->nr_maps++] = map; + } + + return diag; + +err_free: + bpf_sk_storage_diag_free(diag); + return ERR_PTR(err); +} +EXPORT_SYMBOL_GPL(bpf_sk_storage_diag_alloc); + +static int diag_get(struct bpf_sk_storage_data *sdata, struct sk_buff *skb) +{ + struct nlattr *nla_stg, *nla_value; + struct bpf_sk_storage_map *smap; + + /* It cannot exceed max nlattr's payload */ + BUILD_BUG_ON(U16_MAX - NLA_HDRLEN < MAX_VALUE_SIZE); + + nla_stg = nla_nest_start(skb, SK_DIAG_BPF_STORAGE); + if (!nla_stg) + return -EMSGSIZE; + + smap = rcu_dereference(sdata->smap); + if (nla_put_u32(skb, SK_DIAG_BPF_STORAGE_MAP_ID, smap->map.id)) + goto errout; + + nla_value = nla_reserve_64bit(skb, SK_DIAG_BPF_STORAGE_MAP_VALUE, + smap->map.value_size, + SK_DIAG_BPF_STORAGE_PAD); + if (!nla_value) + goto errout; + + if (map_value_has_spin_lock(&smap->map)) + copy_map_value_locked(&smap->map, nla_data(nla_value), + sdata->data, true); + else + copy_map_value(&smap->map, nla_data(nla_value), sdata->data); + + nla_nest_end(skb, nla_stg); + return 0; + +errout: + nla_nest_cancel(skb, nla_stg); + return -EMSGSIZE; +} + +static int bpf_sk_storage_diag_put_all(struct sock *sk, struct sk_buff *skb, + int stg_array_type, + unsigned int *res_diag_size) +{ + /* stg_array_type (e.g. INET_DIAG_BPF_SK_STORAGES) */ + unsigned int diag_size = nla_total_size(0); + struct bpf_sk_storage *sk_storage; + struct bpf_sk_storage_elem *selem; + struct bpf_sk_storage_map *smap; + struct nlattr *nla_stgs; + unsigned int saved_len; + int err = 0; + + rcu_read_lock(); + + sk_storage = rcu_dereference(sk->sk_bpf_storage); + if (!sk_storage || hlist_empty(&sk_storage->list)) { + rcu_read_unlock(); + return 0; + } + + nla_stgs = nla_nest_start(skb, stg_array_type); + if (!nla_stgs) + /* Continue to learn diag_size */ + err = -EMSGSIZE; + + saved_len = skb->len; + hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) { + smap = rcu_dereference(SDATA(selem)->smap); + diag_size += nla_value_size(smap->map.value_size); + + if (nla_stgs && diag_get(SDATA(selem), skb)) + /* Continue to learn diag_size */ + err = -EMSGSIZE; + } + + rcu_read_unlock(); + + if (nla_stgs) { + if (saved_len == skb->len) + nla_nest_cancel(skb, nla_stgs); + else + nla_nest_end(skb, nla_stgs); + } + + if (diag_size == nla_total_size(0)) { + *res_diag_size = 0; + return 0; + } + + *res_diag_size = diag_size; + return err; +} + +int bpf_sk_storage_diag_put(struct bpf_sk_storage_diag *diag, + struct sock *sk, struct sk_buff *skb, + int stg_array_type, + unsigned int *res_diag_size) +{ + /* stg_array_type (e.g. INET_DIAG_BPF_SK_STORAGES) */ + unsigned int diag_size = nla_total_size(0); + struct bpf_sk_storage *sk_storage; + struct bpf_sk_storage_data *sdata; + struct nlattr *nla_stgs; + unsigned int saved_len; + int err = 0; + u32 i; + + *res_diag_size = 0; + + /* No map has been specified. Dump all. */ + if (!diag->nr_maps) + return bpf_sk_storage_diag_put_all(sk, skb, stg_array_type, + res_diag_size); + + rcu_read_lock(); + sk_storage = rcu_dereference(sk->sk_bpf_storage); + if (!sk_storage || hlist_empty(&sk_storage->list)) { + rcu_read_unlock(); + return 0; + } + + nla_stgs = nla_nest_start(skb, stg_array_type); + if (!nla_stgs) + /* Continue to learn diag_size */ + err = -EMSGSIZE; + + saved_len = skb->len; + for (i = 0; i < diag->nr_maps; i++) { + sdata = __sk_storage_lookup(sk_storage, + (struct bpf_sk_storage_map *)diag->maps[i], + false); + + if (!sdata) + continue; + + diag_size += nla_value_size(diag->maps[i]->value_size); + + if (nla_stgs && diag_get(sdata, skb)) + /* Continue to learn diag_size */ + err = -EMSGSIZE; + } + rcu_read_unlock(); + + if (nla_stgs) { + if (saved_len == skb->len) + nla_nest_cancel(skb, nla_stgs); + else + nla_nest_end(skb, nla_stgs); + } + + if (diag_size == nla_total_size(0)) { + *res_diag_size = 0; + return 0; + } + + *res_diag_size = diag_size; + return err; +} +EXPORT_SYMBOL_GPL(bpf_sk_storage_diag_put); -- cgit v1.2.3-71-gd317 From 69879c01a0c3f70e0887cfb4d9ff439814361e46 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Thu, 20 Feb 2020 08:08:20 -0600 Subject: proc: Remove the now unnecessary internal mount of proc There remains no more code in the kernel using pids_ns->proc_mnt, therefore remove it from the kernel. The big benefit of this change is that one of the most error prone and tricky parts of the pid namespace implementation, maintaining kernel mounts of proc is removed. In addition removing the unnecessary complexity of the kernel mount fixes a regression that caused the proc mount options to be ignored. Now that the initial mount of proc comes from userspace, those mount options are again honored. This fixes Android's usage of the proc hidepid option. Reported-by: Alistair Strachan Fixes: e94591d0d90c ("proc: Convert proc_mount to use mount_ns.") Signed-off-by: "Eric W. Biederman" --- fs/proc/root.c | 36 ------------------------------------ include/linux/pid_namespace.h | 2 -- include/linux/proc_ns.h | 5 ----- kernel/pid.c | 8 -------- kernel/pid_namespace.c | 7 ------- 5 files changed, 58 deletions(-) (limited to 'kernel') diff --git a/fs/proc/root.c b/fs/proc/root.c index 608233dfd29c..2633f10446c3 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -292,39 +292,3 @@ struct proc_dir_entry proc_root = { .subdir = RB_ROOT, .name = "/proc", }; - -int pid_ns_prepare_proc(struct pid_namespace *ns) -{ - struct proc_fs_context *ctx; - struct fs_context *fc; - struct vfsmount *mnt; - - fc = fs_context_for_mount(&proc_fs_type, SB_KERNMOUNT); - if (IS_ERR(fc)) - return PTR_ERR(fc); - - if (fc->user_ns != ns->user_ns) { - put_user_ns(fc->user_ns); - fc->user_ns = get_user_ns(ns->user_ns); - } - - ctx = fc->fs_private; - if (ctx->pid_ns != ns) { - put_pid_ns(ctx->pid_ns); - get_pid_ns(ns); - ctx->pid_ns = ns; - } - - mnt = fc_mount(fc); - put_fs_context(fc); - if (IS_ERR(mnt)) - return PTR_ERR(mnt); - - ns->proc_mnt = mnt; - return 0; -} - -void pid_ns_release_proc(struct pid_namespace *ns) -{ - kern_unmount(ns->proc_mnt); -} diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h index 2ed6af88794b..4956e362e55e 100644 --- a/include/linux/pid_namespace.h +++ b/include/linux/pid_namespace.h @@ -33,7 +33,6 @@ struct pid_namespace { unsigned int level; struct pid_namespace *parent; #ifdef CONFIG_PROC_FS - struct vfsmount *proc_mnt; struct dentry *proc_self; struct dentry *proc_thread_self; #endif @@ -42,7 +41,6 @@ struct pid_namespace { #endif struct user_namespace *user_ns; struct ucounts *ucounts; - struct work_struct proc_work; kgid_t pid_gid; int hide_pid; int reboot; /* group exit code if this pidns was rebooted */ diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index 4626b1ac3b6c..e1106a077c1a 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -50,16 +50,11 @@ enum { #ifdef CONFIG_PROC_FS -extern int pid_ns_prepare_proc(struct pid_namespace *ns); -extern void pid_ns_release_proc(struct pid_namespace *ns); extern int proc_alloc_inum(unsigned int *pino); extern void proc_free_inum(unsigned int inum); #else /* CONFIG_PROC_FS */ -static inline int pid_ns_prepare_proc(struct pid_namespace *ns) { return 0; } -static inline void pid_ns_release_proc(struct pid_namespace *ns) {} - static inline int proc_alloc_inum(unsigned int *inum) { *inum = 1; diff --git a/kernel/pid.c b/kernel/pid.c index ca08d6a3aa77..60820e72634c 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -144,9 +144,6 @@ void free_pid(struct pid *pid) /* Handle a fork failure of the first process */ WARN_ON(ns->child_reaper); ns->pid_allocated = 0; - /* fall through */ - case 0: - schedule_work(&ns->proc_work); break; } @@ -247,11 +244,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, tmp = tmp->parent; } - if (unlikely(is_child_reaper(pid))) { - if (pid_ns_prepare_proc(ns)) - goto out_free; - } - get_pid_ns(ns); refcount_set(&pid->count, 1); for (type = 0; type < PIDTYPE_MAX; ++type) diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index d40017e79ebe..318fcc6ba301 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -57,12 +57,6 @@ static struct kmem_cache *create_pid_cachep(unsigned int level) return READ_ONCE(*pkc); } -static void proc_cleanup_work(struct work_struct *work) -{ - struct pid_namespace *ns = container_of(work, struct pid_namespace, proc_work); - pid_ns_release_proc(ns); -} - static struct ucounts *inc_pid_namespaces(struct user_namespace *ns) { return inc_ucount(ns, current_euid(), UCOUNT_PID_NAMESPACES); @@ -114,7 +108,6 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns ns->user_ns = get_user_ns(user_ns); ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; - INIT_WORK(&ns->proc_work, proc_cleanup_work); return ns; -- cgit v1.2.3-71-gd317 From af9fe6d607c9f456fb6c1cb87e1dc66a43846efd Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 16:29:12 -0600 Subject: pid: Improve the comment about waiting in zap_pid_ns_processes Oleg wrote a very informative comment, but with the removal of proc_cleanup_work it is no longer accurate. Rewrite the comment so that it only talks about the details that are still relevant, and hopefully is a little clearer. Signed-off-by: "Eric W. Biederman" --- kernel/pid_namespace.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 318fcc6ba301..01f8ba32cc0c 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -224,20 +224,27 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) } while (rc != -ECHILD); /* - * kernel_wait4() above can't reap the EXIT_DEAD children but we do not - * really care, we could reparent them to the global init. We could - * exit and reap ->child_reaper even if it is not the last thread in - * this pid_ns, free_pid(pid_allocated == 0) calls proc_cleanup_work(), - * pid_ns can not go away until proc_kill_sb() drops the reference. + * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE + * process whose parents processes are outside of the pid + * namespace. Such processes are created with setns()+fork(). * - * But this ns can also have other tasks injected by setns()+fork(). - * Again, ignoring the user visible semantics we do not really need - * to wait until they are all reaped, but they can be reparented to - * us and thus we need to ensure that pid->child_reaper stays valid - * until they all go away. See free_pid()->wake_up_process(). + * If those EXIT_ZOMBIE processes are not reaped by their + * parents before their parents exit, they will be reparented + * to pid_ns->child_reaper. Thus pidns->child_reaper needs to + * stay valid until they all go away. * - * We rely on ignored SIGCHLD, an injected zombie must be autoreaped - * if reparented. + * The code relies on the the pid_ns->child_reaper ignoring + * SIGCHILD to cause those EXIT_ZOMBIE processes to be + * autoreaped if reparented. + * + * Semantically it is also desirable to wait for EXIT_ZOMBIE + * processes before allowing the child_reaper to be reaped, as + * that gives the invariant that when the init process of a + * pid namespace is reaped all of the processes in the pid + * namespace are gone. + * + * Once all of the other tasks are gone from the pid_namespace + * free_pid() will awaken this task. */ for (;;) { set_current_state(TASK_INTERRUPTIBLE); -- cgit v1.2.3-71-gd317 From a2efdbf4fcb38ec1ae99c9d54d52c34fa867dc71 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 11:08:45 -0600 Subject: posix-cpu-timers: cpu_clock_sample_group() no longer needs siglock As of e78c3496790e ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group() no longer needs siglock protection so remove the stale comment. Signed-off-by: "Eric W. Biederman" Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/87eeuevduq.fsf@x220.int.ebiederm.org --- kernel/time/posix-cpu-timers.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 8ff6da77a01f..46cc188bf5ab 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -336,9 +336,7 @@ static void __thread_group_cputime(struct task_struct *tsk, u64 *samples) /* * Sample a process (thread group) clock for the given task clkid. If the * group's cputime accounting is already enabled, read the atomic - * store. Otherwise a full update is required. Task's sighand lock must be - * held to protect the task traversal on a full update. clkid is already - * validated. + * store. Otherwise a full update is required. clkid is already validated. */ static u64 cpu_clock_sample_group(const clockid_t clkid, struct task_struct *p, bool start) -- cgit v1.2.3-71-gd317 From 60f2ceaa8111d9ec51af87bde4a6809f2b3235e4 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 11:09:19 -0600 Subject: posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group As of e78c3496790e ("time, signal: Protect resource use statistics with seqlock") cpu_clock_sample_group no longers needs siglock protection. Unfortunately no one realized it at the time. Remove the extra locking that is for cpu_clock_sample_group and not for cpu_clock_sample. This significantly simplifies the code. Signed-off-by: "Eric W. Biederman" Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/878skmvdts.fsf@x220.int.ebiederm.org --- kernel/time/posix-cpu-timers.c | 66 ++++++++---------------------------------- 1 file changed, 12 insertions(+), 54 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 46cc188bf5ab..40c2d8396bb9 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -718,31 +718,10 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec64 *itp /* * Sample the clock to take the difference with the expiry time. */ - if (CPUCLOCK_PERTHREAD(timer->it_clock)) { + if (CPUCLOCK_PERTHREAD(timer->it_clock)) now = cpu_clock_sample(clkid, p); - } else { - struct sighand_struct *sighand; - unsigned long flags; - - /* - * Protect against sighand release/switch in exit/exec and - * also make timer sampling safe if it ends up calling - * thread_group_cputime(). - */ - sighand = lock_task_sighand(p, &flags); - if (unlikely(sighand == NULL)) { - /* - * The process has been reaped. - * We can't even collect a sample any more. - * Disarm the timer, nothing else to do. - */ - cpu_timer_setexpires(ctmr, 0); - return; - } else { - now = cpu_clock_sample_group(clkid, p, false); - unlock_task_sighand(p, &flags); - } - } + else + now = cpu_clock_sample_group(clkid, p, false); if (now < expires) { itp->it_value = ns_to_timespec64(expires - now); @@ -986,43 +965,22 @@ static void posix_cpu_timer_rearm(struct k_itimer *timer) /* * Fetch the current sample and update the timer's expiry time. */ - if (CPUCLOCK_PERTHREAD(timer->it_clock)) { + if (CPUCLOCK_PERTHREAD(timer->it_clock)) now = cpu_clock_sample(clkid, p); - bump_cpu_timer(timer, now); - if (unlikely(p->exit_state)) - return; - - /* Protect timer list r/w in arm_timer() */ - sighand = lock_task_sighand(p, &flags); - if (!sighand) - return; - } else { - /* - * Protect arm_timer() and timer sampling in case of call to - * thread_group_cputime(). - */ - sighand = lock_task_sighand(p, &flags); - if (unlikely(sighand == NULL)) { - /* - * The process has been reaped. - * We can't even collect a sample any more. - */ - cpu_timer_setexpires(ctmr, 0); - return; - } else if (unlikely(p->exit_state) && thread_group_empty(p)) { - /* If the process is dying, no need to rearm */ - goto unlock; - } + else now = cpu_clock_sample_group(clkid, p, true); - bump_cpu_timer(timer, now); - /* Leave the sighand locked for the call below. */ - } + + bump_cpu_timer(timer, now); + + /* Protect timer list r/w in arm_timer() */ + sighand = lock_task_sighand(p, &flags); + if (unlikely(sighand == NULL)) + return; /* * Now re-arm for the new expiry time. */ arm_timer(timer); -unlock: unlock_task_sighand(p, &flags); } -- cgit v1.2.3-71-gd317 From beb41d9cbe4179058634e05d60235b6155c7b6c6 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 11:09:46 -0600 Subject: posix-cpu-timers: Pass the task into arm_timer() The task has been already computed to take siglock before calling arm_timer. So pass the benefit of that labor into arm_timer(). Signed-off-by: "Eric W. Biederman" Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/8736auvdt1.fsf@x220.int.ebiederm.org --- kernel/time/posix-cpu-timers.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 40c2d8396bb9..ef936c5a910b 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -482,12 +482,11 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk) * Insert the timer on the appropriate list before any timers that * expire later. This must be called with the sighand lock held. */ -static void arm_timer(struct k_itimer *timer) +static void arm_timer(struct k_itimer *timer, struct task_struct *p) { int clkidx = CPUCLOCK_WHICH(timer->it_clock); struct cpu_timer *ctmr = &timer->it.cpu; u64 newexp = cpu_timer_getexpires(ctmr); - struct task_struct *p = ctmr->task; struct posix_cputimer_base *base; if (CPUCLOCK_PERTHREAD(timer->it_clock)) @@ -660,7 +659,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, */ cpu_timer_setexpires(ctmr, new_expires); if (new_expires != 0 && val < new_expires) { - arm_timer(timer); + arm_timer(timer, p); } unlock_task_sighand(p, &flags); @@ -980,7 +979,7 @@ static void posix_cpu_timer_rearm(struct k_itimer *timer) /* * Now re-arm for the new expiry time. */ - arm_timer(timer); + arm_timer(timer, p); unlock_task_sighand(p, &flags); } -- cgit v1.2.3-71-gd317 From 6fb614920b38bbf3c1c7fcd944c6d9b5d746103d Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 18 Feb 2020 16:50:18 +0100 Subject: task_work_run: don't take ->pi_lock unconditionally As Peter pointed out, task_work() can avoid ->pi_lock and cmpxchg() if task->task_works == NULL && !PF_EXITING. And in fact the only reason why task_work_run() needs ->pi_lock is the possible race with task_work_cancel(), we can optimize this code and make the locking more clear. Signed-off-by: Oleg Nesterov Signed-off-by: Jens Axboe --- kernel/task_work.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/task_work.c b/kernel/task_work.c index 0fef395662a6..825f28259a19 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -97,16 +97,26 @@ void task_work_run(void) * work->func() can do task_work_add(), do not set * work_exited unless the list is empty. */ - raw_spin_lock_irq(&task->pi_lock); do { + head = NULL; work = READ_ONCE(task->task_works); - head = !work && (task->flags & PF_EXITING) ? - &work_exited : NULL; + if (!work) { + if (task->flags & PF_EXITING) + head = &work_exited; + else + break; + } } while (cmpxchg(&task->task_works, work, head) != work); - raw_spin_unlock_irq(&task->pi_lock); if (!work) break; + /* + * Synchronize with task_work_cancel(). It can not remove + * the first entry == work, cmpxchg(task_works) must fail. + * But it can remove another entry from the ->next list. + */ + raw_spin_lock_irq(&task->pi_lock); + raw_spin_unlock_irq(&task->pi_lock); do { next = work->next; -- cgit v1.2.3-71-gd317 From 70ed506c3bbcfa846d4636b23051ca79fa4781f7 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Mon, 2 Mar 2020 20:31:57 -0800 Subject: bpf: Introduce pinnable bpf_link abstraction Introduce bpf_link abstraction, representing an attachment of BPF program to a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates ownership of attached BPF program, reference counting of a link itself, when reference from multiple anonymous inodes, as well as ensures that release callback will be called from a process context, so that users can safely take mutex locks and sleep. Additionally, with a new abstraction it's now possible to generalize pinning of a link object in BPF FS, allowing to explicitly prevent BPF program detachment on process exit by pinning it in a BPF FS and let it open from independent other process to keep working with it. Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF program attachments) into utilizing bpf_link framework, making them pinnable in BPF FS. More FD-based bpf_links will be added in follow up patches. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com --- include/linux/bpf.h | 13 +++ kernel/bpf/inode.c | 42 +++++++++- kernel/bpf/syscall.c | 223 +++++++++++++++++++++++++++++++++++++++++---------- 3 files changed, 232 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6015a4daf118..f13c78c6f29d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1056,6 +1056,19 @@ extern int sysctl_unprivileged_bpf_disabled; int bpf_map_new_fd(struct bpf_map *map, int flags); int bpf_prog_new_fd(struct bpf_prog *prog); +struct bpf_link; + +struct bpf_link_ops { + void (*release)(struct bpf_link *link); +}; + +void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, + struct bpf_prog *prog); +void bpf_link_inc(struct bpf_link *link); +void bpf_link_put(struct bpf_link *link); +int bpf_link_new_fd(struct bpf_link *link); +struct bpf_link *bpf_link_get_from_fd(u32 ufd); + int bpf_obj_pin_user(u32 ufd, const char __user *pathname); int bpf_obj_get_user(const char __user *pathname, int flags); diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 5e40e7fccc21..95087d9f4ed3 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -25,6 +25,7 @@ enum bpf_type { BPF_TYPE_UNSPEC = 0, BPF_TYPE_PROG, BPF_TYPE_MAP, + BPF_TYPE_LINK, }; static void *bpf_any_get(void *raw, enum bpf_type type) @@ -36,6 +37,9 @@ static void *bpf_any_get(void *raw, enum bpf_type type) case BPF_TYPE_MAP: bpf_map_inc_with_uref(raw); break; + case BPF_TYPE_LINK: + bpf_link_inc(raw); + break; default: WARN_ON_ONCE(1); break; @@ -53,6 +57,9 @@ static void bpf_any_put(void *raw, enum bpf_type type) case BPF_TYPE_MAP: bpf_map_put_with_uref(raw); break; + case BPF_TYPE_LINK: + bpf_link_put(raw); + break; default: WARN_ON_ONCE(1); break; @@ -63,20 +70,32 @@ static void *bpf_fd_probe_obj(u32 ufd, enum bpf_type *type) { void *raw; - *type = BPF_TYPE_MAP; raw = bpf_map_get_with_uref(ufd); - if (IS_ERR(raw)) { + if (!IS_ERR(raw)) { + *type = BPF_TYPE_MAP; + return raw; + } + + raw = bpf_prog_get(ufd); + if (!IS_ERR(raw)) { *type = BPF_TYPE_PROG; - raw = bpf_prog_get(ufd); + return raw; } - return raw; + raw = bpf_link_get_from_fd(ufd); + if (!IS_ERR(raw)) { + *type = BPF_TYPE_LINK; + return raw; + } + + return ERR_PTR(-EINVAL); } static const struct inode_operations bpf_dir_iops; static const struct inode_operations bpf_prog_iops = { }; static const struct inode_operations bpf_map_iops = { }; +static const struct inode_operations bpf_link_iops = { }; static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, @@ -114,6 +133,8 @@ static int bpf_inode_type(const struct inode *inode, enum bpf_type *type) *type = BPF_TYPE_PROG; else if (inode->i_op == &bpf_map_iops) *type = BPF_TYPE_MAP; + else if (inode->i_op == &bpf_link_iops) + *type = BPF_TYPE_LINK; else return -EACCES; @@ -335,6 +356,12 @@ static int bpf_mkmap(struct dentry *dentry, umode_t mode, void *arg) &bpffs_map_fops : &bpffs_obj_fops); } +static int bpf_mklink(struct dentry *dentry, umode_t mode, void *arg) +{ + return bpf_mkobj_ops(dentry, mode, arg, &bpf_link_iops, + &bpffs_obj_fops); +} + static struct dentry * bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags) { @@ -411,6 +438,9 @@ static int bpf_obj_do_pin(const char __user *pathname, void *raw, case BPF_TYPE_MAP: ret = vfs_mkobj(dentry, mode, bpf_mkmap, raw); break; + case BPF_TYPE_LINK: + ret = vfs_mkobj(dentry, mode, bpf_mklink, raw); + break; default: ret = -EPERM; } @@ -487,6 +517,8 @@ int bpf_obj_get_user(const char __user *pathname, int flags) ret = bpf_prog_new_fd(raw); else if (type == BPF_TYPE_MAP) ret = bpf_map_new_fd(raw, f_flags); + else if (type == BPF_TYPE_LINK) + ret = bpf_link_new_fd(raw); else return -ENOENT; @@ -504,6 +536,8 @@ static struct bpf_prog *__get_prog_inode(struct inode *inode, enum bpf_prog_type if (inode->i_op == &bpf_map_iops) return ERR_PTR(-EINVAL); + if (inode->i_op == &bpf_link_iops) + return ERR_PTR(-EINVAL); if (inode->i_op != &bpf_prog_iops) return ERR_PTR(-EACCES); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c536c65256ad..13de65363ba2 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2173,24 +2173,154 @@ static int bpf_obj_get(const union bpf_attr *attr) attr->file_flags); } -static int bpf_tracing_prog_release(struct inode *inode, struct file *filp) +struct bpf_link { + atomic64_t refcnt; + const struct bpf_link_ops *ops; + struct bpf_prog *prog; + struct work_struct work; +}; + +void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, + struct bpf_prog *prog) { - struct bpf_prog *prog = filp->private_data; + atomic64_set(&link->refcnt, 1); + link->ops = ops; + link->prog = prog; +} + +void bpf_link_inc(struct bpf_link *link) +{ + atomic64_inc(&link->refcnt); +} + +/* bpf_link_free is guaranteed to be called from process context */ +static void bpf_link_free(struct bpf_link *link) +{ + struct bpf_prog *prog; - WARN_ON_ONCE(bpf_trampoline_unlink_prog(prog)); + /* remember prog locally, because release below will free link memory */ + prog = link->prog; + /* extra clean up and kfree of container link struct */ + link->ops->release(link); + /* no more accesing of link members after this point */ bpf_prog_put(prog); +} + +static void bpf_link_put_deferred(struct work_struct *work) +{ + struct bpf_link *link = container_of(work, struct bpf_link, work); + + bpf_link_free(link); +} + +/* bpf_link_put can be called from atomic context, but ensures that resources + * are freed from process context + */ +void bpf_link_put(struct bpf_link *link) +{ + if (!atomic64_dec_and_test(&link->refcnt)) + return; + + if (in_atomic()) { + INIT_WORK(&link->work, bpf_link_put_deferred); + schedule_work(&link->work); + } else { + bpf_link_free(link); + } +} + +static int bpf_link_release(struct inode *inode, struct file *filp) +{ + struct bpf_link *link = filp->private_data; + + bpf_link_put(link); return 0; } -static const struct file_operations bpf_tracing_prog_fops = { - .release = bpf_tracing_prog_release, +#ifdef CONFIG_PROC_FS +static const struct bpf_link_ops bpf_raw_tp_lops; +static const struct bpf_link_ops bpf_tracing_link_lops; +static const struct bpf_link_ops bpf_xdp_link_lops; + +static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp) +{ + const struct bpf_link *link = filp->private_data; + const struct bpf_prog *prog = link->prog; + char prog_tag[sizeof(prog->tag) * 2 + 1] = { }; + const char *link_type; + + if (link->ops == &bpf_raw_tp_lops) + link_type = "raw_tracepoint"; + else if (link->ops == &bpf_tracing_link_lops) + link_type = "tracing"; + else + link_type = "unknown"; + + bin2hex(prog_tag, prog->tag, sizeof(prog->tag)); + seq_printf(m, + "link_type:\t%s\n" + "prog_tag:\t%s\n" + "prog_id:\t%u\n", + link_type, + prog_tag, + prog->aux->id); +} +#endif + +const struct file_operations bpf_link_fops = { +#ifdef CONFIG_PROC_FS + .show_fdinfo = bpf_link_show_fdinfo, +#endif + .release = bpf_link_release, .read = bpf_dummy_read, .write = bpf_dummy_write, }; +int bpf_link_new_fd(struct bpf_link *link) +{ + return anon_inode_getfd("bpf-link", &bpf_link_fops, link, O_CLOEXEC); +} + +struct bpf_link *bpf_link_get_from_fd(u32 ufd) +{ + struct fd f = fdget(ufd); + struct bpf_link *link; + + if (!f.file) + return ERR_PTR(-EBADF); + if (f.file->f_op != &bpf_link_fops) { + fdput(f); + return ERR_PTR(-EINVAL); + } + + link = f.file->private_data; + bpf_link_inc(link); + fdput(f); + + return link; +} + +struct bpf_tracing_link { + struct bpf_link link; +}; + +static void bpf_tracing_link_release(struct bpf_link *link) +{ + struct bpf_tracing_link *tr_link = + container_of(link, struct bpf_tracing_link, link); + + WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog)); + kfree(tr_link); +} + +static const struct bpf_link_ops bpf_tracing_link_lops = { + .release = bpf_tracing_link_release, +}; + static int bpf_tracing_prog_attach(struct bpf_prog *prog) { - int tr_fd, err; + struct bpf_tracing_link *link; + int link_fd, err; if (prog->expected_attach_type != BPF_TRACE_FENTRY && prog->expected_attach_type != BPF_TRACE_FEXIT && @@ -2199,58 +2329,61 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog) goto out_put_prog; } + link = kzalloc(sizeof(*link), GFP_USER); + if (!link) { + err = -ENOMEM; + goto out_put_prog; + } + bpf_link_init(&link->link, &bpf_tracing_link_lops, prog); + err = bpf_trampoline_link_prog(prog); if (err) - goto out_put_prog; + goto out_free_link; - tr_fd = anon_inode_getfd("bpf-tracing-prog", &bpf_tracing_prog_fops, - prog, O_CLOEXEC); - if (tr_fd < 0) { + link_fd = bpf_link_new_fd(&link->link); + if (link_fd < 0) { WARN_ON_ONCE(bpf_trampoline_unlink_prog(prog)); - err = tr_fd; - goto out_put_prog; + err = link_fd; + goto out_free_link; } - return tr_fd; + return link_fd; +out_free_link: + kfree(link); out_put_prog: bpf_prog_put(prog); return err; } -struct bpf_raw_tracepoint { +struct bpf_raw_tp_link { + struct bpf_link link; struct bpf_raw_event_map *btp; - struct bpf_prog *prog; }; -static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp) +static void bpf_raw_tp_link_release(struct bpf_link *link) { - struct bpf_raw_tracepoint *raw_tp = filp->private_data; + struct bpf_raw_tp_link *raw_tp = + container_of(link, struct bpf_raw_tp_link, link); - if (raw_tp->prog) { - bpf_probe_unregister(raw_tp->btp, raw_tp->prog); - bpf_prog_put(raw_tp->prog); - } + bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog); bpf_put_raw_tracepoint(raw_tp->btp); kfree(raw_tp); - return 0; } -static const struct file_operations bpf_raw_tp_fops = { - .release = bpf_raw_tracepoint_release, - .read = bpf_dummy_read, - .write = bpf_dummy_write, +static const struct bpf_link_ops bpf_raw_tp_lops = { + .release = bpf_raw_tp_link_release, }; #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd static int bpf_raw_tracepoint_open(const union bpf_attr *attr) { - struct bpf_raw_tracepoint *raw_tp; + struct bpf_raw_tp_link *raw_tp; struct bpf_raw_event_map *btp; struct bpf_prog *prog; const char *tp_name; char buf[128]; - int tp_fd, err; + int link_fd, err; if (CHECK_ATTR(BPF_RAW_TRACEPOINT_OPEN)) return -EINVAL; @@ -2302,21 +2435,20 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) err = -ENOMEM; goto out_put_btp; } + bpf_link_init(&raw_tp->link, &bpf_raw_tp_lops, prog); raw_tp->btp = btp; - raw_tp->prog = prog; err = bpf_probe_register(raw_tp->btp, prog); if (err) goto out_free_tp; - tp_fd = anon_inode_getfd("bpf-raw-tracepoint", &bpf_raw_tp_fops, raw_tp, - O_CLOEXEC); - if (tp_fd < 0) { + link_fd = bpf_link_new_fd(&raw_tp->link); + if (link_fd < 0) { bpf_probe_unregister(raw_tp->btp, prog); - err = tp_fd; + err = link_fd; goto out_free_tp; } - return tp_fd; + return link_fd; out_free_tp: kfree(raw_tp); @@ -3266,15 +3398,21 @@ static int bpf_task_fd_query(const union bpf_attr *attr, if (err) goto out; - if (file->f_op == &bpf_raw_tp_fops) { - struct bpf_raw_tracepoint *raw_tp = file->private_data; - struct bpf_raw_event_map *btp = raw_tp->btp; + if (file->f_op == &bpf_link_fops) { + struct bpf_link *link = file->private_data; - err = bpf_task_fd_query_copy(attr, uattr, - raw_tp->prog->aux->id, - BPF_FD_TYPE_RAW_TRACEPOINT, - btp->tp->name, 0, 0); - goto put_file; + if (link->ops == &bpf_raw_tp_lops) { + struct bpf_raw_tp_link *raw_tp = + container_of(link, struct bpf_raw_tp_link, link); + struct bpf_raw_event_map *btp = raw_tp->btp; + + err = bpf_task_fd_query_copy(attr, uattr, + raw_tp->link.prog->aux->id, + BPF_FD_TYPE_RAW_TRACEPOINT, + btp->tp->name, 0, 0); + goto put_file; + } + goto out_not_supp; } event = perf_get_event(file); @@ -3294,6 +3432,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr, goto put_file; } +out_not_supp: err = -ENOTSUPP; put_file: fput(file); -- cgit v1.2.3-71-gd317 From b396bfdebffcc05a855137775e38f4652cbca454 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 12 Feb 2020 12:21:03 -0500 Subject: tracing: Have hwlat ts be first instance and record count of instances The hwlat tracer runs a loop of width time during a given window. It then reports the max latency over a given threshold and records a timestamp. But this timestamp is the time after the width has finished, and not the time it actually triggered. Record the actual time when the latency was greater than the threshold as well as the number of times it was greater in a given width per window. Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/ftrace.rst | 32 ++++++++++++++++++++++---------- kernel/trace/trace_entries.h | 4 +++- kernel/trace/trace_hwlat.c | 24 +++++++++++++++++------- kernel/trace/trace_output.c | 4 ++-- 4 files changed, 44 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index ff658e27d25b..99a0890e20ec 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -2126,6 +2126,8 @@ periodically make a CPU constantly busy with interrupts disabled. # cat trace # tracer: hwlat # + # entries-in-buffer/entries-written: 13/13 #P:8 + # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq @@ -2133,12 +2135,18 @@ periodically make a CPU constantly busy with interrupts disabled. # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | - <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940 - <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365 - <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062 - <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633 - <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961 - <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1 + <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6 + <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1 + <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5 + <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1 + <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1 + <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2 + <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1 + <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1 + <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1 + <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1 + <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1 + <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1 The above output is somewhat the same in the header. All events will have @@ -2148,7 +2156,7 @@ interrupts disabled 'd'. Under the FUNCTION title there is: This is the count of events recorded that were greater than the tracing_threshold (See below). - inner/outer(us): 12/14 + inner/outer(us): 11/11 This shows two numbers as "inner latency" and "outer latency". The test runs in a loop checking a timestamp twice. The latency detected within @@ -2156,11 +2164,15 @@ interrupts disabled 'd'. Under the FUNCTION title there is: after the previous timestamp and the next timestamp in the loop is the "outer latency". - ts:1499801089.066141940 + ts:1581527483.343962693 + + The absolute timestamp that the first latency was recorded in the window. + + count:6 - The absolute timestamp that the event happened. + The number of times a latency was detected during the window. - nmi-total:4 nmi-count:1 + nmi-total:7 nmi-count:1 On architectures that support it, if an NMI comes in during the test, the time spent in NMI is reported in "nmi-total" (in diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index f22746f3c132..a523da0dae0a 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -325,14 +325,16 @@ FTRACE_ENTRY(hwlat, hwlat_entry, __field_desc( long, timestamp, tv_nsec ) __field( unsigned int, nmi_count ) __field( unsigned int, seqnum ) + __field( unsigned int, count ) ), - F_printk("cnt:%u\tts:%010llu.%010lu\tinner:%llu\touter:%llu\tnmi-ts:%llu\tnmi-count:%u\n", + F_printk("cnt:%u\tts:%010llu.%010lu\tinner:%llu\touter:%llu\tcount:%d\tnmi-ts:%llu\tnmi-count:%u\n", __entry->seqnum, __entry->tv_sec, __entry->tv_nsec, __entry->duration, __entry->outer_duration, + __entry->count, __entry->nmi_total_ts, __entry->nmi_count) ); diff --git a/kernel/trace/trace_hwlat.c b/kernel/trace/trace_hwlat.c index a48808c43249..e2be7bb7ef7e 100644 --- a/kernel/trace/trace_hwlat.c +++ b/kernel/trace/trace_hwlat.c @@ -83,6 +83,7 @@ struct hwlat_sample { u64 nmi_total_ts; /* Total time spent in NMIs */ struct timespec64 timestamp; /* wall time */ int nmi_count; /* # NMIs during this sample */ + int count; /* # of iteratons over threash */ }; /* keep the global state somewhere. */ @@ -124,6 +125,7 @@ static void trace_hwlat_sample(struct hwlat_sample *sample) entry->timestamp = sample->timestamp; entry->nmi_total_ts = sample->nmi_total_ts; entry->nmi_count = sample->nmi_count; + entry->count = sample->count; if (!call_filter_check_discard(call, entry, buffer, event)) trace_buffer_unlock_commit_nostack(buffer, event); @@ -167,12 +169,14 @@ void trace_hwlat_callback(bool enter) static int get_sample(void) { struct trace_array *tr = hwlat_trace; + struct hwlat_sample s; time_type start, t1, t2, last_t2; - s64 diff, total, last_total = 0; + s64 diff, outer_diff, total, last_total = 0; u64 sample = 0; u64 thresh = tracing_thresh; u64 outer_sample = 0; int ret = -1; + unsigned int count = 0; do_div(thresh, NSEC_PER_USEC); /* modifies interval value */ @@ -186,6 +190,7 @@ static int get_sample(void) init_time(last_t2, 0); start = time_get(); /* start timestamp */ + outer_diff = 0; do { @@ -194,14 +199,14 @@ static int get_sample(void) if (time_u64(last_t2)) { /* Check the delta from outer loop (t2 to next t1) */ - diff = time_to_us(time_sub(t1, last_t2)); + outer_diff = time_to_us(time_sub(t1, last_t2)); /* This shouldn't happen */ - if (diff < 0) { + if (outer_diff < 0) { pr_err(BANNER "time running backwards\n"); goto out; } - if (diff > outer_sample) - outer_sample = diff; + if (outer_diff > outer_sample) + outer_sample = outer_diff; } last_t2 = t2; @@ -217,6 +222,12 @@ static int get_sample(void) /* This checks the inner loop (t1 to t2) */ diff = time_to_us(time_sub(t2, t1)); /* current diff */ + if (diff > thresh || outer_diff > thresh) { + if (!count) + ktime_get_real_ts64(&s.timestamp); + count++; + } + /* This shouldn't happen */ if (diff < 0) { pr_err(BANNER "time running backwards\n"); @@ -236,7 +247,6 @@ static int get_sample(void) /* If we exceed the threshold value, we have found a hardware latency */ if (sample > thresh || outer_sample > thresh) { - struct hwlat_sample s; u64 latency; ret = 1; @@ -249,9 +259,9 @@ static int get_sample(void) s.seqnum = hwlat_data.count; s.duration = sample; s.outer_duration = outer_sample; - ktime_get_real_ts64(&s.timestamp); s.nmi_total_ts = nmi_total_ts; s.nmi_count = nmi_count; + s.count = count; trace_hwlat_sample(&s); latency = max(sample, outer_sample); diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index b4909082f6a4..e25a7da79c6b 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -1158,12 +1158,12 @@ trace_hwlat_print(struct trace_iterator *iter, int flags, trace_assign_type(field, entry); - trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%lld.%09ld", + trace_seq_printf(s, "#%-5u inner/outer(us): %4llu/%-5llu ts:%lld.%09ld count:%d", field->seqnum, field->duration, field->outer_duration, (long long)field->timestamp.tv_sec, - field->timestamp.tv_nsec); + field->timestamp.tv_nsec, field->count); if (field->nmi_count) { /* -- cgit v1.2.3-71-gd317 From 5412e0b763e0c46165fa353f695fa3b68a66ac91 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 14 Feb 2020 16:20:04 -0500 Subject: tracing: Remove unused TRACE_BUFFER bits Commit 567cd4da54ff ("ring-buffer: User context bit recursion checking") added the TRACE_BUFFER bits to be used in the current task's trace_recursion field. But the final submission of the logic removed the use of those bits, but never removed the bits themselves (they were never used in upstream Linux). These can be safely removed. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.h | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 99372dd7d168..c61e1b1c85a6 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -557,12 +557,7 @@ struct tracer { * caller, and we can skip the current check. */ enum { - TRACE_BUFFER_BIT, - TRACE_BUFFER_NMI_BIT, - TRACE_BUFFER_IRQ_BIT, - TRACE_BUFFER_SIRQ_BIT, - - /* Start of function recursion bits */ + /* Function recursion bits */ TRACE_FTRACE_BIT, TRACE_FTRACE_NMI_BIT, TRACE_FTRACE_IRQ_BIT, -- cgit v1.2.3-71-gd317 From a534e924c58d2e7c07509521e87d059dd029dca1 Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Tue, 25 Feb 2020 20:58:13 -0500 Subject: PM: QoS: annotate data races in pm_qos_*_value() The target_value field in struct pm_qos_constraints is used for lockless access to the effective constraint value of a given QoS list, so the readers of it cannot expect it to always reflect the most recent effective constraint value. However, they can and do expect it to be equal to a valid effective constraint value computed at a certain time in the past (event though it may not be the most recent one), so add READ|WRITE_ONCE() annotations around the target_value accesses to prevent the compiler from possibly causing that expectation to be unmet by generating code in an exceptionally convoluted way. Signed-off-by: Qian Cai [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki --- kernel/power/qos.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/power/qos.c b/kernel/power/qos.c index 32927682bcc4..db0bed2cae26 100644 --- a/kernel/power/qos.c +++ b/kernel/power/qos.c @@ -52,7 +52,7 @@ static DEFINE_SPINLOCK(pm_qos_lock); */ s32 pm_qos_read_value(struct pm_qos_constraints *c) { - return c->target_value; + return READ_ONCE(c->target_value); } static int pm_qos_get_value(struct pm_qos_constraints *c) @@ -75,7 +75,7 @@ static int pm_qos_get_value(struct pm_qos_constraints *c) static void pm_qos_set_value(struct pm_qos_constraints *c, s32 value) { - c->target_value = value; + WRITE_ONCE(c->target_value, value); } /** -- cgit v1.2.3-71-gd317 From 55e8c8eb2c7b6bf30e99423ccfe7ca032f498f59 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 11:11:06 -0600 Subject: posix-cpu-timers: Store a reference to a pid not a task posix cpu timers do not handle the death of a process well. This is most clearly seen when a multi-threaded process calls exec from a thread that is not the leader of the thread group. The posix cpu timer code continues to pin the old thread group leader and is unable to find the siglock from there. This results in posix_cpu_timer_del being unable to delete a timer, posix_cpu_timer_set being unable to set a timer. Further to compensate for the problems in posix_cpu_timer_del on a multi-threaded exec all timers that point at the multi-threaded task are stopped. The code for the timers fundamentally needs to check if the target process/thread is alive. This needs an extra level of indirection. This level of indirection is already available in struct pid. So replace cpu.task with cpu.pid to get the needed extra layer of indirection. In addition to handling things more cleanly this reduces the amount of memory a timer can pin when a process exits and then is reaped from a task_struct to the vastly smaller struct pid. Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/87wo86tz6d.fsf@x220.int.ebiederm.org --- include/linux/posix-timers.h | 2 +- kernel/time/posix-cpu-timers.c | 73 +++++++++++++++++++++++++++++++----------- 2 files changed, 56 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h index 3d10c84a97a9..e3f0f8585da4 100644 --- a/include/linux/posix-timers.h +++ b/include/linux/posix-timers.h @@ -69,7 +69,7 @@ static inline int clockid_to_fd(const clockid_t clk) struct cpu_timer { struct timerqueue_node node; struct timerqueue_head *head; - struct task_struct *task; + struct pid *pid; struct list_head elist; int firing; }; diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index ef936c5a910b..6df468a622fe 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -118,6 +118,16 @@ static inline int validate_clock_permissions(const clockid_t clock) return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL; } +static inline enum pid_type cpu_timer_pid_type(struct k_itimer *timer) +{ + return CPUCLOCK_PERTHREAD(timer->it_clock) ? PIDTYPE_PID : PIDTYPE_TGID; +} + +static inline struct task_struct *cpu_timer_task_rcu(struct k_itimer *timer) +{ + return pid_task(timer->it.cpu.pid, cpu_timer_pid_type(timer)); +} + /* * Update expiry time from increment, and increase overrun count, * given the current clock sample. @@ -391,7 +401,12 @@ static int posix_cpu_timer_create(struct k_itimer *new_timer) new_timer->kclock = &clock_posix_cpu; timerqueue_init(&new_timer->it.cpu.node); - new_timer->it.cpu.task = p; + new_timer->it.cpu.pid = get_task_pid(p, cpu_timer_pid_type(new_timer)); + /* + * get_task_for_clock() took a reference on @p. Drop it as the timer + * holds a reference on the pid of @p. + */ + put_task_struct(p); return 0; } @@ -404,13 +419,15 @@ static int posix_cpu_timer_create(struct k_itimer *new_timer) static int posix_cpu_timer_del(struct k_itimer *timer) { struct cpu_timer *ctmr = &timer->it.cpu; - struct task_struct *p = ctmr->task; struct sighand_struct *sighand; + struct task_struct *p; unsigned long flags; int ret = 0; - if (WARN_ON_ONCE(!p)) - return -EINVAL; + rcu_read_lock(); + p = cpu_timer_task_rcu(timer); + if (!p) + goto out; /* * Protect against sighand release/switch in exit/exec and process/ @@ -432,8 +449,10 @@ static int posix_cpu_timer_del(struct k_itimer *timer) unlock_task_sighand(p, &flags); } +out: + rcu_read_unlock(); if (!ret) - put_task_struct(p); + put_pid(ctmr->pid); return ret; } @@ -561,13 +580,21 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, clockid_t clkid = CPUCLOCK_WHICH(timer->it_clock); u64 old_expires, new_expires, old_incr, val; struct cpu_timer *ctmr = &timer->it.cpu; - struct task_struct *p = ctmr->task; struct sighand_struct *sighand; + struct task_struct *p; unsigned long flags; int ret = 0; - if (WARN_ON_ONCE(!p)) - return -EINVAL; + rcu_read_lock(); + p = cpu_timer_task_rcu(timer); + if (!p) { + /* + * If p has just been reaped, we can no + * longer get any information about it at all. + */ + rcu_read_unlock(); + return -ESRCH; + } /* * Use the to_ktime conversion because that clamps the maximum @@ -584,8 +611,10 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, * If p has just been reaped, we can no * longer get any information about it at all. */ - if (unlikely(sighand == NULL)) + if (unlikely(sighand == NULL)) { + rcu_read_unlock(); return -ESRCH; + } /* * Disarm any old timer after extracting its expiry time. @@ -690,6 +719,7 @@ static int posix_cpu_timer_set(struct k_itimer *timer, int timer_flags, ret = 0; out: + rcu_read_unlock(); if (old) old->it_interval = ns_to_timespec64(old_incr); @@ -701,10 +731,12 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec64 *itp clockid_t clkid = CPUCLOCK_WHICH(timer->it_clock); struct cpu_timer *ctmr = &timer->it.cpu; u64 now, expires = cpu_timer_getexpires(ctmr); - struct task_struct *p = ctmr->task; + struct task_struct *p; - if (WARN_ON_ONCE(!p)) - return; + rcu_read_lock(); + p = cpu_timer_task_rcu(timer); + if (!p) + goto out; /* * Easy part: convert the reload time. @@ -712,7 +744,7 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec64 *itp itp->it_interval = ktime_to_timespec64(timer->it_interval); if (!expires) - return; + goto out; /* * Sample the clock to take the difference with the expiry time. @@ -732,6 +764,8 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec64 *itp itp->it_value.tv_nsec = 1; itp->it_value.tv_sec = 0; } +out: + rcu_read_unlock(); } #define MAX_COLLECTED 20 @@ -952,14 +986,15 @@ static void check_process_timers(struct task_struct *tsk, static void posix_cpu_timer_rearm(struct k_itimer *timer) { clockid_t clkid = CPUCLOCK_WHICH(timer->it_clock); - struct cpu_timer *ctmr = &timer->it.cpu; - struct task_struct *p = ctmr->task; + struct task_struct *p; struct sighand_struct *sighand; unsigned long flags; u64 now; - if (WARN_ON_ONCE(!p)) - return; + rcu_read_lock(); + p = cpu_timer_task_rcu(timer); + if (!p) + goto out; /* * Fetch the current sample and update the timer's expiry time. @@ -974,13 +1009,15 @@ static void posix_cpu_timer_rearm(struct k_itimer *timer) /* Protect timer list r/w in arm_timer() */ sighand = lock_task_sighand(p, &flags); if (unlikely(sighand == NULL)) - return; + goto out; /* * Now re-arm for the new expiry time. */ arm_timer(timer, p); unlock_task_sighand(p, &flags); +out: + rcu_read_unlock(); } /** -- cgit v1.2.3-71-gd317 From b95e31c07c5eb4f5c0769f12b38b0343d7961040 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 28 Feb 2020 11:15:03 -0600 Subject: posix-cpu-timers: Stop disabling timers on mt-exec The reasons why the extra posix_cpu_timers_exit_group() invocation has been added are not entirely clear from the commit message. Today all that posix_cpu_timers_exit_group() does is stop timers that are tracking the task from firing. Every other operation on those timers is still allowed. The practical implication of this is posix_cpu_timer_del() which could not get the siglock after the thread group leader has exited (because sighand == NULL) would be able to run successfully because the timer was already dequeued. With that locking issue fixed there is no point in disabling all of the timers. So remove this ``tempoary'' hack. Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec") Signed-off-by: "Eric W. Biederman" Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org --- kernel/exit.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/exit.c b/kernel/exit.c index 2833ffb0c211..df546315f699 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -103,17 +103,8 @@ static void __exit_signal(struct task_struct *tsk) #ifdef CONFIG_POSIX_TIMERS posix_cpu_timers_exit(tsk); - if (group_dead) { + if (group_dead) posix_cpu_timers_exit_group(tsk); - } else { - /* - * This can only happen if the caller is de_thread(). - * FIXME: this is the temporary hack, we should teach - * posix-cpu-timers to handle this case correctly. - */ - if (unlikely(has_group_leader_pid(tsk))) - posix_cpu_timers_exit_group(tsk); - } #endif if (group_dead) { -- cgit v1.2.3-71-gd317 From 4cbbc3a0eeed675449b1a4d080008927121f3da3 Mon Sep 17 00:00:00 2001 From: Wen Yang Date: Mon, 20 Jan 2020 18:05:23 +0800 Subject: timekeeping: Prevent 32bit truncation in scale64_check_overflow() While unlikely the divisor in scale64_check_overflow() could be >= 32bit in scale64_check_overflow(). do_div() truncates the divisor to 32bit at least on 32bit platforms. Use div64_u64() instead to avoid the truncation to 32-bit. [ tglx: Massaged changelog ] Signed-off-by: Wen Yang Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com --- kernel/time/timekeeping.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ca69290bee2a..4fc2af4367a7 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1005,9 +1005,8 @@ static int scale64_check_overflow(u64 mult, u64 div, u64 *base) ((int)sizeof(u64)*8 - fls64(mult) < fls64(rem))) return -EOVERFLOW; tmp *= mult; - rem *= mult; - do_div(rem, div); + rem = div64_u64(rem * mult, div); *base = tmp + rem; return 0; } -- cgit v1.2.3-71-gd317 From 38f7b0b1316d435f38ec3f2bb078897b7a1cfdea Mon Sep 17 00:00:00 2001 From: Wen Yang Date: Thu, 30 Jan 2020 21:08:51 +0800 Subject: hrtimer: Cast explicitely to u32t in __ktime_divns() do_div() does a 64-by-32 division at least on 32bit platforms, while the divisor 'div' is explicitly casted to unsigned long, thus 64-bit on 64-bit platforms. The code already ensures that the divisor is less than 2^32. Hence the proper cast type is u32. Signed-off-by: Wen Yang Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200130130851.29204-1-wenyang@linux.alibaba.com --- kernel/time/hrtimer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 3a609e7344f3..d74fdcd3ced4 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -311,7 +311,7 @@ s64 __ktime_divns(const ktime_t kt, s64 div) div >>= 1; } tmp >>= sft; - do_div(tmp, (unsigned long) div); + do_div(tmp, (u32) div); return dclc < 0 ? -tmp : tmp; } EXPORT_SYMBOL_GPL(__ktime_divns); -- cgit v1.2.3-71-gd317 From d441dceb5dce71150f28add80d36d91bbfccba99 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Fri, 7 Feb 2020 14:39:29 -0500 Subject: tick/common: Make tick_periodic() check for missing ticks The tick_periodic() function is used at the beginning part of the bootup process for time keeping while the other clock sources are being initialized. The current code assumes that all the timer interrupts are handled in a timely manner with no missing ticks. That is not actually true. Some ticks are missed and there are some discrepancies between the tick time (jiffies) and the timestamp reported in the kernel log. Some systems, however, are more prone to missing ticks than the others. In the extreme case, the discrepancy can actually cause a soft lockup message to be printed by the watchdog kthread. For example, on a Cavium ThunderX2 Sabre arm64 system: [ 25.496379] watchdog: BUG: soft lockup - CPU#14 stuck for 22s! On that system, the missing ticks are especially prevalent during the smp_init() phase of the boot process. With an instrumented kernel, it was found that it took about 24s as reported by the timestamp for the tick to accumulate 4s of time. Investigation and bisection done by others seemed to point to the commit 73f381660959 ("arm64: Advertise mitigation of Spectre-v2, or lack thereof") as the culprit. It could also be a firmware issue as new firmware was promised that would fix the issue. To properly address this problem, stop assuming that there will be no missing tick in tick_periodic(). Modify it to follow the example of tick_do_update_jiffies64() by using another reference clock to check for missing ticks. Since the watchdog timer uses running_clock(), it is used here as the reference. With this applied, the soft lockup problem in the affected arm64 system is gone and tick time tracks much more closely to the timestamp time. Signed-off-by: Waiman Long Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200207193929.27308-1-longman@redhat.com --- kernel/time/tick-common.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 7e5d3524e924..cce4ed1515c7 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -84,12 +85,41 @@ int tick_is_oneshot_available(void) static void tick_periodic(int cpu) { if (tick_do_timer_cpu == cpu) { + /* + * Use running_clock() as reference to check for missing ticks. + */ + static ktime_t last_update; + ktime_t now; + int ticks = 1; + + now = ns_to_ktime(running_clock()); write_seqlock(&jiffies_lock); - /* Keep track of the next tick event */ - tick_next_period = ktime_add(tick_next_period, tick_period); + if (last_update) { + u64 delta = ktime_sub(now, last_update); - do_timer(1); + /* + * Check for eventually missed ticks + * + * There is likely a persistent delta between + * last_update and tick_next_period. So they are + * updated separately. + */ + if (delta >= 2 * tick_period) { + s64 period = ktime_to_ns(tick_period); + + ticks = ktime_divns(delta, period); + } + last_update = ktime_add(last_update, + ticks * tick_period); + } else { + last_update = now; + } + + /* Keep track of the next tick event */ + tick_next_period = ktime_add(tick_next_period, + ticks * tick_period); + do_timer(ticks); write_sequnlock(&jiffies_lock); update_wall_time(); } -- cgit v1.2.3-71-gd317 From 2333e829952fb437db915bbb17f4d8c43127d438 Mon Sep 17 00:00:00 2001 From: Yu Chen Date: Sun, 23 Feb 2020 15:28:52 +0800 Subject: workqueue: Make workqueue_init*() return void The return values of workqueue_init() and workqueue_early_int() are always 0, and there is no usage of their return value. So just make them return void. Signed-off-by: Yu Chen Signed-off-by: Tejun Heo --- include/linux/workqueue.h | 4 ++-- kernel/workqueue.c | 8 ++------ 2 files changed, 4 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 4261d1c6e87b..c86a7691e13c 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -649,7 +649,7 @@ int workqueue_online_cpu(unsigned int cpu); int workqueue_offline_cpu(unsigned int cpu); #endif -int __init workqueue_init_early(void); -int __init workqueue_init(void); +void __init workqueue_init_early(void); +void __init workqueue_init(void); #endif diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 301db4406bc3..5afa9ad45eba 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5896,7 +5896,7 @@ static void __init wq_numa_init(void) * items. Actual work item execution starts only after kthreads can be * created and scheduled right before early initcalls. */ -int __init workqueue_init_early(void) +void __init workqueue_init_early(void) { int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL }; int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; @@ -5963,8 +5963,6 @@ int __init workqueue_init_early(void) !system_unbound_wq || !system_freezable_wq || !system_power_efficient_wq || !system_freezable_power_efficient_wq); - - return 0; } /** @@ -5976,7 +5974,7 @@ int __init workqueue_init_early(void) * are no kworkers executing the work items yet. Populate the worker pools * with the initial workers and enable future kworker creations. */ -int __init workqueue_init(void) +void __init workqueue_init(void) { struct workqueue_struct *wq; struct worker_pool *pool; @@ -6023,6 +6021,4 @@ int __init workqueue_init(void) wq_online = true; wq_watchdog_init(); - - return 0; } -- cgit v1.2.3-71-gd317 From 88fd9e5352fe05f7fe57778293aebd4cd106960b Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:47 +0100 Subject: bpf: Refactor trampoline update code As we need to introduce a third type of attachment for trampolines, the flattened signature of arch_prepare_bpf_trampoline gets even more complicated. Refactor the prog and count argument to arch_prepare_bpf_trampoline to use bpf_tramp_progs to simplify the addition and accounting for new attachment types. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org --- arch/x86/net/bpf_jit_comp.c | 31 ++++++++++++----------- include/linux/bpf.h | 13 ++++++++-- kernel/bpf/bpf_struct_ops.c | 10 +++++++- kernel/bpf/trampoline.c | 62 +++++++++++++++++++++++++-------------------- 4 files changed, 71 insertions(+), 45 deletions(-) (limited to 'kernel') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 9ba08e9abc09..15c7d28bc05c 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1362,12 +1362,12 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, } static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, - struct bpf_prog **progs, int prog_cnt, int stack_size) + struct bpf_tramp_progs *tp, int stack_size) { u8 *prog = *pprog; int cnt = 0, i; - for (i = 0; i < prog_cnt; i++) { + for (i = 0; i < tp->nr_progs; i++) { if (emit_call(&prog, __bpf_prog_enter, prog)) return -EINVAL; /* remember prog start time returned by __bpf_prog_enter */ @@ -1376,17 +1376,17 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, /* arg1: lea rdi, [rbp - stack_size] */ EMIT4(0x48, 0x8D, 0x7D, -stack_size); /* arg2: progs[i]->insnsi for interpreter */ - if (!progs[i]->jited) + if (!tp->progs[i]->jited) emit_mov_imm64(&prog, BPF_REG_2, - (long) progs[i]->insnsi >> 32, - (u32) (long) progs[i]->insnsi); + (long) tp->progs[i]->insnsi >> 32, + (u32) (long) tp->progs[i]->insnsi); /* call JITed bpf program or interpreter */ - if (emit_call(&prog, progs[i]->bpf_func, prog)) + if (emit_call(&prog, tp->progs[i]->bpf_func, prog)) return -EINVAL; /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) progs[i] >> 32, - (u32) (long) progs[i]); + emit_mov_imm64(&prog, BPF_REG_1, (long) tp->progs[i] >> 32, + (u32) (long) tp->progs[i]); /* arg2: mov rsi, rbx <- start time in nsec */ emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); if (emit_call(&prog, __bpf_prog_exit, prog)) @@ -1458,12 +1458,13 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, */ int arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, - struct bpf_prog **fentry_progs, int fentry_cnt, - struct bpf_prog **fexit_progs, int fexit_cnt, + struct bpf_tramp_progs *tprogs, void *orig_call) { int cnt = 0, nr_args = m->nr_args; int stack_size = nr_args * 8; + struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY]; + struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT]; u8 *prog; /* x86-64 supports up to 6 arguments. 7+ can be added in the future */ @@ -1492,12 +1493,12 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, save_regs(m, &prog, nr_args, stack_size); - if (fentry_cnt) - if (invoke_bpf(m, &prog, fentry_progs, fentry_cnt, stack_size)) + if (fentry->nr_progs) + if (invoke_bpf(m, &prog, fentry, stack_size)) return -EINVAL; if (flags & BPF_TRAMP_F_CALL_ORIG) { - if (fentry_cnt) + if (fentry->nr_progs) restore_regs(m, &prog, nr_args, stack_size); /* call original function */ @@ -1507,8 +1508,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); } - if (fexit_cnt) - if (invoke_bpf(m, &prog, fexit_progs, fexit_cnt, stack_size)) + if (fexit->nr_progs) + if (invoke_bpf(m, &prog, fexit, stack_size)) return -EINVAL; if (flags & BPF_TRAMP_F_RESTORE_REGS) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f13c78c6f29d..98ec10b23dbb 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -433,6 +433,16 @@ struct btf_func_model { */ #define BPF_TRAMP_F_SKIP_FRAME BIT(2) +/* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 + * bytes on x86. Pick a number to fit into BPF_IMAGE_SIZE / 2 + */ +#define BPF_MAX_TRAMP_PROGS 40 + +struct bpf_tramp_progs { + struct bpf_prog *progs[BPF_MAX_TRAMP_PROGS]; + int nr_progs; +}; + /* Different use cases for BPF trampoline: * 1. replace nop at the function entry (kprobe equivalent) * flags = BPF_TRAMP_F_RESTORE_REGS @@ -455,8 +465,7 @@ struct btf_func_model { */ int arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, - struct bpf_prog **fentry_progs, int fentry_cnt, - struct bpf_prog **fexit_progs, int fexit_cnt, + struct bpf_tramp_progs *tprogs, void *orig_call); /* these two functions are called from generated trampoline */ u64 notrace __bpf_prog_enter(void); diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index c498f0fffb40..ca5cc8cdb6eb 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -320,6 +320,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, struct bpf_struct_ops_value *uvalue, *kvalue; const struct btf_member *member; const struct btf_type *t = st_ops->type; + struct bpf_tramp_progs *tprogs = NULL; void *udata, *kdata; int prog_fd, err = 0; void *image; @@ -343,6 +344,10 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, if (uvalue->state || refcount_read(&uvalue->refcnt)) return -EINVAL; + tprogs = kcalloc(BPF_TRAMP_MAX, sizeof(*tprogs), GFP_KERNEL); + if (!tprogs) + return -ENOMEM; + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; @@ -425,10 +430,12 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, goto reset_unlock; } + tprogs[BPF_TRAMP_FENTRY].progs[0] = prog; + tprogs[BPF_TRAMP_FENTRY].nr_progs = 1; err = arch_prepare_bpf_trampoline(image, st_map->image + PAGE_SIZE, &st_ops->func_models[i], 0, - &prog, 1, NULL, 0, NULL); + tprogs, NULL); if (err < 0) goto reset_unlock; @@ -469,6 +476,7 @@ reset_unlock: memset(uvalue, 0, map->value_size); memset(kvalue, 0, map->value_size); unlock: + kfree(tprogs); mutex_unlock(&st_map->lock); return err; } diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 704fa787fec0..546198f6f307 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -190,40 +190,49 @@ static int register_fentry(struct bpf_trampoline *tr, void *new_addr) return ret; } -/* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 - * bytes on x86. Pick a number to fit into BPF_IMAGE_SIZE / 2 - */ -#define BPF_MAX_TRAMP_PROGS 40 +static struct bpf_tramp_progs * +bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total) +{ + const struct bpf_prog_aux *aux; + struct bpf_tramp_progs *tprogs; + struct bpf_prog **progs; + int kind; + + *total = 0; + tprogs = kcalloc(BPF_TRAMP_MAX, sizeof(*tprogs), GFP_KERNEL); + if (!tprogs) + return ERR_PTR(-ENOMEM); + + for (kind = 0; kind < BPF_TRAMP_MAX; kind++) { + tprogs[kind].nr_progs = tr->progs_cnt[kind]; + *total += tr->progs_cnt[kind]; + progs = tprogs[kind].progs; + + hlist_for_each_entry(aux, &tr->progs_hlist[kind], tramp_hlist) + *progs++ = aux->prog; + } + return tprogs; +} static int bpf_trampoline_update(struct bpf_trampoline *tr) { void *old_image = tr->image + ((tr->selector + 1) & 1) * BPF_IMAGE_SIZE/2; void *new_image = tr->image + (tr->selector & 1) * BPF_IMAGE_SIZE/2; - struct bpf_prog *progs_to_run[BPF_MAX_TRAMP_PROGS]; - int fentry_cnt = tr->progs_cnt[BPF_TRAMP_FENTRY]; - int fexit_cnt = tr->progs_cnt[BPF_TRAMP_FEXIT]; - struct bpf_prog **progs, **fentry, **fexit; + struct bpf_tramp_progs *tprogs; u32 flags = BPF_TRAMP_F_RESTORE_REGS; - struct bpf_prog_aux *aux; - int err; + int err, total; - if (fentry_cnt + fexit_cnt == 0) { + tprogs = bpf_trampoline_get_progs(tr, &total); + if (IS_ERR(tprogs)) + return PTR_ERR(tprogs); + + if (total == 0) { err = unregister_fentry(tr, old_image); tr->selector = 0; goto out; } - /* populate fentry progs */ - fentry = progs = progs_to_run; - hlist_for_each_entry(aux, &tr->progs_hlist[BPF_TRAMP_FENTRY], tramp_hlist) - *progs++ = aux->prog; - - /* populate fexit progs */ - fexit = progs; - hlist_for_each_entry(aux, &tr->progs_hlist[BPF_TRAMP_FEXIT], tramp_hlist) - *progs++ = aux->prog; - - if (fexit_cnt) + if (tprogs[BPF_TRAMP_FEXIT].nr_progs) flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; /* Though the second half of trampoline page is unused a task could be @@ -232,12 +241,11 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) * preempted task. Hence wait for tasks to voluntarily schedule or go * to userspace. */ + synchronize_rcu_tasks(); err = arch_prepare_bpf_trampoline(new_image, new_image + BPF_IMAGE_SIZE / 2, - &tr->func.model, flags, - fentry, fentry_cnt, - fexit, fexit_cnt, + &tr->func.model, flags, tprogs, tr->func.addr); if (err < 0) goto out; @@ -252,6 +260,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) goto out; tr->selector++; out: + kfree(tprogs); return err; } @@ -409,8 +418,7 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) int __weak arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, - struct bpf_prog **fentry_progs, int fentry_cnt, - struct bpf_prog **fexit_progs, int fexit_cnt, + struct bpf_tramp_progs *tprogs, void *orig_call) { return -ENOTSUPP; -- cgit v1.2.3-71-gd317 From ae24082331d9bbaae283aafbe930a8f0eb85605a Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:49 +0100 Subject: bpf: Introduce BPF_MODIFY_RETURN When multiple programs are attached, each program receives the return value from the previous program on the stack and the last program provides the return value to the attached function. The fmod_ret bpf programs are run after the fentry programs and before the fexit programs. The original function is only called if all the fmod_ret programs return 0 to avoid any unintended side-effects. The success value, i.e. 0 is not currently configurable but can be made so where user-space can specify it at load time. For example: int func_to_be_attached(int a, int b) { <--- do_fentry do_fmod_ret: if (ret != 0) goto do_fexit; original_function: } <--- do_fexit The fmod_ret program attached to this function can be defined as: SEC("fmod_ret/func_to_be_attached") int BPF_PROG(func_name, int a, int b, int ret) { // This will skip the original function logic. return 1; } The first fmod_ret program is passed 0 in its return argument. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org --- arch/x86/net/bpf_jit_comp.c | 130 +++++++++++++++++++++++++++++++++++++---- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 1 + kernel/bpf/btf.c | 3 +- kernel/bpf/syscall.c | 1 + kernel/bpf/trampoline.c | 5 +- kernel/bpf/verifier.c | 1 + tools/include/uapi/linux/bpf.h | 1 + 8 files changed, 130 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index d6349e930b06..b1fd000feb89 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1362,7 +1362,7 @@ static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, } static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, - struct bpf_prog *p, int stack_size) + struct bpf_prog *p, int stack_size, bool mod_ret) { u8 *prog = *pprog; int cnt = 0; @@ -1383,6 +1383,13 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, if (emit_call(&prog, p->bpf_func, prog)) return -EINVAL; + /* BPF_TRAMP_MODIFY_RETURN trampolines can modify the return + * of the previous call which is then passed on the stack to + * the next BPF program. + */ + if (mod_ret) + emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); + /* arg1: mov rdi, progs[i] */ emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, (u32) (long) p); @@ -1442,6 +1449,23 @@ static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) return 0; } +static int emit_mod_ret_check_imm8(u8 **pprog, int value) +{ + u8 *prog = *pprog; + int cnt = 0; + + if (!is_imm8(value)) + return -EINVAL; + + if (value == 0) + EMIT2(0x85, add_2reg(0xC0, BPF_REG_0, BPF_REG_0)); + else + EMIT3(0x83, add_1reg(0xF8, BPF_REG_0), value); + + *pprog = prog; + return 0; +} + static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_tramp_progs *tp, int stack_size) { @@ -1449,9 +1473,49 @@ static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, u8 *prog = *pprog; for (i = 0; i < tp->nr_progs; i++) { - if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size)) + if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, false)) + return -EINVAL; + } + *pprog = prog; + return 0; +} + +static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog, + struct bpf_tramp_progs *tp, int stack_size, + u8 **branches) +{ + u8 *prog = *pprog; + int i; + + /* The first fmod_ret program will receive a garbage return value. + * Set this to 0 to avoid confusing the program. + */ + emit_mov_imm32(&prog, false, BPF_REG_0, 0); + emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); + for (i = 0; i < tp->nr_progs; i++) { + if (invoke_bpf_prog(m, &prog, tp->progs[i], stack_size, true)) return -EINVAL; + + /* Generate a branch: + * + * if (ret != 0) + * goto do_fexit; + * + * If needed this can be extended to any integer value which can + * be passed by user-space when the program is loaded. + */ + if (emit_mod_ret_check_imm8(&prog, 0)) + return -EINVAL; + + /* Save the location of the branch and Generate 6 nops + * (4 bytes for an offset and 2 bytes for the jump) These nops + * are replaced with a conditional jump once do_fexit (i.e. the + * start of the fexit invocation) is finalized. + */ + branches[i] = prog; + emit_nops(&prog, 4 + 2); } + *pprog = prog; return 0; } @@ -1521,10 +1585,12 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, struct bpf_tramp_progs *tprogs, void *orig_call) { - int cnt = 0, nr_args = m->nr_args; + int ret, i, cnt = 0, nr_args = m->nr_args; int stack_size = nr_args * 8; struct bpf_tramp_progs *fentry = &tprogs[BPF_TRAMP_FENTRY]; struct bpf_tramp_progs *fexit = &tprogs[BPF_TRAMP_FEXIT]; + struct bpf_tramp_progs *fmod_ret = &tprogs[BPF_TRAMP_MODIFY_RETURN]; + u8 **branches = NULL; u8 *prog; /* x86-64 supports up to 6 arguments. 7+ can be added in the future */ @@ -1557,24 +1623,60 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, if (invoke_bpf(m, &prog, fentry, stack_size)) return -EINVAL; + if (fmod_ret->nr_progs) { + branches = kcalloc(fmod_ret->nr_progs, sizeof(u8 *), + GFP_KERNEL); + if (!branches) + return -ENOMEM; + + if (invoke_bpf_mod_ret(m, &prog, fmod_ret, stack_size, + branches)) { + ret = -EINVAL; + goto cleanup; + } + } + if (flags & BPF_TRAMP_F_CALL_ORIG) { - if (fentry->nr_progs) + if (fentry->nr_progs || fmod_ret->nr_progs) restore_regs(m, &prog, nr_args, stack_size); /* call original function */ - if (emit_call(&prog, orig_call, prog)) - return -EINVAL; + if (emit_call(&prog, orig_call, prog)) { + ret = -EINVAL; + goto cleanup; + } /* remember return value in a stack for bpf prog to access */ emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); } + if (fmod_ret->nr_progs) { + /* From Intel 64 and IA-32 Architectures Optimization + * Reference Manual, 3.4.1.4 Code Alignment, Assembly/Compiler + * Coding Rule 11: All branch targets should be 16-byte + * aligned. + */ + emit_align(&prog, 16); + /* Update the branches saved in invoke_bpf_mod_ret with the + * aligned address of do_fexit. + */ + for (i = 0; i < fmod_ret->nr_progs; i++) + emit_cond_near_jump(&branches[i], prog, branches[i], + X86_JNE); + } + if (fexit->nr_progs) - if (invoke_bpf(m, &prog, fexit, stack_size)) - return -EINVAL; + if (invoke_bpf(m, &prog, fexit, stack_size)) { + ret = -EINVAL; + goto cleanup; + } if (flags & BPF_TRAMP_F_RESTORE_REGS) restore_regs(m, &prog, nr_args, stack_size); + /* This needs to be done regardless. If there were fmod_ret programs, + * the return value is only updated on the stack and still needs to be + * restored to R0. + */ if (flags & BPF_TRAMP_F_CALL_ORIG) /* restore original return value back into RAX */ emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8); @@ -1586,9 +1688,15 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ EMIT1(0xC3); /* ret */ /* Make sure the trampoline generation logic doesn't overflow */ - if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) - return -EFAULT; - return prog - (u8 *)image; + if (WARN_ON_ONCE(prog > (u8 *)image_end - BPF_INSN_SAFETY)) { + ret = -EFAULT; + goto cleanup; + } + ret = prog - (u8 *)image; + +cleanup: + kfree(branches); + return ret; } static int emit_fallback_jump(u8 **pprog) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 98ec10b23dbb..f748b31e5888 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -474,6 +474,7 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); enum bpf_tramp_prog_type { BPF_TRAMP_FENTRY, BPF_TRAMP_FEXIT, + BPF_TRAMP_MODIFY_RETURN, BPF_TRAMP_MAX, BPF_TRAMP_REPLACE, /* more than MAX */ }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index d6b33ea27bcc..40b2d9476268 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -210,6 +210,7 @@ enum bpf_attach_type { BPF_TRACE_RAW_TP, BPF_TRACE_FENTRY, BPF_TRACE_FEXIT, + BPF_MODIFY_RETURN, __MAX_BPF_ATTACH_TYPE }; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 787140095e58..30841fb8b3c0 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3710,7 +3710,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, nr_args--; } - if (prog->expected_attach_type == BPF_TRACE_FEXIT && + if ((prog->expected_attach_type == BPF_TRACE_FEXIT || + prog->expected_attach_type == BPF_MODIFY_RETURN) && arg == nr_args) { if (!t) /* Default prog with 5 args. 6th arg is retval. */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 13de65363ba2..7ce0815793dd 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2324,6 +2324,7 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog) if (prog->expected_attach_type != BPF_TRACE_FENTRY && prog->expected_attach_type != BPF_TRACE_FEXIT && + prog->expected_attach_type != BPF_MODIFY_RETURN && prog->type != BPF_PROG_TYPE_EXT) { err = -EINVAL; goto out_put_prog; diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 546198f6f307..221a17af1f81 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -232,7 +232,8 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) goto out; } - if (tprogs[BPF_TRAMP_FEXIT].nr_progs) + if (tprogs[BPF_TRAMP_FEXIT].nr_progs || + tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs) flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME; /* Though the second half of trampoline page is unused a task could be @@ -269,6 +270,8 @@ static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(enum bpf_attach_type t) switch (t) { case BPF_TRACE_FENTRY: return BPF_TRAMP_FENTRY; + case BPF_MODIFY_RETURN: + return BPF_TRAMP_MODIFY_RETURN; case BPF_TRACE_FEXIT: return BPF_TRAMP_FEXIT; default: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 289383edfc8c..2460c8e6b5be 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9950,6 +9950,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) if (!prog_extension) return -EINVAL; /* fallthrough */ + case BPF_MODIFY_RETURN: case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: if (!btf_type_is_func(t)) { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index d6b33ea27bcc..40b2d9476268 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -210,6 +210,7 @@ enum bpf_attach_type { BPF_TRACE_RAW_TP, BPF_TRACE_FENTRY, BPF_TRACE_FEXIT, + BPF_MODIFY_RETURN, __MAX_BPF_ATTACH_TYPE }; -- cgit v1.2.3-71-gd317 From 6ba43b761c41349140662e223401bec0e48950e7 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:50 +0100 Subject: bpf: Attachment verification for BPF_MODIFY_RETURN - Allow BPF_MODIFY_RETURN attachment only to functions that are: * Whitelisted for error injection by checking within_error_injection_list. Similar discussions happened for the bpf_override_return helper. * security hooks, this is expected to be cleaned up with the LSM changes after the KRSI patches introduce the LSM_HOOK macro: https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/ - The attachment is currently limited to functions that return an int. This can be extended later other types (e.g. PTR). Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-5-kpsingh@chromium.org --- kernel/bpf/btf.c | 28 ++++++++++++++++++++-------- kernel/bpf/verifier.c | 31 +++++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 30841fb8b3c0..50080add2ab9 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3710,14 +3710,26 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, nr_args--; } - if ((prog->expected_attach_type == BPF_TRACE_FEXIT || - prog->expected_attach_type == BPF_MODIFY_RETURN) && - arg == nr_args) { - if (!t) - /* Default prog with 5 args. 6th arg is retval. */ - return true; - /* function return type */ - t = btf_type_by_id(btf, t->type); + if (arg == nr_args) { + if (prog->expected_attach_type == BPF_TRACE_FEXIT) { + if (!t) + return true; + t = btf_type_by_id(btf, t->type); + } else if (prog->expected_attach_type == BPF_MODIFY_RETURN) { + /* For now the BPF_MODIFY_RETURN can only be attached to + * functions that return an int. + */ + if (!t) + return false; + + t = btf_type_skip_modifiers(btf, t->type, NULL); + if (!btf_type_is_int(t)) { + bpf_log(log, + "ret type %s not allowed for fmod_ret\n", + btf_kind_str[BTF_INFO_KIND(t->info)]); + return false; + } + } } else if (arg >= nr_args) { bpf_log(log, "func '%s' doesn't have %d-th argument\n", tname, arg + 1); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2460c8e6b5be..ae32517d4ccd 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "disasm.h" @@ -9800,6 +9801,33 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) return 0; } +#define SECURITY_PREFIX "security_" + +static int check_attach_modify_return(struct bpf_verifier_env *env) +{ + struct bpf_prog *prog = env->prog; + unsigned long addr = (unsigned long) prog->aux->trampoline->func.addr; + + if (within_error_injection_list(addr)) + return 0; + + /* This is expected to be cleaned up in the future with the KRSI effort + * introducing the LSM_HOOK macro for cleaning up lsm_hooks.h. + */ + if (!strncmp(SECURITY_PREFIX, prog->aux->attach_func_name, + sizeof(SECURITY_PREFIX) - 1)) { + + if (!capable(CAP_MAC_ADMIN)) + return -EPERM; + + return 0; + } + + verbose(env, "fmod_ret attach_btf_id %u (%s) is not modifiable\n", + prog->aux->attach_btf_id, prog->aux->attach_func_name); + + return -EINVAL; +} static int check_attach_btf_id(struct bpf_verifier_env *env) { @@ -10000,6 +10028,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) } tr->func.addr = (void *)addr; prog->aux->trampoline = tr; + + if (prog->expected_attach_type == BPF_MODIFY_RETURN) + ret = check_attach_modify_return(env); out: mutex_unlock(&tr->mutex); if (ret) -- cgit v1.2.3-71-gd317 From da00d2f117a08fbca262db5ea422c80a568b112b Mon Sep 17 00:00:00 2001 From: KP Singh Date: Wed, 4 Mar 2020 20:18:52 +0100 Subject: bpf: Add test ops for BPF_PROG_TYPE_TRACING The current fexit and fentry tests rely on a different program to exercise the functions they attach to. Instead of doing this, implement the test operations for tracing which will also be used for BPF_MODIFY_RETURN in a subsequent patch. Also, clean up the fexit test to use the generated skeleton. Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Acked-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org --- include/linux/bpf.h | 10 ++++ kernel/trace/bpf_trace.c | 1 + net/bpf/test_run.c | 37 +++++++++--- .../selftests/bpf/prog_tests/fentry_fexit.c | 12 +--- .../testing/selftests/bpf/prog_tests/fentry_test.c | 14 ++--- .../testing/selftests/bpf/prog_tests/fexit_test.c | 69 +++++++--------------- 6 files changed, 67 insertions(+), 76 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f748b31e5888..40c53924571d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1156,6 +1156,9 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); +int bpf_prog_test_run_tracing(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr); int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); @@ -1313,6 +1316,13 @@ static inline int bpf_prog_test_run_skb(struct bpf_prog *prog, return -ENOTSUPP; } +static inline int bpf_prog_test_run_tracing(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; +} + static inline int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 07764c761073..363e0a2c75cf 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1266,6 +1266,7 @@ const struct bpf_verifier_ops tracing_verifier_ops = { }; const struct bpf_prog_ops tracing_prog_ops = { + .test_run = bpf_prog_test_run_tracing, }; static bool raw_tp_writable_prog_is_valid_access(int off, int size, diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 1cd7a1c2f8b2..3600f098e7c6 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -160,18 +160,37 @@ static void *bpf_test_init(const union bpf_attr *kattr, u32 size, kfree(data); return ERR_PTR(-EFAULT); } - if (bpf_fentry_test1(1) != 2 || - bpf_fentry_test2(2, 3) != 5 || - bpf_fentry_test3(4, 5, 6) != 15 || - bpf_fentry_test4((void *)7, 8, 9, 10) != 34 || - bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 || - bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111) { - kfree(data); - return ERR_PTR(-EFAULT); - } + return data; } +int bpf_prog_test_run_tracing(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + int err = -EFAULT; + + switch (prog->expected_attach_type) { + case BPF_TRACE_FENTRY: + case BPF_TRACE_FEXIT: + if (bpf_fentry_test1(1) != 2 || + bpf_fentry_test2(2, 3) != 5 || + bpf_fentry_test3(4, 5, 6) != 15 || + bpf_fentry_test4((void *)7, 8, 9, 10) != 34 || + bpf_fentry_test5(11, (void *)12, 13, 14, 15) != 65 || + bpf_fentry_test6(16, (void *)17, 18, 19, (void *)20, 21) != 111) + goto out; + break; + default: + goto out; + } + + err = 0; +out: + trace_bpf_test_finish(&err); + return err; +} + static void *bpf_ctx_init(const union bpf_attr *kattr, u32 max_size) { void __user *data_in = u64_to_user_ptr(kattr->test.ctx_in); diff --git a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c index 235ac4f67f5b..83493bd5745c 100644 --- a/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c +++ b/tools/testing/selftests/bpf/prog_tests/fentry_fexit.c @@ -1,22 +1,17 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include -#include "test_pkt_access.skel.h" #include "fentry_test.skel.h" #include "fexit_test.skel.h" void test_fentry_fexit(void) { - struct test_pkt_access *pkt_skel = NULL; struct fentry_test *fentry_skel = NULL; struct fexit_test *fexit_skel = NULL; __u64 *fentry_res, *fexit_res; __u32 duration = 0, retval; - int err, pkt_fd, i; + int err, prog_fd, i; - pkt_skel = test_pkt_access__open_and_load(); - if (CHECK(!pkt_skel, "pkt_skel_load", "pkt_access skeleton failed\n")) - return; fentry_skel = fentry_test__open_and_load(); if (CHECK(!fentry_skel, "fentry_skel_load", "fentry skeleton failed\n")) goto close_prog; @@ -31,8 +26,8 @@ void test_fentry_fexit(void) if (CHECK(err, "fexit_attach", "fexit attach failed: %d\n", err)) goto close_prog; - pkt_fd = bpf_program__fd(pkt_skel->progs.test_pkt_access); - err = bpf_prog_test_run(pkt_fd, 1, &pkt_v6, sizeof(pkt_v6), + prog_fd = bpf_program__fd(fexit_skel->progs.test1); + err = bpf_prog_test_run(prog_fd, 1, NULL, 0, NULL, NULL, &retval, &duration); CHECK(err || retval, "ipv6", "err %d errno %d retval %d duration %d\n", @@ -49,7 +44,6 @@ void test_fentry_fexit(void) } close_prog: - test_pkt_access__destroy(pkt_skel); fentry_test__destroy(fentry_skel); fexit_test__destroy(fexit_skel); } diff --git a/tools/testing/selftests/bpf/prog_tests/fentry_test.c b/tools/testing/selftests/bpf/prog_tests/fentry_test.c index 5cc06021f27d..04ebbf1cb390 100644 --- a/tools/testing/selftests/bpf/prog_tests/fentry_test.c +++ b/tools/testing/selftests/bpf/prog_tests/fentry_test.c @@ -1,20 +1,15 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include -#include "test_pkt_access.skel.h" #include "fentry_test.skel.h" void test_fentry_test(void) { - struct test_pkt_access *pkt_skel = NULL; struct fentry_test *fentry_skel = NULL; - int err, pkt_fd, i; + int err, prog_fd, i; __u32 duration = 0, retval; __u64 *result; - pkt_skel = test_pkt_access__open_and_load(); - if (CHECK(!pkt_skel, "pkt_skel_load", "pkt_access skeleton failed\n")) - return; fentry_skel = fentry_test__open_and_load(); if (CHECK(!fentry_skel, "fentry_skel_load", "fentry skeleton failed\n")) goto cleanup; @@ -23,10 +18,10 @@ void test_fentry_test(void) if (CHECK(err, "fentry_attach", "fentry attach failed: %d\n", err)) goto cleanup; - pkt_fd = bpf_program__fd(pkt_skel->progs.test_pkt_access); - err = bpf_prog_test_run(pkt_fd, 1, &pkt_v6, sizeof(pkt_v6), + prog_fd = bpf_program__fd(fentry_skel->progs.test1); + err = bpf_prog_test_run(prog_fd, 1, NULL, 0, NULL, NULL, &retval, &duration); - CHECK(err || retval, "ipv6", + CHECK(err || retval, "test_run", "err %d errno %d retval %d duration %d\n", err, errno, retval, duration); @@ -39,5 +34,4 @@ void test_fentry_test(void) cleanup: fentry_test__destroy(fentry_skel); - test_pkt_access__destroy(pkt_skel); } diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_test.c b/tools/testing/selftests/bpf/prog_tests/fexit_test.c index d2c3655dd7a3..78d7a2765c27 100644 --- a/tools/testing/selftests/bpf/prog_tests/fexit_test.c +++ b/tools/testing/selftests/bpf/prog_tests/fexit_test.c @@ -1,64 +1,37 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include +#include "fexit_test.skel.h" void test_fexit_test(void) { - struct bpf_prog_load_attr attr = { - .file = "./fexit_test.o", - }; - - char prog_name[] = "fexit/bpf_fentry_testX"; - struct bpf_object *obj = NULL, *pkt_obj; - int err, pkt_fd, kfree_skb_fd, i; - struct bpf_link *link[6] = {}; - struct bpf_program *prog[6]; + struct fexit_test *fexit_skel = NULL; + int err, prog_fd, i; __u32 duration = 0, retval; - struct bpf_map *data_map; - const int zero = 0; - u64 result[6]; + __u64 *result; - err = bpf_prog_load("./test_pkt_access.o", BPF_PROG_TYPE_SCHED_CLS, - &pkt_obj, &pkt_fd); - if (CHECK(err, "prog_load sched cls", "err %d errno %d\n", err, errno)) - return; - err = bpf_prog_load_xattr(&attr, &obj, &kfree_skb_fd); - if (CHECK(err, "prog_load fail", "err %d errno %d\n", err, errno)) - goto close_prog; + fexit_skel = fexit_test__open_and_load(); + if (CHECK(!fexit_skel, "fexit_skel_load", "fexit skeleton failed\n")) + goto cleanup; - for (i = 0; i < 6; i++) { - prog_name[sizeof(prog_name) - 2] = '1' + i; - prog[i] = bpf_object__find_program_by_title(obj, prog_name); - if (CHECK(!prog[i], "find_prog", "prog %s not found\n", prog_name)) - goto close_prog; - link[i] = bpf_program__attach_trace(prog[i]); - if (CHECK(IS_ERR(link[i]), "attach_trace", "failed to link\n")) - goto close_prog; - } - data_map = bpf_object__find_map_by_name(obj, "fexit_te.bss"); - if (CHECK(!data_map, "find_data_map", "data map not found\n")) - goto close_prog; + err = fexit_test__attach(fexit_skel); + if (CHECK(err, "fexit_attach", "fexit attach failed: %d\n", err)) + goto cleanup; - err = bpf_prog_test_run(pkt_fd, 1, &pkt_v6, sizeof(pkt_v6), + prog_fd = bpf_program__fd(fexit_skel->progs.test1); + err = bpf_prog_test_run(prog_fd, 1, NULL, 0, NULL, NULL, &retval, &duration); - CHECK(err || retval, "ipv6", + CHECK(err || retval, "test_run", "err %d errno %d retval %d duration %d\n", err, errno, retval, duration); - err = bpf_map_lookup_elem(bpf_map__fd(data_map), &zero, &result); - if (CHECK(err, "get_result", - "failed to get output data: %d\n", err)) - goto close_prog; - - for (i = 0; i < 6; i++) - if (CHECK(result[i] != 1, "result", "bpf_fentry_test%d failed err %ld\n", - i + 1, result[i])) - goto close_prog; + result = (__u64 *)fexit_skel->bss; + for (i = 0; i < 6; i++) { + if (CHECK(result[i] != 1, "result", + "fexit_test%d failed err %lld\n", i + 1, result[i])) + goto cleanup; + } -close_prog: - for (i = 0; i < 6; i++) - if (!IS_ERR_OR_NULL(link[i])) - bpf_link__destroy(link[i]); - bpf_object__close(obj); - bpf_object__close(pkt_obj); +cleanup: + fexit_test__destroy(fexit_skel); } -- cgit v1.2.3-71-gd317 From 51891498f2da78ee64dfad88fa53c9e85fb50abf Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Wed, 4 Mar 2020 11:05:17 -0700 Subject: seccomp: allow TSYNC and USER_NOTIF together The restriction introduced in 7a0df7fbc145 ("seccomp: Make NEW_LISTENER and TSYNC flags exclusive") is mostly artificial: there is enough information in a seccomp user notification to tell which thread triggered a notification. The reason it was introduced is because TSYNC makes the syscall return a thread-id on failure, and NEW_LISTENER returns an fd, and there's no way to distinguish between these two cases (well, I suppose the caller could check all fds it has, then do the syscall, and if the return value was an fd that already existed, then it must be a thread id, but bleh). Matthew would like to use these two flags together in the Chrome sandbox which wants to use TSYNC for video drivers and NEW_LISTENER to proxy syscalls. So, let's fix this ugliness by adding another flag, TSYNC_ESRCH, which tells the kernel to just return -ESRCH on a TSYNC error. This way, NEW_LISTENER (and any subsequent seccomp() commands that want to return positive values) don't conflict with each other. Suggested-by: Matthew Denton Signed-off-by: Tycho Andersen Link: https://lore.kernel.org/r/20200304180517.23867-1-tycho@tycho.ws Signed-off-by: Kees Cook --- include/linux/seccomp.h | 3 +- include/uapi/linux/seccomp.h | 1 + kernel/seccomp.c | 14 +++-- tools/testing/selftests/seccomp/seccomp_bpf.c | 74 ++++++++++++++++++++++++++- 4 files changed, 86 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 03583b6d1416..4192369b8418 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -7,7 +7,8 @@ #define SECCOMP_FILTER_FLAG_MASK (SECCOMP_FILTER_FLAG_TSYNC | \ SECCOMP_FILTER_FLAG_LOG | \ SECCOMP_FILTER_FLAG_SPEC_ALLOW | \ - SECCOMP_FILTER_FLAG_NEW_LISTENER) + SECCOMP_FILTER_FLAG_NEW_LISTENER | \ + SECCOMP_FILTER_FLAG_TSYNC_ESRCH) #ifdef CONFIG_SECCOMP diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index be84d87f1f46..c1735455bc53 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -22,6 +22,7 @@ #define SECCOMP_FILTER_FLAG_LOG (1UL << 1) #define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2) #define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3) +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) /* * All BPF programs must return a 32-bit value. diff --git a/kernel/seccomp.c b/kernel/seccomp.c index b6ea3dcb57bf..29022c1bbe18 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -528,8 +528,12 @@ static long seccomp_attach_filter(unsigned int flags, int ret; ret = seccomp_can_sync_threads(); - if (ret) - return ret; + if (ret) { + if (flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) + return -ESRCH; + else + return ret; + } } /* Set log flag, if present. */ @@ -1288,10 +1292,12 @@ static long seccomp_set_mode_filter(unsigned int flags, * In the successful case, NEW_LISTENER returns the new listener fd. * But in the failure case, TSYNC returns the thread that died. If you * combine these two flags, there's no way to tell whether something - * succeeded or failed. So, let's disallow this combination. + * succeeded or failed. So, let's disallow this combination if the user + * has not explicitly requested no errors from TSYNC. */ if ((flags & SECCOMP_FILTER_FLAG_TSYNC) && - (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER)) + (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) && + ((flags & SECCOMP_FILTER_FLAG_TSYNC_ESRCH) == 0)) return -EINVAL; /* Prepare the new filter before holding any locks. */ diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index ee1b727ede04..a9ad3bd8b2ad 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -212,6 +212,10 @@ struct seccomp_notif_sizes { #define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001 #endif +#ifndef SECCOMP_FILTER_FLAG_TSYNC_ESRCH +#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4) +#endif + #ifndef seccomp int seccomp(unsigned int op, unsigned int flags, void *args) { @@ -2187,7 +2191,8 @@ TEST(detect_seccomp_filter_flags) unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC, SECCOMP_FILTER_FLAG_LOG, SECCOMP_FILTER_FLAG_SPEC_ALLOW, - SECCOMP_FILTER_FLAG_NEW_LISTENER }; + SECCOMP_FILTER_FLAG_NEW_LISTENER, + SECCOMP_FILTER_FLAG_TSYNC_ESRCH }; unsigned int exclusive[] = { SECCOMP_FILTER_FLAG_TSYNC, SECCOMP_FILTER_FLAG_NEW_LISTENER }; @@ -2645,6 +2650,55 @@ TEST_F(TSYNC, two_siblings_with_one_divergence) EXPECT_EQ(SIBLING_EXIT_UNKILLED, (long)status); } +TEST_F(TSYNC, two_siblings_with_one_divergence_no_tid_in_err) +{ + long ret, flags; + void *status; + + ASSERT_EQ(0, prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { + TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!"); + } + + ret = seccomp(SECCOMP_SET_MODE_FILTER, 0, &self->root_prog); + ASSERT_NE(ENOSYS, errno) { + TH_LOG("Kernel does not support seccomp syscall!"); + } + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support SECCOMP_SET_MODE_FILTER!"); + } + self->sibling[0].diverge = 1; + tsync_start_sibling(&self->sibling[0]); + tsync_start_sibling(&self->sibling[1]); + + while (self->sibling_count < TSYNC_SIBLINGS) { + sem_wait(&self->started); + self->sibling_count++; + } + + flags = SECCOMP_FILTER_FLAG_TSYNC | \ + SECCOMP_FILTER_FLAG_TSYNC_ESRCH; + ret = seccomp(SECCOMP_SET_MODE_FILTER, flags, &self->apply_prog); + ASSERT_EQ(ESRCH, errno) { + TH_LOG("Did not return ESRCH for diverged sibling."); + } + ASSERT_EQ(-1, ret) { + TH_LOG("Did not fail on diverged sibling."); + } + + /* Wake the threads */ + pthread_mutex_lock(&self->mutex); + ASSERT_EQ(0, pthread_cond_broadcast(&self->cond)) { + TH_LOG("cond broadcast non-zero"); + } + pthread_mutex_unlock(&self->mutex); + + /* Ensure they are both unkilled. */ + PTHREAD_JOIN(self->sibling[0].tid, &status); + EXPECT_EQ(SIBLING_EXIT_UNKILLED, (long)status); + PTHREAD_JOIN(self->sibling[1].tid, &status); + EXPECT_EQ(SIBLING_EXIT_UNKILLED, (long)status); +} + TEST_F(TSYNC, two_siblings_not_under_filter) { long ret, sib; @@ -3196,6 +3250,24 @@ TEST(user_notification_basic) EXPECT_EQ(0, WEXITSTATUS(status)); } +TEST(user_notification_with_tsync) +{ + int ret; + unsigned int flags; + + /* these were exclusive */ + flags = SECCOMP_FILTER_FLAG_NEW_LISTENER | + SECCOMP_FILTER_FLAG_TSYNC; + ASSERT_EQ(-1, user_trap_syscall(__NR_getppid, flags)); + ASSERT_EQ(EINVAL, errno); + + /* but now they're not */ + flags |= SECCOMP_FILTER_FLAG_TSYNC_ESRCH; + ret = user_trap_syscall(__NR_getppid, flags); + close(ret); + ASSERT_LE(0, ret); +} + TEST(user_notification_kill_in_middle) { pid_t pid; -- cgit v1.2.3-71-gd317 From 69191754ff299a64575003d9e2a79db190d27115 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Thu, 5 Mar 2020 21:49:55 +0100 Subject: bpf: Remove unnecessary CAP_MAC_ADMIN check While well intentioned, checking CAP_MAC_ADMIN for attaching BPF_MODIFY_RETURN tracing programs to "security_" functions is not necessary as tracing BPF programs already require CAP_SYS_ADMIN. Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org --- kernel/bpf/verifier.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ae32517d4ccd..55d376c53f7d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -9808,20 +9808,13 @@ static int check_attach_modify_return(struct bpf_verifier_env *env) struct bpf_prog *prog = env->prog; unsigned long addr = (unsigned long) prog->aux->trampoline->func.addr; - if (within_error_injection_list(addr)) - return 0; - /* This is expected to be cleaned up in the future with the KRSI effort * introducing the LSM_HOOK macro for cleaning up lsm_hooks.h. */ - if (!strncmp(SECURITY_PREFIX, prog->aux->attach_func_name, - sizeof(SECURITY_PREFIX) - 1)) { - - if (!capable(CAP_MAC_ADMIN)) - return -EPERM; - + if (within_error_injection_list(addr) || + !strncmp(SECURITY_PREFIX, prog->aux->attach_func_name, + sizeof(SECURITY_PREFIX) - 1)) return 0; - } verbose(env, "fmod_ret attach_btf_id %u (%s) is not modifiable\n", prog->aux->attach_btf_id, prog->aux->attach_func_name); -- cgit v1.2.3-71-gd317 From 3e7c67d90e3ed2f34fce42699f11b150dd1d3999 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Thu, 5 Mar 2020 23:01:27 +0100 Subject: bpf: Fix bpf_prog_test_run_tracing for !CONFIG_NET test_run.o is not built when CONFIG_NET is not set and bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the linker error: ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to `bpf_prog_test_run_tracing' Add a __weak function in bpf_trace.c to handle this. Fixes: da00d2f117a0 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING") Signed-off-by: KP Singh Reported-by: Randy Dunlap Acked-by: Randy Dunlap Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org --- kernel/trace/bpf_trace.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 363e0a2c75cf..6a490d8ce9de 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1252,6 +1252,13 @@ static bool tracing_prog_is_valid_access(int off, int size, return btf_ctx_access(off, size, type, prog, info); } +int __weak bpf_prog_test_run_tracing(struct bpf_prog *prog, + const union bpf_attr *kattr, + union bpf_attr __user *uattr) +{ + return -ENOTSUPP; +} + const struct bpf_verifier_ops raw_tracepoint_verifier_ops = { .get_func_proto = raw_tp_prog_func_proto, .is_valid_access = raw_tp_prog_is_valid_access, -- cgit v1.2.3-71-gd317 From 07b24c7c08bdc2d36de10881a17145426f47742b Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Tue, 25 Feb 2020 20:59:22 -0800 Subject: crypto: pcrypt - simplify error handling in pcrypt_create_aead() Simplify the error handling in pcrypt_create_aead() by taking advantage of crypto_grab_aead() now handling an ERR_PTR() name and by taking advantage of crypto_drop_aead() now accepting (as a no-op) a spawn that hasn't been grabbed yet. This required also making padata_free_shell() accept a NULL argument. Signed-off-by: Eric Biggers Signed-off-by: Herbert Xu --- crypto/pcrypt.c | 33 +++++++++------------------------ kernel/padata.c | 7 ++++--- 2 files changed, 13 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c index 1b632139a8c1..8bddc65cd509 100644 --- a/crypto/pcrypt.c +++ b/crypto/pcrypt.c @@ -232,17 +232,12 @@ static int pcrypt_create_aead(struct crypto_template *tmpl, struct rtattr **tb, struct crypto_attr_type *algt; struct aead_instance *inst; struct aead_alg *alg; - const char *name; int err; algt = crypto_get_attr_type(tb); if (IS_ERR(algt)) return PTR_ERR(algt); - name = crypto_attr_alg_name(tb[1]); - if (IS_ERR(name)) - return PTR_ERR(name); - inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) return -ENOMEM; @@ -252,21 +247,21 @@ static int pcrypt_create_aead(struct crypto_template *tmpl, struct rtattr **tb, ctx = aead_instance_ctx(inst); ctx->psenc = padata_alloc_shell(pencrypt); if (!ctx->psenc) - goto out_free_inst; + goto err_free_inst; ctx->psdec = padata_alloc_shell(pdecrypt); if (!ctx->psdec) - goto out_free_psenc; + goto err_free_inst; err = crypto_grab_aead(&ctx->spawn, aead_crypto_instance(inst), - name, 0, 0); + crypto_attr_alg_name(tb[1]), 0, 0); if (err) - goto out_free_psdec; + goto err_free_inst; alg = crypto_spawn_aead_alg(&ctx->spawn); err = pcrypt_init_instance(aead_crypto_instance(inst), &alg->base); if (err) - goto out_drop_aead; + goto err_free_inst; inst->alg.base.cra_flags = CRYPTO_ALG_ASYNC; @@ -286,21 +281,11 @@ static int pcrypt_create_aead(struct crypto_template *tmpl, struct rtattr **tb, inst->free = pcrypt_free; err = aead_register_instance(tmpl, inst); - if (err) - goto out_drop_aead; - -out: + if (err) { +err_free_inst: + pcrypt_free(inst); + } return err; - -out_drop_aead: - crypto_drop_aead(&ctx->spawn); -out_free_psdec: - padata_free_shell(ctx->psdec); -out_free_psenc: - padata_free_shell(ctx->psenc); -out_free_inst: - kfree(inst); - goto out; } static int pcrypt_create(struct crypto_template *tmpl, struct rtattr **tb) diff --git a/kernel/padata.c b/kernel/padata.c index 62082597d4a2..a6afa12fb75e 100644 --- a/kernel/padata.c +++ b/kernel/padata.c @@ -1038,12 +1038,13 @@ EXPORT_SYMBOL(padata_alloc_shell); */ void padata_free_shell(struct padata_shell *ps) { - struct padata_instance *pinst = ps->pinst; + if (!ps) + return; - mutex_lock(&pinst->lock); + mutex_lock(&ps->pinst->lock); list_del(&ps->list); padata_free_pd(rcu_dereference_protected(ps->pd, 1)); - mutex_unlock(&pinst->lock); + mutex_unlock(&ps->pinst->lock); kfree(ps); } -- cgit v1.2.3-71-gd317 From 222993395ed38f3751287f4bd82ef46b3eb3a66d Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 4 Mar 2020 13:02:41 +0100 Subject: futex: Remove pointless mmgrap() + mmdrop() We always set 'key->private.mm' to 'current->mm', getting an extra reference on 'current->mm' is quite pointless, because as long as the task is blocked it isn't going to go away. Signed-off-by: Peter Zijlstra (Intel) --- kernel/futex.c | 14 +------------- 1 file changed, 1 insertion(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index e14f7cd45dbd..3463c916605a 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -331,17 +331,6 @@ static void compat_exit_robust_list(struct task_struct *curr); static inline void compat_exit_robust_list(struct task_struct *curr) { } #endif -static inline void futex_get_mm(union futex_key *key) -{ - mmgrab(key->private.mm); - /* - * Ensure futex_get_mm() implies a full barrier such that - * get_futex_key() implies a full barrier. This is relied upon - * as smp_mb(); (B), see the ordering comment above. - */ - smp_mb__after_atomic(); -} - /* * Reflects a new waiter being added to the waitqueue. */ @@ -432,7 +421,7 @@ static void get_futex_key_refs(union futex_key *key) smp_mb(); /* explicit smp_mb(); (B) */ break; case FUT_OFF_MMSHARED: - futex_get_mm(key); /* implies smp_mb(); (B) */ + smp_mb(); /* explicit smp_mb(); (B) */ break; default: /* @@ -465,7 +454,6 @@ static void drop_futex_key_refs(union futex_key *key) case FUT_OFF_INODE: break; case FUT_OFF_MMSHARED: - mmdrop(key->private.mm); break; } } -- cgit v1.2.3-71-gd317 From 4b39f99c222a2aff6a52fddfa6d8d4aef1771737 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 4 Mar 2020 13:24:24 +0100 Subject: futex: Remove {get,drop}_futex_key_refs() Now that {get,drop}_futex_key_refs() have become a glorified NOP, remove them entirely. The only thing get_futex_key_refs() is still doing is an smp_mb(), and now that we don't need to (ab)use existing atomic ops to obtain them, we can place it explicitly where we need it. Signed-off-by: Peter Zijlstra (Intel) --- kernel/futex.c | 90 ++++------------------------------------------------------ 1 file changed, 6 insertions(+), 84 deletions(-) (limited to 'kernel') diff --git a/kernel/futex.c b/kernel/futex.c index 3463c916605a..b62cf942e4b7 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -135,8 +135,7 @@ * * Where (A) orders the waiters increment and the futex value read through * atomic operations (see hb_waiters_inc) and where (B) orders the write - * to futex and the waiters read -- this is done by the barriers for both - * shared and private futexes in get_futex_key_refs(). + * to futex and the waiters read (see hb_waiters_pending()). * * This yields the following case (where X:=waiters, Y:=futex): * @@ -359,6 +358,10 @@ static inline void hb_waiters_dec(struct futex_hash_bucket *hb) static inline int hb_waiters_pending(struct futex_hash_bucket *hb) { #ifdef CONFIG_SMP + /* + * Full barrier (B), see the ordering comment above. + */ + smp_mb(); return atomic_read(&hb->waiters); #else return 1; @@ -396,68 +399,6 @@ static inline int match_futex(union futex_key *key1, union futex_key *key2) && key1->both.offset == key2->both.offset); } -/* - * Take a reference to the resource addressed by a key. - * Can be called while holding spinlocks. - * - */ -static void get_futex_key_refs(union futex_key *key) -{ - if (!key->both.ptr) - return; - - /* - * On MMU less systems futexes are always "private" as there is no per - * process address space. We need the smp wmb nevertheless - yes, - * arch/blackfin has MMU less SMP ... - */ - if (!IS_ENABLED(CONFIG_MMU)) { - smp_mb(); /* explicit smp_mb(); (B) */ - return; - } - - switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { - case FUT_OFF_INODE: - smp_mb(); /* explicit smp_mb(); (B) */ - break; - case FUT_OFF_MMSHARED: - smp_mb(); /* explicit smp_mb(); (B) */ - break; - default: - /* - * Private futexes do not hold reference on an inode or - * mm, therefore the only purpose of calling get_futex_key_refs - * is because we need the barrier for the lockless waiter check. - */ - smp_mb(); /* explicit smp_mb(); (B) */ - } -} - -/* - * Drop a reference to the resource addressed by a key. - * The hash bucket spinlock must not be held. This is - * a no-op for private futexes, see comment in the get - * counterpart. - */ -static void drop_futex_key_refs(union futex_key *key) -{ - if (!key->both.ptr) { - /* If we're here then we tried to put a key we failed to get */ - WARN_ON_ONCE(1); - return; - } - - if (!IS_ENABLED(CONFIG_MMU)) - return; - - switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { - case FUT_OFF_INODE: - break; - case FUT_OFF_MMSHARED: - break; - } -} - enum futex_access { FUTEX_READ, FUTEX_WRITE @@ -589,7 +530,6 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, enum futex_a if (!fshared) { key->private.mm = mm; key->private.address = address; - get_futex_key_refs(key); /* implies smp_mb(); (B) */ return 0; } @@ -729,8 +669,6 @@ again: rcu_read_unlock(); } - get_futex_key_refs(key); /* implies smp_mb(); (B) */ - out: put_page(page); return err; @@ -738,7 +676,6 @@ out: static inline void put_futex_key(union futex_key *key) { - drop_futex_key_refs(key); } /** @@ -1873,7 +1810,6 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, plist_add(&q->list, &hb2->chain); q->lock_ptr = &hb2->lock; } - get_futex_key_refs(key2); q->key = *key2; } @@ -1895,7 +1831,6 @@ static inline void requeue_pi_wake_futex(struct futex_q *q, union futex_key *key, struct futex_hash_bucket *hb) { - get_futex_key_refs(key); q->key = *key; __unqueue_futex(q); @@ -2006,7 +1941,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, u32 *cmpval, int requeue_pi) { union futex_key key1 = FUTEX_KEY_INIT, key2 = FUTEX_KEY_INIT; - int drop_count = 0, task_count = 0, ret; + int task_count = 0, ret; struct futex_pi_state *pi_state = NULL; struct futex_hash_bucket *hb1, *hb2; struct futex_q *this, *next; @@ -2127,7 +2062,6 @@ retry_private: */ if (ret > 0) { WARN_ON(pi_state); - drop_count++; task_count++; /* * If we acquired the lock, then the user space value @@ -2247,7 +2181,6 @@ retry_private: * doing so. */ requeue_pi_wake_futex(this, &key2, hb2); - drop_count++; continue; } else if (ret) { /* @@ -2268,7 +2201,6 @@ retry_private: } } requeue_futex(this, hb1, hb2, &key2); - drop_count++; } /* @@ -2283,15 +2215,6 @@ out_unlock: wake_up_q(&wake_q); hb_waiters_dec(hb2); - /* - * drop_futex_key_refs() must be called outside the spinlocks. During - * the requeue we moved futex_q's from the hash bucket at key1 to the - * one at key2 and updated their key pointer. We no longer need to - * hold the references to key1. - */ - while (--drop_count >= 0) - drop_futex_key_refs(&key1); - out_put_keys: put_futex_key(&key2); out_put_key1: @@ -2421,7 +2344,6 @@ retry: ret = 1; } - drop_futex_key_refs(&q->key); return ret; } -- cgit v1.2.3-71-gd317 From ab6f824cfdf7363b5e529621cbc72ae6519c78d1 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 7 Aug 2019 11:17:00 +0200 Subject: perf/core: Unify {pinned,flexible}_sched_in() Less is more; unify the two very nearly identical function. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar --- kernel/events/core.c | 58 +++++++++++++++++++--------------------------------- 1 file changed, 21 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 3f1f77de7247..b713080eedcc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1986,6 +1986,12 @@ static int perf_get_aux_event(struct perf_event *event, return 1; } +static inline struct list_head *get_event_list(struct perf_event *event) +{ + struct perf_event_context *ctx = event->ctx; + return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active; +} + static void perf_group_detach(struct perf_event *event) { struct perf_event *sibling, *tmp; @@ -2028,12 +2034,8 @@ static void perf_group_detach(struct perf_event *event) if (!RB_EMPTY_NODE(&event->group_node)) { add_event_to_groups(sibling, event->ctx); - if (sibling->state == PERF_EVENT_STATE_ACTIVE) { - struct list_head *list = sibling->attr.pinned ? - &ctx->pinned_active : &ctx->flexible_active; - - list_add_tail(&sibling->active_list, list); - } + if (sibling->state == PERF_EVENT_STATE_ACTIVE) + list_add_tail(&sibling->active_list, get_event_list(sibling)); } WARN_ON_ONCE(sibling->ctx != event->ctx); @@ -2350,6 +2352,8 @@ event_sched_in(struct perf_event *event, { int ret = 0; + WARN_ON_ONCE(event->ctx != ctx); + lockdep_assert_held(&ctx->lock); if (event->state <= PERF_EVENT_STATE_OFF) @@ -3425,10 +3429,12 @@ struct sched_in_data { int can_add_hw; }; -static int pinned_sched_in(struct perf_event *event, void *data) +static int merge_sched_in(struct perf_event *event, void *data) { struct sched_in_data *sid = data; + WARN_ON_ONCE(event->ctx != sid->ctx); + if (event->state <= PERF_EVENT_STATE_OFF) return 0; @@ -3437,37 +3443,15 @@ static int pinned_sched_in(struct perf_event *event, void *data) if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) { if (!group_sched_in(event, sid->cpuctx, sid->ctx)) - list_add_tail(&event->active_list, &sid->ctx->pinned_active); + list_add_tail(&event->active_list, get_event_list(event)); } - /* - * If this pinned group hasn't been scheduled, - * put it in error state. - */ - if (event->state == PERF_EVENT_STATE_INACTIVE) - perf_event_set_state(event, PERF_EVENT_STATE_ERROR); - - return 0; -} - -static int flexible_sched_in(struct perf_event *event, void *data) -{ - struct sched_in_data *sid = data; - - if (event->state <= PERF_EVENT_STATE_OFF) - return 0; - - if (!event_filter_match(event)) - return 0; + if (event->state == PERF_EVENT_STATE_INACTIVE) { + if (event->attr.pinned) + perf_event_set_state(event, PERF_EVENT_STATE_ERROR); - if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) { - int ret = group_sched_in(event, sid->cpuctx, sid->ctx); - if (ret) { - sid->can_add_hw = 0; - sid->ctx->rotate_necessary = 1; - return 0; - } - list_add_tail(&event->active_list, &sid->ctx->flexible_active); + sid->can_add_hw = 0; + sid->ctx->rotate_necessary = 1; } return 0; @@ -3485,7 +3469,7 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, visit_groups_merge(&ctx->pinned_groups, smp_processor_id(), - pinned_sched_in, &sid); + merge_sched_in, &sid); } static void @@ -3500,7 +3484,7 @@ ctx_flexible_sched_in(struct perf_event_context *ctx, visit_groups_merge(&ctx->flexible_groups, smp_processor_id(), - flexible_sched_in, &sid); + merge_sched_in, &sid); } static void -- cgit v1.2.3-71-gd317 From 2c2366c7548ecee65adfd264517ddf50f9e2d029 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 7 Aug 2019 11:45:01 +0200 Subject: perf/core: Remove 'struct sched_in_data' We can deduce the ctx and cpuctx from the event, no need to pass them along. Remove the structure and pass in can_add_hw directly. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar --- kernel/events/core.c | 36 +++++++++++------------------------- 1 file changed, 11 insertions(+), 25 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index b713080eedcc..b7eaabaee76f 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3423,17 +3423,11 @@ static int visit_groups_merge(struct perf_event_groups *groups, int cpu, return 0; } -struct sched_in_data { - struct perf_event_context *ctx; - struct perf_cpu_context *cpuctx; - int can_add_hw; -}; - static int merge_sched_in(struct perf_event *event, void *data) { - struct sched_in_data *sid = data; - - WARN_ON_ONCE(event->ctx != sid->ctx); + struct perf_event_context *ctx = event->ctx; + struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); + int *can_add_hw = data; if (event->state <= PERF_EVENT_STATE_OFF) return 0; @@ -3441,8 +3435,8 @@ static int merge_sched_in(struct perf_event *event, void *data) if (!event_filter_match(event)) return 0; - if (group_can_go_on(event, sid->cpuctx, sid->can_add_hw)) { - if (!group_sched_in(event, sid->cpuctx, sid->ctx)) + if (group_can_go_on(event, cpuctx, *can_add_hw)) { + if (!group_sched_in(event, cpuctx, ctx)) list_add_tail(&event->active_list, get_event_list(event)); } @@ -3450,8 +3444,8 @@ static int merge_sched_in(struct perf_event *event, void *data) if (event->attr.pinned) perf_event_set_state(event, PERF_EVENT_STATE_ERROR); - sid->can_add_hw = 0; - sid->ctx->rotate_necessary = 1; + *can_add_hw = 0; + ctx->rotate_necessary = 1; } return 0; @@ -3461,30 +3455,22 @@ static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx) { - struct sched_in_data sid = { - .ctx = ctx, - .cpuctx = cpuctx, - .can_add_hw = 1, - }; + int can_add_hw = 1; visit_groups_merge(&ctx->pinned_groups, smp_processor_id(), - merge_sched_in, &sid); + merge_sched_in, &can_add_hw); } static void ctx_flexible_sched_in(struct perf_event_context *ctx, struct perf_cpu_context *cpuctx) { - struct sched_in_data sid = { - .ctx = ctx, - .cpuctx = cpuctx, - .can_add_hw = 1, - }; + int can_add_hw = 1; visit_groups_merge(&ctx->flexible_groups, smp_processor_id(), - merge_sched_in, &sid); + merge_sched_in, &can_add_hw); } static void -- cgit v1.2.3-71-gd317 From 98add2af89bbfe8241e189b490fd91e5751c7900 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 13 Feb 2020 23:51:28 -0800 Subject: perf/cgroup: Reorder perf_cgroup_connect() Move perf_cgroup_connect() after perf_event_alloc(), such that we can find/use the PMU's cpu context. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200214075133.181299-2-irogers@google.com --- kernel/events/core.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index b7eaabaee76f..dceeeb1d012a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -10774,12 +10774,6 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, if (!has_branch_stack(event)) event->attr.branch_sample_type = 0; - if (cgroup_fd != -1) { - err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); - if (err) - goto err_ns; - } - pmu = perf_init_event(event); if (IS_ERR(pmu)) { err = PTR_ERR(pmu); @@ -10801,6 +10795,12 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, goto err_pmu; } + if (cgroup_fd != -1) { + err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); + if (err) + goto err_pmu; + } + err = exclusive_event_init(event); if (err) goto err_pmu; @@ -10861,12 +10861,12 @@ err_per_task: exclusive_event_destroy(event); err_pmu: + if (is_cgroup_event(event)) + perf_detach_cgroup(event); if (event->destroy) event->destroy(event); module_put(pmu->module); err_ns: - if (is_cgroup_event(event)) - perf_detach_cgroup(event); if (event->ns) put_pid_ns(event->ns); if (event->hw.target) -- cgit v1.2.3-71-gd317 From 6eef8a7116deae0706ba6d897c0d7dd887cd2be2 Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Thu, 13 Feb 2020 23:51:30 -0800 Subject: perf/core: Use min_heap in visit_groups_merge() visit_groups_merge will pick the next event based on when it was inserted in to the context (perf_event group_index). Events may be per CPU or for any CPU, but in the future we'd also like to have per cgroup events to avoid searching all events for the events to schedule for a cgroup. Introduce a min heap for the events that maintains a property that the earliest inserted event is always at the 0th element. Initialize the heap with per-CPU and any-CPU events for the context. Based-on-work-by: Peter Zijlstra (Intel) Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200214075133.181299-4-irogers@google.com --- kernel/events/core.c | 67 +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 51 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index dceeeb1d012a..ddfb06c9f367 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -49,6 +49,7 @@ #include #include #include +#include #include "internal.h" @@ -3392,32 +3393,66 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx, ctx_sched_out(&cpuctx->ctx, cpuctx, event_type); } -static int visit_groups_merge(struct perf_event_groups *groups, int cpu, - int (*func)(struct perf_event *, void *), void *data) +static bool perf_less_group_idx(const void *l, const void *r) { - struct perf_event **evt, *evt1, *evt2; + const struct perf_event *le = l, *re = r; + + return le->group_index < re->group_index; +} + +static void swap_ptr(void *l, void *r) +{ + void **lp = l, **rp = r; + + swap(*lp, *rp); +} + +static const struct min_heap_callbacks perf_min_heap = { + .elem_size = sizeof(struct perf_event *), + .less = perf_less_group_idx, + .swp = swap_ptr, +}; + +static void __heap_add(struct min_heap *heap, struct perf_event *event) +{ + struct perf_event **itrs = heap->data; + + if (event) { + itrs[heap->nr] = event; + heap->nr++; + } +} + +static noinline int visit_groups_merge(struct perf_event_groups *groups, + int cpu, + int (*func)(struct perf_event *, void *), + void *data) +{ + /* Space for per CPU and/or any CPU event iterators. */ + struct perf_event *itrs[2]; + struct min_heap event_heap = { + .data = itrs, + .nr = 0, + .size = ARRAY_SIZE(itrs), + }; + struct perf_event **evt = event_heap.data; int ret; - evt1 = perf_event_groups_first(groups, -1); - evt2 = perf_event_groups_first(groups, cpu); + __heap_add(&event_heap, perf_event_groups_first(groups, -1)); + __heap_add(&event_heap, perf_event_groups_first(groups, cpu)); - while (evt1 || evt2) { - if (evt1 && evt2) { - if (evt1->group_index < evt2->group_index) - evt = &evt1; - else - evt = &evt2; - } else if (evt1) { - evt = &evt1; - } else { - evt = &evt2; - } + min_heapify_all(&event_heap, &perf_min_heap); + while (event_heap.nr) { ret = func(*evt, data); if (ret) return ret; *evt = perf_event_groups_next(*evt); + if (*evt) + min_heapify(&event_heap, 0, &perf_min_heap); + else + min_heap_pop(&event_heap, &perf_min_heap); } return 0; -- cgit v1.2.3-71-gd317 From 836196beb377e59e54ec9e04f7402076ef7a8bd8 Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Thu, 13 Feb 2020 23:51:31 -0800 Subject: perf/core: Add per perf_cpu_context min_heap storage The storage required for visit_groups_merge's min heap needs to vary in order to support more iterators, such as when multiple nested cgroups' events are being visited. This change allows for 2 iterators and doesn't support growth. Based-on-work-by: Peter Zijlstra (Intel) Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com --- include/linux/perf_event.h | 7 +++++++ kernel/events/core.c | 43 ++++++++++++++++++++++++++++++++----------- 2 files changed, 39 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 68e21e828893..8768a39b5258 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -862,6 +862,13 @@ struct perf_cpu_context { int sched_cb_usage; int online; + /* + * Per-CPU storage for iterators used in visit_groups_merge. The default + * storage is of size 2 to hold the CPU and any CPU event iterators. + */ + int heap_size; + struct perf_event **heap; + struct perf_event *heap_default[2]; }; struct perf_output_handle { diff --git a/kernel/events/core.c b/kernel/events/core.c index ddfb06c9f367..7529e76a2e36 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3423,22 +3423,34 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event) } } -static noinline int visit_groups_merge(struct perf_event_groups *groups, - int cpu, +static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx, + struct perf_event_groups *groups, int cpu, int (*func)(struct perf_event *, void *), void *data) { /* Space for per CPU and/or any CPU event iterators. */ struct perf_event *itrs[2]; - struct min_heap event_heap = { - .data = itrs, - .nr = 0, - .size = ARRAY_SIZE(itrs), - }; - struct perf_event **evt = event_heap.data; + struct min_heap event_heap; + struct perf_event **evt; int ret; - __heap_add(&event_heap, perf_event_groups_first(groups, -1)); + if (cpuctx) { + event_heap = (struct min_heap){ + .data = cpuctx->heap, + .nr = 0, + .size = cpuctx->heap_size, + }; + } else { + event_heap = (struct min_heap){ + .data = itrs, + .nr = 0, + .size = ARRAY_SIZE(itrs), + }; + /* Events not within a CPU context may be on any CPU. */ + __heap_add(&event_heap, perf_event_groups_first(groups, -1)); + } + evt = event_heap.data; + __heap_add(&event_heap, perf_event_groups_first(groups, cpu)); min_heapify_all(&event_heap, &perf_min_heap); @@ -3492,7 +3504,10 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, { int can_add_hw = 1; - visit_groups_merge(&ctx->pinned_groups, + if (ctx != &cpuctx->ctx) + cpuctx = NULL; + + visit_groups_merge(cpuctx, &ctx->pinned_groups, smp_processor_id(), merge_sched_in, &can_add_hw); } @@ -3503,7 +3518,10 @@ ctx_flexible_sched_in(struct perf_event_context *ctx, { int can_add_hw = 1; - visit_groups_merge(&ctx->flexible_groups, + if (ctx != &cpuctx->ctx) + cpuctx = NULL; + + visit_groups_merge(cpuctx, &ctx->flexible_groups, smp_processor_id(), merge_sched_in, &can_add_hw); } @@ -10364,6 +10382,9 @@ skip_type: cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask); __perf_mux_hrtimer_init(cpuctx, cpu); + + cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default); + cpuctx->heap = cpuctx->heap_default; } got_cpu_context: -- cgit v1.2.3-71-gd317 From c2283c9368d41063f2077cb58def02217360526d Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Thu, 13 Feb 2020 23:51:32 -0800 Subject: perf/cgroup: Grow per perf_cpu_context heap storage Allow the per-CPU min heap storage to have sufficient space for per-cgroup iterators. Based-on-work-by: Peter Zijlstra (Intel) Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200214075133.181299-6-irogers@google.com --- kernel/events/core.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 7529e76a2e36..8065949a865e 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -892,6 +892,47 @@ static inline void perf_cgroup_sched_in(struct task_struct *prev, rcu_read_unlock(); } +static int perf_cgroup_ensure_storage(struct perf_event *event, + struct cgroup_subsys_state *css) +{ + struct perf_cpu_context *cpuctx; + struct perf_event **storage; + int cpu, heap_size, ret = 0; + + /* + * Allow storage to have sufficent space for an iterator for each + * possibly nested cgroup plus an iterator for events with no cgroup. + */ + for (heap_size = 1; css; css = css->parent) + heap_size++; + + for_each_possible_cpu(cpu) { + cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu); + if (heap_size <= cpuctx->heap_size) + continue; + + storage = kmalloc_node(heap_size * sizeof(struct perf_event *), + GFP_KERNEL, cpu_to_node(cpu)); + if (!storage) { + ret = -ENOMEM; + break; + } + + raw_spin_lock_irq(&cpuctx->ctx.lock); + if (cpuctx->heap_size < heap_size) { + swap(cpuctx->heap, storage); + if (storage == cpuctx->heap_default) + storage = NULL; + cpuctx->heap_size = heap_size; + } + raw_spin_unlock_irq(&cpuctx->ctx.lock); + + kfree(storage); + } + + return ret; +} + static inline int perf_cgroup_connect(int fd, struct perf_event *event, struct perf_event_attr *attr, struct perf_event *group_leader) @@ -911,6 +952,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event, goto out; } + ret = perf_cgroup_ensure_storage(event, css); + if (ret) + goto out; + cgrp = container_of(css, struct perf_cgroup, css); event->cgrp = cgrp; @@ -3440,6 +3485,8 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx, .nr = 0, .size = cpuctx->heap_size, }; + + lockdep_assert_held(&cpuctx->ctx.lock); } else { event_heap = (struct min_heap){ .data = itrs, -- cgit v1.2.3-71-gd317 From 95ed6c707f26a727a29972b60469630ae10d579c Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Thu, 13 Feb 2020 23:51:33 -0800 Subject: perf/cgroup: Order events in RB tree by cgroup id If one is monitoring 6 events on 20 cgroups the per-CPU RB tree will hold 120 events. The scheduling in of the events currently iterates over all events looking to see which events match the task's cgroup or its cgroup hierarchy. If a task is in 1 cgroup with 6 events, then 114 events are considered unnecessarily. This change orders events in the RB tree by cgroup id if it is present. This means scheduling in may go directly to events associated with the task's cgroup if one is present. The per-CPU iterator storage in visit_groups_merge is sized sufficent for an iterator per cgroup depth, where different iterators are needed for the task's cgroup and parent cgroups. By considering the set of iterators when visiting, the lowest group_index event may be selected and the insertion order group_index property is maintained. This also allows event rotation to function correctly, as although events are grouped into a cgroup, rotation always selects the lowest group_index event to rotate (delete/insert into the tree) and the min heap of iterators make it so that the group_index order is maintained. Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20190724223746.153620-3-irogers@google.com --- kernel/events/core.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 84 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 8065949a865e..ccf8d4fc6374 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1577,6 +1577,30 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right) if (left->cpu > right->cpu) return false; +#ifdef CONFIG_CGROUP_PERF + if (left->cgrp != right->cgrp) { + if (!left->cgrp || !left->cgrp->css.cgroup) { + /* + * Left has no cgroup but right does, no cgroups come + * first. + */ + return true; + } + if (!right->cgrp || right->cgrp->css.cgroup) { + /* + * Right has no cgroup but left does, no cgroups come + * first. + */ + return false; + } + /* Two dissimilar cgroups, order by id. */ + if (left->cgrp->css.cgroup->kn->id < right->cgrp->css.cgroup->kn->id) + return true; + + return false; + } +#endif + if (left->group_index < right->group_index) return true; if (left->group_index > right->group_index) @@ -1656,25 +1680,48 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx) } /* - * Get the leftmost event in the @cpu subtree. + * Get the leftmost event in the cpu/cgroup subtree. */ static struct perf_event * -perf_event_groups_first(struct perf_event_groups *groups, int cpu) +perf_event_groups_first(struct perf_event_groups *groups, int cpu, + struct cgroup *cgrp) { struct perf_event *node_event = NULL, *match = NULL; struct rb_node *node = groups->tree.rb_node; +#ifdef CONFIG_CGROUP_PERF + u64 node_cgrp_id, cgrp_id = 0; + + if (cgrp) + cgrp_id = cgrp->kn->id; +#endif while (node) { node_event = container_of(node, struct perf_event, group_node); if (cpu < node_event->cpu) { node = node->rb_left; - } else if (cpu > node_event->cpu) { + continue; + } + if (cpu > node_event->cpu) { node = node->rb_right; - } else { - match = node_event; + continue; + } +#ifdef CONFIG_CGROUP_PERF + node_cgrp_id = 0; + if (node_event->cgrp && node_event->cgrp->css.cgroup) + node_cgrp_id = node_event->cgrp->css.cgroup->kn->id; + + if (cgrp_id < node_cgrp_id) { node = node->rb_left; + continue; + } + if (cgrp_id > node_cgrp_id) { + node = node->rb_right; + continue; } +#endif + match = node_event; + node = node->rb_left; } return match; @@ -1687,12 +1734,26 @@ static struct perf_event * perf_event_groups_next(struct perf_event *event) { struct perf_event *next; +#ifdef CONFIG_CGROUP_PERF + u64 curr_cgrp_id = 0; + u64 next_cgrp_id = 0; +#endif next = rb_entry_safe(rb_next(&event->group_node), typeof(*event), group_node); - if (next && next->cpu == event->cpu) - return next; + if (next == NULL || next->cpu != event->cpu) + return NULL; - return NULL; +#ifdef CONFIG_CGROUP_PERF + if (event->cgrp && event->cgrp->css.cgroup) + curr_cgrp_id = event->cgrp->css.cgroup->kn->id; + + if (next->cgrp && next->cgrp->css.cgroup) + next_cgrp_id = next->cgrp->css.cgroup->kn->id; + + if (curr_cgrp_id != next_cgrp_id) + return NULL; +#endif + return next; } /* @@ -3473,6 +3534,9 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx, int (*func)(struct perf_event *, void *), void *data) { +#ifdef CONFIG_CGROUP_PERF + struct cgroup_subsys_state *css = NULL; +#endif /* Space for per CPU and/or any CPU event iterators. */ struct perf_event *itrs[2]; struct min_heap event_heap; @@ -3487,6 +3551,11 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx, }; lockdep_assert_held(&cpuctx->ctx.lock); + +#ifdef CONFIG_CGROUP_PERF + if (cpuctx->cgrp) + css = &cpuctx->cgrp->css; +#endif } else { event_heap = (struct min_heap){ .data = itrs, @@ -3494,11 +3563,16 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx, .size = ARRAY_SIZE(itrs), }; /* Events not within a CPU context may be on any CPU. */ - __heap_add(&event_heap, perf_event_groups_first(groups, -1)); + __heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL)); } evt = event_heap.data; - __heap_add(&event_heap, perf_event_groups_first(groups, cpu)); + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL)); + +#ifdef CONFIG_CGROUP_PERF + for (; css; css = css->parent) + __heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup)); +#endif min_heapify_all(&event_heap, &perf_min_heap); -- cgit v1.2.3-71-gd317 From f1dfdab694eb3838ac26f4b73695929c07d92a33 Mon Sep 17 00:00:00 2001 From: Chris Wilson Date: Thu, 23 Jan 2020 19:08:49 +0100 Subject: sched/vtime: Prevent unstable evaluation of WARN(vtime->state) As the vtime is sampled under loose seqcount protection by kcpustat, the vtime fields may change as the code flows. Where logic dictates a field has a static value, use a READ_ONCE. Signed-off-by: Chris Wilson Signed-off-by: Frederic Weisbecker Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 74722bb223d0 ("sched/vtime: Bring up complete kcpustat accessor") Link: https://lkml.kernel.org/r/20200123180849.28486-1-frederic@kernel.org --- kernel/sched/cputime.c | 41 ++++++++++++++++++++++------------------- 1 file changed, 22 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index cff3e656566d..dac9104d126f 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -909,8 +909,10 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime) } while (read_seqcount_retry(&vtime->seqcount, seq)); } -static int vtime_state_check(struct vtime *vtime, int cpu) +static int vtime_state_fetch(struct vtime *vtime, int cpu) { + int state = READ_ONCE(vtime->state); + /* * We raced against a context switch, fetch the * kcpustat task again. @@ -927,10 +929,10 @@ static int vtime_state_check(struct vtime *vtime, int cpu) * * Case 1) is ok but 2) is not. So wait for a safe VTIME state. */ - if (vtime->state == VTIME_INACTIVE) + if (state == VTIME_INACTIVE) return -EAGAIN; - return 0; + return state; } static u64 kcpustat_user_vtime(struct vtime *vtime) @@ -949,14 +951,15 @@ static int kcpustat_field_vtime(u64 *cpustat, { struct vtime *vtime = &tsk->vtime; unsigned int seq; - int err; do { + int state; + seq = read_seqcount_begin(&vtime->seqcount); - err = vtime_state_check(vtime, cpu); - if (err < 0) - return err; + state = vtime_state_fetch(vtime, cpu); + if (state < 0) + return state; *val = cpustat[usage]; @@ -969,7 +972,7 @@ static int kcpustat_field_vtime(u64 *cpustat, */ switch (usage) { case CPUTIME_SYSTEM: - if (vtime->state == VTIME_SYS) + if (state == VTIME_SYS) *val += vtime->stime + vtime_delta(vtime); break; case CPUTIME_USER: @@ -981,11 +984,11 @@ static int kcpustat_field_vtime(u64 *cpustat, *val += kcpustat_user_vtime(vtime); break; case CPUTIME_GUEST: - if (vtime->state == VTIME_GUEST && task_nice(tsk) <= 0) + if (state == VTIME_GUEST && task_nice(tsk) <= 0) *val += vtime->gtime + vtime_delta(vtime); break; case CPUTIME_GUEST_NICE: - if (vtime->state == VTIME_GUEST && task_nice(tsk) > 0) + if (state == VTIME_GUEST && task_nice(tsk) > 0) *val += vtime->gtime + vtime_delta(vtime); break; default: @@ -1036,23 +1039,23 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst, { struct vtime *vtime = &tsk->vtime; unsigned int seq; - int err; do { u64 *cpustat; u64 delta; + int state; seq = read_seqcount_begin(&vtime->seqcount); - err = vtime_state_check(vtime, cpu); - if (err < 0) - return err; + state = vtime_state_fetch(vtime, cpu); + if (state < 0) + return state; *dst = *src; cpustat = dst->cpustat; /* Task is sleeping, dead or idle, nothing to add */ - if (vtime->state < VTIME_SYS) + if (state < VTIME_SYS) continue; delta = vtime_delta(vtime); @@ -1061,15 +1064,15 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst, * Task runs either in user (including guest) or kernel space, * add pending nohz time to the right place. */ - if (vtime->state == VTIME_SYS) { + if (state == VTIME_SYS) { cpustat[CPUTIME_SYSTEM] += vtime->stime + delta; - } else if (vtime->state == VTIME_USER) { + } else if (state == VTIME_USER) { if (task_nice(tsk) > 0) cpustat[CPUTIME_NICE] += vtime->utime + delta; else cpustat[CPUTIME_USER] += vtime->utime + delta; } else { - WARN_ON_ONCE(vtime->state != VTIME_GUEST); + WARN_ON_ONCE(state != VTIME_GUEST); if (task_nice(tsk) > 0) { cpustat[CPUTIME_GUEST_NICE] += vtime->gtime + delta; cpustat[CPUTIME_NICE] += vtime->gtime + delta; @@ -1080,7 +1083,7 @@ static int kcpustat_cpu_fetch_vtime(struct kernel_cpustat *dst, } } while (read_seqcount_retry(&vtime->seqcount, seq)); - return err; + return 0; } void kcpustat_cpu_fetch(struct kernel_cpustat *dst, int cpu) -- cgit v1.2.3-71-gd317 From 765047932f153265db6ef15be208d6cbfc03dc62 Mon Sep 17 00:00:00 2001 From: Thara Gopinath Date: Fri, 21 Feb 2020 19:52:05 -0500 Subject: sched/pelt: Add support to track thermal pressure Extrapolating on the existing framework to track rt/dl utilization using pelt signals, add a similar mechanism to track thermal pressure. The difference here from rt/dl utilization tracking is that, instead of tracking time spent by a CPU running a RT/DL task through util_avg, the average thermal pressure is tracked through load_avg. This is because thermal pressure signal is weighted time "delta" capacity unlike util_avg which is binary. "delta capacity" here means delta between the actual capacity of a CPU and the decreased capacity a CPU due to a thermal event. In order to track average thermal pressure, a new sched_avg variable avg_thermal is introduced. Function update_thermal_load_avg can be called to do the periodic bookkeeping (accumulate, decay and average) of the thermal pressure. Reviewed-by: Vincent Guittot Signed-off-by: Thara Gopinath Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200222005213.3873-2-thara.gopinath@linaro.org --- include/trace/events/sched.h | 4 ++++ init/Kconfig | 4 ++++ kernel/sched/pelt.c | 31 +++++++++++++++++++++++++++++++ kernel/sched/pelt.h | 31 +++++++++++++++++++++++++++++++ kernel/sched/sched.h | 3 +++ 5 files changed, 73 insertions(+) (limited to 'kernel') diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9c3ebb7c83a5..ed168b0e2c53 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -618,6 +618,10 @@ DECLARE_TRACE(pelt_dl_tp, TP_PROTO(struct rq *rq), TP_ARGS(rq)); +DECLARE_TRACE(pelt_thermal_tp, + TP_PROTO(struct rq *rq), + TP_ARGS(rq)); + DECLARE_TRACE(pelt_irq_tp, TP_PROTO(struct rq *rq), TP_ARGS(rq)); diff --git a/init/Kconfig b/init/Kconfig index 20a6ac33761c..275c848b02c6 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -451,6 +451,10 @@ config HAVE_SCHED_AVG_IRQ depends on IRQ_TIME_ACCOUNTING || PARAVIRT_TIME_ACCOUNTING depends on SMP +config SCHED_THERMAL_PRESSURE + bool "Enable periodic averaging of thermal pressure" + depends on SMP + config BSD_PROCESS_ACCT bool "BSD Process Accounting" depends on MULTIUSER diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index c40d57a2a248..b647d04d9c8b 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -368,6 +368,37 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) return 0; } +#ifdef CONFIG_SCHED_THERMAL_PRESSURE +/* + * thermal: + * + * load_sum = \Sum se->avg.load_sum but se->avg.load_sum is not tracked + * + * util_avg and runnable_load_avg are not supported and meaningless. + * + * Unlike rt/dl utilization tracking that track time spent by a cpu + * running a rt/dl task through util_avg, the average thermal pressure is + * tracked through load_avg. This is because thermal pressure signal is + * time weighted "delta" capacity unlike util_avg which is binary. + * "delta capacity" = actual capacity - + * capped capacity a cpu due to a thermal event. + */ + +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) +{ + if (___update_load_sum(now, &rq->avg_thermal, + capacity, + capacity, + capacity)) { + ___update_load_avg(&rq->avg_thermal, 1); + trace_pelt_thermal_tp(rq); + return 1; + } + + return 0; +} +#endif + #ifdef CONFIG_HAVE_SCHED_AVG_IRQ /* * irq: diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index afff644da065..eb034d9f024d 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -7,6 +7,26 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq); int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); int update_dl_rq_load_avg(u64 now, struct rq *rq, int running); +#ifdef CONFIG_SCHED_THERMAL_PRESSURE +int update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity); + +static inline u64 thermal_load_avg(struct rq *rq) +{ + return READ_ONCE(rq->avg_thermal.load_avg); +} +#else +static inline int +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) +{ + return 0; +} + +static inline u64 thermal_load_avg(struct rq *rq) +{ + return 0; +} +#endif + #ifdef CONFIG_HAVE_SCHED_AVG_IRQ int update_irq_load_avg(struct rq *rq, u64 running); #else @@ -158,6 +178,17 @@ update_dl_rq_load_avg(u64 now, struct rq *rq, int running) return 0; } +static inline int +update_thermal_load_avg(u64 now, struct rq *rq, u64 capacity) +{ + return 0; +} + +static inline u64 thermal_load_avg(struct rq *rq) +{ + return 0; +} + static inline int update_irq_load_avg(struct rq *rq, u64 running) { diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2a0caf394dd4..6c839f829a25 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -960,6 +960,9 @@ struct rq { struct sched_avg avg_dl; #ifdef CONFIG_HAVE_SCHED_AVG_IRQ struct sched_avg avg_irq; +#endif +#ifdef CONFIG_SCHED_THERMAL_PRESSURE + struct sched_avg avg_thermal; #endif u64 idle_stamp; u64 avg_idle; -- cgit v1.2.3-71-gd317 From b4eccf5f8e1dcade112d97be86ad455a94501a0f Mon Sep 17 00:00:00 2001 From: Thara Gopinath Date: Fri, 21 Feb 2020 19:52:10 -0500 Subject: sched/fair: Enable periodic update of average thermal pressure Introduce support in scheduler periodic tick and other CFS bookkeeping APIs to trigger the process of computing average thermal pressure for a CPU. Also consider avg_thermal.load_avg in others_have_blocked which allows for decay of pelt signals. Signed-off-by: Thara Gopinath Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200222005213.3873-7-thara.gopinath@linaro.org --- kernel/sched/core.c | 3 +++ kernel/sched/fair.c | 7 +++++++ 2 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 8e6f38073ab3..3e620fe8e3a0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3586,6 +3586,7 @@ void scheduler_tick(void) struct rq *rq = cpu_rq(cpu); struct task_struct *curr = rq->curr; struct rq_flags rf; + unsigned long thermal_pressure; arch_scale_freq_tick(); sched_clock_tick(); @@ -3593,6 +3594,8 @@ void scheduler_tick(void) rq_lock(rq, &rf); update_rq_clock(rq); + thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure); curr->sched_class->task_tick(rq, curr, 0); calc_global_load_tick(rq); psi_task_tick(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4b5d5e5e701e..11f8488f83d7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7719,6 +7719,9 @@ static inline bool others_have_blocked(struct rq *rq) if (READ_ONCE(rq->avg_dl.util_avg)) return true; + if (thermal_load_avg(rq)) + return true; + #ifdef CONFIG_HAVE_SCHED_AVG_IRQ if (READ_ONCE(rq->avg_irq.util_avg)) return true; @@ -7744,6 +7747,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done) { const struct sched_class *curr_class; u64 now = rq_clock_pelt(rq); + unsigned long thermal_pressure; bool decayed; /* @@ -7752,8 +7756,11 @@ static bool __update_blocked_others(struct rq *rq, bool *done) */ curr_class = rq->curr->sched_class; + thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); + decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | + update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) | update_irq_load_avg(rq, 0); if (others_have_blocked(rq)) -- cgit v1.2.3-71-gd317 From 467b7d01c469dc6aa492c17d1f1d1952632728f1 Mon Sep 17 00:00:00 2001 From: Thara Gopinath Date: Fri, 21 Feb 2020 19:52:11 -0500 Subject: sched/fair: Update cpu_capacity to reflect thermal pressure cpu_capacity initially reflects the maximum possible capacity of a CPU. Thermal pressure on a CPU means this maximum possible capacity is unavailable due to thermal events. This patch subtracts the average thermal pressure for a CPU from its maximum possible capacity so that cpu_capacity reflects the remaining maximum capacity. Signed-off-by: Thara Gopinath Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200222005213.3873-8-thara.gopinath@linaro.org --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 11f8488f83d7..aa51286c66da 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7984,8 +7984,15 @@ static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu) if (unlikely(irq >= max)) return 1; + /* + * avg_rt.util_avg and avg_dl.util_avg track binary signals + * (running and not running) with weights 0 and 1024 respectively. + * avg_thermal.load_avg tracks thermal pressure and the weighted + * average uses the actual delta max capacity(load). + */ used = READ_ONCE(rq->avg_rt.util_avg); used += READ_ONCE(rq->avg_dl.util_avg); + used += thermal_load_avg(rq); if (unlikely(used >= max)) return 1; -- cgit v1.2.3-71-gd317 From 05289b90c2e40ae80f5c70431cd0be4cc8a6038d Mon Sep 17 00:00:00 2001 From: Thara Gopinath Date: Fri, 21 Feb 2020 19:52:13 -0500 Subject: sched/fair: Enable tuning of decay period Thermal pressure follows pelt signals which means the decay period for thermal pressure is the default pelt decay period. Depending on SoC characteristics and thermal activity, it might be beneficial to decay thermal pressure slower, but still in-tune with the pelt signals. One way to achieve this is to provide a command line parameter to set a decay shift parameter to an integer between 0 and 10. Signed-off-by: Thara Gopinath Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200222005213.3873-10-thara.gopinath@linaro.org --- Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++++++++++ kernel/sched/core.c | 2 +- kernel/sched/fair.c | 15 ++++++++++++++- kernel/sched/sched.h | 18 ++++++++++++++++++ 4 files changed, 49 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c07815d230bc..dac8245ef588 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4392,6 +4392,22 @@ incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. + sched_thermal_decay_shift= + [KNL, SMP] Set a decay shift for scheduler thermal + pressure signal. Thermal pressure signal follows the + default decay period of other scheduler pelt + signals(usually 32 ms but configurable). Setting + sched_thermal_decay_shift will left shift the decay + period for the thermal pressure signal by the shift + value. + i.e. with the default pelt decay period of 32 ms + sched_thermal_decay_shift thermal pressure decay pr + 1 64 ms + 2 128 ms + and so on. + Format: integer between 0 and 10 + Default is 0. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3e620fe8e3a0..4d76df33418e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3595,7 +3595,7 @@ void scheduler_tick(void) update_rq_clock(rq); thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); - update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure); + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure); curr->sched_class->task_tick(rq, curr, 0); calc_global_load_tick(rq); psi_task_tick(rq); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aa51286c66da..79bb423de1e6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -86,6 +86,19 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL; const_debug unsigned int sysctl_sched_migration_cost = 500000UL; +int sched_thermal_decay_shift; +static int __init setup_sched_thermal_decay_shift(char *str) +{ + int _shift = 0; + + if (kstrtoint(str, 0, &_shift)) + pr_warn("Unable to set scheduler thermal pressure decay shift parameter\n"); + + sched_thermal_decay_shift = clamp(_shift, 0, 10); + return 1; +} +__setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift); + #ifdef CONFIG_SMP /* * For asym packing, by default the lower numbered CPU has higher priority. @@ -7760,7 +7773,7 @@ static bool __update_blocked_others(struct rq *rq, bool *done) decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | - update_thermal_load_avg(rq_clock_task(rq), rq, thermal_pressure) | + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | update_irq_load_avg(rq, 0); if (others_have_blocked(rq)) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6c839f829a25..7f1a85bd540d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1127,6 +1127,24 @@ static inline u64 rq_clock_task(struct rq *rq) return rq->clock_task; } +/** + * By default the decay is the default pelt decay period. + * The decay shift can change the decay period in + * multiples of 32. + * Decay shift Decay period(ms) + * 0 32 + * 1 64 + * 2 128 + * 3 256 + * 4 512 + */ +extern int sched_thermal_decay_shift; + +static inline u64 rq_clock_thermal(struct rq *rq) +{ + return rq_clock_task(rq) >> sched_thermal_decay_shift; +} + static inline void rq_clock_skip_update(struct rq *rq) { lockdep_assert_held(&rq->lock); -- cgit v1.2.3-71-gd317 From 76c389ab2b5e300698eab87f9d4b7916f14117ba Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Tue, 3 Mar 2020 11:02:57 +0000 Subject: sched/fair: Fix kernel build warning in test_idle_cores() for !SMT NUMA Building against the tip/sched/core as ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") with the arm64 defconfig (which doesn't have CONFIG_SCHED_SMT set) leads to: kernel/sched/fair.c:1525:20: warning: 'test_idle_cores' declared 'static' but never defined [-Wunused-function] static inline bool test_idle_cores(int cpu, bool def); ^~~~~~~~~~~~~~~ Rather than define it in its own CONFIG_SCHED_SMT #define island, bunch it up with test_idle_cores(). Reported-by: Anshuman Khandual Reported-by: Naresh Kamboju Reviewed-by: Lukasz Luba [mgorman@techsingularity.net: Edit changelog, minor style change] Signed-off-by: Valentin Schneider Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/20200303110258.1092-3-mgorman@techsingularity.net --- kernel/sched/fair.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 79bb423de1e6..bba945277205 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1533,9 +1533,6 @@ static inline bool is_core_idle(int cpu) return true; } -/* Forward declarations of select_idle_sibling helpers */ -static inline bool test_idle_cores(int cpu, bool def); - struct task_numa_env { struct task_struct *p; @@ -1571,9 +1568,11 @@ numa_type numa_classify(unsigned int imbalance_pct, return node_fully_busy; } +#ifdef CONFIG_SCHED_SMT +/* Forward declarations of select_idle_sibling helpers */ +static inline bool test_idle_cores(int cpu, bool def); static inline int numa_idle_core(int idle_core, int cpu) { -#ifdef CONFIG_SCHED_SMT if (!static_branch_likely(&sched_smt_present) || idle_core >= 0 || !test_idle_cores(cpu, false)) return idle_core; @@ -1584,10 +1583,15 @@ static inline int numa_idle_core(int idle_core, int cpu) */ if (is_core_idle(cpu)) idle_core = cpu; -#endif return idle_core; } +#else +static inline int numa_idle_core(int idle_core, int cpu) +{ + return idle_core; +} +#endif /* * Gather all necessary information to make NUMA balancing placement -- cgit v1.2.3-71-gd317 From 0621df315402dd7bc56f7272fae9778701289825 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 27 Feb 2020 19:18:04 +0000 Subject: sched/numa: Acquire RCU lock for checking idle cores during NUMA balancing Qian Cai reported the following bug: The linux-next commit ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") introduced a boot warning, [ 86.520534][ T1] WARNING: suspicious RCU usage [ 86.520540][ T1] 5.6.0-rc3-next-20200227 #7 Not tainted [ 86.520545][ T1] ----------------------------- [ 86.520551][ T1] kernel/sched/fair.c:5914 suspicious rcu_dereference_check() usage! [ 86.520555][ T1] [ 86.520555][ T1] other info that might help us debug this: [ 86.520555][ T1] [ 86.520561][ T1] [ 86.520561][ T1] rcu_scheduler_active = 2, debug_locks = 1 [ 86.520567][ T1] 1 lock held by systemd/1: [ 86.520571][ T1] #0: ffff8887f4b14848 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x1d2/0x998 [ 86.520594][ T1] [ 86.520594][ T1] stack backtrace: [ 86.520602][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.6.0-rc3-next-20200227 #7 task_numa_migrate() checks for idle cores when updating NUMA-related statistics. This relies on reading a RCU-protected structure in test_idle_cores() via this call chain task_numa_migrate -> update_numa_stats -> numa_idle_core -> test_idle_cores While the locking could be fine-grained, it is more appropriate to acquire the RCU lock for the entire scan of the domain. This patch removes the warning triggered at boot time. Reported-by: Qian Cai Reviewed-by: Paul E. McKenney Signed-off-by: Mel Gorman Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: ff7db0bf24db ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks") Link: https://lkml.kernel.org/r/20200227191804.GJ3818@techsingularity.net --- kernel/sched/fair.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bba945277205..3887b7323abd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1608,6 +1608,7 @@ static void update_numa_stats(struct task_numa_env *env, memset(ns, 0, sizeof(*ns)); ns->idle_cpu = -1; + rcu_read_lock(); for_each_cpu(cpu, cpumask_of_node(nid)) { struct rq *rq = cpu_rq(cpu); @@ -1627,6 +1628,7 @@ static void update_numa_stats(struct task_numa_env *env, idle_core = numa_idle_core(idle_core, cpu); } } + rcu_read_unlock(); ns->weight = cpumask_weight(cpumask_of_node(nid)); -- cgit v1.2.3-71-gd317 From 38502ab4bf3c463081bfd53356980a9ec2f32d1d Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Thu, 27 Feb 2020 19:14:32 +0000 Subject: sched/topology: Don't enable EAS on SMT systems EAS already requires asymmetric CPU capacities to be enabled, and mixing this with SMT is an aberration, but better be safe than sorry. Reviewed-by: Dietmar Eggemann Acked-by: Quentin Perret Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200227191433.31994-2-valentin.schneider@arm.com --- kernel/sched/topology.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 00911884b7e7..8344757bba6e 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -317,8 +317,9 @@ static void sched_energy_set(bool has_eas) * EAS can be used on a root domain if it meets all the following conditions: * 1. an Energy Model (EM) is available; * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. - * 3. the EM complexity is low enough to keep scheduling overheads low; - * 4. schedutil is driving the frequency of all CPUs of the rd; + * 3. no SMT is detected. + * 4. the EM complexity is low enough to keep scheduling overheads low; + * 5. schedutil is driving the frequency of all CPUs of the rd; * * The complexity of the Energy Model is defined as: * @@ -360,6 +361,13 @@ static bool build_perf_domains(const struct cpumask *cpu_map) goto free; } + /* EAS definitely does *not* handle SMT */ + if (sched_smt_active()) { + pr_warn("rd %*pbl: Disabling EAS, SMT is not supported\n", + cpumask_pr_args(cpu_map)); + goto free; + } + for_each_cpu(i, cpu_map) { /* Skip already covered CPUs. */ if (find_pd(pd, i)) -- cgit v1.2.3-71-gd317 From ba4f7bc1dee318a0fd9c0e3bd46227aca21ac2f2 Mon Sep 17 00:00:00 2001 From: Yu Chen Date: Fri, 28 Feb 2020 18:03:29 +0800 Subject: sched/deadline: Make two functions static Since commit 06a76fe08d4 ("sched/deadline: Move DL related code from sched/core.c to sched/deadline.c"), DL related code moved to deadline.c. Make the following two functions static since they're only used in deadline.c: dl_change_utilization() init_dl_rq_bw_ratio() Signed-off-by: Yu Chen Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200228100329.16927-1-chen.yu@easystack.cn --- kernel/sched/deadline.c | 6 ++++-- kernel/sched/sched.h | 2 -- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 43323f875cb9..504d2f51b0d6 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -153,7 +153,7 @@ void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq) __sub_running_bw(dl_se->dl_bw, dl_rq); } -void dl_change_utilization(struct task_struct *p, u64 new_bw) +static void dl_change_utilization(struct task_struct *p, u64 new_bw) { struct rq *rq; @@ -334,6 +334,8 @@ static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq) return dl_rq->root.rb_leftmost == &dl_se->rb_node; } +static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); + void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime) { raw_spin_lock_init(&dl_b->dl_runtime_lock); @@ -2496,7 +2498,7 @@ int sched_dl_global_validate(void) return ret; } -void init_dl_rq_bw_ratio(struct dl_rq *dl_rq) +static void init_dl_rq_bw_ratio(struct dl_rq *dl_rq) { if (global_rt_runtime() == RUNTIME_INF) { dl_rq->bw_ratio = 1 << RATIO_SHIFT; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7f1a85bd540d..9e173fad0425 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -305,7 +305,6 @@ bool __dl_overflow(struct dl_bw *dl_b, int cpus, u64 old_bw, u64 new_bw) dl_b->bw * cpus < dl_b->total_bw - old_bw + new_bw; } -extern void dl_change_utilization(struct task_struct *p, u64 new_bw); extern void init_dl_bw(struct dl_bw *dl_b); extern int sched_dl_global_validate(void); extern void sched_dl_do_global(void); @@ -1905,7 +1904,6 @@ extern struct dl_bandwidth def_dl_bandwidth; extern void init_dl_bandwidth(struct dl_bandwidth *dl_b, u64 period, u64 runtime); extern void init_dl_task_timer(struct sched_dl_entity *dl_se); extern void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se); -extern void init_dl_rq_bw_ratio(struct dl_rq *dl_rq); #define BW_SHIFT 20 #define BW_UNIT (1 << BW_SHIFT) -- cgit v1.2.3-71-gd317 From 6212437f0f6043e825e021e4afc5cd63e248a2b4 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Thu, 27 Feb 2020 16:41:15 +0100 Subject: sched/fair: Fix runnable_avg for throttled cfs When a cfs_rq is throttled, its group entity is dequeued and its running tasks are removed. We must update runnable_avg with the old h_nr_running and update group_se->runnable_weight with the new h_nr_running at each level of the hierarchy. Reviewed-by: Ben Segall Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 9f68395333ad ("sched/pelt: Add a new runnable average signal") Link: https://lkml.kernel.org/r/20200227154115.8332-1-vincent.guittot@linaro.org --- kernel/sched/fair.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3887b7323abd..54bd6280676e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4720,8 +4720,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq) if (!se->on_rq) break; - if (dequeue) + if (dequeue) { dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP); + } else { + update_load_avg(qcfs_rq, se, 0); + se_update_runnable(se); + } + qcfs_rq->h_nr_running -= task_delta; qcfs_rq->idle_h_nr_running -= idle_task_delta; @@ -4789,8 +4794,13 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) enqueue = 0; cfs_rq = cfs_rq_of(se); - if (enqueue) + if (enqueue) { enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); + } else { + update_load_avg(cfs_rq, se, 0); + se_update_runnable(se); + } + cfs_rq->h_nr_running += task_delta; cfs_rq->idle_h_nr_running += idle_task_delta; -- cgit v1.2.3-71-gd317 From 5ab297bab984310267734dfbcc8104566658ebef Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 6 Mar 2020 09:42:08 +0100 Subject: sched/fair: Fix reordering of enqueue/dequeue_task_fair() Even when a cgroup is throttled, the group se of a child cgroup can still be enqueued and its gse->on_rq stays true. When a task is enqueued on such child, we still have to update the load_avg and increase h_nr_running of the throttled cfs. Nevertheless, the 1st for_each_sched_entity() loop is skipped because of gse->on_rq == true and the 2nd loop because the cfs is throttled whereas we have to update both load_avg with the old h_nr_running and increase h_nr_running in such case. The same sequence can happen during dequeue when se moves to parent before breaking in the 1st loop. Note that the update of load_avg will effectively happen only once in order to sync up to the throttled time. Next call for updating load_avg will stop early because the clock stays unchanged. Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path") Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org --- kernel/sched/fair.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54bd6280676e..1dea8554ead0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5460,16 +5460,16 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto enqueue_throttle; - update_load_avg(cfs_rq, se, UPDATE_TG); se_update_runnable(se); update_cfs_group(se); cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; + + /* end evaluation on encountering a throttled cfs_rq */ + if (cfs_rq_throttled(cfs_rq)) + goto enqueue_throttle; } enqueue_throttle: @@ -5558,16 +5558,17 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - /* end evaluation on encountering a throttled cfs_rq */ - if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; - update_load_avg(cfs_rq, se, UPDATE_TG); se_update_runnable(se); update_cfs_group(se); cfs_rq->h_nr_running--; cfs_rq->idle_h_nr_running -= idle_h_nr_running; + + /* end evaluation on encountering a throttled cfs_rq */ + if (cfs_rq_throttled(cfs_rq)) + goto dequeue_throttle; + } dequeue_throttle: -- cgit v1.2.3-71-gd317 From d9cb236b9429044dc694ea70a50163ddd283cea6 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 2 Mar 2020 13:27:16 +0000 Subject: sched/rt: cpupri_find: Implement fallback mechanism for !fit case When searching for the best lowest_mask with a fitness_fn passed, make sure we record the lowest_level that returns a valid lowest_mask so that we can use that as a fallback in case we fail to find a fitting CPU at all levels. The intention in the original patch was not to allow a down migration to unfitting CPU. But this missed the case where we are already running on unfitting one. With this change now RT tasks can still move between unfitting CPUs when they're already running on such CPU. And as Steve suggested; to adhere to the strict priority rules of RT, if a task is already running on a fitting CPU but due to priority it can't run on it, allow it to downmigrate to unfitting CPU so it can run. Reported-by: Pavan Kondeti Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-2-qais.yousef@arm.com Link: https://lore.kernel.org/lkml/20200203142712.a7yvlyo2y3le5cpn@e107158-lin/ --- kernel/sched/cpupri.c | 157 ++++++++++++++++++++++++++++++++------------------ 1 file changed, 101 insertions(+), 56 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index 1a2719e1350a..1bcfa1995550 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -41,6 +41,59 @@ static int convert_prio(int prio) return cpupri; } +static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, + struct cpumask *lowest_mask, int idx) +{ + struct cpupri_vec *vec = &cp->pri_to_cpu[idx]; + int skip = 0; + + if (!atomic_read(&(vec)->count)) + skip = 1; + /* + * When looking at the vector, we need to read the counter, + * do a memory barrier, then read the mask. + * + * Note: This is still all racey, but we can deal with it. + * Ideally, we only want to look at masks that are set. + * + * If a mask is not set, then the only thing wrong is that we + * did a little more work than necessary. + * + * If we read a zero count but the mask is set, because of the + * memory barriers, that can only happen when the highest prio + * task for a run queue has left the run queue, in which case, + * it will be followed by a pull. If the task we are processing + * fails to find a proper place to go, that pull request will + * pull this task if the run queue is running at a lower + * priority. + */ + smp_rmb(); + + /* Need to do the rmb for every iteration */ + if (skip) + return 0; + + if (cpumask_any_and(p->cpus_ptr, vec->mask) >= nr_cpu_ids) + return 0; + + if (lowest_mask) { + cpumask_and(lowest_mask, p->cpus_ptr, vec->mask); + + /* + * We have to ensure that we have at least one bit + * still set in the array, since the map could have + * been concurrently emptied between the first and + * second reads of vec->mask. If we hit this + * condition, simply act as though we never hit this + * priority level and continue on. + */ + if (cpumask_empty(lowest_mask)) + return 0; + } + + return 1; +} + /** * cpupri_find - find the best (lowest-pri) CPU in the system * @cp: The cpupri context @@ -62,80 +115,72 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask, bool (*fitness_fn)(struct task_struct *p, int cpu)) { - int idx = 0; int task_pri = convert_prio(p->prio); + int best_unfit_idx = -1; + int idx = 0, cpu; BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); for (idx = 0; idx < task_pri; idx++) { - struct cpupri_vec *vec = &cp->pri_to_cpu[idx]; - int skip = 0; - if (!atomic_read(&(vec)->count)) - skip = 1; - /* - * When looking at the vector, we need to read the counter, - * do a memory barrier, then read the mask. - * - * Note: This is still all racey, but we can deal with it. - * Ideally, we only want to look at masks that are set. - * - * If a mask is not set, then the only thing wrong is that we - * did a little more work than necessary. - * - * If we read a zero count but the mask is set, because of the - * memory barriers, that can only happen when the highest prio - * task for a run queue has left the run queue, in which case, - * it will be followed by a pull. If the task we are processing - * fails to find a proper place to go, that pull request will - * pull this task if the run queue is running at a lower - * priority. - */ - smp_rmb(); - - /* Need to do the rmb for every iteration */ - if (skip) - continue; - - if (cpumask_any_and(p->cpus_ptr, vec->mask) >= nr_cpu_ids) + if (!__cpupri_find(cp, p, lowest_mask, idx)) continue; - if (lowest_mask) { - int cpu; + if (!lowest_mask || !fitness_fn) + return 1; - cpumask_and(lowest_mask, p->cpus_ptr, vec->mask); + /* Ensure the capacity of the CPUs fit the task */ + for_each_cpu(cpu, lowest_mask) { + if (!fitness_fn(p, cpu)) + cpumask_clear_cpu(cpu, lowest_mask); + } + /* + * If no CPU at the current priority can fit the task + * continue looking + */ + if (cpumask_empty(lowest_mask)) { /* - * We have to ensure that we have at least one bit - * still set in the array, since the map could have - * been concurrently emptied between the first and - * second reads of vec->mask. If we hit this - * condition, simply act as though we never hit this - * priority level and continue on. + * Store our fallback priority in case we + * didn't find a fitting CPU */ - if (cpumask_empty(lowest_mask)) - continue; + if (best_unfit_idx == -1) + best_unfit_idx = idx; - if (!fitness_fn) - return 1; - - /* Ensure the capacity of the CPUs fit the task */ - for_each_cpu(cpu, lowest_mask) { - if (!fitness_fn(p, cpu)) - cpumask_clear_cpu(cpu, lowest_mask); - } - - /* - * If no CPU at the current priority can fit the task - * continue looking - */ - if (cpumask_empty(lowest_mask)) - continue; + continue; } return 1; } + /* + * If we failed to find a fitting lowest_mask, make sure we fall back + * to the last known unfitting lowest_mask. + * + * Note that the map of the recorded idx might have changed since then, + * so we must ensure to do the full dance to make sure that level still + * holds a valid lowest_mask. + * + * As per above, the map could have been concurrently emptied while we + * were busy searching for a fitting lowest_mask at the other priority + * levels. + * + * This rule favours honouring priority over fitting the task in the + * correct CPU (Capacity Awareness being the only user now). + * The idea is that if a higher priority task can run, then it should + * run even if this ends up being on unfitting CPU. + * + * The cost of this trade-off is not entirely clear and will probably + * be good for some workloads and bad for others. + * + * The main idea here is that if some CPUs were overcommitted, we try + * to spread which is what the scheduler traditionally did. Sys admins + * must do proper RT planning to avoid overloading the system if they + * really care. + */ + if (best_unfit_idx != -1) + return __cpupri_find(cp, p, lowest_mask, best_unfit_idx); + return 0; } -- cgit v1.2.3-71-gd317 From b28bc1e002c23ff8a4999c4a2fb1d4d412bc6f5e Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 2 Mar 2020 13:27:17 +0000 Subject: sched/rt: Re-instate old behavior in select_task_rq_rt() When RT Capacity Aware support was added, the logic in select_task_rq_rt was modified to force a search for a fitting CPU if the task currently doesn't run on one. But if the search failed, and the search was only triggered to fulfill the fitness request; we could end up selecting a new CPU unnecessarily. Fix this and re-instate the original behavior by ensuring we bail out in that case. This behavior change only affected asymmetric systems that are using util_clamp to implement capacity aware. None asymmetric systems weren't affected. LINK: https://lore.kernel.org/lkml/20200218041620.GD28029@codeaurora.org/ Reported-by: Pavan Kondeti Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-3-qais.yousef@arm.com --- kernel/sched/rt.c | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 55a4a5042292..f0071fa01c1d 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1474,6 +1474,13 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) if (test || !rt_task_fits_capacity(p, cpu)) { int target = find_lowest_rq(p); + /* + * Bail out if we were forcing a migration to find a better + * fitting CPU but our search failed. + */ + if (!test && target != -1 && !rt_task_fits_capacity(p, target)) + goto out_unlock; + /* * Don't bother moving it if the destination CPU is * not running a lower priority task. @@ -1482,6 +1489,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags) p->prio < cpu_rq(target)->rt.highest_prio.curr) cpu = target; } + +out_unlock: rcu_read_unlock(); out: -- cgit v1.2.3-71-gd317 From a1bd02e1f28b1939cac8c64072a0e578c3cbc345 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 2 Mar 2020 13:27:18 +0000 Subject: sched/rt: Optimize cpupri_find() on non-heterogenous systems By introducing a new cpupri_find_fitness() function that takes the fitness_fn as an argument and only called when asym_system static key is enabled. cpupri_find() is now a wrapper function that calls cpupri_find_fitness() passing NULL as a fitness_fn, hence disabling the logic that handles fitness by default. LINK: https://lore.kernel.org/lkml/c0772fca-0a4b-c88d-fdf2-5715fcf8447b@arm.com/ Reported-by: Dietmar Eggemann Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-4-qais.yousef@arm.com --- kernel/sched/cpupri.c | 10 ++++++++-- kernel/sched/cpupri.h | 6 ++++-- kernel/sched/rt.c | 23 +++++++++++++++++++---- 3 files changed, 31 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index 1bcfa1995550..dd3f16d1a04a 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -94,8 +94,14 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, return 1; } +int cpupri_find(struct cpupri *cp, struct task_struct *p, + struct cpumask *lowest_mask) +{ + return cpupri_find_fitness(cp, p, lowest_mask, NULL); +} + /** - * cpupri_find - find the best (lowest-pri) CPU in the system + * cpupri_find_fitness - find the best (lowest-pri) CPU in the system * @cp: The cpupri context * @p: The task * @lowest_mask: A mask to fill in with selected CPUs (or NULL) @@ -111,7 +117,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, * * Return: (int)bool - CPUs were found */ -int cpupri_find(struct cpupri *cp, struct task_struct *p, +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask, bool (*fitness_fn)(struct task_struct *p, int cpu)) { diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index 32dd520db11f..efbb492bb94c 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -19,8 +19,10 @@ struct cpupri { #ifdef CONFIG_SMP int cpupri_find(struct cpupri *cp, struct task_struct *p, - struct cpumask *lowest_mask, - bool (*fitness_fn)(struct task_struct *p, int cpu)); + struct cpumask *lowest_mask); +int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, + struct cpumask *lowest_mask, + bool (*fitness_fn)(struct task_struct *p, int cpu)); void cpupri_set(struct cpupri *cp, int cpu, int pri); int cpupri_init(struct cpupri *cp); void cpupri_cleanup(struct cpupri *cp); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f0071fa01c1d..29a86959bda4 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1504,7 +1504,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) * let's hope p can move out. */ if (rq->curr->nr_cpus_allowed == 1 || - !cpupri_find(&rq->rd->cpupri, rq->curr, NULL, NULL)) + !cpupri_find(&rq->rd->cpupri, rq->curr, NULL)) return; /* @@ -1512,7 +1512,7 @@ static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) * see if it is pushed or pulled somewhere else. */ if (p->nr_cpus_allowed != 1 && - cpupri_find(&rq->rd->cpupri, p, NULL, NULL)) + cpupri_find(&rq->rd->cpupri, p, NULL)) return; /* @@ -1691,6 +1691,7 @@ static int find_lowest_rq(struct task_struct *task) struct cpumask *lowest_mask = this_cpu_cpumask_var_ptr(local_cpu_mask); int this_cpu = smp_processor_id(); int cpu = task_cpu(task); + int ret; /* Make sure the mask is initialized first */ if (unlikely(!lowest_mask)) @@ -1699,8 +1700,22 @@ static int find_lowest_rq(struct task_struct *task) if (task->nr_cpus_allowed == 1) return -1; /* No other targets possible */ - if (!cpupri_find(&task_rq(task)->rd->cpupri, task, lowest_mask, - rt_task_fits_capacity)) + /* + * If we're on asym system ensure we consider the different capacities + * of the CPUs when searching for the lowest_mask. + */ + if (static_branch_unlikely(&sched_asym_cpucapacity)) { + + ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri, + task, lowest_mask, + rt_task_fits_capacity); + } else { + + ret = cpupri_find(&task_rq(task)->rd->cpupri, + task, lowest_mask); + } + + if (!ret) return -1; /* No targets found */ /* -- cgit v1.2.3-71-gd317 From 98ca645f824301bde72e0a51cdc8bdbbea6774a5 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 2 Mar 2020 13:27:19 +0000 Subject: sched/rt: Allow pulling unfitting task When implemented RT Capacity Awareness; the logic was done such that if a task was running on a fitting CPU, then it was sticky and we would try our best to keep it there. But as Steve suggested, to adhere to the strict priority rules of RT class; allow pulling an RT task to unfitting CPU to ensure it gets a chance to run ASAP. LINK: https://lore.kernel.org/lkml/20200203111451.0d1da58f@oasis.local.home/ Suggested-by: Steven Rostedt Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-5-qais.yousef@arm.com --- kernel/sched/rt.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 29a86959bda4..bcb143661a86 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1656,8 +1656,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu) { if (!task_running(rq, p) && - cpumask_test_cpu(cpu, p->cpus_ptr) && - rt_task_fits_capacity(p, cpu)) + cpumask_test_cpu(cpu, p->cpus_ptr)) return 1; return 0; -- cgit v1.2.3-71-gd317 From d94a9df49069ba8ff7c4aaeca1229e6471a01a15 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 2 Mar 2020 13:27:20 +0000 Subject: sched/rt: Remove unnecessary push for unfit tasks In task_woken_rt() and switched_to_rto() we try trigger push-pull if the task is unfit. But the logic is found lacking because if the task was the only one running on the CPU, then rt_rq is not in overloaded state and won't trigger a push. The necessity of this logic was under a debate as well, a summary of the discussion can be found in the following thread: https://lore.kernel.org/lkml/20200226160247.iqvdakiqbakk2llz@e107158-lin.cambridge.arm.com/ Remove the logic for now until a better approach is agreed upon. Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Fixes: 804d402fb6f6 ("sched/rt: Make RT capacity-aware") Link: https://lkml.kernel.org/r/20200302132721.8353-6-qais.yousef@arm.com --- kernel/sched/rt.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index bcb143661a86..df11d88c9895 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2225,7 +2225,7 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p) (rq->curr->nr_cpus_allowed < 2 || rq->curr->prio <= p->prio); - if (need_to_push || !rt_task_fits_capacity(p, cpu_of(rq))) + if (need_to_push) push_rt_tasks(rq); } @@ -2297,10 +2297,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p) */ if (task_on_rq_queued(p) && rq->curr != p) { #ifdef CONFIG_SMP - bool need_to_push = rq->rt.overloaded || - !rt_task_fits_capacity(p, cpu_of(rq)); - - if (p->nr_cpus_allowed > 1 && need_to_push) + if (p->nr_cpus_allowed > 1 && rq->rt.overloaded) rt_queue_push_tasks(rq); #endif /* CONFIG_SMP */ if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq))) -- cgit v1.2.3-71-gd317 From 5a18ceca63502546d6c0cab1f3f79cb6900f947a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 16 Dec 2019 16:31:23 -0500 Subject: smp: Allow smp_call_function_single_async() to insert locked csd Previously we will raise an warning if we want to insert a csd object which is with the LOCK flag set, and if it happens we'll also wait for the lock to be released. However, this operation does not match perfectly with how the function is named - the name with "_async" suffix hints that this function should not block, while we will. This patch changed this behavior by simply return -EBUSY instead of waiting, at the meantime we allow this operation to happen without warning the user to change this into a feature when the caller wants to "insert a csd object, if it's there, just wait for that one". This is pretty safe because in flush_smp_call_function_queue() for async csd objects (where csd->flags&SYNC is zero) we'll first do the unlock then we call the csd->func(). So if we see the csd->flags&LOCK is true in smp_call_function_single_async(), then it's guaranteed that csd->func() will be called after this smp_call_function_single_async() returns -EBUSY. Update the comment of the function too to refect this. Signed-off-by: Peter Xu Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20191216213125.9536-2-peterx@redhat.com --- kernel/smp.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/smp.c b/kernel/smp.c index d0ada39eb4d4..97f1d9765c94 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -329,6 +329,11 @@ EXPORT_SYMBOL(smp_call_function_single); * (ie: embedded in an object) and is responsible for synchronizing it * such that the IPIs performed on the @csd are strictly serialized. * + * If the function is called with one csd which has not yet been + * processed by previous call to smp_call_function_single_async(), the + * function will return immediately with -EBUSY showing that the csd + * object is still in progress. + * * NOTE: Be careful, there is unfortunately no current debugging facility to * validate the correctness of this serialization. */ @@ -338,14 +343,17 @@ int smp_call_function_single_async(int cpu, call_single_data_t *csd) preempt_disable(); - /* We could deadlock if we have to wait here with interrupts disabled! */ - if (WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK)) - csd_lock_wait(csd); + if (csd->flags & CSD_FLAG_LOCK) { + err = -EBUSY; + goto out; + } csd->flags = CSD_FLAG_LOCK; smp_wmb(); err = generic_exec_single(cpu, csd, csd->func, csd->info); + +out: preempt_enable(); return err; -- cgit v1.2.3-71-gd317 From fd3eafda8f146d4ad8f95f91a8c2b9a5319ff6b2 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Mon, 16 Dec 2019 16:31:25 -0500 Subject: sched/core: Remove rq.hrtick_csd_pending Now smp_call_function_single_async() provides the protection that we'll return with -EBUSY if the csd object is still pending, then we don't need the rq.hrtick_csd_pending any more. Signed-off-by: Peter Xu Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20191216213125.9536-4-peterx@redhat.com --- kernel/sched/core.c | 9 ++------- kernel/sched/sched.h | 1 - 2 files changed, 2 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1a9983da4408..b70ec3814ca5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -269,7 +269,6 @@ static void __hrtick_start(void *arg) rq_lock(rq, &rf); __hrtick_restart(rq); - rq->hrtick_csd_pending = 0; rq_unlock(rq, &rf); } @@ -293,12 +292,10 @@ void hrtick_start(struct rq *rq, u64 delay) hrtimer_set_expires(timer, time); - if (rq == this_rq()) { + if (rq == this_rq()) __hrtick_restart(rq); - } else if (!rq->hrtick_csd_pending) { + else smp_call_function_single_async(cpu_of(rq), &rq->hrtick_csd); - rq->hrtick_csd_pending = 1; - } } #else @@ -322,8 +319,6 @@ void hrtick_start(struct rq *rq, u64 delay) static void hrtick_rq_init(struct rq *rq) { #ifdef CONFIG_SMP - rq->hrtick_csd_pending = 0; - rq->hrtick_csd.flags = 0; rq->hrtick_csd.func = __hrtick_start; rq->hrtick_csd.info = rq; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9ea647835fd6..38e60b84e3b4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -967,7 +967,6 @@ struct rq { #ifdef CONFIG_SCHED_HRTICK #ifdef CONFIG_SMP - int hrtick_csd_pending; call_single_data_t hrtick_csd; #endif struct hrtimer hrtick_timer; -- cgit v1.2.3-71-gd317 From 14533a16c46db70b8a75eda8fa633c25ac446d81 Mon Sep 17 00:00:00 2001 From: Ingo Molnar Date: Fri, 6 Mar 2020 14:26:31 +0100 Subject: thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code drivers/base/arch_topology.c is only built if CONFIG_GENERIC_ARCH_TOPOLOGY=y, resulting in such build failures: cpufreq_cooling.c:(.text+0x1e7): undefined reference to `arch_set_thermal_pressure' Move it to sched/core.c instead, and keep it enabled on x86 despite us not having a arch_scale_thermal_pressure() facility there, to build-test this thing. Cc: Thara Gopinath Cc: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar --- drivers/base/arch_topology.c | 11 ----------- kernel/sched/core.c | 11 +++++++++++ 2 files changed, 11 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 68dfa49d3b63..6119e11a9f95 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -42,17 +42,6 @@ void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) per_cpu(cpu_scale, cpu) = capacity; } -DEFINE_PER_CPU(unsigned long, thermal_pressure); - -void arch_set_thermal_pressure(struct cpumask *cpus, - unsigned long th_pressure) -{ - int cpu; - - for_each_cpu(cpu, cpus) - WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); -} - static ssize_t cpu_capacity_show(struct device *dev, struct device_attribute *attr, char *buf) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4d76df33418e..978bf6f6e0ff 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3576,6 +3576,17 @@ unsigned long long task_sched_runtime(struct task_struct *p) return ns; } +DEFINE_PER_CPU(unsigned long, thermal_pressure); + +void arch_set_thermal_pressure(struct cpumask *cpus, + unsigned long th_pressure) +{ + int cpu; + + for_each_cpu(cpu, cpus) + WRITE_ONCE(per_cpu(thermal_pressure, cpu), th_pressure); +} + /* * This function gets called by the timer code, with HZ frequency. * We call it with interrupts disabled. -- cgit v1.2.3-71-gd317 From eaee41727e6d8a7d2b94421c25e82b00cb2fded5 Mon Sep 17 00:00:00 2001 From: Dmitry Safonov Date: Mon, 2 Mar 2020 17:51:34 +0000 Subject: sysctl/sysrq: Remove __sysrq_enabled copy Many embedded boards have a disconnected TTL level serial which can generate some garbage that can lead to spurious false sysrq detects. Currently, sysrq can be either completely disabled for serial console or always disabled (with CONFIG_MAGIC_SYSRQ_SERIAL), since commit 732dbf3a6104 ("serial: do not accept sysrq characters via serial port") At Arista, we have such boards that can generate BREAK and random garbage. While disabling sysrq for serial console would solve the problem with spurious false sysrq triggers, it's also desirable to have a way to enable sysrq back. Having the way to enable sysrq was beneficial to debug lockups with a manual investigation in field and on the other side preventing false sysrq detections. As a preparation to add sysrq_toggle_support() call into uart, remove a private copy of sysrq_enabled from sysctl - it should reflect the actual status of sysrq. Furthermore, the private copy isn't correct already in case sysrq_always_enabled is true. So, remove __sysrq_enabled and use a getter-helper sysrq_mask() to check sysrq_key_op enabled status. Cc: Iurii Zaikin Cc: Jiri Slaby Cc: Luis Chamberlain Cc: Kees Cook Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Dmitry Safonov Link: https://lore.kernel.org/r/20200302175135.269397-2-dima@arista.com Signed-off-by: Greg Kroah-Hartman --- drivers/tty/sysrq.c | 12 ++++++++++++ include/linux/sysrq.h | 7 +++++++ kernel/sysctl.c | 41 ++++++++++++++++++++++------------------- 3 files changed, 41 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c index f724962a5906..5e0d0813da55 100644 --- a/drivers/tty/sysrq.c +++ b/drivers/tty/sysrq.c @@ -63,6 +63,18 @@ static bool sysrq_on(void) return sysrq_enabled || sysrq_always_enabled; } +/** + * sysrq_mask - Getter for sysrq_enabled mask. + * + * Return: 1 if sysrq is always enabled, enabled sysrq_key_op mask otherwise. + */ +int sysrq_mask(void) +{ + if (sysrq_always_enabled) + return 1; + return sysrq_enabled; +} + /* * A value of 1 means 'all', other nonzero values are an op mask: */ diff --git a/include/linux/sysrq.h b/include/linux/sysrq.h index 8c71874e8485..8e159e16850f 100644 --- a/include/linux/sysrq.h +++ b/include/linux/sysrq.h @@ -50,6 +50,7 @@ int unregister_sysrq_key(int key, struct sysrq_key_op *op); struct sysrq_key_op *__sysrq_get_key_op(int key); int sysrq_toggle_support(int enable_mask); +int sysrq_mask(void); #else @@ -71,6 +72,12 @@ static inline int unregister_sysrq_key(int key, struct sysrq_key_op *op) return -EINVAL; } +static inline int sysrq_mask(void) +{ + /* Magic SysRq disabled mask */ + return 0; +} + #endif #endif /* _LINUX_SYSRQ_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ad5b88a53c5a..94638f695e60 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -229,25 +229,8 @@ static int proc_dopipe_max_size(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); #ifdef CONFIG_MAGIC_SYSRQ -/* Note: sysrq code uses its own private copy */ -static int __sysrq_enabled = CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE; - static int sysrq_sysctl_handler(struct ctl_table *table, int write, - void __user *buffer, size_t *lenp, - loff_t *ppos) -{ - int error; - - error = proc_dointvec(table, write, buffer, lenp, ppos); - if (error) - return error; - - if (write) - sysrq_toggle_support(__sysrq_enabled); - - return 0; -} - + void __user *buffer, size_t *lenp, loff_t *ppos); #endif static struct ctl_table kern_table[]; @@ -747,7 +730,7 @@ static struct ctl_table kern_table[] = { #ifdef CONFIG_MAGIC_SYSRQ { .procname = "sysrq", - .data = &__sysrq_enabled, + .data = NULL, .maxlen = sizeof (int), .mode = 0644, .proc_handler = sysrq_sysctl_handler, @@ -2835,6 +2818,26 @@ static int proc_dostring_coredump(struct ctl_table *table, int write, } #endif +#ifdef CONFIG_MAGIC_SYSRQ +static int sysrq_sysctl_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int tmp, ret; + + tmp = sysrq_mask(); + + ret = __do_proc_dointvec(&tmp, table, write, buffer, + lenp, ppos, NULL, NULL); + if (ret || !write) + return ret; + + if (write) + sysrq_toggle_support(tmp); + + return 0; +} +#endif + static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos, -- cgit v1.2.3-71-gd317 From b513df6780ec0265f6c6d8d618d43bd55fd64691 Mon Sep 17 00:00:00 2001 From: luanshi Date: Tue, 3 Mar 2020 09:48:45 +0800 Subject: irqdomain: Fix function documentation of __irq_domain_alloc_fwnode() The function got renamed at some point, but the kernel-doc was not updated. Signed-off-by: luanshi Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/1583200125-58806-1-git-send-email-zhangliguang@linux.alibaba.com --- kernel/irq/irqdomain.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 7527e5ef6fe5..fdfc2132d2a9 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -46,11 +46,11 @@ const struct fwnode_operations irqchip_fwnode_ops; EXPORT_SYMBOL_GPL(irqchip_fwnode_ops); /** - * irq_domain_alloc_fwnode - Allocate a fwnode_handle suitable for + * __irq_domain_alloc_fwnode - Allocate a fwnode_handle suitable for * identifying an irq domain * @type: Type of irqchip_fwnode. See linux/irqdomain.h - * @name: Optional user provided domain name * @id: Optional user provided id if name != NULL + * @name: Optional user provided domain name * @pa: Optional user-provided physical address * * Allocate a struct irqchip_fwid, and return a poiner to the embedded -- cgit v1.2.3-71-gd317 From a740a423c36932695b01a3e920f697bc55b05fec Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:42 +0100 Subject: genirq/debugfs: Add missing sanity checks to interrupt injection Interrupts cannot be injected when the interrupt is not activated and when a replay is already in progress. Fixes: 536e2e34bd00 ("genirq/debugfs: Triggering of interrupts from userspace") Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200306130623.500019114@linutronix.de --- kernel/irq/debugfs.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index a949bd39e343..d44c8fd17609 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -206,8 +206,15 @@ static ssize_t irq_debug_write(struct file *file, const char __user *user_buf, chip_bus_lock(desc); raw_spin_lock_irqsave(&desc->lock, flags); - if (irq_settings_is_level(desc) || desc->istate & IRQS_NMI) { - /* Can't do level nor NMIs, sorry */ + /* + * Don't allow injection when the interrupt is: + * - Level or NMI type + * - not activated + * - replaying already + */ + if (irq_settings_is_level(desc) || + !irqd_is_activated(&desc->irq_data) || + (desc->istate & (IRQS_NMI | IRQS_REPLAY))) { err = -EINVAL; } else { desc->istate |= IRQS_PENDING; -- cgit v1.2.3-71-gd317 From c16816acd08697b02a53f56f8936497a9f6f6e7a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:43 +0100 Subject: genirq: Add protection against unsafe usage of generic_handle_irq() In general calling generic_handle_irq() with interrupts disabled from non interrupt context is harmless. For some interrupt controllers like the x86 trainwrecks this is outright dangerous as it might corrupt state if an interrupt affinity change is pending. Add infrastructure which allows to mark interrupts as unsafe and catch such usage in generic_handle_irq(). Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Link: https://lkml.kernel.org/r/20200306130623.590923677@linutronix.de --- include/linux/irq.h | 13 +++++++++++++ kernel/irq/internals.h | 8 ++++++++ kernel/irq/irqdesc.c | 6 ++++++ kernel/irq/resend.c | 5 +++-- 4 files changed, 30 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/irq.h b/include/linux/irq.h index 3ed5a055b5f4..9315fbb87db3 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -211,6 +211,8 @@ struct irq_data { * IRQD_CAN_RESERVE - Can use reservation mode * IRQD_MSI_NOMASK_QUIRK - Non-maskable MSI quirk for affinity change * required + * IRQD_HANDLE_ENFORCE_IRQCTX - Enforce that handle_irq_*() is only invoked + * from actual interrupt context. */ enum { IRQD_TRIGGER_MASK = 0xf, @@ -234,6 +236,7 @@ enum { IRQD_DEFAULT_TRIGGER_SET = (1 << 25), IRQD_CAN_RESERVE = (1 << 26), IRQD_MSI_NOMASK_QUIRK = (1 << 27), + IRQD_HANDLE_ENFORCE_IRQCTX = (1 << 28), }; #define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors) @@ -303,6 +306,16 @@ static inline bool irqd_is_single_target(struct irq_data *d) return __irqd_to_state(d) & IRQD_SINGLE_TARGET; } +static inline void irqd_set_handle_enforce_irqctx(struct irq_data *d) +{ + __irqd_to_state(d) |= IRQD_HANDLE_ENFORCE_IRQCTX; +} + +static inline bool irqd_is_handle_enforce_irqctx(struct irq_data *d) +{ + return __irqd_to_state(d) & IRQD_HANDLE_ENFORCE_IRQCTX; +} + static inline bool irqd_is_wakeup_set(struct irq_data *d) { return __irqd_to_state(d) & IRQD_WAKEUP_STATE; diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index c9d8eb7f5c02..5be382fc9415 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -425,6 +425,10 @@ static inline struct cpumask *irq_desc_get_pending_mask(struct irq_desc *desc) { return desc->pending_mask; } +static inline bool handle_enforce_irqctx(struct irq_data *data) +{ + return irqd_is_handle_enforce_irqctx(data); +} bool irq_fixup_move_pending(struct irq_desc *desc, bool force_clear); #else /* CONFIG_GENERIC_PENDING_IRQ */ static inline bool irq_can_move_pcntxt(struct irq_data *data) @@ -451,6 +455,10 @@ static inline bool irq_fixup_move_pending(struct irq_desc *desc, bool fclear) { return false; } +static inline bool handle_enforce_irqctx(struct irq_data *data) +{ + return false; +} #endif /* !CONFIG_GENERIC_PENDING_IRQ */ #if !defined(CONFIG_IRQ_DOMAIN) || !defined(CONFIG_IRQ_DOMAIN_HIERARCHY) diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 98a5f10d1900..1a7723604399 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -638,9 +638,15 @@ void irq_init_desc(unsigned int irq) int generic_handle_irq(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); + struct irq_data *data; if (!desc) return -EINVAL; + + data = irq_desc_get_irq_data(desc); + if (WARN_ON_ONCE(!in_irq() && handle_enforce_irqctx(data))) + return -EPERM; + generic_handle_irq_desc(desc); return 0; } diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index 98c04ca5fa43..5064b13b80d6 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -72,8 +72,9 @@ void check_irq_resend(struct irq_desc *desc) desc->istate &= ~IRQS_PENDING; desc->istate |= IRQS_REPLAY; - if (!desc->irq_data.chip->irq_retrigger || - !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) { + if ((!desc->irq_data.chip->irq_retrigger || + !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) && + !handle_enforce_irqctx(&desc->irq_data)) { #ifdef CONFIG_HARDIRQS_SW_RESEND unsigned int irq = irq_desc_get_irq(desc); -- cgit v1.2.3-71-gd317 From 1f85b1f5e1f5541272abedc19ba7b6c5b564c228 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:45 +0100 Subject: genirq: Add return value to check_irq_resend() In preparation for an interrupt injection interface which can be used safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value to check_irq_resend() so errors can be propagated to the caller. Split out the software resend code so the ugly #ifdef in check_irq_resend() goes away and the whole thing becomes readable. Fix up the caller in debugfs. The caller in irq_startup() does not care about the return value as this is unconditionally invoked for all interrupts and the resend is best effort anyway. Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de --- kernel/irq/debugfs.c | 3 +- kernel/irq/internals.h | 2 +- kernel/irq/resend.c | 83 +++++++++++++++++++++++++++++--------------------- 3 files changed, 51 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index d44c8fd17609..0c607798f519 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -218,8 +218,7 @@ static ssize_t irq_debug_write(struct file *file, const char __user *user_buf, err = -EINVAL; } else { desc->istate |= IRQS_PENDING; - check_irq_resend(desc); - err = 0; + err = check_irq_resend(desc); } raw_spin_unlock_irqrestore(&desc->lock, flags); diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 5be382fc9415..8980859bdf1e 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -108,7 +108,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc); irqreturn_t handle_irq_event(struct irq_desc *desc); /* Resending of interrupts :*/ -void check_irq_resend(struct irq_desc *desc); +int check_irq_resend(struct irq_desc *desc); bool irq_wait_for_poll(struct irq_desc *desc); void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action); diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index 5064b13b80d6..a21fc4eeedc7 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -47,6 +47,43 @@ static void resend_irqs(unsigned long arg) /* Tasklet to handle resend: */ static DECLARE_TASKLET(resend_tasklet, resend_irqs, 0); +static int irq_sw_resend(struct irq_desc *desc) +{ + unsigned int irq = irq_desc_get_irq(desc); + + /* + * Validate whether this interrupt can be safely injected from + * non interrupt context + */ + if (handle_enforce_irqctx(&desc->irq_data)) + return -EINVAL; + + /* + * If the interrupt is running in the thread context of the parent + * irq we need to be careful, because we cannot trigger it + * directly. + */ + if (irq_settings_is_nested_thread(desc)) { + /* + * If the parent_irq is valid, we retrigger the parent, + * otherwise we do nothing. + */ + if (!desc->parent_irq) + return -EINVAL; + irq = desc->parent_irq; + } + + /* Set it pending and activate the softirq: */ + set_bit(irq, irqs_resend); + tasklet_schedule(&resend_tasklet); + return 0; +} + +#else +static int irq_sw_resend(struct irq_desc *desc) +{ + return -EINVAL; +} #endif /* @@ -54,50 +91,28 @@ static DECLARE_TASKLET(resend_tasklet, resend_irqs, 0); * * Is called with interrupts disabled and desc->lock held. */ -void check_irq_resend(struct irq_desc *desc) +int check_irq_resend(struct irq_desc *desc) { /* - * We do not resend level type interrupts. Level type - * interrupts are resent by hardware when they are still - * active. Clear the pending bit so suspend/resume does not - * get confused. + * We do not resend level type interrupts. Level type interrupts + * are resent by hardware when they are still active. Clear the + * pending bit so suspend/resume does not get confused. */ if (irq_settings_is_level(desc)) { desc->istate &= ~IRQS_PENDING; - return; + return -EINVAL; } + if (desc->istate & IRQS_REPLAY) - return; + return -EBUSY; + if (desc->istate & IRQS_PENDING) { desc->istate &= ~IRQS_PENDING; desc->istate |= IRQS_REPLAY; - if ((!desc->irq_data.chip->irq_retrigger || - !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) && - !handle_enforce_irqctx(&desc->irq_data)) { -#ifdef CONFIG_HARDIRQS_SW_RESEND - unsigned int irq = irq_desc_get_irq(desc); - - /* - * If the interrupt is running in the thread - * context of the parent irq we need to be - * careful, because we cannot trigger it - * directly. - */ - if (irq_settings_is_nested_thread(desc)) { - /* - * If the parent_irq is valid, we - * retrigger the parent, otherwise we - * do nothing. - */ - if (!desc->parent_irq) - return; - irq = desc->parent_irq; - } - /* Set it pending and activate the softirq: */ - set_bit(irq, irqs_resend); - tasklet_schedule(&resend_tasklet); -#endif - } + if (!desc->irq_data.chip->irq_retrigger || + !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) + return irq_sw_resend(desc); } + return 0; } -- cgit v1.2.3-71-gd317 From da90921acc62c71d27729ae211ccfda5370bf75b Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:46 +0100 Subject: genirq: Sanitize state handling in check_irq_resend() The code sets IRQS_REPLAY unconditionally whether the resend happens or not. That doesn't have bad side effects right now, but inconsistent state is always a latent source of problems. Signed-off-by: Thomas Gleixner Acked-by: Marc Zyngier Link: https://lkml.kernel.org/r/20200306130623.882129117@linutronix.de --- kernel/irq/resend.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index a21fc4eeedc7..bef72dcfb79b 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -93,6 +93,8 @@ static int irq_sw_resend(struct irq_desc *desc) */ int check_irq_resend(struct irq_desc *desc) { + int err = 0; + /* * We do not resend level type interrupts. Level type interrupts * are resent by hardware when they are still active. Clear the @@ -106,13 +108,17 @@ int check_irq_resend(struct irq_desc *desc) if (desc->istate & IRQS_REPLAY) return -EBUSY; - if (desc->istate & IRQS_PENDING) { - desc->istate &= ~IRQS_PENDING; - desc->istate |= IRQS_REPLAY; + if (!(desc->istate & IRQS_PENDING)) + return 0; - if (!desc->irq_data.chip->irq_retrigger || - !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) - return irq_sw_resend(desc); - } - return 0; + desc->istate &= ~IRQS_PENDING; + + if (!desc->irq_data.chip->irq_retrigger || + !desc->irq_data.chip->irq_retrigger(&desc->irq_data)) + err = irq_sw_resend(desc); + + /* If the retrigger was successfull, mark it with the REPLAY bit */ + if (!err) + desc->istate |= IRQS_REPLAY; + return err; } -- cgit v1.2.3-71-gd317 From acd26bcf362708594ea081ef55140e37d0854ed2 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 6 Mar 2020 14:03:47 +0100 Subject: genirq: Provide interrupt injection mechanism Error injection mechanisms need a half ways safe way to inject interrupts as invoking generic_handle_irq() or the actual device interrupt handler directly from e.g. a debugfs write is not guaranteed to be safe. On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which is the base of x86 interrupt delivery and affinity management. Move the irq debugfs injection code into a separate function which can be used by error injection code as well. The implementation prevents at least that state is corrupted, but it cannot close a very tiny race window on x86 which might result in a stale and not serviced device interrupt under very unlikely circumstances. This is explicitly for debugging and testing and not for production use or abuse in random driver code. Signed-off-by: Thomas Gleixner Tested-by: Kuppuswamy Sathyanarayanan Reviewed-by: Kuppuswamy Sathyanarayanan Acked-by: Marc Zyngier Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de --- include/linux/interrupt.h | 2 ++ kernel/irq/Kconfig | 5 +++++ kernel/irq/chip.c | 2 +- kernel/irq/debugfs.c | 34 +----------------------------- kernel/irq/internals.h | 2 +- kernel/irq/resend.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-- 6 files changed, 61 insertions(+), 37 deletions(-) (limited to 'kernel') diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c5fe60ec6b84..80f637c3a6f3 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -248,6 +248,8 @@ extern void enable_percpu_nmi(unsigned int irq, unsigned int type); extern int prepare_percpu_nmi(unsigned int irq); extern void teardown_percpu_nmi(unsigned int irq); +extern int irq_inject_interrupt(unsigned int irq); + /* The following three functions are for the core kernel use only. */ extern void suspend_device_irqs(void); extern void resume_device_irqs(void); diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig index f92d9a687372..20d501af4f2e 100644 --- a/kernel/irq/Kconfig +++ b/kernel/irq/Kconfig @@ -43,6 +43,10 @@ config GENERIC_IRQ_MIGRATION config AUTO_IRQ_AFFINITY bool +# Interrupt injection mechanism +config GENERIC_IRQ_INJECTION + bool + # Tasklet based software resend for pending interrupts on enable_irq() config HARDIRQS_SW_RESEND bool @@ -127,6 +131,7 @@ config SPARSE_IRQ config GENERIC_IRQ_DEBUGFS bool "Expose irq internals in debugfs" depends on DEBUG_FS + select GENERIC_IRQ_INJECTION default n ---help--- diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index b3fa2d87d2f3..41e7e37a0928 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -278,7 +278,7 @@ int irq_startup(struct irq_desc *desc, bool resend, bool force) } } if (resend) - check_irq_resend(desc); + check_irq_resend(desc, false); return ret; } diff --git a/kernel/irq/debugfs.c b/kernel/irq/debugfs.c index 0c607798f519..4f9f844074db 100644 --- a/kernel/irq/debugfs.c +++ b/kernel/irq/debugfs.c @@ -190,39 +190,7 @@ static ssize_t irq_debug_write(struct file *file, const char __user *user_buf, return -EFAULT; if (!strncmp(buf, "trigger", size)) { - unsigned long flags; - int err; - - /* Try the HW interface first */ - err = irq_set_irqchip_state(irq_desc_get_irq(desc), - IRQCHIP_STATE_PENDING, true); - if (!err) - return count; - - /* - * Otherwise, try to inject via the resend interface, - * which may or may not succeed. - */ - chip_bus_lock(desc); - raw_spin_lock_irqsave(&desc->lock, flags); - - /* - * Don't allow injection when the interrupt is: - * - Level or NMI type - * - not activated - * - replaying already - */ - if (irq_settings_is_level(desc) || - !irqd_is_activated(&desc->irq_data) || - (desc->istate & (IRQS_NMI | IRQS_REPLAY))) { - err = -EINVAL; - } else { - desc->istate |= IRQS_PENDING; - err = check_irq_resend(desc); - } - - raw_spin_unlock_irqrestore(&desc->lock, flags); - chip_bus_sync_unlock(desc); + int err = irq_inject_interrupt(irq_desc_get_irq(desc)); return err ? err : count; } diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h index 8980859bdf1e..7db284b10ac9 100644 --- a/kernel/irq/internals.h +++ b/kernel/irq/internals.h @@ -108,7 +108,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc); irqreturn_t handle_irq_event(struct irq_desc *desc); /* Resending of interrupts :*/ -int check_irq_resend(struct irq_desc *desc); +int check_irq_resend(struct irq_desc *desc, bool inject); bool irq_wait_for_poll(struct irq_desc *desc); void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action); diff --git a/kernel/irq/resend.c b/kernel/irq/resend.c index bef72dcfb79b..27634f4022d0 100644 --- a/kernel/irq/resend.c +++ b/kernel/irq/resend.c @@ -91,7 +91,7 @@ static int irq_sw_resend(struct irq_desc *desc) * * Is called with interrupts disabled and desc->lock held. */ -int check_irq_resend(struct irq_desc *desc) +int check_irq_resend(struct irq_desc *desc, bool inject) { int err = 0; @@ -108,7 +108,7 @@ int check_irq_resend(struct irq_desc *desc) if (desc->istate & IRQS_REPLAY) return -EBUSY; - if (!(desc->istate & IRQS_PENDING)) + if (!(desc->istate & IRQS_PENDING) && !inject) return 0; desc->istate &= ~IRQS_PENDING; @@ -122,3 +122,52 @@ int check_irq_resend(struct irq_desc *desc) desc->istate |= IRQS_REPLAY; return err; } + +#ifdef CONFIG_GENERIC_IRQ_INJECTION +/** + * irq_inject_interrupt - Inject an interrupt for testing/error injection + * @irq: The interrupt number + * + * This function must only be used for debug and testing purposes! + * + * Especially on x86 this can cause a premature completion of an interrupt + * affinity change causing the interrupt line to become stale. Very + * unlikely, but possible. + * + * The injection can fail for various reasons: + * - Interrupt is not activated + * - Interrupt is NMI type or currently replaying + * - Interrupt is level type + * - Interrupt does not support hardware retrigger and software resend is + * either not enabled or not possible for the interrupt. + */ +int irq_inject_interrupt(unsigned int irq) +{ + struct irq_desc *desc; + unsigned long flags; + int err; + + /* Try the state injection hardware interface first */ + if (!irq_set_irqchip_state(irq, IRQCHIP_STATE_PENDING, true)) + return 0; + + /* That failed, try via the resend mechanism */ + desc = irq_get_desc_buslock(irq, &flags, 0); + if (!desc) + return -EINVAL; + + /* + * Only try to inject when the interrupt is: + * - not NMI type + * - activated + */ + if ((desc->istate & IRQS_NMI) || !irqd_is_activated(&desc->irq_data)) + err = -EINVAL; + else + err = check_irq_resend(desc, true); + + irq_put_desc_busunlock(desc, flags); + return err; +} +EXPORT_SYMBOL_GPL(irq_inject_interrupt); +#endif -- cgit v1.2.3-71-gd317 From 87f2d1c662fa1761359fdf558246f97e484d177a Mon Sep 17 00:00:00 2001 From: Alexander Sverdlin Date: Fri, 6 Mar 2020 18:47:20 +0100 Subject: genirq/irqdomain: Check pointer in irq_domain_alloc_irqs_hierarchy() irq_domain_alloc_irqs_hierarchy() has 3 call sites in the compilation unit but only one of them checks for the pointer which is being dereferenced inside the called function. Move the check into the function. This allows for catching the error instead of the following crash: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at 0x0 LR is at gpiochip_hierarchy_irq_domain_alloc+0x11f/0x140 ... [] (gpiochip_hierarchy_irq_domain_alloc) [] (__irq_domain_alloc_irqs) [] (irq_create_fwspec_mapping) [] (gpiochip_to_irq) [] (gpiod_to_irq) [] (gpio_irqs_init [gpio_irqs]) [] (gpio_irqs_exit+0xecc/0xe84 [gpio_irqs]) Code: bad PC value Signed-off-by: Alexander Sverdlin Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200306174720.82604-1-alexander.sverdlin@nokia.com --- kernel/irq/irqdomain.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index fdfc2132d2a9..35b8d97c3a1d 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1310,6 +1310,11 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain, unsigned int irq_base, unsigned int nr_irqs, void *arg) { + if (!domain->ops->alloc) { + pr_debug("domain->ops->alloc() is NULL\n"); + return -ENOSYS; + } + return domain->ops->alloc(domain, irq_base, nr_irqs, arg); } @@ -1347,11 +1352,6 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base, return -EINVAL; } - if (!domain->ops->alloc) { - pr_debug("domain->ops->alloc() is NULL\n"); - return -ENOSYS; - } - if (realloc && irq_base >= 0) { virq = irq_base; } else { -- cgit v1.2.3-71-gd317 From babf3164095b0670435910340c2a1eec37757b57 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Mon, 9 Mar 2020 16:10:51 -0700 Subject: bpf: Add bpf_link_new_file that doesn't install FD Add bpf_link_new_file() API for cases when we need to ensure anon_inode is successfully created before we proceed with expensive BPF program attachment procedure, which will require equally (if not more so) expensive and potentially failing compensation detachment procedure just because anon_inode creation failed. This API allows to simplify code by ensuring first that anon_inode is created and after BPF program is attached proceed with fd_install() that can't fail. After anon_inode file is created, link can't be just kfree()'d anymore, because its destruction will be performed by deferred file_operations->release call. For this, bpf_link API required specifying two separate operations: release() and dealloc(), former performing detachment only, while the latter frees memory used by bpf_link itself. dealloc() needs to be specified, because struct bpf_link is frequently embedded into link type-specific container struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to properly free the memory. In case when anon_inode file was successfully created, but subsequent BPF attachment failed, bpf_link needs to be marked as "defunct", so that file's release() callback will perform only memory deallocation, but no detachment. Convert raw tracepoint and tracing attachment to new API and eliminate detachment from error handling path. Signed-off-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann Acked-by: John Fastabend Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com --- include/linux/bpf.h | 3 ++ kernel/bpf/syscall.c | 122 +++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 91 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 94a329b9da81..4fd91b7c95ea 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1070,13 +1070,16 @@ struct bpf_link; struct bpf_link_ops { void (*release)(struct bpf_link *link); + void (*dealloc)(struct bpf_link *link); }; void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, struct bpf_prog *prog); +void bpf_link_defunct(struct bpf_link *link); void bpf_link_inc(struct bpf_link *link); void bpf_link_put(struct bpf_link *link); int bpf_link_new_fd(struct bpf_link *link); +struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd); struct bpf_link *bpf_link_get_from_fd(u32 ufd); int bpf_obj_pin_user(u32 ufd, const char __user *pathname); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7ce0815793dd..b2f73ecacced 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2188,6 +2188,11 @@ void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, link->prog = prog; } +void bpf_link_defunct(struct bpf_link *link) +{ + link->prog = NULL; +} + void bpf_link_inc(struct bpf_link *link) { atomic64_inc(&link->refcnt); @@ -2196,14 +2201,13 @@ void bpf_link_inc(struct bpf_link *link) /* bpf_link_free is guaranteed to be called from process context */ static void bpf_link_free(struct bpf_link *link) { - struct bpf_prog *prog; - - /* remember prog locally, because release below will free link memory */ - prog = link->prog; - /* extra clean up and kfree of container link struct */ - link->ops->release(link); - /* no more accesing of link members after this point */ - bpf_prog_put(prog); + if (link->prog) { + /* detach BPF program, clean up used resources */ + link->ops->release(link); + bpf_prog_put(link->prog); + } + /* free bpf_link and its containing memory */ + link->ops->dealloc(link); } static void bpf_link_put_deferred(struct work_struct *work) @@ -2281,6 +2285,33 @@ int bpf_link_new_fd(struct bpf_link *link) return anon_inode_getfd("bpf-link", &bpf_link_fops, link, O_CLOEXEC); } +/* Similar to bpf_link_new_fd, create anon_inode for given bpf_link, but + * instead of immediately installing fd in fdtable, just reserve it and + * return. Caller then need to either install it with fd_install(fd, file) or + * release with put_unused_fd(fd). + * This is useful for cases when bpf_link attachment/detachment are + * complicated and expensive operations and should be delayed until all the fd + * reservation and anon_inode creation succeeds. + */ +struct file *bpf_link_new_file(struct bpf_link *link, int *reserved_fd) +{ + struct file *file; + int fd; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (fd < 0) + return ERR_PTR(fd); + + file = anon_inode_getfile("bpf_link", &bpf_link_fops, link, O_CLOEXEC); + if (IS_ERR(file)) { + put_unused_fd(fd); + return file; + } + + *reserved_fd = fd; + return file; +} + struct bpf_link *bpf_link_get_from_fd(u32 ufd) { struct fd f = fdget(ufd); @@ -2305,21 +2336,27 @@ struct bpf_tracing_link { }; static void bpf_tracing_link_release(struct bpf_link *link) +{ + WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog)); +} + +static void bpf_tracing_link_dealloc(struct bpf_link *link) { struct bpf_tracing_link *tr_link = container_of(link, struct bpf_tracing_link, link); - WARN_ON_ONCE(bpf_trampoline_unlink_prog(link->prog)); kfree(tr_link); } static const struct bpf_link_ops bpf_tracing_link_lops = { .release = bpf_tracing_link_release, + .dealloc = bpf_tracing_link_dealloc, }; static int bpf_tracing_prog_attach(struct bpf_prog *prog) { struct bpf_tracing_link *link; + struct file *link_file; int link_fd, err; if (prog->expected_attach_type != BPF_TRACE_FENTRY && @@ -2337,20 +2374,24 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog) } bpf_link_init(&link->link, &bpf_tracing_link_lops, prog); - err = bpf_trampoline_link_prog(prog); - if (err) - goto out_free_link; + link_file = bpf_link_new_file(&link->link, &link_fd); + if (IS_ERR(link_file)) { + kfree(link); + err = PTR_ERR(link_file); + goto out_put_prog; + } - link_fd = bpf_link_new_fd(&link->link); - if (link_fd < 0) { - WARN_ON_ONCE(bpf_trampoline_unlink_prog(prog)); - err = link_fd; - goto out_free_link; + err = bpf_trampoline_link_prog(prog); + if (err) { + bpf_link_defunct(&link->link); + fput(link_file); + put_unused_fd(link_fd); + goto out_put_prog; } + + fd_install(link_fd, link_file); return link_fd; -out_free_link: - kfree(link); out_put_prog: bpf_prog_put(prog); return err; @@ -2368,19 +2409,28 @@ static void bpf_raw_tp_link_release(struct bpf_link *link) bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog); bpf_put_raw_tracepoint(raw_tp->btp); +} + +static void bpf_raw_tp_link_dealloc(struct bpf_link *link) +{ + struct bpf_raw_tp_link *raw_tp = + container_of(link, struct bpf_raw_tp_link, link); + kfree(raw_tp); } static const struct bpf_link_ops bpf_raw_tp_lops = { .release = bpf_raw_tp_link_release, + .dealloc = bpf_raw_tp_link_dealloc, }; #define BPF_RAW_TRACEPOINT_OPEN_LAST_FIELD raw_tracepoint.prog_fd static int bpf_raw_tracepoint_open(const union bpf_attr *attr) { - struct bpf_raw_tp_link *raw_tp; + struct bpf_raw_tp_link *link; struct bpf_raw_event_map *btp; + struct file *link_file; struct bpf_prog *prog; const char *tp_name; char buf[128]; @@ -2431,28 +2481,32 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) goto out_put_prog; } - raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); - if (!raw_tp) { + link = kzalloc(sizeof(*link), GFP_USER); + if (!link) { err = -ENOMEM; goto out_put_btp; } - bpf_link_init(&raw_tp->link, &bpf_raw_tp_lops, prog); - raw_tp->btp = btp; + bpf_link_init(&link->link, &bpf_raw_tp_lops, prog); + link->btp = btp; - err = bpf_probe_register(raw_tp->btp, prog); - if (err) - goto out_free_tp; + link_file = bpf_link_new_file(&link->link, &link_fd); + if (IS_ERR(link_file)) { + kfree(link); + err = PTR_ERR(link_file); + goto out_put_btp; + } - link_fd = bpf_link_new_fd(&raw_tp->link); - if (link_fd < 0) { - bpf_probe_unregister(raw_tp->btp, prog); - err = link_fd; - goto out_free_tp; + err = bpf_probe_register(link->btp, prog); + if (err) { + bpf_link_defunct(&link->link); + fput(link_file); + put_unused_fd(link_fd); + goto out_put_btp; } + + fd_install(link_fd, link_file); return link_fd; -out_free_tp: - kfree(raw_tp); out_put_btp: bpf_put_raw_tracepoint(btp); out_put_prog: -- cgit v1.2.3-71-gd317 From 1320a4052ea11eb2879eb7361da15a106a780972 Mon Sep 17 00:00:00 2001 From: Richard Guy Briggs Date: Tue, 10 Mar 2020 09:20:17 -0400 Subject: audit: trigger accompanying records when no rules present When there are no audit rules registered, mandatory records (config, etc.) are missing their accompanying records (syscall, proctitle, etc.). This is due to audit context dummy set on syscall entry based on absence of rules that signals that no other records are to be printed. Clear the dummy bit if any record is generated. The proctitle context and dummy checks are pointless since the proctitle record will not be printed if no syscall records are printed. Please see upstream github issue https://github.com/linux-audit/audit-kernel/issues/120 Signed-off-by: Richard Guy Briggs Signed-off-by: Paul Moore --- kernel/audit.c | 1 + kernel/audit.h | 8 ++++++++ kernel/auditsc.c | 3 --- 3 files changed, 9 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index 17b0d523afb3..b96331e1976d 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -1798,6 +1798,7 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask, } audit_get_stamp(ab->ctx, &t, &serial); + audit_clear_dummy(ab->ctx); audit_log_format(ab, "audit(%llu.%03lu:%u): ", (unsigned long long)t.tv_sec, t.tv_nsec/1000000, serial); diff --git a/kernel/audit.h b/kernel/audit.h index 6fb7160412d4..2eed4d231624 100644 --- a/kernel/audit.h +++ b/kernel/audit.h @@ -290,6 +290,13 @@ extern int audit_signal_info_syscall(struct task_struct *t); extern void audit_filter_inodes(struct task_struct *tsk, struct audit_context *ctx); extern struct list_head *audit_killed_trees(void); + +static inline void audit_clear_dummy(struct audit_context *ctx) +{ + if (ctx) + ctx->dummy = 0; +} + #else /* CONFIG_AUDITSYSCALL */ #define auditsc_get_stamp(c, t, s) 0 #define audit_put_watch(w) {} @@ -323,6 +330,7 @@ static inline int audit_signal_info_syscall(struct task_struct *t) } #define audit_filter_inodes(t, c) AUDIT_DISABLED +#define audit_clear_dummy(c) {} #endif /* CONFIG_AUDITSYSCALL */ extern char *audit_unpack_string(void **bufp, size_t *remain, size_t len); diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 4effe01ebbe2..814406a35db1 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1406,9 +1406,6 @@ static void audit_log_proctitle(void) struct audit_context *context = audit_context(); struct audit_buffer *ab; - if (!context || context->dummy) - return; - ab = audit_log_start(context, GFP_KERNEL, AUDIT_PROCTITLE); if (!ab) return; /* audit_panic or being filtered */ -- cgit v1.2.3-71-gd317 From 00d5d15b0641f4ae463253eba06c836d56c2ce42 Mon Sep 17 00:00:00 2001 From: Chris Wilson Date: Tue, 10 Mar 2020 16:23:19 +0000 Subject: workqueue: Mark up unlocked access to wq->first_flusher [ 7329.671518] BUG: KCSAN: data-race in flush_workqueue / flush_workqueue [ 7329.671549] [ 7329.671572] write to 0xffff8881f65fb250 of 8 bytes by task 37173 on cpu 2: [ 7329.671607] flush_workqueue+0x3bc/0x9b0 (kernel/workqueue.c:2844) [ 7329.672527] [ 7329.672540] read to 0xffff8881f65fb250 of 8 bytes by task 37175 on cpu 0: [ 7329.672571] flush_workqueue+0x28d/0x9b0 (kernel/workqueue.c:2835) Signed-off-by: Chris Wilson Cc: Tejun Heo Cc: Lai Jiangshan Signed-off-by: Tejun Heo --- kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 5afa9ad45eba..40a4ba39ad5e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2832,7 +2832,7 @@ void flush_workqueue(struct workqueue_struct *wq) * First flushers are responsible for cascading flushes and * handling overflow. Non-first flushers can simply return. */ - if (wq->first_flusher != &this_flusher) + if (READ_ONCE(wq->first_flusher) != &this_flusher) return; mutex_lock(&wq->mutex); @@ -2841,7 +2841,7 @@ void flush_workqueue(struct workqueue_struct *wq) if (wq->first_flusher != &this_flusher) goto out_unlock; - wq->first_flusher = NULL; + WRITE_ONCE(wq->first_flusher, NULL); WARN_ON_ONCE(!list_empty(&this_flusher.list)); WARN_ON_ONCE(wq->flush_color != this_flusher.flush_color); -- cgit v1.2.3-71-gd317 From e7b20d97967c2995700041f0348ea33047e5c942 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 12 Mar 2020 16:44:35 -0400 Subject: cgroup: Restructure release_agent_path handling cgrp->root->release_agent_path is protected by both cgroup_mutex and release_agent_path_lock and readers can hold either one. The dual-locking scheme was introduced while breaking a locking dependency issue around cgroup_mutex but doesn't make sense anymore given that the only remaining reader which uses cgroup_mutex is cgroup1_releaes_agent(). This patch updates cgroup1_release_agent() to use release_agent_path_lock so that release_agent_path is always protected only by release_agent_path_lock. While at it, convert strlen() based empty string checks to direct tests on the first character as suggested by Linus. Signed-off-by: Tejun Heo Cc: Linus Torvalds --- kernel/cgroup/cgroup-v1.c | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index f2d7cea86ffe..191c329e482a 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -38,10 +38,7 @@ static bool cgroup_no_v1_named; */ static struct workqueue_struct *cgroup_pidlist_destroy_wq; -/* - * Protects cgroup_subsys->release_agent_path. Modifying it also requires - * cgroup_mutex. Reading requires either cgroup_mutex or this spinlock. - */ +/* protects cgroup_subsys->release_agent_path */ static DEFINE_SPINLOCK(release_agent_path_lock); bool cgroup1_ssid_disabled(int ssid) @@ -775,22 +772,29 @@ void cgroup1_release_agent(struct work_struct *work) { struct cgroup *cgrp = container_of(work, struct cgroup, release_agent_work); - char *pathbuf = NULL, *agentbuf = NULL; + char *pathbuf, *agentbuf; char *argv[3], *envp[3]; int ret; - mutex_lock(&cgroup_mutex); + /* snoop agent path and exit early if empty */ + if (!cgrp->root->release_agent_path[0]) + return; + /* prepare argument buffers */ pathbuf = kmalloc(PATH_MAX, GFP_KERNEL); - agentbuf = kstrdup(cgrp->root->release_agent_path, GFP_KERNEL); - if (!pathbuf || !agentbuf || !strlen(agentbuf)) - goto out; + agentbuf = kmalloc(PATH_MAX, GFP_KERNEL); + if (!pathbuf || !agentbuf) + goto out_free; - spin_lock_irq(&css_set_lock); - ret = cgroup_path_ns_locked(cgrp, pathbuf, PATH_MAX, &init_cgroup_ns); - spin_unlock_irq(&css_set_lock); + spin_lock(&release_agent_path_lock); + strlcpy(agentbuf, cgrp->root->release_agent_path, PATH_MAX); + spin_unlock(&release_agent_path_lock); + if (!agentbuf[0]) + goto out_free; + + ret = cgroup_path_ns(cgrp, pathbuf, PATH_MAX, &init_cgroup_ns); if (ret < 0 || ret >= PATH_MAX) - goto out; + goto out_free; argv[0] = agentbuf; argv[1] = pathbuf; @@ -801,11 +805,7 @@ void cgroup1_release_agent(struct work_struct *work) envp[1] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin"; envp[2] = NULL; - mutex_unlock(&cgroup_mutex); call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC); - goto out_free; -out: - mutex_unlock(&cgroup_mutex); out_free: kfree(agentbuf); kfree(pathbuf); -- cgit v1.2.3-71-gd317 From b4490c5c4e023f09b7d27c9a9d3e7ad7d09ea6bf Mon Sep 17 00:00:00 2001 From: Carlos Neira Date: Wed, 4 Mar 2020 17:41:56 -0300 Subject: bpf: Added new helper bpf_get_ns_current_pid_tgid New bpf helper bpf_get_ns_current_pid_tgid, This helper will return pid and tgid from current task which namespace matches dev_t and inode number provided, this will allows us to instrument a process inside a container. Signed-off-by: Carlos Neira Signed-off-by: Alexei Starovoitov Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 20 ++++++++++++++++++- kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 45 ++++++++++++++++++++++++++++++++++++++++++ kernel/trace/bpf_trace.c | 2 ++ scripts/bpf_helpers_doc.py | 1 + tools/include/uapi/linux/bpf.h | 20 ++++++++++++++++++- 7 files changed, 88 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4fd91b7c95ea..4ec835334a1f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1497,6 +1497,7 @@ extern const struct bpf_func_proto bpf_strtol_proto; extern const struct bpf_func_proto bpf_strtoul_proto; extern const struct bpf_func_proto bpf_tcp_sock_proto; extern const struct bpf_func_proto bpf_jiffies64_proto; +extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto; /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 40b2d9476268..15b239da775b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2914,6 +2914,19 @@ union bpf_attr { * of sizeof(struct perf_branch_entry). * * **-ENOENT** if architecture does not support branch records. + * + * int bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size) + * Description + * Returns 0 on success, values for *pid* and *tgid* as seen from the current + * *namespace* will be returned in *nsdata*. + * + * On failure, the returned value is one of the following: + * + * **-EINVAL** if dev and inum supplied don't match dev_t and inode number + * with nsfs of current task, or if dev conversion to dev_t lost high bits. + * + * **-ENOENT** if pidns does not exists for the current task. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3035,7 +3048,8 @@ union bpf_attr { FN(tcp_send_ack), \ FN(send_signal_thread), \ FN(jiffies64), \ - FN(read_branch_records), + FN(read_branch_records), \ + FN(get_ns_current_pid_tgid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3829,4 +3843,8 @@ struct bpf_sockopt { __s32 retval; }; +struct bpf_pidns_info { + __u32 pid; + __u32 tgid; +}; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 973a20d49749..0f9ca46e1978 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2149,6 +2149,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; const struct bpf_func_proto bpf_get_current_comm_proto __weak; const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak; const struct bpf_func_proto bpf_get_local_storage_proto __weak; +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto __weak; const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index d8b7b110a1c5..01878db15eaf 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -12,6 +12,8 @@ #include #include #include +#include +#include #include "../../lib/kstrtox.h" @@ -499,3 +501,46 @@ const struct bpf_func_proto bpf_strtoul_proto = { .arg4_type = ARG_PTR_TO_LONG, }; #endif + +BPF_CALL_4(bpf_get_ns_current_pid_tgid, u64, dev, u64, ino, + struct bpf_pidns_info *, nsdata, u32, size) +{ + struct task_struct *task = current; + struct pid_namespace *pidns; + int err = -EINVAL; + + if (unlikely(size != sizeof(struct bpf_pidns_info))) + goto clear; + + if (unlikely((u64)(dev_t)dev != dev)) + goto clear; + + if (unlikely(!task)) + goto clear; + + pidns = task_active_pid_ns(task); + if (unlikely(!pidns)) { + err = -ENOENT; + goto clear; + } + + if (!ns_match(&pidns->ns, (dev_t)dev, ino)) + goto clear; + + nsdata->pid = task_pid_nr_ns(task, pidns); + nsdata->tgid = task_tgid_nr_ns(task, pidns); + return 0; +clear: + memset((void *)nsdata, 0, (size_t) size); + return err; +} + +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { + .func = bpf_get_ns_current_pid_tgid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_UNINIT_MEM, + .arg4_type = ARG_CONST_SIZE, +}; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 6a490d8ce9de..b5071c7e93ca 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -843,6 +843,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_send_signal_thread_proto; case BPF_FUNC_perf_event_read_value: return &bpf_perf_event_read_value_proto; + case BPF_FUNC_get_ns_current_pid_tgid: + return &bpf_get_ns_current_pid_tgid_proto; default: return NULL; } diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py index cebed6fb5bbb..c1e2b5410faa 100755 --- a/scripts/bpf_helpers_doc.py +++ b/scripts/bpf_helpers_doc.py @@ -435,6 +435,7 @@ class PrinterHelpers(Printer): 'struct bpf_fib_lookup', 'struct bpf_perf_event_data', 'struct bpf_perf_event_value', + 'struct bpf_pidns_info', 'struct bpf_sock', 'struct bpf_sock_addr', 'struct bpf_sock_ops', diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 40b2d9476268..15b239da775b 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2914,6 +2914,19 @@ union bpf_attr { * of sizeof(struct perf_branch_entry). * * **-ENOENT** if architecture does not support branch records. + * + * int bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct bpf_pidns_info *nsdata, u32 size) + * Description + * Returns 0 on success, values for *pid* and *tgid* as seen from the current + * *namespace* will be returned in *nsdata*. + * + * On failure, the returned value is one of the following: + * + * **-EINVAL** if dev and inum supplied don't match dev_t and inode number + * with nsfs of current task, or if dev conversion to dev_t lost high bits. + * + * **-ENOENT** if pidns does not exists for the current task. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3035,7 +3048,8 @@ union bpf_attr { FN(tcp_send_ack), \ FN(send_signal_thread), \ FN(jiffies64), \ - FN(read_branch_records), + FN(read_branch_records), \ + FN(get_ns_current_pid_tgid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3829,4 +3843,8 @@ struct bpf_sockopt { __s32 retval; }; +struct bpf_pidns_info { + __u32 pid; + __u32 tgid; +}; #endif /* _UAPI__LINUX_BPF_H__ */ -- cgit v1.2.3-71-gd317 From d831ee84bfc9173eecf30dbbc2553ae81b996c60 Mon Sep 17 00:00:00 2001 From: Eelco Chaudron Date: Fri, 6 Mar 2020 08:59:23 +0000 Subject: bpf: Add bpf_xdp_output() helper MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Introduce new helper that reuses existing xdp perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct xdp_buff *' as a tracepoint argument. Signed-off-by: Eelco Chaudron Signed-off-by: Alexei Starovoitov Acked-by: John Fastabend Acked-by: Toke Høiland-Jørgensen Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial --- include/uapi/linux/bpf.h | 26 ++++++++++- kernel/bpf/verifier.c | 4 +- kernel/trace/bpf_trace.c | 3 ++ net/core/filter.c | 16 ++++++- tools/include/uapi/linux/bpf.h | 26 ++++++++++- .../testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c | 53 ++++++++++++++++++++++ .../testing/selftests/bpf/progs/test_xdp_bpf2bpf.c | 24 ++++++++++ 7 files changed, 148 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 15b239da775b..5d01c5c7e598 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2927,6 +2927,29 @@ union bpf_attr { * * **-ENOENT** if pidns does not exists for the current task. * + * int bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size) + * Description + * Write raw *data* blob into a special BPF perf event held by + * *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf + * event must have the following attributes: **PERF_SAMPLE_RAW** + * as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and + * **PERF_COUNT_SW_BPF_OUTPUT** as **config**. + * + * The *flags* are used to indicate the index in *map* for which + * the value must be put, masked with **BPF_F_INDEX_MASK**. + * Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU** + * to indicate that the index of the current CPU core should be + * used. + * + * The value to write, of *size*, is passed through eBPF stack and + * pointed by *data*. + * + * *ctx* is a pointer to in-kernel struct xdp_buff. + * + * This helper is similar to **bpf_perf_eventoutput**\ () but + * restricted to raw_tracepoint bpf programs. + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3049,7 +3072,8 @@ union bpf_attr { FN(send_signal_thread), \ FN(jiffies64), \ FN(read_branch_records), \ - FN(get_ns_current_pid_tgid), + FN(get_ns_current_pid_tgid), \ + FN(xdp_output), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 55d376c53f7d..745f3cfdf3b2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3650,7 +3650,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, if (func_id != BPF_FUNC_perf_event_read && func_id != BPF_FUNC_perf_event_output && func_id != BPF_FUNC_skb_output && - func_id != BPF_FUNC_perf_event_read_value) + func_id != BPF_FUNC_perf_event_read_value && + func_id != BPF_FUNC_xdp_output) goto error; break; case BPF_MAP_TYPE_STACK_TRACE: @@ -3740,6 +3741,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, case BPF_FUNC_perf_event_output: case BPF_FUNC_perf_event_read_value: case BPF_FUNC_skb_output: + case BPF_FUNC_xdp_output: if (map->map_type != BPF_MAP_TYPE_PERF_EVENT_ARRAY) goto error; break; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b5071c7e93ca..e619eedb5919 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1145,6 +1145,7 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { }; extern const struct bpf_func_proto bpf_skb_output_proto; +extern const struct bpf_func_proto bpf_xdp_output_proto; BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags) @@ -1220,6 +1221,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) #ifdef CONFIG_NET case BPF_FUNC_skb_output: return &bpf_skb_output_proto; + case BPF_FUNC_xdp_output: + return &bpf_xdp_output_proto; #endif default: return raw_tp_prog_func_proto(func_id, prog); diff --git a/net/core/filter.c b/net/core/filter.c index cd0a532db4e7..22219544410f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4061,7 +4061,8 @@ BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct bpf_map *, map, if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK))) return -EINVAL; - if (unlikely(xdp_size > (unsigned long)(xdp->data_end - xdp->data))) + if (unlikely(!xdp || + xdp_size > (unsigned long)(xdp->data_end - xdp->data))) return -EFAULT; return bpf_event_output(map, flags, meta, meta_size, xdp->data, @@ -4079,6 +4080,19 @@ static const struct bpf_func_proto bpf_xdp_event_output_proto = { .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; +static int bpf_xdp_output_btf_ids[5]; +const struct bpf_func_proto bpf_xdp_output_proto = { + .func = bpf_xdp_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, + .btf_id = bpf_xdp_output_btf_ids, +}; + BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb) { return skb->sk ? sock_gen_cookie(skb->sk) : 0; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 15b239da775b..5d01c5c7e598 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2927,6 +2927,29 @@ union bpf_attr { * * **-ENOENT** if pidns does not exists for the current task. * + * int bpf_xdp_output(void *ctx, struct bpf_map *map, u64 flags, void *data, u64 size) + * Description + * Write raw *data* blob into a special BPF perf event held by + * *map* of type **BPF_MAP_TYPE_PERF_EVENT_ARRAY**. This perf + * event must have the following attributes: **PERF_SAMPLE_RAW** + * as **sample_type**, **PERF_TYPE_SOFTWARE** as **type**, and + * **PERF_COUNT_SW_BPF_OUTPUT** as **config**. + * + * The *flags* are used to indicate the index in *map* for which + * the value must be put, masked with **BPF_F_INDEX_MASK**. + * Alternatively, *flags* can be set to **BPF_F_CURRENT_CPU** + * to indicate that the index of the current CPU core should be + * used. + * + * The value to write, of *size*, is passed through eBPF stack and + * pointed by *data*. + * + * *ctx* is a pointer to in-kernel struct xdp_buff. + * + * This helper is similar to **bpf_perf_eventoutput**\ () but + * restricted to raw_tracepoint bpf programs. + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3049,7 +3072,8 @@ union bpf_attr { FN(send_signal_thread), \ FN(jiffies64), \ FN(read_branch_records), \ - FN(get_ns_current_pid_tgid), + FN(get_ns_current_pid_tgid), \ + FN(xdp_output), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c b/tools/testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c index 4ba011031d4c..a0f688c37023 100644 --- a/tools/testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c +++ b/tools/testing/selftests/bpf/prog_tests/xdp_bpf2bpf.c @@ -4,17 +4,51 @@ #include "test_xdp.skel.h" #include "test_xdp_bpf2bpf.skel.h" +struct meta { + int ifindex; + int pkt_len; +}; + +static void on_sample(void *ctx, int cpu, void *data, __u32 size) +{ + int duration = 0; + struct meta *meta = (struct meta *)data; + struct ipv4_packet *trace_pkt_v4 = data + sizeof(*meta); + + if (CHECK(size < sizeof(pkt_v4) + sizeof(*meta), + "check_size", "size %u < %zu\n", + size, sizeof(pkt_v4) + sizeof(*meta))) + return; + + if (CHECK(meta->ifindex != if_nametoindex("lo"), "check_meta_ifindex", + "meta->ifindex = %d\n", meta->ifindex)) + return; + + if (CHECK(meta->pkt_len != sizeof(pkt_v4), "check_meta_pkt_len", + "meta->pkt_len = %zd\n", sizeof(pkt_v4))) + return; + + if (CHECK(memcmp(trace_pkt_v4, &pkt_v4, sizeof(pkt_v4)), + "check_packet_content", "content not the same\n")) + return; + + *(bool *)ctx = true; +} + void test_xdp_bpf2bpf(void) { __u32 duration = 0, retval, size; char buf[128]; int err, pkt_fd, map_fd; + bool passed = false; struct iphdr *iph = (void *)buf + sizeof(struct ethhdr); struct iptnl_info value4 = {.family = AF_INET}; struct test_xdp *pkt_skel = NULL; struct test_xdp_bpf2bpf *ftrace_skel = NULL; struct vip key4 = {.protocol = 6, .family = AF_INET}; struct bpf_program *prog; + struct perf_buffer *pb = NULL; + struct perf_buffer_opts pb_opts = {}; /* Load XDP program to introspect */ pkt_skel = test_xdp__open_and_load(); @@ -50,6 +84,14 @@ void test_xdp_bpf2bpf(void) if (CHECK(err, "ftrace_attach", "ftrace attach failed: %d\n", err)) goto out; + /* Set up perf buffer */ + pb_opts.sample_cb = on_sample; + pb_opts.ctx = &passed; + pb = perf_buffer__new(bpf_map__fd(ftrace_skel->maps.perf_buf_map), + 1, &pb_opts); + if (CHECK(IS_ERR(pb), "perf_buf__new", "err %ld\n", PTR_ERR(pb))) + goto out; + /* Run test program */ err = bpf_prog_test_run(pkt_fd, 1, &pkt_v4, sizeof(pkt_v4), buf, &size, &retval, &duration); @@ -60,6 +102,15 @@ void test_xdp_bpf2bpf(void) err, errno, retval, size)) goto out; + /* Make sure bpf_xdp_output() was triggered and it sent the expected + * data to the perf ring buffer. + */ + err = perf_buffer__poll(pb, 100); + if (CHECK(err < 0, "perf_buffer__poll", "err %d\n", err)) + goto out; + + CHECK_FAIL(!passed); + /* Verify test results */ if (CHECK(ftrace_skel->bss->test_result_fentry != if_nametoindex("lo"), "result", "fentry failed err %llu\n", @@ -70,6 +121,8 @@ void test_xdp_bpf2bpf(void) "fexit failed err %llu\n", ftrace_skel->bss->test_result_fexit); out: + if (pb) + perf_buffer__free(pb); test_xdp__destroy(pkt_skel); test_xdp_bpf2bpf__destroy(ftrace_skel); } diff --git a/tools/testing/selftests/bpf/progs/test_xdp_bpf2bpf.c b/tools/testing/selftests/bpf/progs/test_xdp_bpf2bpf.c index 42dd2fedd588..a038e827f850 100644 --- a/tools/testing/selftests/bpf/progs/test_xdp_bpf2bpf.c +++ b/tools/testing/selftests/bpf/progs/test_xdp_bpf2bpf.c @@ -3,6 +3,8 @@ #include #include +char _license[] SEC("license") = "GPL"; + struct net_device { /* Structure does not need to contain all entries, * as "preserve_access_index" will use BTF to fix this... @@ -27,10 +29,32 @@ struct xdp_buff { struct xdp_rxq_info *rxq; } __attribute__((preserve_access_index)); +struct meta { + int ifindex; + int pkt_len; +}; + +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(int)); +} perf_buf_map SEC(".maps"); + __u64 test_result_fentry = 0; SEC("fentry/FUNC") int BPF_PROG(trace_on_entry, struct xdp_buff *xdp) { + struct meta meta; + void *data_end = (void *)(long)xdp->data_end; + void *data = (void *)(long)xdp->data; + + meta.ifindex = xdp->rxq->dev->ifindex; + meta.pkt_len = data_end - data; + bpf_xdp_output(xdp, &perf_buf_map, + ((__u64) meta.pkt_len << 32) | + BPF_F_CURRENT_CPU, + &meta, sizeof(meta)); + test_result_fentry = xdp->rxq->dev->ifindex; return 0; } -- cgit v1.2.3-71-gd317 From 98868668367b24487c0b0b3298d7ca98409baf07 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Thu, 12 Mar 2020 17:21:28 -0700 Subject: bpf: Abstract away entire bpf_link clean up procedure Instead of requiring users to do three steps for cleaning up bpf_link, its anon_inode file, and unused fd, abstract that away into bpf_link_cleanup() helper. bpf_link_defunct() is removed, as it shouldn't be needed as an individual operation anymore. v1->v2: - keep bpf_link_cleanup() static for now (Daniel). Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Acked-by: Martin KaFai Lau Link: https://lore.kernel.org/bpf/20200313002128.2028680-1-andriin@fb.com Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 1 - kernel/bpf/syscall.c | 18 +++++++++++------- 2 files changed, 11 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4ec835334a1f..c2f815e9f7d0 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1075,7 +1075,6 @@ struct bpf_link_ops { void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, struct bpf_prog *prog); -void bpf_link_defunct(struct bpf_link *link); void bpf_link_inc(struct bpf_link *link); void bpf_link_put(struct bpf_link *link); int bpf_link_new_fd(struct bpf_link *link); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b2f73ecacced..85567a6ea5f9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2188,9 +2188,17 @@ void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, link->prog = prog; } -void bpf_link_defunct(struct bpf_link *link) +/* Clean up bpf_link and corresponding anon_inode file and FD. After + * anon_inode is created, bpf_link can't be just kfree()'d due to deferred + * anon_inode's release() call. This helper manages marking bpf_link as + * defunct, releases anon_inode file and puts reserved FD. + */ +static void bpf_link_cleanup(struct bpf_link *link, struct file *link_file, + int link_fd) { link->prog = NULL; + fput(link_file); + put_unused_fd(link_fd); } void bpf_link_inc(struct bpf_link *link) @@ -2383,9 +2391,7 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog) err = bpf_trampoline_link_prog(prog); if (err) { - bpf_link_defunct(&link->link); - fput(link_file); - put_unused_fd(link_fd); + bpf_link_cleanup(&link->link, link_file, link_fd); goto out_put_prog; } @@ -2498,9 +2504,7 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) err = bpf_probe_register(link->btp, prog); if (err) { - bpf_link_defunct(&link->link); - fput(link_file); - put_unused_fd(link_fd); + bpf_link_cleanup(&link->link, link_file, link_fd); goto out_put_btp; } -- cgit v1.2.3-71-gd317 From 535911c80ad4f5801700e9d827a1985bbff41519 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:55:58 +0100 Subject: bpf: Add struct bpf_ksym Adding 'struct bpf_ksym' object that will carry the kallsym information for bpf symbol. Adding the start and end address to begin with. It will be used by bpf_prog, bpf_trampoline, bpf_dispatcher objects. The symbol_start/symbol_end values were originally used to sort bpf_prog objects. For the address displayed in /proc/kallsyms we are using prog->bpf_func value. I'm using the bpf_func value for program symbol start instead of the symbol_start, because it makes no difference for sorting bpf_prog objects and we can use it directly as an address to display it in /proc/kallsyms. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200312195610.346362-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 6 ++++++ kernel/bpf/core.c | 28 ++++++++++++---------------- 2 files changed, 18 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index fe1f8b075378..6ca3d5c8ccf3 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -471,6 +471,11 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, u64 notrace __bpf_prog_enter(void); void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); +struct bpf_ksym { + unsigned long start; + unsigned long end; +}; + enum bpf_tramp_prog_type { BPF_TRAMP_FENTRY, BPF_TRAMP_FEXIT, @@ -653,6 +658,7 @@ struct bpf_prog_aux { u32 size_poke_tab; struct latch_tree_node ksym_tnode; struct list_head ksym_lnode; + struct bpf_ksym ksym; const struct bpf_prog_ops *ops; struct bpf_map **used_maps; struct bpf_prog *prog; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 0f9ca46e1978..e587d6306d7c 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -523,18 +523,16 @@ int bpf_jit_kallsyms __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_DEFAULT_ON); int bpf_jit_harden __read_mostly; long bpf_jit_limit __read_mostly; -static __always_inline void -bpf_get_prog_addr_region(const struct bpf_prog *prog, - unsigned long *symbol_start, - unsigned long *symbol_end) +static void +bpf_prog_ksym_set_addr(struct bpf_prog *prog) { const struct bpf_binary_header *hdr = bpf_jit_binary_hdr(prog); unsigned long addr = (unsigned long)hdr; WARN_ON_ONCE(!bpf_prog_ebpf_jited(prog)); - *symbol_start = addr; - *symbol_end = addr + hdr->pages * PAGE_SIZE; + prog->aux->ksym.start = (unsigned long) prog->bpf_func; + prog->aux->ksym.end = addr + hdr->pages * PAGE_SIZE; } void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) @@ -575,13 +573,10 @@ void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) static __always_inline unsigned long bpf_get_prog_addr_start(struct latch_tree_node *n) { - unsigned long symbol_start, symbol_end; const struct bpf_prog_aux *aux; aux = container_of(n, struct bpf_prog_aux, ksym_tnode); - bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end); - - return symbol_start; + return aux->ksym.start; } static __always_inline bool bpf_tree_less(struct latch_tree_node *a, @@ -593,15 +588,13 @@ static __always_inline bool bpf_tree_less(struct latch_tree_node *a, static __always_inline int bpf_tree_comp(void *key, struct latch_tree_node *n) { unsigned long val = (unsigned long)key; - unsigned long symbol_start, symbol_end; const struct bpf_prog_aux *aux; aux = container_of(n, struct bpf_prog_aux, ksym_tnode); - bpf_get_prog_addr_region(aux->prog, &symbol_start, &symbol_end); - if (val < symbol_start) + if (val < aux->ksym.start) return -1; - if (val >= symbol_end) + if (val >= aux->ksym.end) return 1; return 0; @@ -649,6 +642,8 @@ void bpf_prog_kallsyms_add(struct bpf_prog *fp) !capable(CAP_SYS_ADMIN)) return; + bpf_prog_ksym_set_addr(fp); + spin_lock_bh(&bpf_lock); bpf_prog_ksym_node_add(fp->aux); spin_unlock_bh(&bpf_lock); @@ -677,14 +672,15 @@ static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) const char *__bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym) { - unsigned long symbol_start, symbol_end; struct bpf_prog *prog; char *ret = NULL; rcu_read_lock(); prog = bpf_prog_kallsyms_find(addr); if (prog) { - bpf_get_prog_addr_region(prog, &symbol_start, &symbol_end); + unsigned long symbol_start = prog->aux->ksym.start; + unsigned long symbol_end = prog->aux->ksym.end; + bpf_get_prog_name(prog, sym); ret = sym; -- cgit v1.2.3-71-gd317 From bfea9a8574f34597581f74f792d044d38497b775 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:55:59 +0100 Subject: bpf: Add name to struct bpf_ksym Adding name to 'struct bpf_ksym' object to carry the name of the symbol for bpf_prog, bpf_trampoline, bpf_dispatcher objects. The current benefit is that name is now generated only when the symbol is added to the list, so we don't need to generate it every time it's accessed. The future benefit is that we will have all the bpf objects symbols represented by struct bpf_ksym. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200312195610.346362-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 2 ++ include/linux/filter.h | 6 ------ kernel/bpf/core.c | 9 ++++++--- kernel/events/core.c | 9 ++++----- 4 files changed, 12 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 6ca3d5c8ccf3..047b44deb3c5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -18,6 +18,7 @@ #include #include #include +#include struct bpf_verifier_env; struct bpf_verifier_log; @@ -474,6 +475,7 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); struct bpf_ksym { unsigned long start; unsigned long end; + char name[KSYM_NAME_LEN]; }; enum bpf_tramp_prog_type { diff --git a/include/linux/filter.h b/include/linux/filter.h index 6249679275b3..9b5aa5c483cc 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1083,7 +1083,6 @@ bpf_address_lookup(unsigned long addr, unsigned long *size, void bpf_prog_kallsyms_add(struct bpf_prog *fp); void bpf_prog_kallsyms_del(struct bpf_prog *fp); -void bpf_get_prog_name(const struct bpf_prog *prog, char *sym); #else /* CONFIG_BPF_JIT */ @@ -1152,11 +1151,6 @@ static inline void bpf_prog_kallsyms_del(struct bpf_prog *fp) { } -static inline void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) -{ - sym[0] = '\0'; -} - #endif /* CONFIG_BPF_JIT */ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp); diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index e587d6306d7c..f6800c2d4b01 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -535,8 +535,10 @@ bpf_prog_ksym_set_addr(struct bpf_prog *prog) prog->aux->ksym.end = addr + hdr->pages * PAGE_SIZE; } -void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +static void +bpf_prog_ksym_set_name(struct bpf_prog *prog) { + char *sym = prog->aux->ksym.name; const char *end = sym + KSYM_NAME_LEN; const struct btf_type *type; const char *func_name; @@ -643,6 +645,7 @@ void bpf_prog_kallsyms_add(struct bpf_prog *fp) return; bpf_prog_ksym_set_addr(fp); + bpf_prog_ksym_set_name(fp); spin_lock_bh(&bpf_lock); bpf_prog_ksym_node_add(fp->aux); @@ -681,7 +684,7 @@ const char *__bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long symbol_start = prog->aux->ksym.start; unsigned long symbol_end = prog->aux->ksym.end; - bpf_get_prog_name(prog, sym); + strncpy(sym, prog->aux->ksym.name, KSYM_NAME_LEN); ret = sym; if (size) @@ -738,7 +741,7 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, if (it++ != symnum) continue; - bpf_get_prog_name(aux->prog, sym); + strncpy(sym, aux->ksym.name, KSYM_NAME_LEN); *value = (unsigned long)aux->prog->bpf_func; *type = BPF_SYM_ELF_TYPE; diff --git a/kernel/events/core.c b/kernel/events/core.c index bbdfac0182f4..9b89ef176247 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8255,23 +8255,22 @@ static void perf_event_bpf_emit_ksymbols(struct bpf_prog *prog, enum perf_bpf_event_type type) { bool unregister = type == PERF_BPF_EVENT_PROG_UNLOAD; - char sym[KSYM_NAME_LEN]; int i; if (prog->aux->func_cnt == 0) { - bpf_get_prog_name(prog, sym); perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, (u64)(unsigned long)prog->bpf_func, - prog->jited_len, unregister, sym); + prog->jited_len, unregister, + prog->aux->ksym.name); } else { for (i = 0; i < prog->aux->func_cnt; i++) { struct bpf_prog *subprog = prog->aux->func[i]; - bpf_get_prog_name(subprog, sym); perf_event_ksymbol( PERF_RECORD_KSYMBOL_TYPE_BPF, (u64)(unsigned long)subprog->bpf_func, - subprog->jited_len, unregister, sym); + subprog->jited_len, unregister, + prog->aux->ksym.name); } } } -- cgit v1.2.3-71-gd317 From ecb60d1c670e9b205197d8e4381b19e77bc2d834 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:00 +0100 Subject: bpf: Move lnode list node to struct bpf_ksym Adding lnode list node to 'struct bpf_ksym' object, so the struct bpf_ksym itself can be chained and used in other objects like bpf_trampoline and bpf_dispatcher. Changing iterator to bpf_ksym in bpf_get_kallsym function. The ksym->start is holding the prog->bpf_func value, so it's ok to use it as value in bpf_get_kallsym. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200312195610.346362-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 2 +- kernel/bpf/core.c | 22 +++++++++++----------- 2 files changed, 12 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 047b44deb3c5..4fad2fa4135c 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -476,6 +476,7 @@ struct bpf_ksym { unsigned long start; unsigned long end; char name[KSYM_NAME_LEN]; + struct list_head lnode; }; enum bpf_tramp_prog_type { @@ -659,7 +660,6 @@ struct bpf_prog_aux { struct bpf_jit_poke_descriptor *poke_tab; u32 size_poke_tab; struct latch_tree_node ksym_tnode; - struct list_head ksym_lnode; struct bpf_ksym ksym; const struct bpf_prog_ops *ops; struct bpf_map **used_maps; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index f6800c2d4b01..5eb5d5bb7a95 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -97,7 +97,7 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag fp->aux->prog = fp; fp->jit_requested = ebpf_jit_enabled(); - INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode); + INIT_LIST_HEAD_RCU(&fp->aux->ksym.lnode); return fp; } @@ -613,18 +613,18 @@ static struct latch_tree_root bpf_tree __cacheline_aligned; static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux) { - WARN_ON_ONCE(!list_empty(&aux->ksym_lnode)); - list_add_tail_rcu(&aux->ksym_lnode, &bpf_kallsyms); + WARN_ON_ONCE(!list_empty(&aux->ksym.lnode)); + list_add_tail_rcu(&aux->ksym.lnode, &bpf_kallsyms); latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops); } static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux) { - if (list_empty(&aux->ksym_lnode)) + if (list_empty(&aux->ksym.lnode)) return; latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops); - list_del_rcu(&aux->ksym_lnode); + list_del_rcu(&aux->ksym.lnode); } static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp) @@ -634,8 +634,8 @@ static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp) static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) { - return list_empty(&fp->aux->ksym_lnode) || - fp->aux->ksym_lnode.prev == LIST_POISON2; + return list_empty(&fp->aux->ksym.lnode) || + fp->aux->ksym.lnode.prev == LIST_POISON2; } void bpf_prog_kallsyms_add(struct bpf_prog *fp) @@ -729,7 +729,7 @@ out: int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym) { - struct bpf_prog_aux *aux; + struct bpf_ksym *ksym; unsigned int it = 0; int ret = -ERANGE; @@ -737,13 +737,13 @@ int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, return ret; rcu_read_lock(); - list_for_each_entry_rcu(aux, &bpf_kallsyms, ksym_lnode) { + list_for_each_entry_rcu(ksym, &bpf_kallsyms, lnode) { if (it++ != symnum) continue; - strncpy(sym, aux->ksym.name, KSYM_NAME_LEN); + strncpy(sym, ksym->name, KSYM_NAME_LEN); - *value = (unsigned long)aux->prog->bpf_func; + *value = ksym->start; *type = BPF_SYM_ELF_TYPE; ret = 0; -- cgit v1.2.3-71-gd317 From ca4424c920f574b7246ff1b6d83cfdfd709e42c8 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:01 +0100 Subject: bpf: Move ksym_tnode to bpf_ksym Moving ksym_tnode list node to 'struct bpf_ksym' object, so the symbol itself can be chained and used in other objects like bpf_trampoline and bpf_dispatcher. We need bpf_ksym object to be linked both in bpf_kallsyms via lnode for /proc/kallsyms and in bpf_tree via tnode for bpf address lookup functions like __bpf_address_lookup or bpf_prog_kallsyms_find. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 2 +- kernel/bpf/core.c | 24 ++++++++++-------------- 2 files changed, 11 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4fad2fa4135c..68d66b0078df 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -477,6 +477,7 @@ struct bpf_ksym { unsigned long end; char name[KSYM_NAME_LEN]; struct list_head lnode; + struct latch_tree_node tnode; }; enum bpf_tramp_prog_type { @@ -659,7 +660,6 @@ struct bpf_prog_aux { void *jit_data; /* JIT specific data. arch dependent */ struct bpf_jit_poke_descriptor *poke_tab; u32 size_poke_tab; - struct latch_tree_node ksym_tnode; struct bpf_ksym ksym; const struct bpf_prog_ops *ops; struct bpf_map **used_maps; diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5eb5d5bb7a95..ab1846c34167 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -572,31 +572,27 @@ bpf_prog_ksym_set_name(struct bpf_prog *prog) *sym = 0; } -static __always_inline unsigned long -bpf_get_prog_addr_start(struct latch_tree_node *n) +static unsigned long bpf_get_ksym_start(struct latch_tree_node *n) { - const struct bpf_prog_aux *aux; - - aux = container_of(n, struct bpf_prog_aux, ksym_tnode); - return aux->ksym.start; + return container_of(n, struct bpf_ksym, tnode)->start; } static __always_inline bool bpf_tree_less(struct latch_tree_node *a, struct latch_tree_node *b) { - return bpf_get_prog_addr_start(a) < bpf_get_prog_addr_start(b); + return bpf_get_ksym_start(a) < bpf_get_ksym_start(b); } static __always_inline int bpf_tree_comp(void *key, struct latch_tree_node *n) { unsigned long val = (unsigned long)key; - const struct bpf_prog_aux *aux; + const struct bpf_ksym *ksym; - aux = container_of(n, struct bpf_prog_aux, ksym_tnode); + ksym = container_of(n, struct bpf_ksym, tnode); - if (val < aux->ksym.start) + if (val < ksym->start) return -1; - if (val >= aux->ksym.end) + if (val >= ksym->end) return 1; return 0; @@ -615,7 +611,7 @@ static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux) { WARN_ON_ONCE(!list_empty(&aux->ksym.lnode)); list_add_tail_rcu(&aux->ksym.lnode, &bpf_kallsyms); - latch_tree_insert(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops); + latch_tree_insert(&aux->ksym.tnode, &bpf_tree, &bpf_tree_ops); } static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux) @@ -623,7 +619,7 @@ static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux) if (list_empty(&aux->ksym.lnode)) return; - latch_tree_erase(&aux->ksym_tnode, &bpf_tree, &bpf_tree_ops); + latch_tree_erase(&aux->ksym.tnode, &bpf_tree, &bpf_tree_ops); list_del_rcu(&aux->ksym.lnode); } @@ -668,7 +664,7 @@ static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops); return n ? - container_of(n, struct bpf_prog_aux, ksym_tnode)->prog : + container_of(n, struct bpf_prog_aux, ksym.tnode)->prog : NULL; } -- cgit v1.2.3-71-gd317 From eda0c92902b57bbde674c27882554b074e9180a6 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:02 +0100 Subject: bpf: Add bpf_ksym_find function Adding bpf_ksym_find function that is used bpf bpf address lookup functions: __bpf_address_lookup is_bpf_text_address while keeping bpf_prog_kallsyms_find to be used only for lookup of bpf_prog objects (will happen in following changes). Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-8-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index ab1846c34167..cd380f7f015c 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -668,19 +668,27 @@ static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) NULL; } +static struct bpf_ksym *bpf_ksym_find(unsigned long addr) +{ + struct latch_tree_node *n; + + n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops); + return n ? container_of(n, struct bpf_ksym, tnode) : NULL; +} + const char *__bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym) { - struct bpf_prog *prog; + struct bpf_ksym *ksym; char *ret = NULL; rcu_read_lock(); - prog = bpf_prog_kallsyms_find(addr); - if (prog) { - unsigned long symbol_start = prog->aux->ksym.start; - unsigned long symbol_end = prog->aux->ksym.end; + ksym = bpf_ksym_find(addr); + if (ksym) { + unsigned long symbol_start = ksym->start; + unsigned long symbol_end = ksym->end; - strncpy(sym, prog->aux->ksym.name, KSYM_NAME_LEN); + strncpy(sym, ksym->name, KSYM_NAME_LEN); ret = sym; if (size) @@ -698,7 +706,7 @@ bool is_bpf_text_address(unsigned long addr) bool ret; rcu_read_lock(); - ret = bpf_prog_kallsyms_find(addr) != NULL; + ret = bpf_ksym_find(addr) != NULL; rcu_read_unlock(); return ret; -- cgit v1.2.3-71-gd317 From cbd76f8d5ac9c4e99c4ffe5e39a1e907cdf5a76f Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:03 +0100 Subject: bpf: Add prog flag to struct bpf_ksym object Adding 'prog' bool flag to 'struct bpf_ksym' to mark that this object belongs to bpf_prog object. This change allows having bpf_prog objects together with other types (trampolines and dispatchers) in the single bpf_tree. It's used when searching for bpf_prog exception tables by the bpf_prog_ksym_find function, where we need to get the bpf_prog pointer. >From now we can safely add bpf_ksym support for trampoline or dispatcher objects, because we can differentiate them from bpf_prog objects. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-9-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 1 + kernel/bpf/core.c | 22 +++++++++++----------- 2 files changed, 12 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 68d66b0078df..a0cef664c1a9 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -478,6 +478,7 @@ struct bpf_ksym { char name[KSYM_NAME_LEN]; struct list_head lnode; struct latch_tree_node tnode; + bool prog; }; enum bpf_tramp_prog_type { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index cd380f7f015c..7516cbc65996 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -642,6 +642,7 @@ void bpf_prog_kallsyms_add(struct bpf_prog *fp) bpf_prog_ksym_set_addr(fp); bpf_prog_ksym_set_name(fp); + fp->aux->ksym.prog = true; spin_lock_bh(&bpf_lock); bpf_prog_ksym_node_add(fp->aux); @@ -658,16 +659,6 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) spin_unlock_bh(&bpf_lock); } -static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) -{ - struct latch_tree_node *n; - - n = latch_tree_find((void *)addr, &bpf_tree, &bpf_tree_ops); - return n ? - container_of(n, struct bpf_prog_aux, ksym.tnode)->prog : - NULL; -} - static struct bpf_ksym *bpf_ksym_find(unsigned long addr) { struct latch_tree_node *n; @@ -712,13 +703,22 @@ bool is_bpf_text_address(unsigned long addr) return ret; } +static struct bpf_prog *bpf_prog_ksym_find(unsigned long addr) +{ + struct bpf_ksym *ksym = bpf_ksym_find(addr); + + return ksym && ksym->prog ? + container_of(ksym, struct bpf_prog_aux, ksym)->prog : + NULL; +} + const struct exception_table_entry *search_bpf_extables(unsigned long addr) { const struct exception_table_entry *e = NULL; struct bpf_prog *prog; rcu_read_lock(); - prog = bpf_prog_kallsyms_find(addr); + prog = bpf_prog_ksym_find(addr); if (!prog) goto out; if (!prog->aux->num_exentries) -- cgit v1.2.3-71-gd317 From dba122fb5e122e8e07e2f11cdebc10ba4f425cf7 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:04 +0100 Subject: bpf: Add bpf_ksym_add/del functions Separating /proc/kallsyms add/del code and adding bpf_ksym_add/del functions for that. Moving bpf_prog_ksym_node_add/del functions to __bpf_ksym_add/del and changing their argument to 'struct bpf_ksym' object. This way we can call them for other bpf objects types like trampoline and dispatcher. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-10-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 3 +++ kernel/bpf/core.c | 33 +++++++++++++++++++-------------- 2 files changed, 22 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index a0cef664c1a9..ec1de88b8487 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -584,6 +584,9 @@ struct bpf_image { #define BPF_IMAGE_SIZE (PAGE_SIZE - sizeof(struct bpf_image)) bool is_bpf_image_address(unsigned long address); void *bpf_image_alloc(void); +/* Called only from JIT-enabled code, so there's no need for stubs. */ +void bpf_ksym_add(struct bpf_ksym *ksym); +void bpf_ksym_del(struct bpf_ksym *ksym); #else static inline struct bpf_trampoline *bpf_trampoline_lookup(u64 key) { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 7516cbc65996..914f3463aa41 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -607,20 +607,29 @@ static DEFINE_SPINLOCK(bpf_lock); static LIST_HEAD(bpf_kallsyms); static struct latch_tree_root bpf_tree __cacheline_aligned; -static void bpf_prog_ksym_node_add(struct bpf_prog_aux *aux) +void bpf_ksym_add(struct bpf_ksym *ksym) { - WARN_ON_ONCE(!list_empty(&aux->ksym.lnode)); - list_add_tail_rcu(&aux->ksym.lnode, &bpf_kallsyms); - latch_tree_insert(&aux->ksym.tnode, &bpf_tree, &bpf_tree_ops); + spin_lock_bh(&bpf_lock); + WARN_ON_ONCE(!list_empty(&ksym->lnode)); + list_add_tail_rcu(&ksym->lnode, &bpf_kallsyms); + latch_tree_insert(&ksym->tnode, &bpf_tree, &bpf_tree_ops); + spin_unlock_bh(&bpf_lock); } -static void bpf_prog_ksym_node_del(struct bpf_prog_aux *aux) +static void __bpf_ksym_del(struct bpf_ksym *ksym) { - if (list_empty(&aux->ksym.lnode)) + if (list_empty(&ksym->lnode)) return; - latch_tree_erase(&aux->ksym.tnode, &bpf_tree, &bpf_tree_ops); - list_del_rcu(&aux->ksym.lnode); + latch_tree_erase(&ksym->tnode, &bpf_tree, &bpf_tree_ops); + list_del_rcu(&ksym->lnode); +} + +void bpf_ksym_del(struct bpf_ksym *ksym) +{ + spin_lock_bh(&bpf_lock); + __bpf_ksym_del(ksym); + spin_unlock_bh(&bpf_lock); } static bool bpf_prog_kallsyms_candidate(const struct bpf_prog *fp) @@ -644,9 +653,7 @@ void bpf_prog_kallsyms_add(struct bpf_prog *fp) bpf_prog_ksym_set_name(fp); fp->aux->ksym.prog = true; - spin_lock_bh(&bpf_lock); - bpf_prog_ksym_node_add(fp->aux); - spin_unlock_bh(&bpf_lock); + bpf_ksym_add(&fp->aux->ksym); } void bpf_prog_kallsyms_del(struct bpf_prog *fp) @@ -654,9 +661,7 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) if (!bpf_prog_kallsyms_candidate(fp)) return; - spin_lock_bh(&bpf_lock); - bpf_prog_ksym_node_del(fp->aux); - spin_unlock_bh(&bpf_lock); + bpf_ksym_del(&fp->aux->ksym); } static struct bpf_ksym *bpf_ksym_find(unsigned long addr) -- cgit v1.2.3-71-gd317 From a108f7dcfa010e3da825af90d77ac0a6a0240992 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:05 +0100 Subject: bpf: Add trampolines to kallsyms Adding trampolines to kallsyms. It's displayed as bpf_trampoline_ [bpf] where ID is the BTF id of the trampoline function. Adding bpf_image_ksym_add/del functions that setup the start/end values and call KSYMBOL perf events handlers. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 3 +++ kernel/bpf/trampoline.c | 28 ++++++++++++++++++++++++++++ 2 files changed, 31 insertions(+) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index ec1de88b8487..083860be1944 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -513,6 +513,7 @@ struct bpf_trampoline { /* Executable image of trampoline */ void *image; u64 selector; + struct bpf_ksym ksym; }; #define BPF_DISPATCHER_MAX 48 /* Fits in 2048B */ @@ -585,6 +586,8 @@ struct bpf_image { bool is_bpf_image_address(unsigned long address); void *bpf_image_alloc(void); /* Called only from JIT-enabled code, so there's no need for stubs. */ +void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym); +void bpf_image_ksym_del(struct bpf_ksym *ksym); void bpf_ksym_add(struct bpf_ksym *ksym); void bpf_ksym_del(struct bpf_ksym *ksym); #else diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 221a17af1f81..36549c9afec4 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -5,6 +5,7 @@ #include #include #include +#include /* dummy _ops. The verifier will operate on target program's ops. */ const struct bpf_verifier_ops bpf_extension_verifier_ops = { @@ -96,6 +97,30 @@ bool is_bpf_image_address(unsigned long addr) return ret; } +void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym) +{ + ksym->start = (unsigned long) data; + ksym->end = ksym->start + BPF_IMAGE_SIZE; + bpf_ksym_add(ksym); + perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, ksym->start, + BPF_IMAGE_SIZE, false, ksym->name); +} + +void bpf_image_ksym_del(struct bpf_ksym *ksym) +{ + bpf_ksym_del(ksym); + perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, ksym->start, + BPF_IMAGE_SIZE, true, ksym->name); +} + +static void bpf_trampoline_ksym_add(struct bpf_trampoline *tr) +{ + struct bpf_ksym *ksym = &tr->ksym; + + snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu", tr->key); + bpf_image_ksym_add(tr->image, ksym); +} + struct bpf_trampoline *bpf_trampoline_lookup(u64 key) { struct bpf_trampoline *tr; @@ -131,6 +156,8 @@ struct bpf_trampoline *bpf_trampoline_lookup(u64 key) for (i = 0; i < BPF_TRAMP_MAX; i++) INIT_HLIST_HEAD(&tr->progs_hlist[i]); tr->image = image; + INIT_LIST_HEAD_RCU(&tr->ksym.lnode); + bpf_trampoline_ksym_add(tr); out: mutex_unlock(&trampoline_mutex); return tr; @@ -368,6 +395,7 @@ void bpf_trampoline_put(struct bpf_trampoline *tr) goto out; if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT]))) goto out; + bpf_image_ksym_del(&tr->ksym); image = container_of(tr->image, struct bpf_image, data); latch_tree_erase(&image->tnode, &image_tree, &image_tree_ops); /* wait for tasks to get out of trampoline before freeing it */ -- cgit v1.2.3-71-gd317 From 517b75e44c7be9c776aa5f7beaa85baff3868f80 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:06 +0100 Subject: bpf: Add dispatchers to kallsyms Adding dispatchers to kallsyms. It's displayed as bpf_dispatcher_ where NAME is the name of dispatcher. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 19 ++++++++++++------- kernel/bpf/dispatcher.c | 1 + 2 files changed, 13 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 083860be1944..86cacb54ba23 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -531,6 +531,7 @@ struct bpf_dispatcher { int num_progs; void *image; u32 image_off; + struct bpf_ksym ksym; }; static __always_inline unsigned int bpf_dispatcher_nop_func( @@ -546,13 +547,17 @@ struct bpf_trampoline *bpf_trampoline_lookup(u64 key); int bpf_trampoline_link_prog(struct bpf_prog *prog); int bpf_trampoline_unlink_prog(struct bpf_prog *prog); void bpf_trampoline_put(struct bpf_trampoline *tr); -#define BPF_DISPATCHER_INIT(name) { \ - .mutex = __MUTEX_INITIALIZER(name.mutex), \ - .func = &name##_func, \ - .progs = {}, \ - .num_progs = 0, \ - .image = NULL, \ - .image_off = 0 \ +#define BPF_DISPATCHER_INIT(_name) { \ + .mutex = __MUTEX_INITIALIZER(_name.mutex), \ + .func = &_name##_func, \ + .progs = {}, \ + .num_progs = 0, \ + .image = NULL, \ + .image_off = 0, \ + .ksym = { \ + .name = #_name, \ + .lnode = LIST_HEAD_INIT(_name.ksym.lnode), \ + }, \ } #define DEFINE_BPF_DISPATCHER(name) \ diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c index b3e5b214fed8..a2679bae9e73 100644 --- a/kernel/bpf/dispatcher.c +++ b/kernel/bpf/dispatcher.c @@ -143,6 +143,7 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, d->image = bpf_image_alloc(); if (!d->image) goto out; + bpf_image_ksym_add(d->image, &d->ksym); } prev_num_progs = d->num_progs; -- cgit v1.2.3-71-gd317 From 7ac88eba185b4d0e06a71678e54bc092edcd3af3 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Thu, 12 Mar 2020 20:56:07 +0100 Subject: bpf: Remove bpf_image tree Now that we have all the objects (bpf_prog, bpf_trampoline, bpf_dispatcher) linked in bpf_tree, there's no need to have separate bpf_image tree for images. Reverting the bpf_image tree together with struct bpf_image, because it's no longer needed. Also removing bpf_image_alloc function and adding the original bpf_jit_alloc_exec_page interface instead. The kernel_text_address function can now rely only on is_bpf_text_address, because it checks the bpf_tree that contains all the objects. Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are useful wrappers with perf's ksymbol interface calls. Signed-off-by: Jiri Olsa Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 8 +---- kernel/bpf/dispatcher.c | 4 +-- kernel/bpf/trampoline.c | 83 ++++++------------------------------------------- kernel/extable.c | 2 -- 4 files changed, 13 insertions(+), 84 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 86cacb54ba23..bdb981c204fa 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -583,14 +583,8 @@ void bpf_trampoline_put(struct bpf_trampoline *tr); #define BPF_DISPATCHER_PTR(name) (&bpf_dispatcher_##name) void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, struct bpf_prog *to); -struct bpf_image { - struct latch_tree_node tnode; - unsigned char data[]; -}; -#define BPF_IMAGE_SIZE (PAGE_SIZE - sizeof(struct bpf_image)) -bool is_bpf_image_address(unsigned long address); -void *bpf_image_alloc(void); /* Called only from JIT-enabled code, so there's no need for stubs. */ +void *bpf_jit_alloc_exec_page(void); void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym); void bpf_image_ksym_del(struct bpf_ksym *ksym); void bpf_ksym_add(struct bpf_ksym *ksym); diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c index a2679bae9e73..2444bd15cc2d 100644 --- a/kernel/bpf/dispatcher.c +++ b/kernel/bpf/dispatcher.c @@ -113,7 +113,7 @@ static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs) noff = 0; } else { old = d->image + d->image_off; - noff = d->image_off ^ (BPF_IMAGE_SIZE / 2); + noff = d->image_off ^ (PAGE_SIZE / 2); } new = d->num_progs ? d->image + noff : NULL; @@ -140,7 +140,7 @@ void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, mutex_lock(&d->mutex); if (!d->image) { - d->image = bpf_image_alloc(); + d->image = bpf_jit_alloc_exec_page(); if (!d->image) goto out; bpf_image_ksym_add(d->image, &d->ksym); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 36549c9afec4..f42f700c1d28 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -18,12 +18,11 @@ const struct bpf_prog_ops bpf_extension_prog_ops = { #define TRAMPOLINE_TABLE_SIZE (1 << TRAMPOLINE_HASH_BITS) static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE]; -static struct latch_tree_root image_tree __cacheline_aligned; -/* serializes access to trampoline_table and image_tree */ +/* serializes access to trampoline_table */ static DEFINE_MUTEX(trampoline_mutex); -static void *bpf_jit_alloc_exec_page(void) +void *bpf_jit_alloc_exec_page(void) { void *image; @@ -39,78 +38,20 @@ static void *bpf_jit_alloc_exec_page(void) return image; } -static __always_inline bool image_tree_less(struct latch_tree_node *a, - struct latch_tree_node *b) -{ - struct bpf_image *ia = container_of(a, struct bpf_image, tnode); - struct bpf_image *ib = container_of(b, struct bpf_image, tnode); - - return ia < ib; -} - -static __always_inline int image_tree_comp(void *addr, struct latch_tree_node *n) -{ - void *image = container_of(n, struct bpf_image, tnode); - - if (addr < image) - return -1; - if (addr >= image + PAGE_SIZE) - return 1; - - return 0; -} - -static const struct latch_tree_ops image_tree_ops = { - .less = image_tree_less, - .comp = image_tree_comp, -}; - -static void *__bpf_image_alloc(bool lock) -{ - struct bpf_image *image; - - image = bpf_jit_alloc_exec_page(); - if (!image) - return NULL; - - if (lock) - mutex_lock(&trampoline_mutex); - latch_tree_insert(&image->tnode, &image_tree, &image_tree_ops); - if (lock) - mutex_unlock(&trampoline_mutex); - return image->data; -} - -void *bpf_image_alloc(void) -{ - return __bpf_image_alloc(true); -} - -bool is_bpf_image_address(unsigned long addr) -{ - bool ret; - - rcu_read_lock(); - ret = latch_tree_find((void *) addr, &image_tree, &image_tree_ops) != NULL; - rcu_read_unlock(); - - return ret; -} - void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym) { ksym->start = (unsigned long) data; - ksym->end = ksym->start + BPF_IMAGE_SIZE; + ksym->end = ksym->start + PAGE_SIZE; bpf_ksym_add(ksym); perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, ksym->start, - BPF_IMAGE_SIZE, false, ksym->name); + PAGE_SIZE, false, ksym->name); } void bpf_image_ksym_del(struct bpf_ksym *ksym) { bpf_ksym_del(ksym); perf_event_ksymbol(PERF_RECORD_KSYMBOL_TYPE_BPF, ksym->start, - BPF_IMAGE_SIZE, true, ksym->name); + PAGE_SIZE, true, ksym->name); } static void bpf_trampoline_ksym_add(struct bpf_trampoline *tr) @@ -141,7 +82,7 @@ struct bpf_trampoline *bpf_trampoline_lookup(u64 key) goto out; /* is_root was checked earlier. No need for bpf_jit_charge_modmem() */ - image = __bpf_image_alloc(false); + image = bpf_jit_alloc_exec_page(); if (!image) { kfree(tr); tr = NULL; @@ -243,8 +184,8 @@ bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total) static int bpf_trampoline_update(struct bpf_trampoline *tr) { - void *old_image = tr->image + ((tr->selector + 1) & 1) * BPF_IMAGE_SIZE/2; - void *new_image = tr->image + (tr->selector & 1) * BPF_IMAGE_SIZE/2; + void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2; + void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2; struct bpf_tramp_progs *tprogs; u32 flags = BPF_TRAMP_F_RESTORE_REGS; int err, total; @@ -272,7 +213,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) synchronize_rcu_tasks(); - err = arch_prepare_bpf_trampoline(new_image, new_image + BPF_IMAGE_SIZE / 2, + err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2, &tr->func.model, flags, tprogs, tr->func.addr); if (err < 0) @@ -383,8 +324,6 @@ out: void bpf_trampoline_put(struct bpf_trampoline *tr) { - struct bpf_image *image; - if (!tr) return; mutex_lock(&trampoline_mutex); @@ -396,11 +335,9 @@ void bpf_trampoline_put(struct bpf_trampoline *tr) if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT]))) goto out; bpf_image_ksym_del(&tr->ksym); - image = container_of(tr->image, struct bpf_image, data); - latch_tree_erase(&image->tnode, &image_tree, &image_tree_ops); /* wait for tasks to get out of trampoline before freeing it */ synchronize_rcu_tasks(); - bpf_jit_free_exec(image); + bpf_jit_free_exec(tr->image); hlist_del(&tr->hlist); kfree(tr); out: diff --git a/kernel/extable.c b/kernel/extable.c index a0024f27d3a1..7681f87e89dd 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -149,8 +149,6 @@ int kernel_text_address(unsigned long addr) goto out; if (is_bpf_text_address(addr)) goto out; - if (is_bpf_image_address(addr)) - goto out; ret = 0; out: if (no_rcu) -- cgit v1.2.3-71-gd317 From dcce11d545cc5d04c3fb529a8e2a0da987389139 Mon Sep 17 00:00:00 2001 From: Jules Irenge Date: Wed, 11 Mar 2020 01:09:01 +0000 Subject: bpf: Add missing annotations for __bpf_prog_enter() and __bpf_prog_exit() Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit() warning: context imbalance in __bpf_prog_enter() - wrong count at exit warning: context imbalance in __bpf_prog_exit() - unexpected unlock The root cause is the missing annotation at __bpf_prog_enter() and __bpf_prog_exit() Add the missing __acquires(RCU) annotation Add the missing __releases(RCU) annotation Signed-off-by: Jules Irenge Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com --- kernel/bpf/trampoline.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index f42f700c1d28..f30bca2a4d01 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -352,6 +352,7 @@ out: * call __bpf_prog_exit */ u64 notrace __bpf_prog_enter(void) + __acquires(RCU) { u64 start = 0; @@ -363,6 +364,7 @@ u64 notrace __bpf_prog_enter(void) } void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) + __releases(RCU) { struct bpf_prog_stats *stats; -- cgit v1.2.3-71-gd317 From fba616a49fe8929f4b4066f4384039db274cb73e Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Sat, 7 Mar 2020 19:27:01 -0800 Subject: PM / hibernate: Remove unnecessary compat ioctl overrides Since the SNAPSHOT_GET_IMAGE_SIZE, SNAPSHOT_AVAIL_SWAP_SIZE, and SNAPSHOT_ALLOC_SWAP_PAGE ioctls produce an loff_t result, and loff_t is always 64-bit even in the compat case, there's no reason to have the special compat handling for these ioctls. Just remove this unneeded code so that these ioctls call into snapshot_ioctl() directly, doing just the compat_ptr() conversion on the argument. (This unnecessary code was also causing a KMSAN false positive.) Signed-off-by: Eric Biggers Signed-off-by: Rafael J. Wysocki --- kernel/power/user.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/power/user.c b/kernel/power/user.c index 77438954cc2b..58ed9478787f 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -409,21 +409,7 @@ snapshot_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) switch (cmd) { case SNAPSHOT_GET_IMAGE_SIZE: case SNAPSHOT_AVAIL_SWAP_SIZE: - case SNAPSHOT_ALLOC_SWAP_PAGE: { - compat_loff_t __user *uoffset = compat_ptr(arg); - loff_t offset; - mm_segment_t old_fs; - int err; - - old_fs = get_fs(); - set_fs(KERNEL_DS); - err = snapshot_ioctl(file, cmd, (unsigned long) &offset); - set_fs(old_fs); - if (!err && put_user(offset, uoffset)) - err = -EFAULT; - return err; - } - + case SNAPSHOT_ALLOC_SWAP_PAGE: case SNAPSHOT_CREATE_IMAGE: return snapshot_ioctl(file, cmd, (unsigned long) compat_ptr(arg)); -- cgit v1.2.3-71-gd317 From 286c21de32b904131f8cf6a36ce40b8b0c9c5da3 Mon Sep 17 00:00:00 2001 From: Kevin Grandemange Date: Thu, 12 Mar 2020 15:41:45 +0000 Subject: dma-coherent: fix integer overflow in the reserved-memory dma allocation pageno is an int and the PAGE_SHIFT shift is done on an int, overflowing if the memory is bigger than 2G This can be reproduced using for example a reserved-memory of 4G reserved-memory { #address-cells = <2>; #size-cells = <2>; ranges; reserved_dma: buffer@0 { compatible = "shared-dma-pool"; no-map; reg = <0x5 0x00000000 0x1 0x0>; }; }; Signed-off-by: Kevin Grandemange Signed-off-by: Christoph Hellwig --- kernel/dma/coherent.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 551b0eb7028a..2a0c4985f38e 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -134,7 +134,7 @@ static void *__dma_alloc_from_coherent(struct device *dev, spin_lock_irqsave(&mem->spinlock, flags); - if (unlikely(size > (mem->size << PAGE_SHIFT))) + if (unlikely(size > ((dma_addr_t)mem->size << PAGE_SHIFT))) goto err; pageno = bitmap_find_free_region(mem->bitmap, mem->size, order); @@ -144,8 +144,9 @@ static void *__dma_alloc_from_coherent(struct device *dev, /* * Memory was found in the coherent area. */ - *dma_handle = dma_get_device_base(dev, mem) + (pageno << PAGE_SHIFT); - ret = mem->virt_base + (pageno << PAGE_SHIFT); + *dma_handle = dma_get_device_base(dev, mem) + + ((dma_addr_t)pageno << PAGE_SHIFT); + ret = mem->virt_base + ((dma_addr_t)pageno << PAGE_SHIFT); spin_unlock_irqrestore(&mem->spinlock, flags); memset(ret, 0, size); return ret; @@ -194,7 +195,7 @@ static int __dma_release_from_coherent(struct dma_coherent_mem *mem, int order, void *vaddr) { if (mem && vaddr >= mem->virt_base && vaddr < - (mem->virt_base + (mem->size << PAGE_SHIFT))) { + (mem->virt_base + ((dma_addr_t)mem->size << PAGE_SHIFT))) { int page = (vaddr - mem->virt_base) >> PAGE_SHIFT; unsigned long flags; @@ -238,10 +239,10 @@ static int __dma_mmap_from_coherent(struct dma_coherent_mem *mem, struct vm_area_struct *vma, void *vaddr, size_t size, int *ret) { if (mem && vaddr >= mem->virt_base && vaddr + size <= - (mem->virt_base + (mem->size << PAGE_SHIFT))) { + (mem->virt_base + ((dma_addr_t)mem->size << PAGE_SHIFT))) { unsigned long off = vma->vm_pgoff; int start = (vaddr - mem->virt_base) >> PAGE_SHIFT; - int user_count = vma_pages(vma); + unsigned long user_count = vma_pages(vma); int count = PAGE_ALIGN(size) >> PAGE_SHIFT; *ret = -ENXIO; -- cgit v1.2.3-71-gd317 From 3d0fc341c4bb66b2c41c0d1ec954a6d300e100b7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 21 Feb 2020 12:26:00 -0800 Subject: dma-direct: consolidate the error handling in dma_direct_alloc_pages Use a goto label to merge two error return cases. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- kernel/dma/direct.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 6af7ae83c4ad..650580fbbff3 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -169,11 +169,8 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, ret = dma_common_contiguous_remap(page, PAGE_ALIGN(size), dma_pgprot(dev, PAGE_KERNEL, attrs), __builtin_return_address(0)); - if (!ret) { - dma_free_contiguous(dev, page, size); - return ret; - } - + if (!ret) + goto out_free_pages; memset(ret, 0, size); goto done; } @@ -186,8 +183,7 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, * so log an error and fail. */ dev_info(dev, "Rejecting highmem page from CMA.\n"); - dma_free_contiguous(dev, page, size); - return NULL; + goto out_free_pages; } ret = page_address(page); @@ -207,6 +203,9 @@ done: else *dma_handle = phys_to_dma(dev, page_to_phys(page)); return ret; +out_free_pages: + dma_free_contiguous(dev, page, size); + return NULL; } void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, -- cgit v1.2.3-71-gd317 From fa7e2247c5729f990c7456fe09f3af99c8f2571b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 21 Feb 2020 15:55:43 -0800 Subject: dma-direct: make uncached_kernel_address more general Rename the symbol to arch_dma_set_uncached, and pass a size to it as well as allow an error return. That will allow reusing this hook for in-place pagetable remapping. As the in-place remap doesn't always require an explicit cache flush, also detangle ARCH_HAS_DMA_PREP_COHERENT from ARCH_HAS_DMA_SET_UNCACHED. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- arch/Kconfig | 8 ++++---- arch/microblaze/Kconfig | 2 +- arch/microblaze/mm/consistent.c | 2 +- arch/mips/Kconfig | 3 ++- arch/mips/mm/dma-noncoherent.c | 2 +- arch/nios2/Kconfig | 3 ++- arch/nios2/mm/dma-mapping.c | 2 +- arch/xtensa/Kconfig | 2 +- arch/xtensa/kernel/pci-dma.c | 2 +- include/linux/dma-noncoherent.h | 2 +- kernel/dma/direct.c | 10 ++++++---- 11 files changed, 21 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/arch/Kconfig b/arch/Kconfig index 7994b239f155..090cfe0c82a7 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -248,11 +248,11 @@ config ARCH_HAS_SET_DIRECT_MAP bool # -# Select if arch has an uncached kernel segment and provides the -# uncached_kernel_address symbol to use it +# Select if the architecture provides the arch_dma_set_uncached symbol to +# either provide an uncached segement alias for a DMA allocation, or +# to remap the page tables in place. # -config ARCH_HAS_UNCACHED_SEGMENT - select ARCH_HAS_DMA_PREP_COHERENT +config ARCH_HAS_DMA_SET_UNCACHED bool # Select if arch init_task must go in the __init_task_data section diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index 6a331bd57ea8..9606c244b5b8 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -8,7 +8,7 @@ config MICROBLAZE select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE - select ARCH_HAS_UNCACHED_SEGMENT if !MMU + select ARCH_HAS_DMA_SET_UNCACHED if !MMU select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_WANT_IPC_PARSE_VERSION select BUILDTIME_TABLE_SORT diff --git a/arch/microblaze/mm/consistent.c b/arch/microblaze/mm/consistent.c index cede7c5e8135..e09b66e43cb6 100644 --- a/arch/microblaze/mm/consistent.c +++ b/arch/microblaze/mm/consistent.c @@ -40,7 +40,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) #define UNCACHED_SHADOW_MASK 0 #endif /* CONFIG_XILINX_UNCACHED_SHADOW */ -void *uncached_kernel_address(void *ptr) +void *arch_dma_set_uncached(void *ptr, size_t size) { unsigned long addr = (unsigned long)ptr; diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 797d7f1ad5fe..489185db501e 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -1187,8 +1187,9 @@ config DMA_NONCOHERENT # significant advantages. # select ARCH_HAS_DMA_WRITE_COMBINE + select ARCH_HAS_DMA_PREP_COHERENT select ARCH_HAS_SYNC_DMA_FOR_DEVICE - select ARCH_HAS_UNCACHED_SEGMENT + select ARCH_HAS_DMA_SET_UNCACHED select DMA_NONCOHERENT_MMAP select DMA_NONCOHERENT_CACHE_SYNC select NEED_DMA_MAP_STATE diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c index 77dce28ad0a0..fcea92d95d86 100644 --- a/arch/mips/mm/dma-noncoherent.c +++ b/arch/mips/mm/dma-noncoherent.c @@ -49,7 +49,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) dma_cache_wback_inv((unsigned long)page_address(page), size); } -void *uncached_kernel_address(void *addr) +void *arch_dma_set_uncached(void *addr, size_t size) { return (void *)(__pa(addr) + UNCAC_BASE); } diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig index 44b5da37e8bd..2fc4ed210b5f 100644 --- a/arch/nios2/Kconfig +++ b/arch/nios2/Kconfig @@ -2,9 +2,10 @@ config NIOS2 def_bool y select ARCH_32BIT_OFF_T + select ARCH_HAS_DMA_PREP_COHERENT select ARCH_HAS_SYNC_DMA_FOR_CPU select ARCH_HAS_SYNC_DMA_FOR_DEVICE - select ARCH_HAS_UNCACHED_SEGMENT + select ARCH_HAS_DMA_SET_UNCACHED select ARCH_NO_SWAP select TIMER_OF select GENERIC_ATOMIC64 diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c index f30f2749257c..fd887d5f3f9a 100644 --- a/arch/nios2/mm/dma-mapping.c +++ b/arch/nios2/mm/dma-mapping.c @@ -67,7 +67,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) flush_dcache_range(start, start + size); } -void *uncached_kernel_address(void *ptr) +void *arch_dma_set_uncached(void *ptr, size_t size) { unsigned long addr = (unsigned long)ptr; diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 32ee759a3fda..de229424b659 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -6,7 +6,7 @@ config XTENSA select ARCH_HAS_DMA_PREP_COHERENT if MMU select ARCH_HAS_SYNC_DMA_FOR_CPU if MMU select ARCH_HAS_SYNC_DMA_FOR_DEVICE if MMU - select ARCH_HAS_UNCACHED_SEGMENT if MMU + select ARCH_HAS_DMA_SET_UNCACHED if MMU select ARCH_USE_QUEUED_RWLOCKS select ARCH_USE_QUEUED_SPINLOCKS select ARCH_WANT_FRAME_POINTERS diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c index 6a685545d5c9..17c4384f8495 100644 --- a/arch/xtensa/kernel/pci-dma.c +++ b/arch/xtensa/kernel/pci-dma.c @@ -92,7 +92,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size) * coherent DMA memory operations when CONFIG_MMU is not enabled. */ #ifdef CONFIG_MMU -void *uncached_kernel_address(void *p) +void *arch_dma_set_uncached(void *p, size_t size) { return p + XCHAL_KSEG_BYPASS_VADDR - XCHAL_KSEG_CACHED_VADDR; } diff --git a/include/linux/dma-noncoherent.h b/include/linux/dma-noncoherent.h index b6b72e19b0cd..1a4039506673 100644 --- a/include/linux/dma-noncoherent.h +++ b/include/linux/dma-noncoherent.h @@ -108,6 +108,6 @@ static inline void arch_dma_prep_coherent(struct page *page, size_t size) } #endif /* CONFIG_ARCH_HAS_DMA_PREP_COHERENT */ -void *uncached_kernel_address(void *addr); +void *arch_dma_set_uncached(void *addr, size_t size); #endif /* _LINUX_DMA_NONCOHERENT_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 650580fbbff3..baf4e93735c3 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -192,10 +192,12 @@ void *dma_direct_alloc_pages(struct device *dev, size_t size, memset(ret, 0, size); - if (IS_ENABLED(CONFIG_ARCH_HAS_UNCACHED_SEGMENT) && + if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && dma_alloc_need_uncached(dev, attrs)) { arch_dma_prep_coherent(page, size); - ret = uncached_kernel_address(ret); + ret = arch_dma_set_uncached(ret, size); + if (IS_ERR(ret)) + goto out_free_pages; } done: if (force_dma_unencrypted(dev)) @@ -236,7 +238,7 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs) { - if (!IS_ENABLED(CONFIG_ARCH_HAS_UNCACHED_SEGMENT) && + if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && dma_alloc_need_uncached(dev, attrs)) return arch_dma_alloc(dev, size, dma_handle, gfp, attrs); @@ -246,7 +248,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, void dma_direct_free(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs) { - if (!IS_ENABLED(CONFIG_ARCH_HAS_UNCACHED_SEGMENT) && + if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && dma_alloc_need_uncached(dev, attrs)) arch_dma_free(dev, size, cpu_addr, dma_addr, attrs); -- cgit v1.2.3-71-gd317 From 999a5d1203baa7cff00586361feae263ee3f23a5 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Fri, 21 Feb 2020 12:35:05 -0800 Subject: dma-direct: provide a arch_dma_clear_uncached hook This allows the arch code to reset the page tables to cached access when freeing a dma coherent allocation that was set to uncached using arch_dma_set_uncached. Signed-off-by: Christoph Hellwig Reviewed-by: Robin Murphy --- arch/Kconfig | 7 +++++++ include/linux/dma-noncoherent.h | 1 + kernel/dma/direct.c | 2 ++ 3 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/arch/Kconfig b/arch/Kconfig index 090cfe0c82a7..c26302f90c96 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -255,6 +255,13 @@ config ARCH_HAS_SET_DIRECT_MAP config ARCH_HAS_DMA_SET_UNCACHED bool +# +# Select if the architectures provides the arch_dma_clear_uncached symbol +# to undo an in-place page table remap for uncached access. +# +config ARCH_HAS_DMA_CLEAR_UNCACHED + bool + # Select if arch init_task must go in the __init_task_data section config ARCH_TASK_STRUCT_ON_STACK bool diff --git a/include/linux/dma-noncoherent.h b/include/linux/dma-noncoherent.h index 1a4039506673..b59f1b6be3e9 100644 --- a/include/linux/dma-noncoherent.h +++ b/include/linux/dma-noncoherent.h @@ -109,5 +109,6 @@ static inline void arch_dma_prep_coherent(struct page *page, size_t size) #endif /* CONFIG_ARCH_HAS_DMA_PREP_COHERENT */ void *arch_dma_set_uncached(void *addr, size_t size); +void arch_dma_clear_uncached(void *addr, size_t size); #endif /* _LINUX_DMA_NONCOHERENT_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index baf4e93735c3..412f560dc69f 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -231,6 +231,8 @@ void dma_direct_free_pages(struct device *dev, size_t size, void *cpu_addr, if (IS_ENABLED(CONFIG_DMA_REMAP) && is_vmalloc_addr(cpu_addr)) vunmap(cpu_addr); + else if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED)) + arch_dma_clear_uncached(cpu_addr, size); dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size); } -- cgit v1.2.3-71-gd317 From 38aca3071cebc90e6b07abd697cba5c9d7b37a94 Mon Sep 17 00:00:00 2001 From: Daniel Xu Date: Thu, 12 Mar 2020 13:03:17 -0700 Subject: cgroupfs: Support user xattrs This patch turns on xattr support for cgroupfs. This is useful for letting non-root owners of delegated subtrees attach metadata to cgroups. One use case is for subtree owners to tell a userspace out of memory killer to bias away from killing specific subtrees. Tests: [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice --remove user.name$i; done setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute setfattr: workload.slice: No such attribute [/sys/fs/cgroup]# for i in $(seq 0 130); \ do setfattr workload.slice -n user.name$i -v wow; done setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device setfattr: workload.slice: No space left on device `seq 0 130` is inclusive, and 131 - 128 = 3, which is the number of errors we expect to see. [/data]# cat testxattr.c #include #include #include #include int main() { char name[256]; char *buf = malloc(64 << 10); if (!buf) { perror("malloc"); return 1; } for (int i = 0; i < 4; ++i) { snprintf(name, 256, "user.bigone%d", i); if (setxattr("/sys/fs/cgroup/system.slice", name, buf, 64 << 10, 0)) { printf("setxattr failed on iteration=%d\n", i); return 1; } } return 0; } [/data]# ./a.out setxattr failed on iteration=2 [/data]# ./a.out setxattr failed on iteration=0 [/sys/fs/cgroup]# setfattr -x user.bigone0 system.slice/ [/sys/fs/cgroup]# setfattr -x user.bigone1 system.slice/ [/data]# ./a.out setxattr failed on iteration=2 Signed-off-by: Daniel Xu Acked-by: Chris Down Reviewed-by: Greg Kroah-Hartman Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 114dcbf8d4f4..33ff9ec4a523 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1954,7 +1954,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) root->kf_root = kernfs_create_root(kf_sops, KERNFS_ROOT_CREATE_DEACTIVATED | - KERNFS_ROOT_SUPPORT_EXPORTOP, + KERNFS_ROOT_SUPPORT_EXPORTOP | + KERNFS_ROOT_SUPPORT_USER_XATTR, root_cgrp); if (IS_ERR(root->kf_root)) { ret = PTR_ERR(root->kf_root); -- cgit v1.2.3-71-gd317 From 17c4a2ae15a7aaefe84bdb271952678c5c9cd8e1 Mon Sep 17 00:00:00 2001 From: Thomas Hellstrom Date: Wed, 4 Mar 2020 12:45:27 +0100 Subject: dma-mapping: Fix dma_pgprot() for unencrypted coherent pages When dma_mmap_coherent() sets up a mapping to unencrypted coherent memory under SEV encryption and sometimes under SME encryption, it will actually set up an encrypted mapping rather than an unencrypted, causing devices that DMAs from that memory to read encrypted contents. Fix this. When force_dma_unencrypted() returns true, the linear kernel map of the coherent pages have had the encryption bit explicitly cleared and the page content is unencrypted. Make sure that any additional PTEs we set up to these pages also have the encryption bit cleared by having dma_pgprot() return a protection with the encryption bit cleared in this case. Signed-off-by: Thomas Hellstrom Signed-off-by: Borislav Petkov Reviewed-by: Christoph Hellwig Acked-by: Tom Lendacky Link: https://lkml.kernel.org/r/20200304114527.3636-3-thomas_os@shipmail.org --- kernel/dma/mapping.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index 12ff766ec1fa..98e3d873792e 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -154,6 +154,8 @@ EXPORT_SYMBOL(dma_get_sgtable_attrs); */ pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs) { + if (force_dma_unencrypted(dev)) + prot = pgprot_decrypted(prot); if (dev_is_dma_coherent(dev) || (IS_ENABLED(CONFIG_DMA_NONCOHERENT_CACHE_SYNC) && (attrs & DMA_ATTR_NON_CONSISTENT))) -- cgit v1.2.3-71-gd317 From 6ce6ae7c178b95f83ca0e15bd2ac961425a3af5c Mon Sep 17 00:00:00 2001 From: Zhenzhong Duan Date: Wed, 11 Mar 2020 15:16:53 +0800 Subject: misc: cleanup minor number definitions in c file into miscdevice.h HWRNG_MINOR and RNG_MISCDEV_MINOR are duplicate definitions, use unified HWRNG_MINOR instead and moved into miscdevice.h ANSLCD_MINOR and LCD_MINOR are duplicate definitions, use unified LCD_MINOR instead and moved into miscdevice.h MISCDEV_MINOR is renamed to PXA3XX_GCU_MINOR and moved into miscdevice.h Other definitions are just moved without any change. Link: https://lore.kernel.org/lkml/20200120221323.GJ15860@mit.edu/t/ Suggested-by: Arnd Bergmann Build-tested-by: Willy TARREAU Build-tested-by: Miguel Ojeda Signed-off-by: Zhenzhong Duan Acked-by: Miguel Ojeda Acked-by: Arnd Bergmann Acked-by: Herbert Xu Link: https://lore.kernel.org/r/20200311071654.335-2-zhenzhong.duan@gmail.com Signed-off-by: Greg Kroah-Hartman --- arch/um/drivers/random.c | 4 +--- drivers/auxdisplay/charlcd.c | 2 -- drivers/auxdisplay/panel.c | 2 -- drivers/char/applicom.c | 1 - drivers/char/nwbutton.h | 1 - drivers/char/toshiba.c | 2 -- drivers/macintosh/ans-lcd.c | 2 +- drivers/macintosh/ans-lcd.h | 2 -- drivers/macintosh/via-pmu.c | 3 --- drivers/sbus/char/envctrl.c | 2 -- drivers/sbus/char/uctrl.c | 2 -- drivers/video/fbdev/pxa3xx-gcu.c | 7 +++---- include/linux/miscdevice.h | 10 ++++++++++ kernel/power/user.c | 2 -- 14 files changed, 15 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/arch/um/drivers/random.c b/arch/um/drivers/random.c index 1d5d3057e6f1..ce115fce52f0 100644 --- a/arch/um/drivers/random.c +++ b/arch/um/drivers/random.c @@ -23,8 +23,6 @@ #define RNG_VERSION "1.0.0" #define RNG_MODULE_NAME "hw_random" -#define RNG_MISCDEV_MINOR 183 /* official */ - /* Changed at init time, in the non-modular case, and at module load * time, in the module case. Presumably, the module subsystem * protects against a module being loaded twice at the same time. @@ -104,7 +102,7 @@ static const struct file_operations rng_chrdev_ops = { /* rng_init shouldn't be called more than once at boot time */ static struct miscdevice rng_miscdev = { - RNG_MISCDEV_MINOR, + HWRNG_MINOR, RNG_MODULE_NAME, &rng_chrdev_ops, }; diff --git a/drivers/auxdisplay/charlcd.c b/drivers/auxdisplay/charlcd.c index 874c259a8829..e7048658cb5e 100644 --- a/drivers/auxdisplay/charlcd.c +++ b/drivers/auxdisplay/charlcd.c @@ -22,8 +22,6 @@ #include "charlcd.h" -#define LCD_MINOR 156 - #define DEFAULT_LCD_BWIDTH 40 #define DEFAULT_LCD_HWIDTH 64 diff --git a/drivers/auxdisplay/panel.c b/drivers/auxdisplay/panel.c index 85965953683e..99980aa3644b 100644 --- a/drivers/auxdisplay/panel.c +++ b/drivers/auxdisplay/panel.c @@ -57,8 +57,6 @@ #include "charlcd.h" -#define KEYPAD_MINOR 185 - #define LCD_MAXBYTES 256 /* max burst write */ #define KEYPAD_BUFFER 64 diff --git a/drivers/char/applicom.c b/drivers/char/applicom.c index 51121a4b82c7..14b2d8034c51 100644 --- a/drivers/char/applicom.c +++ b/drivers/char/applicom.c @@ -53,7 +53,6 @@ #define MAX_BOARD 8 /* maximum of pc board possible */ #define MAX_ISA_BOARD 4 #define LEN_RAM_IO 0x800 -#define AC_MINOR 157 #ifndef PCI_VENDOR_ID_APPLICOM #define PCI_VENDOR_ID_APPLICOM 0x1389 diff --git a/drivers/char/nwbutton.h b/drivers/char/nwbutton.h index 9dedfd7adc0e..f2b9fdc1f9ea 100644 --- a/drivers/char/nwbutton.h +++ b/drivers/char/nwbutton.h @@ -14,7 +14,6 @@ #define NUM_PRESSES_REBOOT 2 /* How many presses to activate shutdown */ #define BUTTON_DELAY 30 /* How many jiffies for sequence to end */ #define VERSION "0.3" /* Driver version number */ -#define BUTTON_MINOR 158 /* Major 10, Minor 158, /dev/nwbutton */ /* Structure definitions: */ diff --git a/drivers/char/toshiba.c b/drivers/char/toshiba.c index 98f3150e0048..aff0a8e44fff 100644 --- a/drivers/char/toshiba.c +++ b/drivers/char/toshiba.c @@ -61,8 +61,6 @@ #include #include -#define TOSH_MINOR_DEV 181 - MODULE_LICENSE("GPL"); MODULE_AUTHOR("Jonathan Buzzard "); MODULE_DESCRIPTION("Toshiba laptop SMM driver"); diff --git a/drivers/macintosh/ans-lcd.c b/drivers/macintosh/ans-lcd.c index b1314d104b06..b4821c751d04 100644 --- a/drivers/macintosh/ans-lcd.c +++ b/drivers/macintosh/ans-lcd.c @@ -142,7 +142,7 @@ const struct file_operations anslcd_fops = { }; static struct miscdevice anslcd_dev = { - ANSLCD_MINOR, + LCD_MINOR, "anslcd", &anslcd_fops }; diff --git a/drivers/macintosh/ans-lcd.h b/drivers/macintosh/ans-lcd.h index f0a6e4c68557..bca7d76d441b 100644 --- a/drivers/macintosh/ans-lcd.h +++ b/drivers/macintosh/ans-lcd.h @@ -2,8 +2,6 @@ #ifndef _PPC_ANS_LCD_H #define _PPC_ANS_LCD_H -#define ANSLCD_MINOR 156 - #define ANSLCD_CLEAR 0x01 #define ANSLCD_SENDCTRL 0x02 #define ANSLCD_SETSHORTDELAY 0x03 diff --git a/drivers/macintosh/via-pmu.c b/drivers/macintosh/via-pmu.c index d38fb78a3b23..83eb05bf85ff 100644 --- a/drivers/macintosh/via-pmu.c +++ b/drivers/macintosh/via-pmu.c @@ -75,9 +75,6 @@ /* Some compile options */ #undef DEBUG_SLEEP -/* Misc minor number allocated for /dev/pmu */ -#define PMU_MINOR 154 - /* How many iterations between battery polls */ #define BATTERY_POLLING_COUNT 2 diff --git a/drivers/sbus/char/envctrl.c b/drivers/sbus/char/envctrl.c index 12d66aa61ede..843e830b5f87 100644 --- a/drivers/sbus/char/envctrl.c +++ b/drivers/sbus/char/envctrl.c @@ -37,8 +37,6 @@ #define DRIVER_NAME "envctrl" #define PFX DRIVER_NAME ": " -#define ENVCTRL_MINOR 162 - #define PCF8584_ADDRESS 0x55 #define CONTROL_PIN 0x80 diff --git a/drivers/sbus/char/uctrl.c b/drivers/sbus/char/uctrl.c index 7173a2e4e8cf..37d252f2548d 100644 --- a/drivers/sbus/char/uctrl.c +++ b/drivers/sbus/char/uctrl.c @@ -23,8 +23,6 @@ #include #include -#define UCTRL_MINOR 174 - #define DEBUG 1 #ifdef DEBUG #define dprintk(x) printk x diff --git a/drivers/video/fbdev/pxa3xx-gcu.c b/drivers/video/fbdev/pxa3xx-gcu.c index 74ffb446e00c..4279e13a3b58 100644 --- a/drivers/video/fbdev/pxa3xx-gcu.c +++ b/drivers/video/fbdev/pxa3xx-gcu.c @@ -36,7 +36,6 @@ #include "pxa3xx-gcu.h" #define DRV_NAME "pxa3xx-gcu" -#define MISCDEV_MINOR 197 #define REG_GCCR 0x00 #define GCCR_SYNC_CLR (1 << 9) @@ -595,7 +594,7 @@ static int pxa3xx_gcu_probe(struct platform_device *pdev) * container_of(). This isn't really necessary as we have a fixed minor * number anyway, but this is to avoid statics. */ - priv->misc_dev.minor = MISCDEV_MINOR, + priv->misc_dev.minor = PXA3XX_GCU_MINOR, priv->misc_dev.name = DRV_NAME, priv->misc_dev.fops = &pxa3xx_gcu_miscdev_fops; @@ -638,7 +637,7 @@ static int pxa3xx_gcu_probe(struct platform_device *pdev) ret = misc_register(&priv->misc_dev); if (ret < 0) { dev_err(dev, "misc_register() for minor %d failed\n", - MISCDEV_MINOR); + PXA3XX_GCU_MINOR); goto err_free_dma; } @@ -714,7 +713,7 @@ module_platform_driver(pxa3xx_gcu_driver); MODULE_DESCRIPTION("PXA3xx graphics controller unit driver"); MODULE_LICENSE("GPL"); -MODULE_ALIAS_MISCDEV(MISCDEV_MINOR); +MODULE_ALIAS_MISCDEV(PXA3XX_GCU_MINOR); MODULE_AUTHOR("Janine Kropp , " "Denis Oliver Kropp , " "Daniel Mack "); diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index becde6981a95..42360fcd7342 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -31,14 +31,23 @@ #define DMAPI_MINOR 140 /* unused */ #define NVRAM_MINOR 144 #define SGI_MMTIMER 153 +#define PMU_MINOR 154 #define STORE_QUEUE_MINOR 155 /* unused */ +#define LCD_MINOR 156 +#define AC_MINOR 157 +#define BUTTON_MINOR 158 /* Major 10, Minor 158, /dev/nwbutton */ +#define ENVCTRL_MINOR 162 #define I2O_MINOR 166 +#define UCTRL_MINOR 174 #define AGPGART_MINOR 175 +#define TOSH_MINOR_DEV 181 #define HWRNG_MINOR 183 #define MICROCODE_MINOR 184 +#define KEYPAD_MINOR 185 #define IRNET_MINOR 187 #define D7S_MINOR 193 #define VFIO_MINOR 196 +#define PXA3XX_GCU_MINOR 197 #define TUN_MINOR 200 #define CUSE_MINOR 203 #define MWAVE_MINOR 219 /* ACP/Mwave Modem */ @@ -49,6 +58,7 @@ #define MISC_MCELOG_MINOR 227 #define HPET_MINOR 228 #define FUSE_MINOR 229 +#define SNAPSHOT_MINOR 231 #define KVM_MINOR 232 #define BTRFS_MINOR 234 #define AUTOFS_MINOR 235 diff --git a/kernel/power/user.c b/kernel/power/user.c index 77438954cc2b..98fb65970b6b 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -27,8 +27,6 @@ #include "power.h" -#define SNAPSHOT_MINOR 231 - static struct snapshot_data { struct snapshot_handle handle; int swap; -- cgit v1.2.3-71-gd317 From 2c8bd58812ee3dbf0d78b566822f7eacd34bdd7b Mon Sep 17 00:00:00 2001 From: "Ahmed S. Darwish" Date: Mon, 9 Mar 2020 18:15:29 +0000 Subject: time/sched_clock: Expire timer in hardirq context To minimize latency, PREEMPT_RT kernels expires hrtimers in preemptible softirq context by default. This can be overriden by marking the timer's expiry with HRTIMER_MODE_HARD. sched_clock_timer is missing this annotation: if its callback is preempted and the duration of the preemption exceeds the wrap around time of the underlying clocksource, sched clock will get out of sync. Mark the sched_clock_timer for expiry in hard interrupt context. Signed-off-by: Ahmed S. Darwish Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200309181529.26558-1-a.darwish@linutronix.de --- kernel/time/sched_clock.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index e4332e3e2d56..fa3f800d7d76 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -208,7 +208,8 @@ sched_clock_register(u64 (*read)(void), int bits, unsigned long rate) if (sched_clock_timer.function != NULL) { /* update timeout for clock wrap */ - hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL); + hrtimer_start(&sched_clock_timer, cd.wrap_kt, + HRTIMER_MODE_REL_HARD); } r = rate; @@ -254,9 +255,9 @@ void __init generic_sched_clock_init(void) * Start the timer to keep sched_clock() properly updated and * sets the initial epoch. */ - hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_HARD); sched_clock_timer.function = sched_clock_poll; - hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL); + hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL_HARD); } /* @@ -293,7 +294,7 @@ void sched_clock_resume(void) struct clock_read_data *rd = &cd.read_data[0]; rd->epoch_cyc = cd.actual_read_sched_clock(); - hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL); + hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL_HARD); rd->read_sched_clock = cd.actual_read_sched_clock; } -- cgit v1.2.3-71-gd317 From 90ceddcb495008ac8ba7a3dce297841efcd7d584 Mon Sep 17 00:00:00 2001 From: Fangrui Song Date: Wed, 18 Mar 2020 15:27:46 -0700 Subject: bpf: Support llvm-objcopy for vmlinux BTF Simplify gen_btf logic to make it work with llvm-objcopy. The existing 'file format' and 'architecture' parsing logic is brittle and does not work with llvm-objcopy/llvm-objdump. 'file format' output of llvm-objdump>=11 will match GNU objdump, but 'architecture' (bfdarch) may not. .BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag because it is part of vmlinux image used for introspection. C code can reference the section via linker script defined __start_BTF and __stop_BTF. This fixes a small problem that previous .BTF had the SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data). Additionally, `objcopy -I binary` synthesized symbols _binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not used elsewhere) are replaced with more commonplace __start_BTF and __stop_BTF. Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns "empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?" We use a dd command to change the e_type field in the ELF header from ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting ET_EXEC as an input file is an extremely rare GNU ld feature that lld does not intend to support, because this is error-prone. The output section description .BTF in include/asm-generic/vmlinux.lds.h avoids potential subtle orphan section placement issues and suppresses --orphan-handling=warn warnings. Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux") Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section") Reported-by: Nathan Chancellor Signed-off-by: Fangrui Song Signed-off-by: Daniel Borkmann Tested-by: Stanislav Fomichev Tested-by: Andrii Nakryiko Reviewed-by: Stanislav Fomichev Reviewed-by: Kees Cook Acked-by: Andrii Nakryiko Acked-by: Michael Ellerman (powerpc) Link: https://github.com/ClangBuiltLinux/linux/issues/871 Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com --- arch/powerpc/kernel/vmlinux.lds.S | 6 ------ include/asm-generic/vmlinux.lds.h | 15 +++++++++++++++ kernel/bpf/btf.c | 9 ++++----- kernel/bpf/sysfs_btf.c | 11 +++++------ scripts/link-vmlinux.sh | 24 ++++++++++-------------- 5 files changed, 34 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kernel/vmlinux.lds.S b/arch/powerpc/kernel/vmlinux.lds.S index a32d478a7f41..b4c89a1acebb 100644 --- a/arch/powerpc/kernel/vmlinux.lds.S +++ b/arch/powerpc/kernel/vmlinux.lds.S @@ -303,12 +303,6 @@ SECTIONS *(.branch_lt) } -#ifdef CONFIG_DEBUG_INFO_BTF - .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { - *(.BTF) - } -#endif - .opd : AT(ADDR(.opd) - LOAD_OFFSET) { __start_opd = .; KEEP(*(.opd)) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index e00f41aa8ec4..39da8d8b561d 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -535,6 +535,7 @@ \ RO_EXCEPTION_TABLE \ NOTES \ + BTF \ \ . = ALIGN((align)); \ __end_rodata = .; @@ -621,6 +622,20 @@ __stop___ex_table = .; \ } +/* + * .BTF + */ +#ifdef CONFIG_DEBUG_INFO_BTF +#define BTF \ + .BTF : AT(ADDR(.BTF) - LOAD_OFFSET) { \ + __start_BTF = .; \ + *(.BTF) \ + __stop_BTF = .; \ + } +#else +#define BTF +#endif + /* * Init task */ diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 50080add2ab9..6f397c4da05e 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3477,8 +3477,8 @@ errout: return ERR_PTR(err); } -extern char __weak _binary__btf_vmlinux_bin_start[]; -extern char __weak _binary__btf_vmlinux_bin_end[]; +extern char __weak __start_BTF[]; +extern char __weak __stop_BTF[]; extern struct btf *btf_vmlinux; #define BPF_MAP_TYPE(_id, _ops) @@ -3605,9 +3605,8 @@ struct btf *btf_parse_vmlinux(void) } env->btf = btf; - btf->data = _binary__btf_vmlinux_bin_start; - btf->data_size = _binary__btf_vmlinux_bin_end - - _binary__btf_vmlinux_bin_start; + btf->data = __start_BTF; + btf->data_size = __stop_BTF - __start_BTF; err = btf_parse_hdr(env); if (err) diff --git a/kernel/bpf/sysfs_btf.c b/kernel/bpf/sysfs_btf.c index 7ae5dddd1fe6..3b495773de5a 100644 --- a/kernel/bpf/sysfs_btf.c +++ b/kernel/bpf/sysfs_btf.c @@ -9,15 +9,15 @@ #include /* See scripts/link-vmlinux.sh, gen_btf() func for details */ -extern char __weak _binary__btf_vmlinux_bin_start[]; -extern char __weak _binary__btf_vmlinux_bin_end[]; +extern char __weak __start_BTF[]; +extern char __weak __stop_BTF[]; static ssize_t btf_vmlinux_read(struct file *file, struct kobject *kobj, struct bin_attribute *bin_attr, char *buf, loff_t off, size_t len) { - memcpy(buf, _binary__btf_vmlinux_bin_start + off, len); + memcpy(buf, __start_BTF + off, len); return len; } @@ -30,15 +30,14 @@ static struct kobject *btf_kobj; static int __init btf_vmlinux_init(void) { - if (!_binary__btf_vmlinux_bin_start) + if (!__start_BTF) return 0; btf_kobj = kobject_create_and_add("btf", kernel_kobj); if (!btf_kobj) return -ENOMEM; - bin_attr_btf_vmlinux.size = _binary__btf_vmlinux_bin_end - - _binary__btf_vmlinux_bin_start; + bin_attr_btf_vmlinux.size = __stop_BTF - __start_BTF; return sysfs_create_bin_file(btf_kobj, &bin_attr_btf_vmlinux); } diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh index ac569e197bfa..d09ab4afbda4 100755 --- a/scripts/link-vmlinux.sh +++ b/scripts/link-vmlinux.sh @@ -113,9 +113,6 @@ vmlinux_link() gen_btf() { local pahole_ver - local bin_arch - local bin_format - local bin_file if ! [ -x "$(command -v ${PAHOLE})" ]; then echo >&2 "BTF: ${1}: pahole (${PAHOLE}) is not available" @@ -133,17 +130,16 @@ gen_btf() info "BTF" ${2} LLVM_OBJCOPY=${OBJCOPY} ${PAHOLE} -J ${1} - # dump .BTF section into raw binary file to link with final vmlinux - bin_arch=$(LANG=C ${OBJDUMP} -f ${1} | grep architecture | \ - cut -d, -f1 | cut -d' ' -f2) - bin_format=$(LANG=C ${OBJDUMP} -f ${1} | grep 'file format' | \ - awk '{print $4}') - bin_file=.btf.vmlinux.bin - ${OBJCOPY} --change-section-address .BTF=0 \ - --set-section-flags .BTF=alloc -O binary \ - --only-section=.BTF ${1} $bin_file - ${OBJCOPY} -I binary -O ${bin_format} -B ${bin_arch} \ - --rename-section .data=.BTF $bin_file ${2} + # Create ${2} which contains just .BTF section but no symbols. Add + # SHF_ALLOC because .BTF will be part of the vmlinux image. --strip-all + # deletes all symbols including __start_BTF and __stop_BTF, which will + # be redefined in the linker script. Add 2>/dev/null to suppress GNU + # objcopy warnings: "empty loadable segment detected at ..." + ${OBJCOPY} --only-section=.BTF --set-section-flags .BTF=alloc,readonly \ + --strip-all ${1} ${2} 2>/dev/null + # Change e_type to ET_REL so that it can be used to link final vmlinux. + # Unlike GNU ld, lld does not allow an ET_EXEC input. + printf '\1' | dd of=${2} conv=notrunc bs=1 seek=16 status=none } # Create ${2} .o file with all symbols from the ${1} object file -- cgit v1.2.3-71-gd317 From 52da479a9aee630d2cdf37d05edfe5bcfff3e17f Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Thu, 19 Mar 2020 19:47:06 +0100 Subject: Revert "tick/common: Make tick_periodic() check for missing ticks" This reverts commit d441dceb5dce71150f28add80d36d91bbfccba99 due to boot failures. Reported-by: Qian Cai Signed-off-by: Thomas Gleixner Cc: Waiman Long --- kernel/time/tick-common.c | 36 +++--------------------------------- 1 file changed, 3 insertions(+), 33 deletions(-) (limited to 'kernel') diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index cce4ed1515c7..7e5d3524e924 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -16,7 +16,6 @@ #include #include #include -#include #include #include @@ -85,41 +84,12 @@ int tick_is_oneshot_available(void) static void tick_periodic(int cpu) { if (tick_do_timer_cpu == cpu) { - /* - * Use running_clock() as reference to check for missing ticks. - */ - static ktime_t last_update; - ktime_t now; - int ticks = 1; - - now = ns_to_ktime(running_clock()); write_seqlock(&jiffies_lock); - if (last_update) { - u64 delta = ktime_sub(now, last_update); - - /* - * Check for eventually missed ticks - * - * There is likely a persistent delta between - * last_update and tick_next_period. So they are - * updated separately. - */ - if (delta >= 2 * tick_period) { - s64 period = ktime_to_ns(tick_period); - - ticks = ktime_divns(delta, period); - } - last_update = ktime_add(last_update, - ticks * tick_period); - } else { - last_update = now; - } - /* Keep track of the next tick event */ - tick_next_period = ktime_add(tick_next_period, - ticks * tick_period); - do_timer(ticks); + tick_next_period = ktime_add(tick_next_period, tick_period); + + do_timer(1); write_sequnlock(&jiffies_lock); update_wall_time(); } -- cgit v1.2.3-71-gd317 From bf2cbe044da275021b2de5917240411a19e5c50d Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Wed, 19 Feb 2020 22:10:12 -0700 Subject: tracing: Use address-of operator on section symbols Clang warns: ../kernel/trace/trace.c:9335:33: warning: array comparison always evaluates to true [-Wtautological-compare] if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the runtime result of the check (tested with some print statements compiled in with clang + ld.lld and gcc + ld.bfd in QEMU). Link: http://lkml.kernel.org/r/20200220051011.26113-1-natechancellor@gmail.com Link: https://github.com/ClangBuiltLinux/linux/issues/893 Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6b11e4e2150c..02be4ddd4ad5 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -9334,7 +9334,7 @@ __init static int tracer_alloc_buffers(void) goto out_free_buffer_mask; /* Only allocate trace_printk buffers if a trace_printk exists */ - if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) + if (&__stop___trace_bprintk_fmt != &__start___trace_bprintk_fmt) /* Must be called before global_trace.buffer is allocated */ trace_printk_init_buffers(); -- cgit v1.2.3-71-gd317 From ff895103a84abc85a5f43ecabc7f67cf36e1348f Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:23 -0400 Subject: tracing: Save off entry when peeking at next entry In order to have the iterator read the buffer even when it's still updating, it requires that the ring buffer iterator saves each event in a separate location outside the ring buffer such that its use is immutable. There's one use case that saves off the event returned from the ring buffer interator and calls it again to look at the next event, before going back to use the first event. As the ring buffer iterator will only have a single copy, this use case will no longer be supported. Instead, have the one use case create its own buffer to store the first event when looking at the next event. This way, when looking at the first event again, it wont be corrupted by the second read. Link: http://lkml.kernel.org/r/20200317213415.722539921@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- include/linux/trace_events.h | 2 ++ kernel/trace/trace.c | 40 +++++++++++++++++++++++++++++++++++++++- kernel/trace/trace_output.c | 15 ++++++--------- 3 files changed, 47 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 6c7a10a6d71e..5c6943354049 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -85,6 +85,8 @@ struct trace_iterator { struct mutex mutex; struct ring_buffer_iter **buffer_iter; unsigned long iter_flags; + void *temp; /* temp holder */ + unsigned int temp_size; /* trace_seq for __print_flags() and __print_symbolic() etc. */ struct trace_seq tmp_seq; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 02be4ddd4ad5..819e31d0d66c 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3466,7 +3466,31 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu, struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts) { - return __find_next_entry(iter, ent_cpu, NULL, ent_ts); + /* __find_next_entry will reset ent_size */ + int ent_size = iter->ent_size; + struct trace_entry *entry; + + /* + * The __find_next_entry() may call peek_next_entry(), which may + * call ring_buffer_peek() that may make the contents of iter->ent + * undefined. Need to copy iter->ent now. + */ + if (iter->ent && iter->ent != iter->temp) { + if (!iter->temp || iter->temp_size < iter->ent_size) { + kfree(iter->temp); + iter->temp = kmalloc(iter->ent_size, GFP_KERNEL); + if (!iter->temp) + return NULL; + } + memcpy(iter->temp, iter->ent, iter->ent_size); + iter->temp_size = iter->ent_size; + iter->ent = iter->temp; + } + entry = __find_next_entry(iter, ent_cpu, NULL, ent_ts); + /* Put back the original ent_size */ + iter->ent_size = ent_size; + + return entry; } /* Find the next real entry, and increment the iterator to the next entry */ @@ -4197,6 +4221,18 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) if (!iter->buffer_iter) goto release; + /* + * trace_find_next_entry() may need to save off iter->ent. + * It will place it into the iter->temp buffer. As most + * events are less than 128, allocate a buffer of that size. + * If one is greater, then trace_find_next_entry() will + * allocate a new buffer to adjust for the bigger iter->ent. + * It's not critical if it fails to get allocated here. + */ + iter->temp = kmalloc(128, GFP_KERNEL); + if (iter->temp) + iter->temp_size = 128; + /* * We make a copy of the current tracer to avoid concurrent * changes on it while we are reading. @@ -4269,6 +4305,7 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) fail: mutex_unlock(&trace_types_lock); kfree(iter->trace); + kfree(iter->temp); kfree(iter->buffer_iter); release: seq_release_private(inode, file); @@ -4344,6 +4381,7 @@ static int tracing_release(struct inode *inode, struct file *file) mutex_destroy(&iter->mutex); free_cpumask_var(iter->started); + kfree(iter->temp); kfree(iter->trace); kfree(iter->buffer_iter); seq_release_private(inode, file); diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index e25a7da79c6b..9a121e147102 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -617,22 +617,19 @@ int trace_print_context(struct trace_iterator *iter) int trace_print_lat_context(struct trace_iterator *iter) { + struct trace_entry *entry, *next_entry; struct trace_array *tr = iter->tr; - /* trace_find_next_entry will reset ent_size */ - int ent_size = iter->ent_size; struct trace_seq *s = &iter->seq; - u64 next_ts; - struct trace_entry *entry = iter->ent, - *next_entry = trace_find_next_entry(iter, NULL, - &next_ts); unsigned long verbose = (tr->trace_flags & TRACE_ITER_VERBOSE); + u64 next_ts; - /* Restore the original ent_size */ - iter->ent_size = ent_size; - + next_entry = trace_find_next_entry(iter, NULL, &next_ts); if (!next_entry) next_ts = iter->ts; + /* trace_find_next_entry() may change iter->ent */ + entry = iter->ent; + if (verbose) { char comm[TASK_COMM_LEN]; -- cgit v1.2.3-71-gd317 From ead6ecfddea54f4754c97f64ab7198cc1d8c0daa Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:24 -0400 Subject: ring-buffer: Have ring_buffer_empty() not depend on tracing stopped It was complained about that when the trace file is read, that the tracing is disabled, as the iterator expects writing to the buffer it reads is not updated. Several steps are needed to make the iterator handle a writer, by testing if things have changed as it reads. This step is to make ring_buffer_empty() expect the buffer to be changing. Note if the current location of the iterator is overwritten, then it will return false as new data is being added. Note, that this means that data will be skipped. Link: http://lkml.kernel.org/r/20200317213415.870741809@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 61f0e92ace99..1718520a2809 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -3590,16 +3590,37 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) struct buffer_page *reader; struct buffer_page *head_page; struct buffer_page *commit_page; + struct buffer_page *curr_commit_page; unsigned commit; + u64 curr_commit_ts; + u64 commit_ts; cpu_buffer = iter->cpu_buffer; - - /* Remember, trace recording is off when iterator is in use */ reader = cpu_buffer->reader_page; head_page = cpu_buffer->head_page; commit_page = cpu_buffer->commit_page; + commit_ts = commit_page->page->time_stamp; + + /* + * When the writer goes across pages, it issues a cmpxchg which + * is a mb(), which will synchronize with the rmb here. + * (see rb_tail_page_update()) + */ + smp_rmb(); commit = rb_page_commit(commit_page); + /* We want to make sure that the commit page doesn't change */ + smp_rmb(); + + /* Make sure commit page didn't change */ + curr_commit_page = READ_ONCE(cpu_buffer->commit_page); + curr_commit_ts = READ_ONCE(curr_commit_page->page->time_stamp); + + /* If the commit page changed, then there's more data */ + if (curr_commit_page != commit_page || + curr_commit_ts != commit_ts) + return 0; + /* Still racy, as it may return a false positive, but that's OK */ return ((iter->head_page == commit_page && iter->head == commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && -- cgit v1.2.3-71-gd317 From bc1a72afdc4a91844928831cac85731566e03bc6 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:25 -0400 Subject: ring-buffer: Rename ring_buffer_read() to read_buffer_iter_advance() When the ring buffer was first created, the iterator followed the normal producer/consumer operations where it had both a peek() operation, that just returned the event at the current location, and a read(), that would return the event at the current location and also increment the iterator such that the next peek() or read() will return the next event. The only use of the ring_buffer_read() is currently to move the iterator to the next location and nothing now actually reads the event it returns. Rename this function to its actual use case to ring_buffer_iter_advance(), which also adds the "iter" part to the name, which is more meaningful. As the timestamp returned by ring_buffer_read() was never used, there's no reason that this new version should bother having returning it. It will also become a void function. Link: http://lkml.kernel.org/r/20200317213416.018928618@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- include/linux/ring_buffer.h | 3 +-- kernel/trace/ring_buffer.c | 23 ++++++----------------- kernel/trace/trace.c | 4 ++-- kernel/trace/trace_functions_graph.c | 2 +- 4 files changed, 10 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index df0124eabece..0ae603b79b0e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -135,8 +135,7 @@ void ring_buffer_read_finish(struct ring_buffer_iter *iter); struct ring_buffer_event * ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts); -struct ring_buffer_event * -ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts); +void ring_buffer_iter_advance(struct ring_buffer_iter *iter); void ring_buffer_iter_reset(struct ring_buffer_iter *iter); int ring_buffer_iter_empty(struct ring_buffer_iter *iter); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 1718520a2809..f57eeaa80e3e 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4318,35 +4318,24 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) EXPORT_SYMBOL_GPL(ring_buffer_read_finish); /** - * ring_buffer_read - read the next item in the ring buffer by the iterator + * ring_buffer_iter_advance - advance the iterator to the next location * @iter: The ring buffer iterator - * @ts: The time stamp of the event read. * - * This reads the next event in the ring buffer and increments the iterator. + * Move the location of the iterator such that the next read will + * be the next location of the iterator. */ -struct ring_buffer_event * -ring_buffer_read(struct ring_buffer_iter *iter, u64 *ts) +void ring_buffer_iter_advance(struct ring_buffer_iter *iter) { - struct ring_buffer_event *event; struct ring_buffer_per_cpu *cpu_buffer = iter->cpu_buffer; unsigned long flags; raw_spin_lock_irqsave(&cpu_buffer->reader_lock, flags); - again: - event = rb_iter_peek(iter, ts); - if (!event) - goto out; - - if (event->type_len == RINGBUF_TYPE_PADDING) - goto again; rb_advance_iter(iter); - out: - raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - return event; + raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); } -EXPORT_SYMBOL_GPL(ring_buffer_read); +EXPORT_SYMBOL_GPL(ring_buffer_iter_advance); /** * ring_buffer_size - return the size of the ring buffer (in bytes) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 819e31d0d66c..47889123be7f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3378,7 +3378,7 @@ static void trace_iterator_increment(struct trace_iterator *iter) iter->idx++; if (buf_iter) - ring_buffer_read(buf_iter, NULL); + ring_buffer_iter_advance(buf_iter); } static struct trace_entry * @@ -3562,7 +3562,7 @@ void tracing_iter_reset(struct trace_iterator *iter, int cpu) if (ts >= iter->array_buffer->time_start) break; entries++; - ring_buffer_read(buf_iter, NULL); + ring_buffer_iter_advance(buf_iter); } per_cpu_ptr(iter->array_buffer->data, cpu)->skipped_entries = entries; diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c index 7d71546ba00a..4a9c49c08ec9 100644 --- a/kernel/trace/trace_functions_graph.c +++ b/kernel/trace/trace_functions_graph.c @@ -482,7 +482,7 @@ get_return_for_leaf(struct trace_iterator *iter, /* this is a leaf, now advance the iterator */ if (ring_iter) - ring_buffer_read(ring_iter, NULL); + ring_buffer_iter_advance(ring_iter); return next; } -- cgit v1.2.3-71-gd317 From 28e3fc56a471bbac39d24571e11dde64b15de988 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:26 -0400 Subject: ring-buffer: Add page_stamp to iterator for synchronization Have the ring_buffer_iter structure contain a page_stamp, such that it can be used to see if the writer entered the page the iterator is on. When going to a new page, the iterator will record the time stamp of that page. When reading events, it can copy the event to an internal buffer on the iterator (to be implemented later), then check the page's time stamp with its own to see if the writer entered the page. If so, it will need to try to read the event again. Link: http://lkml.kernel.org/r/20200317213416.163549674@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f57eeaa80e3e..e689bdcb53e8 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -507,6 +507,7 @@ struct ring_buffer_iter { struct buffer_page *cache_reader_page; unsigned long cache_read; u64 read_stamp; + u64 page_stamp; }; /** @@ -1959,7 +1960,7 @@ static void rb_inc_iter(struct ring_buffer_iter *iter) else rb_inc_page(cpu_buffer, &iter->head_page); - iter->read_stamp = iter->head_page->page->time_stamp; + iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp; iter->head = 0; } @@ -3551,10 +3552,13 @@ static void rb_iter_reset(struct ring_buffer_iter *iter) iter->cache_reader_page = iter->head_page; iter->cache_read = cpu_buffer->read; - if (iter->head) + if (iter->head) { iter->read_stamp = cpu_buffer->read_stamp; - else + iter->page_stamp = cpu_buffer->reader_page->page->time_stamp; + } else { iter->read_stamp = iter->head_page->page->time_stamp; + iter->page_stamp = iter->read_stamp; + } } /** -- cgit v1.2.3-71-gd317 From 785888c544e0433f601df18ff98a3215b380b9c3 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:27 -0400 Subject: ring-buffer: Have rb_iter_head_event() handle concurrent writer Have the ring_buffer_iter structure have a place to store an event, such that it can not be overwritten by a writer, and load it in such a way via rb_iter_head_event() that it will return NULL and reset the iter to the start of the current page if a writer updated the page. Link: http://lkml.kernel.org/r/20200317213416.306959216@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 106 ++++++++++++++++++++++++++++++++------------- 1 file changed, 75 insertions(+), 31 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index e689bdcb53e8..3d718add73c1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -503,11 +503,13 @@ struct trace_buffer { struct ring_buffer_iter { struct ring_buffer_per_cpu *cpu_buffer; unsigned long head; + unsigned long next_event; struct buffer_page *head_page; struct buffer_page *cache_reader_page; unsigned long cache_read; u64 read_stamp; u64 page_stamp; + struct ring_buffer_event *event; }; /** @@ -1914,15 +1916,59 @@ rb_reader_event(struct ring_buffer_per_cpu *cpu_buffer) cpu_buffer->reader_page->read); } -static __always_inline struct ring_buffer_event * -rb_iter_head_event(struct ring_buffer_iter *iter) +static __always_inline unsigned rb_page_commit(struct buffer_page *bpage) { - return __rb_page_index(iter->head_page, iter->head); + return local_read(&bpage->page->commit); } -static __always_inline unsigned rb_page_commit(struct buffer_page *bpage) +static struct ring_buffer_event * +rb_iter_head_event(struct ring_buffer_iter *iter) { - return local_read(&bpage->page->commit); + struct ring_buffer_event *event; + struct buffer_page *iter_head_page = iter->head_page; + unsigned long commit; + unsigned length; + + /* + * When the writer goes across pages, it issues a cmpxchg which + * is a mb(), which will synchronize with the rmb here. + * (see rb_tail_page_update() and __rb_reserve_next()) + */ + commit = rb_page_commit(iter_head_page); + smp_rmb(); + event = __rb_page_index(iter_head_page, iter->head); + length = rb_event_length(event); + + /* + * READ_ONCE() doesn't work on functions and we don't want the + * compiler doing any crazy optimizations with length. + */ + barrier(); + + if ((iter->head + length) > commit || length > BUF_MAX_DATA_SIZE) + /* Writer corrupted the read? */ + goto reset; + + memcpy(iter->event, event, length); + /* + * If the page stamp is still the same after this rmb() then the + * event was safely copied without the writer entering the page. + */ + smp_rmb(); + + /* Make sure the page didn't change since we read this */ + if (iter->page_stamp != iter_head_page->page->time_stamp || + commit > rb_page_commit(iter_head_page)) + goto reset; + + iter->next_event = iter->head + length; + return iter->event; + reset: + /* Reset to the beginning */ + iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp; + iter->head = 0; + iter->next_event = 0; + return NULL; } /* Size is determined by what has been committed */ @@ -1962,6 +2008,7 @@ static void rb_inc_iter(struct ring_buffer_iter *iter) iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp; iter->head = 0; + iter->next_event = 0; } /* @@ -3548,6 +3595,7 @@ static void rb_iter_reset(struct ring_buffer_iter *iter) /* Iterator usage is expected to have record disabled */ iter->head_page = cpu_buffer->reader_page; iter->head = cpu_buffer->reader_page->read; + iter->next_event = iter->head; iter->cache_reader_page = iter->head_page; iter->cache_read = cpu_buffer->read; @@ -3625,7 +3673,7 @@ int ring_buffer_iter_empty(struct ring_buffer_iter *iter) return 0; /* Still racy, as it may return a false positive, but that's OK */ - return ((iter->head_page == commit_page && iter->head == commit) || + return ((iter->head_page == commit_page && iter->head >= commit) || (iter->head_page == reader && commit_page == head_page && head_page->read == commit && iter->head == rb_page_commit(cpu_buffer->reader_page))); @@ -3853,15 +3901,22 @@ static void rb_advance_reader(struct ring_buffer_per_cpu *cpu_buffer) static void rb_advance_iter(struct ring_buffer_iter *iter) { struct ring_buffer_per_cpu *cpu_buffer; - struct ring_buffer_event *event; - unsigned length; cpu_buffer = iter->cpu_buffer; + /* If head == next_event then we need to jump to the next event */ + if (iter->head == iter->next_event) { + /* If the event gets overwritten again, there's nothing to do */ + if (rb_iter_head_event(iter) == NULL) + return; + } + + iter->head = iter->next_event; + /* * Check if we are at the end of the buffer. */ - if (iter->head >= rb_page_size(iter->head_page)) { + if (iter->next_event >= rb_page_size(iter->head_page)) { /* discarded commits can make the page empty */ if (iter->head_page == cpu_buffer->commit_page) return; @@ -3869,27 +3924,7 @@ static void rb_advance_iter(struct ring_buffer_iter *iter) return; } - event = rb_iter_head_event(iter); - - length = rb_event_length(event); - - /* - * This should not be called to advance the header if we are - * at the tail of the buffer. - */ - if (RB_WARN_ON(cpu_buffer, - (iter->head_page == cpu_buffer->commit_page) && - (iter->head + length > rb_commit_index(cpu_buffer)))) - return; - - rb_update_iter_read_stamp(iter, event); - - iter->head += length; - - /* check for end of page padding */ - if ((iter->head >= rb_page_size(iter->head_page)) && - (iter->head_page != cpu_buffer->commit_page)) - rb_inc_iter(iter); + rb_update_iter_read_stamp(iter, iter->event); } static int rb_lost_events(struct ring_buffer_per_cpu *cpu_buffer) @@ -4017,6 +4052,8 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) } event = rb_iter_head_event(iter); + if (!event) + goto again; switch (event->type_len) { case RINGBUF_TYPE_PADDING: @@ -4233,10 +4270,16 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags) if (!cpumask_test_cpu(cpu, buffer->cpumask)) return NULL; - iter = kmalloc(sizeof(*iter), flags); + iter = kzalloc(sizeof(*iter), flags); if (!iter) return NULL; + iter->event = kmalloc(BUF_MAX_DATA_SIZE, flags); + if (!iter->event) { + kfree(iter); + return NULL; + } + cpu_buffer = buffer->buffers[cpu]; iter->cpu_buffer = cpu_buffer; @@ -4317,6 +4360,7 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) atomic_dec(&cpu_buffer->record_disabled); atomic_dec(&cpu_buffer->buffer->resize_disabled); + kfree(iter->event); kfree(iter); } EXPORT_SYMBOL_GPL(ring_buffer_read_finish); -- cgit v1.2.3-71-gd317 From ff84c50cfb4b8dc68c982fb6c05a524e1539ee2f Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:28 -0400 Subject: ring-buffer: Do not die if rb_iter_peek() fails more than thrice As the iterator will be reading a live buffer, and if the event being read is on a page that a writer crosses, it will fail and try again, the condition in rb_iter_peek() that only allows a retry to happen three times is no longer valid. Allow rb_iter_peek() to retry more than three times without killing the ring buffer, but only if rb_iter_head_event() had failed at least once. Link: http://lkml.kernel.org/r/20200317213416.452888193@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 3d718add73c1..475338fda969 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4012,6 +4012,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; int nr_loops = 0; + bool failed = false; if (ts) *ts = 0; @@ -4038,10 +4039,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) * to a data event, we should never loop more than three times. * Once for going to next page, once on time extend, and * finally once to get the event. - * (We never hit the following condition more than thrice). + * We should never hit the following condition more than thrice, + * unless the buffer is very small, and there's a writer + * that is causing the reader to fail getting an event. */ - if (RB_WARN_ON(cpu_buffer, ++nr_loops > 3)) + if (++nr_loops > 3) { + RB_WARN_ON(cpu_buffer, !failed); return NULL; + } if (rb_per_cpu_empty(cpu_buffer)) return NULL; @@ -4052,8 +4057,10 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) } event = rb_iter_head_event(iter); - if (!event) + if (!event) { + failed = true; goto again; + } switch (event->type_len) { case RINGBUF_TYPE_PADDING: -- cgit v1.2.3-71-gd317 From 153368ce1bd0ccb47812a3185e824445a7024ea5 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:29 -0400 Subject: ring-buffer: Optimize rb_iter_head_event() As it is fine to perform several "peeks" of event data in the ring buffer via the iterator before moving it forward, do not re-read the event, just return what was read before. Otherwise, it can cause inconsistent results, especially when testing multiple CPU buffers to interleave them. Link: http://lkml.kernel.org/r/20200317213416.592032170@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 475338fda969..5979327254f9 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1929,6 +1929,9 @@ rb_iter_head_event(struct ring_buffer_iter *iter) unsigned long commit; unsigned length; + if (iter->head != iter->next_event) + return iter->event; + /* * When the writer goes across pages, it issues a cmpxchg which * is a mb(), which will synchronize with the rmb here. -- cgit v1.2.3-71-gd317 From fe61468b2cbc2b7ce5f8d3bf32ae5001d4c434e9 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Fri, 6 Mar 2020 14:52:57 +0100 Subject: sched/fair: Fix enqueue_task_fair warning When a cfs rq is throttled, the latter and its child are removed from the leaf list but their nr_running is not changed which includes staying higher than 1. When a task is enqueued in this throttled branch, the cfs rqs must be added back in order to ensure correct ordering in the list but this can only happens if nr_running == 1. When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq() when enqueuing an entity to make sure that the complete branch will be added. Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent is throttled. Iterate the remaining entity to ensure that the complete branch will be added in the list. Reported-by: Christian Borntraeger Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Tested-by: Christian Borntraeger Tested-by: Dietmar Eggemann Cc: stable@vger.kernel.org Cc: stable@vger.kernel.org #v5.1+ Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.org --- kernel/sched/fair.c | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1dea8554ead0..c7aaae2b1030 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4136,6 +4136,7 @@ static inline void check_schedstat_required(void) #endif } +static inline bool cfs_bandwidth_used(void); /* * MIGRATION @@ -4214,10 +4215,16 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) __enqueue_entity(cfs_rq, se); se->on_rq = 1; - if (cfs_rq->nr_running == 1) { + /* + * When bandwidth control is enabled, cfs might have been removed + * because of a parent been throttled but cfs->nr_running > 1. Try to + * add it unconditionnally. + */ + if (cfs_rq->nr_running == 1 || cfs_bandwidth_used()) list_add_leaf_cfs_rq(cfs_rq); + + if (cfs_rq->nr_running == 1) check_enqueue_throttle(cfs_rq); - } } static void __clear_buddies_last(struct sched_entity *se) @@ -4808,11 +4815,22 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) break; } - assert_list_leaf_cfs_rq(rq); - if (!se) add_nr_running(rq, task_delta); + /* + * The cfs_rq_throttled() breaks in the above iteration can result in + * incomplete leaf list maintenance, resulting in triggering the + * assertion below. + */ + for_each_sched_entity(se) { + cfs_rq = cfs_rq_of(se); + + list_add_leaf_cfs_rq(cfs_rq); + } + + assert_list_leaf_cfs_rq(rq); + /* Determine whether we need to wake up potentially idle CPU: */ if (rq->curr == rq->idle && rq->cfs.nr_running) resched_curr(rq); -- cgit v1.2.3-71-gd317 From 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 Mon Sep 17 00:00:00 2001 From: Paul Turner Date: Tue, 10 Mar 2020 18:01:13 -0700 Subject: sched/core: Distribute tasks within affinity masks Currently, when updating the affinity of tasks via either cpusets.cpus, or, sched_setaffinity(); tasks not currently running within the newly specified mask will be arbitrarily assigned to the first CPU within the mask. This (particularly in the case that we are restricting masks) can result in many tasks being assigned to the first CPUs of their new masks. This: 1) Can induce scheduling delays while the load-balancer has a chance to spread them between their new CPUs. 2) Can antogonize a poor load-balancer behavior where it has a difficult time recognizing that a cross-socket imbalance has been forced by an affinity mask. This change adds a new cpumask interface to allow iterated calls to distribute within the intersection of the provided masks. The cases that this mainly affects are: - modifying cpuset.cpus - when tasks join a cpuset - when modifying a task's affinity via sched_setaffinity(2) Signed-off-by: Paul Turner Signed-off-by: Josh Don Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Qais Yousef Tested-by: Qais Yousef Link: https://lkml.kernel.org/r/20200311010113.136465-1-joshdon@google.com --- include/linux/cpumask.h | 7 +++++++ kernel/sched/core.c | 7 ++++++- lib/cpumask.c | 29 +++++++++++++++++++++++++++++ 3 files changed, 42 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index d5cc88514aee..f0d895d6ac39 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node) return 0; } +static inline int cpumask_any_and_distribute(const struct cpumask *src1p, + const struct cpumask *src2p) { + return cpumask_next_and(-1, src1p, src2p); +} + #define for_each_cpu(cpu, mask) \ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask) #define for_each_cpu_not(cpu, mask) \ @@ -245,6 +250,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp) int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); int cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int cpumask_local_spread(unsigned int i, int node); +int cpumask_any_and_distribute(const struct cpumask *src1p, + const struct cpumask *src2p); /** * for_each_cpu - iterate over every cpu in a mask diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 978bf6f6e0ff..014d4f793313 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1650,7 +1650,12 @@ static int __set_cpus_allowed_ptr(struct task_struct *p, if (cpumask_equal(p->cpus_ptr, new_mask)) goto out; - dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask); + /* + * Picking a ~random cpu helps in cases where we are changing affinity + * for groups of tasks (ie. cpuset), so that load balancing is not + * immediately required to distribute the tasks within their new mask. + */ + dest_cpu = cpumask_any_and_distribute(cpu_valid_mask, new_mask); if (dest_cpu >= nr_cpu_ids) { ret = -EINVAL; goto out; diff --git a/lib/cpumask.c b/lib/cpumask.c index 0cb672eb107c..fb22fb266f93 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -232,3 +232,32 @@ unsigned int cpumask_local_spread(unsigned int i, int node) BUG(); } EXPORT_SYMBOL(cpumask_local_spread); + +static DEFINE_PER_CPU(int, distribute_cpu_mask_prev); + +/** + * Returns an arbitrary cpu within srcp1 & srcp2. + * + * Iterated calls using the same srcp1 and srcp2 will be distributed within + * their intersection. + * + * Returns >= nr_cpu_ids if the intersection is empty. + */ +int cpumask_any_and_distribute(const struct cpumask *src1p, + const struct cpumask *src2p) +{ + int next, prev; + + /* NOTE: our first selection will skip 0. */ + prev = __this_cpu_read(distribute_cpu_mask_prev); + + next = cpumask_next_and(prev, src1p, src2p); + if (next >= nr_cpu_ids) + next = cpumask_first_and(src1p, src2p); + + if (next < nr_cpu_ids) + __this_cpu_write(distribute_cpu_mask_prev, next); + + return next; +} +EXPORT_SYMBOL(cpumask_any_and_distribute); -- cgit v1.2.3-71-gd317 From b05e75d611380881e73edc58a20fd8c6bb71720b Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 16 Mar 2020 15:13:31 -0400 Subject: psi: Fix cpu.pressure for cpu.max and competing cgroups For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: Johannes Weiner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org --- include/linux/psi_types.h | 10 +++++++++- kernel/sched/core.c | 2 ++ kernel/sched/psi.c | 12 +++++++----- kernel/sched/stats.h | 28 ++++++++++++++++++++++++++++ 4 files changed, 46 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..4b7258495a04 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -14,13 +14,21 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - NR_PSI_TASK_COUNTS = 3, + /* + * This can't have values other than 0 or 1 and could be + * implemented as a bit flag. But for now we still have room + * in the first cacheline of psi_group_cpu, and this way we + * don't have to special case any state tracking for it. + */ + NR_ONCPU, + NR_PSI_TASK_COUNTS = 4, }; /* Task state bitmasks */ #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) +#define TSK_ONCPU (1 << NR_ONCPU) /* Resources that workloads could be stalled on */ enum psi_res { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 014d4f793313..c1f923d647ee 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4091,6 +4091,8 @@ static void __sched notrace __schedule(bool preempt) */ ++*switch_count; + psi_sched_switch(prev, next, !task_on_rq_queued(prev)); + trace_sched_switch(preempt, prev, next); /* Also unlocks the rq: */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 028520702717..50128297a4f9 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -225,7 +225,7 @@ static bool test_state(unsigned int *tasks, enum psi_states state) case PSI_MEM_FULL: return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]; case PSI_CPU_SOME: - return tasks[NR_RUNNING] > 1; + return tasks[NR_RUNNING] > tasks[NR_ONCPU]; case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; @@ -695,10 +695,10 @@ static u32 psi_group_change(struct psi_group *group, int cpu, if (!(m & (1 << t))) continue; if (groupc->tasks[t] == 0 && !psi_bug) { - printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u] clear=%x set=%x\n", + printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n", cpu, t, groupc->tasks[0], groupc->tasks[1], groupc->tasks[2], - clear, set); + groupc->tasks[3], clear, set); psi_bug = 1; } groupc->tasks[t]--; @@ -916,9 +916,11 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) rq = task_rq_lock(task, &rf); - if (task_on_rq_queued(task)) + if (task_on_rq_queued(task)) { task_flags = TSK_RUNNING; - else if (task->in_iowait) + if (task_current(rq, task)) + task_flags |= TSK_ONCPU; + } else if (task->in_iowait) task_flags = TSK_IOWAIT; if (task->flags & PF_MEMSTALL) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index ba683fe81a6e..6ff0ac1a803f 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -93,6 +93,14 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) if (p->flags & PF_MEMSTALL) clear |= TSK_MEMSTALL; } else { + /* + * When a task sleeps, schedule() dequeues it before + * switching to the next one. Merge the clearing of + * TSK_RUNNING and TSK_ONCPU to save an unnecessary + * psi_task_change() call in psi_sched_switch(). + */ + clear |= TSK_ONCPU; + if (p->in_iowait) set |= TSK_IOWAIT; } @@ -126,6 +134,23 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) } } +static inline void psi_sched_switch(struct task_struct *prev, + struct task_struct *next, + bool sleep) +{ + if (static_branch_likely(&psi_disabled)) + return; + + /* + * Clear the TSK_ONCPU state if the task was preempted. If + * it's a voluntary sleep, dequeue will have taken care of it. + */ + if (!sleep) + psi_task_change(prev, TSK_ONCPU, 0); + + psi_task_change(next, 0, TSK_ONCPU); +} + static inline void psi_task_tick(struct rq *rq) { if (static_branch_likely(&psi_disabled)) @@ -138,6 +163,9 @@ static inline void psi_task_tick(struct rq *rq) static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} static inline void psi_dequeue(struct task_struct *p, bool sleep) {} static inline void psi_ttwu_dequeue(struct task_struct *p) {} +static inline void psi_sched_switch(struct task_struct *prev, + struct task_struct *next, + bool sleep) {} static inline void psi_task_tick(struct rq *rq) {} #endif /* CONFIG_PSI */ -- cgit v1.2.3-71-gd317 From 36b238d5717279163859fb6ba0f4360abcafab83 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 16 Mar 2020 15:13:32 -0400 Subject: psi: Optimize switching tasks inside shared cgroups When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: Peter Zijlstra Signed-off-by: Johannes Weiner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org --- include/linux/psi.h | 2 ++ kernel/sched/psi.c | 87 ++++++++++++++++++++++++++++++++++++++++------------ kernel/sched/stats.h | 9 +----- 3 files changed, 70 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b3de7321219..7361023f3fdd 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -17,6 +17,8 @@ extern struct psi_group psi_system; void psi_init(void); void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); void psi_memstall_tick(struct task_struct *task, int cpu); void psi_memstall_enter(unsigned long *flags); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 50128297a4f9..955a124bae81 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -669,13 +669,14 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, groupc->times[PSI_NONIDLE] += delta; } -static u32 psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set) +static void psi_group_change(struct psi_group *group, int cpu, + unsigned int clear, unsigned int set, + bool wake_clock) { struct psi_group_cpu *groupc; + u32 state_mask = 0; unsigned int t, m; enum psi_states s; - u32 state_mask = 0; groupc = per_cpu_ptr(group->pcpu, cpu); @@ -717,7 +718,11 @@ static u32 psi_group_change(struct psi_group *group, int cpu, write_seqcount_end(&groupc->seq); - return state_mask; + if (state_mask & group->poll_states) + psi_schedule_poll_work(group, 1); + + if (wake_clock && !delayed_work_pending(&group->avgs_work)) + schedule_delayed_work(&group->avgs_work, PSI_FREQ); } static struct psi_group *iterate_groups(struct task_struct *task, void **iter) @@ -744,27 +749,32 @@ static struct psi_group *iterate_groups(struct task_struct *task, void **iter) return &psi_system; } -void psi_task_change(struct task_struct *task, int clear, int set) +static void psi_flags_change(struct task_struct *task, int clear, int set) { - int cpu = task_cpu(task); - struct psi_group *group; - bool wake_clock = true; - void *iter = NULL; - - if (!task->pid) - return; - if (((task->psi_flags & set) || (task->psi_flags & clear) != clear) && !psi_bug) { printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n", - task->pid, task->comm, cpu, + task->pid, task->comm, task_cpu(task), task->psi_flags, clear, set); psi_bug = 1; } task->psi_flags &= ~clear; task->psi_flags |= set; +} + +void psi_task_change(struct task_struct *task, int clear, int set) +{ + int cpu = task_cpu(task); + struct psi_group *group; + bool wake_clock = true; + void *iter = NULL; + + if (!task->pid) + return; + + psi_flags_change(task, clear, set); /* * Periodic aggregation shuts off if there is a period of no @@ -777,14 +787,51 @@ void psi_task_change(struct task_struct *task, int clear, int set) wq_worker_last_func(task) == psi_avgs_work)) wake_clock = false; - while ((group = iterate_groups(task, &iter))) { - u32 state_mask = psi_group_change(group, cpu, clear, set); + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, wake_clock); +} + +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep) +{ + struct psi_group *group, *common = NULL; + int cpu = task_cpu(prev); + void *iter; + + if (next->pid) { + psi_flags_change(next, 0, TSK_ONCPU); + /* + * When moving state between tasks, the group that + * contains them both does not change: we can stop + * updating the tree once we reach the first common + * ancestor. Iterate @next's ancestors until we + * encounter @prev's state. + */ + iter = NULL; + while ((group = iterate_groups(next, &iter))) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + common = group; + break; + } + + psi_group_change(group, cpu, 0, TSK_ONCPU, true); + } + } + + /* + * If this is a voluntary sleep, dequeue will have taken care + * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We + * only need to deal with it during preemption. + */ + if (sleep) + return; - if (state_mask & group->poll_states) - psi_schedule_poll_work(group, 1); + if (prev->pid) { + psi_flags_change(prev, TSK_ONCPU, 0); - if (wake_clock && !delayed_work_pending(&group->avgs_work)) - schedule_delayed_work(&group->avgs_work, PSI_FREQ); + iter = NULL; + while ((group = iterate_groups(prev, &iter)) && group != common) + psi_group_change(group, cpu, TSK_ONCPU, 0, true); } } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 6ff0ac1a803f..1339f5bfe513 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -141,14 +141,7 @@ static inline void psi_sched_switch(struct task_struct *prev, if (static_branch_likely(&psi_disabled)) return; - /* - * Clear the TSK_ONCPU state if the task was preempted. If - * it's a voluntary sleep, dequeue will have taken care of it. - */ - if (!sleep) - psi_task_change(prev, TSK_ONCPU, 0); - - psi_task_change(next, 0, TSK_ONCPU); + psi_task_switch(prev, next, sleep); } static inline void psi_task_tick(struct rq *rq) -- cgit v1.2.3-71-gd317 From 1066d1b6974e095d5a6c472ad9180a957b496cd6 Mon Sep 17 00:00:00 2001 From: Yafang Shao Date: Mon, 16 Mar 2020 21:28:05 -0400 Subject: psi: Move PF_MEMSTALL out of task->flags The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: Johannes Weiner Signed-off-by: Yafang Shao Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com --- include/linux/sched.h | 6 ++++-- kernel/sched/psi.c | 12 ++++++------ kernel/sched/stats.h | 10 +++++----- 3 files changed, 15 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/sched.h b/include/linux/sched.h index 2e9199bf947b..09bddd9e69a2 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -785,9 +785,12 @@ struct task_struct { unsigned frozen:1; #endif #ifdef CONFIG_BLK_CGROUP - /* to be used once the psi infrastructure lands upstream. */ unsigned use_memdelay:1; #endif +#ifdef CONFIG_PSI + /* Stalled due to lack of memory */ + unsigned in_memstall:1; +#endif unsigned long atomic_flags; /* Flags requiring atomic access. */ @@ -1480,7 +1483,6 @@ extern struct pid *cad_pid; #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ -#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */ #define PF_UMH 0x02000000 /* I'm an Usermodehelper process */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 955a124bae81..8f45cdb6463b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -865,17 +865,17 @@ void psi_memstall_enter(unsigned long *flags) if (static_branch_likely(&psi_disabled)) return; - *flags = current->flags & PF_MEMSTALL; + *flags = current->in_memstall; if (*flags) return; /* - * PF_MEMSTALL setting & accounting needs to be atomic wrt + * in_memstall setting & accounting needs to be atomic wrt * changes to the task's scheduling state, otherwise we can * race with CPU migration. */ rq = this_rq_lock_irq(&rf); - current->flags |= PF_MEMSTALL; + current->in_memstall = 1; psi_task_change(current, 0, TSK_MEMSTALL); rq_unlock_irq(rq, &rf); @@ -898,13 +898,13 @@ void psi_memstall_leave(unsigned long *flags) if (*flags) return; /* - * PF_MEMSTALL clearing & accounting needs to be atomic wrt + * in_memstall clearing & accounting needs to be atomic wrt * changes to the task's scheduling state, otherwise we could * race with CPU migration. */ rq = this_rq_lock_irq(&rf); - current->flags &= ~PF_MEMSTALL; + current->in_memstall = 0; psi_task_change(current, TSK_MEMSTALL, 0); rq_unlock_irq(rq, &rf); @@ -970,7 +970,7 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to) } else if (task->in_iowait) task_flags = TSK_IOWAIT; - if (task->flags & PF_MEMSTALL) + if (task->in_memstall) task_flags |= TSK_MEMSTALL; if (task_flags) diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 1339f5bfe513..33d0daf83842 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -70,7 +70,7 @@ static inline void psi_enqueue(struct task_struct *p, bool wakeup) return; if (!wakeup || p->sched_psi_wake_requeue) { - if (p->flags & PF_MEMSTALL) + if (p->in_memstall) set |= TSK_MEMSTALL; if (p->sched_psi_wake_requeue) p->sched_psi_wake_requeue = 0; @@ -90,7 +90,7 @@ static inline void psi_dequeue(struct task_struct *p, bool sleep) return; if (!sleep) { - if (p->flags & PF_MEMSTALL) + if (p->in_memstall) clear |= TSK_MEMSTALL; } else { /* @@ -117,14 +117,14 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) * deregister its sleep-persistent psi states from the old * queue, and let psi_enqueue() know it has to requeue. */ - if (unlikely(p->in_iowait || (p->flags & PF_MEMSTALL))) { + if (unlikely(p->in_iowait || p->in_memstall)) { struct rq_flags rf; struct rq *rq; int clear = 0; if (p->in_iowait) clear |= TSK_IOWAIT; - if (p->flags & PF_MEMSTALL) + if (p->in_memstall) clear |= TSK_MEMSTALL; rq = __task_rq_lock(p, &rf); @@ -149,7 +149,7 @@ static inline void psi_task_tick(struct rq *rq) if (static_branch_likely(&psi_disabled)) return; - if (unlikely(rq->curr->flags & PF_MEMSTALL)) + if (unlikely(rq->curr->in_memstall)) psi_memstall_tick(rq->curr, cpu_of(rq)); } #else /* CONFIG_PSI */ -- cgit v1.2.3-71-gd317 From 26cf52229efc87e2effa9d788f9b33c40fb3358a Mon Sep 17 00:00:00 2001 From: Michael Wang Date: Wed, 18 Mar 2020 10:15:15 +0800 Subject: sched: Avoid scale real weight down to zero During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra Signed-off-by: Michael Wang Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com --- kernel/sched/sched.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9e173fad0425..1e72d1b3d3ce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -118,7 +118,13 @@ extern long calc_load_fold_active(struct rq *this_rq, long adjust); #ifdef CONFIG_64BIT # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) -# define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT) +# define scale_load_down(w) \ +({ \ + unsigned long __w = (w); \ + if (__w) \ + __w = max(2UL, __w >> SCHED_FIXEDPOINT_SHIFT); \ + __w; \ +}) #else # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) (w) -- cgit v1.2.3-71-gd317 From c32b4308295aaaaedd5beae56cb42e205ae63e58 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Thu, 12 Mar 2020 17:54:29 +0100 Subject: sched/fair: Improve spreading of utilization During load_balancing, a group with spare capacity will try to pull some utilizations from an overloaded group. In such case, the load balance looks for the runqueue with the highest utilization. Nevertheless, it should also ensure that there are some pending tasks to pull otherwise the load balance will fail to pull a task and the spread of the load will be delayed. This situation is quite transient but it's possible to highlight the effect with a short run of sysbench test so the time to spread task impacts the global result significantly. Below are the average results for 15 iterations on an arm64 octo core: sysbench --test=cpu --num-threads=8 --max-requests=1000 run tip/sched/core +patchset total time: 172ms 158ms per-request statistics: avg: 1.337ms 1.244ms max: 21.191ms 10.753ms The average max doesn't fully reflect the wide spread of the value which ranges from 1.350ms to more than 41ms for the tip/sched/core and from 1.350ms to 21ms with the patch. Other factors like waiting for an idle load balance or cache hotness can delay the spreading of the tasks which explains why we can still have up to 21ms with the patch. Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200312165429.990-1-vincent.guittot@linaro.org --- kernel/sched/fair.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c7aaae2b1030..783356f96b7b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9313,6 +9313,14 @@ static struct rq *find_busiest_queue(struct lb_env *env, case migrate_util: util = cpu_util(cpu_of(rq)); + /* + * Don't try to pull utilization from a CPU with one + * running task. Whatever its utilization, we will fail + * detach the task. + */ + if (nr_running <= 1) + continue; + if (busiest_util < util) { busiest_util = util; busiest = rq; -- cgit v1.2.3-71-gd317 From 26c7295be0c5e6da3fa45970e9748be983175b1b Mon Sep 17 00:00:00 2001 From: Liang Chen Date: Fri, 6 Mar 2020 15:01:33 +0800 Subject: kthread: Do not preempt current task if it is going to call schedule() when we create a kthread with ktrhead_create_on_cpu(),the child thread entry is ktread.c:ktrhead() which will be preempted by the parent after call complete(done) while schedule() is not called yet,then the parent will call wait_task_inactive(child) but the child is still on the runqueue, so the parent will schedule_hrtimeout() for 1 jiffy,it will waste a lot of time,especially on startup. parent child ktrhead_create_on_cpu() wait_fo_completion(&done) -----> ktread.c:ktrhead() |----- complete(done);--wakeup and preempted by parent kthread_bind() <------------| |-> schedule();--dequeue here wait_task_inactive(child) | schedule_hrtimeout(1 jiffy) -| So we hope the child just wakeup parent but not preempted by parent, and the child is going to call schedule() soon,then the parent will not call schedule_hrtimeout(1 jiffy) as the child is already dequeue. The same issue for ktrhead_park()&&kthread_parkme(). This patch can save 120ms on rk312x startup with CONFIG_HZ=300. Signed-off-by: Liang Chen Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Steven Rostedt (VMware) Link: https://lkml.kernel.org/r/20200306070133.18335-2-cl@rock-chips.com --- kernel/kthread.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kthread.c b/kernel/kthread.c index b262f47046ca..bfbfa481be3a 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -199,8 +199,15 @@ static void __kthread_parkme(struct kthread *self) if (!test_bit(KTHREAD_SHOULD_PARK, &self->flags)) break; + /* + * Thread is going to call schedule(), do not preempt it, + * or the caller of kthread_park() may spend more time in + * wait_task_inactive(). + */ + preempt_disable(); complete(&self->parked); - schedule(); + schedule_preempt_disabled(); + preempt_enable(); } __set_current_state(TASK_RUNNING); } @@ -245,8 +252,14 @@ static int kthread(void *_create) /* OK, tell user we're spawned, wait for stop or wakeup */ __set_current_state(TASK_UNINTERRUPTIBLE); create->result = current; + /* + * Thread is going to call schedule(), do not preempt it, + * or the creator may spend more time in wait_task_inactive(). + */ + preempt_disable(); complete(done); - schedule(); + schedule_preempt_disabled(); + preempt_enable(); ret = -EINTR; if (!test_bit(KTHREAD_SHOULD_STOP, &self->flags)) { -- cgit v1.2.3-71-gd317 From e94f80f6c49020008e6fa0f3d4b806b8595d17d8 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Thu, 5 Mar 2020 10:24:50 +0000 Subject: sched/rt: cpupri_find: Trigger a full search as fallback If we failed to find a fitting CPU, in cpupri_find(), we only fallback to the level we found a hit at. But Steve suggested to fallback to a second full scan instead as this could be a better effort. https://lore.kernel.org/lkml/20200304135404.146c56eb@gandalf.local.home/ We trigger the 2nd search unconditionally since the argument about triggering a full search is that the recorded fall back level might have become empty by then. Which means storing any data about what happened would be meaningless and stale. I had a humble try at timing it and it seemed okay for the small 6 CPUs system I was running on https://lore.kernel.org/lkml/20200305124324.42x6ehjxbnjkklnh@e107158-lin.cambridge.arm.com/ On large system this second full scan could be expensive. But there are no users outside capacity awareness for this fitness function at the moment. Heterogeneous systems tend to be small with 8cores in total. Suggested-by: Steven Rostedt Signed-off-by: Qais Yousef Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Steven Rostedt (VMware) Link: https://lkml.kernel.org/r/20200310142219.syxzn5ljpdxqtbgx@e107158-lin.cambridge.arm.com --- kernel/sched/cpupri.c | 29 ++++++----------------------- 1 file changed, 6 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index dd3f16d1a04a..0033731a0797 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -122,8 +122,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, bool (*fitness_fn)(struct task_struct *p, int cpu)) { int task_pri = convert_prio(p->prio); - int best_unfit_idx = -1; - int idx = 0, cpu; + int idx, cpu; BUG_ON(task_pri >= CPUPRI_NR_PRIORITIES); @@ -145,31 +144,15 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, * If no CPU at the current priority can fit the task * continue looking */ - if (cpumask_empty(lowest_mask)) { - /* - * Store our fallback priority in case we - * didn't find a fitting CPU - */ - if (best_unfit_idx == -1) - best_unfit_idx = idx; - + if (cpumask_empty(lowest_mask)) continue; - } return 1; } /* - * If we failed to find a fitting lowest_mask, make sure we fall back - * to the last known unfitting lowest_mask. - * - * Note that the map of the recorded idx might have changed since then, - * so we must ensure to do the full dance to make sure that level still - * holds a valid lowest_mask. - * - * As per above, the map could have been concurrently emptied while we - * were busy searching for a fitting lowest_mask at the other priority - * levels. + * If we failed to find a fitting lowest_mask, kick off a new search + * but without taking into account any fitness criteria this time. * * This rule favours honouring priority over fitting the task in the * correct CPU (Capacity Awareness being the only user now). @@ -184,8 +167,8 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p, * must do proper RT planning to avoid overloading the system if they * really care. */ - if (best_unfit_idx != -1) - return __cpupri_find(cp, p, lowest_mask, best_unfit_idx); + if (fitness_fn) + return cpupri_find(cp, p, lowest_mask); return 0; } -- cgit v1.2.3-71-gd317 From 6c8116c914b65be5e4d6f66d69c8142eb0648c22 Mon Sep 17 00:00:00 2001 From: Tao Zhou Date: Thu, 19 Mar 2020 11:39:20 +0800 Subject: sched/fair: Fix condition of avg_load calculation In update_sg_wakeup_stats(), the comment says: Computing avg_load makes sense only when group is fully busy or overloaded. But, the code below this comment does not check like this. From reading the code about avg_load in other functions, I confirm that avg_load should be calculated in fully busy or overloaded case. The comment is correct and the checking condition is wrong. So, change that condition. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Tao Zhou Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Vincent Guittot Acked-by: Mel Gorman Link: https://lkml.kernel.org/r/Message-ID: --- kernel/sched/fair.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 783356f96b7b..d7fb20adabeb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8631,7 +8631,8 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd, * Computing avg_load makes sense only when group is fully busy or * overloaded */ - if (sgs->group_type < group_fully_busy) + if (sgs->group_type == group_fully_busy || + sgs->group_type == group_overloaded) sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; } -- cgit v1.2.3-71-gd317 From 90c91dfb86d0ff545bd329d3ddd72c147e2ae198 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 5 Mar 2020 13:38:51 +0100 Subject: perf/core: Fix endless multiplex timer Kan and Andi reported that we fail to kill rotation when the flexible events go empty, but the context does not. XXX moar Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily") Reported-by: Andi Kleen Reported-by: Kan Liang Tested-by: Kan Liang Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200305123851.GX2596@hirez.programming.kicks-ass.net --- kernel/events/core.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index ccf8d4fc6374..b5a68d26f0b9 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2291,6 +2291,7 @@ __perf_remove_from_context(struct perf_event *event, if (!ctx->nr_events && ctx->is_active) { ctx->is_active = 0; + ctx->rotate_necessary = 0; if (ctx->task) { WARN_ON_ONCE(cpuctx->task_ctx != ctx); cpuctx->task_ctx = NULL; @@ -3188,12 +3189,6 @@ static void ctx_sched_out(struct perf_event_context *ctx, if (!ctx->nr_active || !(is_active & EVENT_ALL)) return; - /* - * If we had been multiplexing, no rotations are necessary, now no events - * are active. - */ - ctx->rotate_necessary = 0; - perf_pmu_disable(ctx->pmu); if (is_active & EVENT_PINNED) { list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list) @@ -3203,6 +3198,13 @@ static void ctx_sched_out(struct perf_event_context *ctx, if (is_active & EVENT_FLEXIBLE) { list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list) group_sched_out(event, cpuctx, ctx); + + /* + * Since we cleared EVENT_FLEXIBLE, also clear + * rotate_necessary, is will be reset by + * ctx_flexible_sched_in() when needed. + */ + ctx->rotate_necessary = 0; } perf_pmu_enable(ctx->pmu); } @@ -3985,6 +3987,12 @@ ctx_event_to_rotate(struct perf_event_context *ctx) typeof(*event), group_node); } + /* + * Unconditionally clear rotate_necessary; if ctx_flexible_sched_in() + * finds there are unschedulable events, it will set it again. + */ + ctx->rotate_necessary = 0; + return event; } -- cgit v1.2.3-71-gd317 From a6763625ae6f8aa5ee82fcd8fa4e5e38db20dbc6 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 12 Mar 2020 13:56:37 +0300 Subject: perf/core: Fix reversed NULL check in perf_event_groups_less() This NULL check is reversed so it leads to a Smatch warning and presumably a NULL dereference. kernel/events/core.c:1598 perf_event_groups_less() error: we previously assumed 'right->cgrp->css.cgroup' could be null (see line 1590) Fixes: 95ed6c707f26 ("perf/cgroup: Order events in RB tree by cgroup id") Signed-off-by: Dan Carpenter Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200312105637.GA8960@mwanda --- kernel/events/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index b5a68d26f0b9..d22e4ba59dfa 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1586,7 +1586,7 @@ perf_event_groups_less(struct perf_event *left, struct perf_event *right) */ return true; } - if (!right->cgrp || right->cgrp->css.cgroup) { + if (!right->cgrp || !right->cgrp->css.cgroup) { /* * Right has no cgroup but left does, no cgroups come * first. -- cgit v1.2.3-71-gd317 From 25016bd7f4caf5fc983bbab7403d08e64cba3004 Mon Sep 17 00:00:00 2001 From: Boqun Feng Date: Thu, 12 Mar 2020 23:12:55 +0800 Subject: locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps() Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning: [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0 The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning. Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps() Reported-by: Qian Cai Signed-off-by: Boqun Feng Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com --- kernel/locking/lockdep.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index e55c4ee14e64..2564950a402c 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -1723,9 +1723,11 @@ unsigned long lockdep_count_forward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); + current->lockdep_recursion = 1; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_forward_deps(&this); arch_spin_unlock(&lockdep_lock); + current->lockdep_recursion = 0; raw_local_irq_restore(flags); return ret; @@ -1750,9 +1752,11 @@ unsigned long lockdep_count_backward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); + current->lockdep_recursion = 1; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_backward_deps(&this); arch_spin_unlock(&lockdep_lock); + current->lockdep_recursion = 0; raw_local_irq_restore(flags); return ret; -- cgit v1.2.3-71-gd317 From 10476e6304222ced7df9b3d5fb0a043b3c2a1ad8 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 13 Mar 2020 09:56:38 +0100 Subject: locking/lockdep: Fix bad recursion pattern There were two patterns for lockdep_recursion: Pattern-A: if (current->lockdep_recursion) return current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; Pattern-B: current->lockdep_recursion++; /* do stuff */ current->lockdep_recursion--; But a third pattern has emerged: Pattern-C: current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; And while this isn't broken per-se, it is highly dangerous because it doesn't nest properly. Get rid of all Pattern-C instances and shore up Pattern-A with a warning. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200313093325.GW12561@hirez.programming.kicks-ass.net --- kernel/locking/lockdep.c | 74 ++++++++++++++++++++++++++---------------------- 1 file changed, 40 insertions(+), 34 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 2564950a402c..64ea69f94b3a 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -390,6 +390,12 @@ void lockdep_on(void) } EXPORT_SYMBOL(lockdep_on); +static inline void lockdep_recursion_finish(void) +{ + if (WARN_ON_ONCE(--current->lockdep_recursion)) + current->lockdep_recursion = 0; +} + void lockdep_set_selftest_task(struct task_struct *task) { lockdep_selftest_task_struct = task; @@ -1723,11 +1729,11 @@ unsigned long lockdep_count_forward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_forward_deps(&this); arch_spin_unlock(&lockdep_lock); - current->lockdep_recursion = 0; + current->lockdep_recursion--; raw_local_irq_restore(flags); return ret; @@ -1752,11 +1758,11 @@ unsigned long lockdep_count_backward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; arch_spin_lock(&lockdep_lock); ret = __lockdep_count_backward_deps(&this); arch_spin_unlock(&lockdep_lock); - current->lockdep_recursion = 0; + current->lockdep_recursion--; raw_local_irq_restore(flags); return ret; @@ -3668,9 +3674,9 @@ void lockdep_hardirqs_on(unsigned long ip) if (DEBUG_LOCKS_WARN_ON(current->hardirq_context)) return; - current->lockdep_recursion = 1; + current->lockdep_recursion++; __trace_hardirqs_on_caller(ip); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); } NOKPROBE_SYMBOL(lockdep_hardirqs_on); @@ -3726,7 +3732,7 @@ void trace_softirqs_on(unsigned long ip) return; } - current->lockdep_recursion = 1; + current->lockdep_recursion++; /* * We'll do an OFF -> ON transition: */ @@ -3741,7 +3747,7 @@ void trace_softirqs_on(unsigned long ip) */ if (curr->hardirqs_enabled) mark_held_locks(curr, LOCK_ENABLED_SOFTIRQ); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); } /* @@ -3995,9 +4001,9 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name, return; raw_local_irq_save(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; register_lock_class(lock, subclass, 1); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } } @@ -4677,11 +4683,11 @@ void lock_set_class(struct lockdep_map *lock, const char *name, return; raw_local_irq_save(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; check_flags(flags); if (__lock_set_class(lock, name, key, subclass, ip)) check_chain_key(current); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_set_class); @@ -4694,11 +4700,11 @@ void lock_downgrade(struct lockdep_map *lock, unsigned long ip) return; raw_local_irq_save(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; check_flags(flags); if (__lock_downgrade(lock, ip)) check_chain_key(current); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_downgrade); @@ -4719,11 +4725,11 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass, raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock, ip); __lock_acquire(lock, subclass, trylock, read, check, irqs_disabled_flags(flags), nest_lock, ip, 0, 0); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_acquire); @@ -4737,11 +4743,11 @@ void lock_release(struct lockdep_map *lock, unsigned long ip) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; trace_lock_release(lock, ip); if (__lock_release(lock, ip)) check_chain_key(current); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_release); @@ -4757,9 +4763,9 @@ int lock_is_held_type(const struct lockdep_map *lock, int read) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; ret = __lock_is_held(lock, read); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); return ret; @@ -4778,9 +4784,9 @@ struct pin_cookie lock_pin_lock(struct lockdep_map *lock) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; cookie = __lock_pin_lock(lock); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); return cookie; @@ -4797,9 +4803,9 @@ void lock_repin_lock(struct lockdep_map *lock, struct pin_cookie cookie) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; __lock_repin_lock(lock, cookie); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_repin_lock); @@ -4814,9 +4820,9 @@ void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie cookie) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; __lock_unpin_lock(lock, cookie); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_unpin_lock); @@ -4952,10 +4958,10 @@ void lock_contended(struct lockdep_map *lock, unsigned long ip) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; trace_lock_contended(lock, ip); __lock_contended(lock, ip); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_contended); @@ -4972,9 +4978,9 @@ void lock_acquired(struct lockdep_map *lock, unsigned long ip) raw_local_irq_save(flags); check_flags(flags); - current->lockdep_recursion = 1; + current->lockdep_recursion++; __lock_acquired(lock, ip); - current->lockdep_recursion = 0; + lockdep_recursion_finish(); raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_acquired); @@ -5176,7 +5182,7 @@ static void free_zapped_rcu(struct rcu_head *ch) raw_local_irq_save(flags); arch_spin_lock(&lockdep_lock); - current->lockdep_recursion = 1; + current->lockdep_recursion++; /* closed head */ pf = delayed_free.pf + (delayed_free.index ^ 1); @@ -5188,7 +5194,7 @@ static void free_zapped_rcu(struct rcu_head *ch) */ call_rcu_zapped(delayed_free.pf + delayed_free.index); - current->lockdep_recursion = 0; + current->lockdep_recursion--; arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); } @@ -5235,11 +5241,11 @@ static void lockdep_free_key_range_reg(void *start, unsigned long size) raw_local_irq_save(flags); arch_spin_lock(&lockdep_lock); - current->lockdep_recursion = 1; + current->lockdep_recursion++; pf = get_pending_free(); __lockdep_free_key_range(pf, start, size); call_rcu_zapped(pf); - current->lockdep_recursion = 0; + current->lockdep_recursion--; arch_spin_unlock(&lockdep_lock); raw_local_irq_restore(flags); -- cgit v1.2.3-71-gd317 From 248efb2158f1e23750728e92ad9db3ab60c14485 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 13 Mar 2020 11:09:49 +0100 Subject: locking/lockdep: Rework lockdep_lock A few sites want to assert we own the graph_lock/lockdep_lock, provide a more conventional lock interface for it with a number of trivial debug checks. Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200313102107.GX12561@hirez.programming.kicks-ass.net --- kernel/locking/lockdep.c | 89 ++++++++++++++++++++++++++---------------------- 1 file changed, 48 insertions(+), 41 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 64ea69f94b3a..47e3acb6b5c9 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -84,12 +84,39 @@ module_param(lock_stat, int, 0644); * to use a raw spinlock - we really dont want the spinlock * code to recurse back into the lockdep code... */ -static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; +static arch_spinlock_t __lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; +static struct task_struct *__owner; + +static inline void lockdep_lock(void) +{ + DEBUG_LOCKS_WARN_ON(!irqs_disabled()); + + arch_spin_lock(&__lock); + __owner = current; + current->lockdep_recursion++; +} + +static inline void lockdep_unlock(void) +{ + if (debug_locks && DEBUG_LOCKS_WARN_ON(__owner != current)) + return; + + current->lockdep_recursion--; + __owner = NULL; + arch_spin_unlock(&__lock); +} + +static inline bool lockdep_assert_locked(void) +{ + return DEBUG_LOCKS_WARN_ON(__owner != current); +} + static struct task_struct *lockdep_selftest_task_struct; + static int graph_lock(void) { - arch_spin_lock(&lockdep_lock); + lockdep_lock(); /* * Make sure that if another CPU detected a bug while * walking the graph we dont change it (while the other @@ -97,27 +124,15 @@ static int graph_lock(void) * dropped already) */ if (!debug_locks) { - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); return 0; } - /* prevent any recursions within lockdep from causing deadlocks */ - current->lockdep_recursion++; return 1; } -static inline int graph_unlock(void) +static inline void graph_unlock(void) { - if (debug_locks && !arch_spin_is_locked(&lockdep_lock)) { - /* - * The lockdep graph lock isn't locked while we expect it to - * be, we're confused now, bye! - */ - return DEBUG_LOCKS_WARN_ON(1); - } - - current->lockdep_recursion--; - arch_spin_unlock(&lockdep_lock); - return 0; + lockdep_unlock(); } /* @@ -128,7 +143,7 @@ static inline int debug_locks_off_graph_unlock(void) { int ret = debug_locks_off(); - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); return ret; } @@ -1479,6 +1494,8 @@ static int __bfs(struct lock_list *source_entry, struct circular_queue *cq = &lock_cq; int ret = 1; + lockdep_assert_locked(); + if (match(source_entry, data)) { *target_entry = source_entry; ret = 0; @@ -1501,8 +1518,6 @@ static int __bfs(struct lock_list *source_entry, head = get_dep_list(lock, offset); - DEBUG_LOCKS_WARN_ON(!irqs_disabled()); - list_for_each_entry_rcu(entry, head, entry) { if (!lock_accessed(entry)) { unsigned int cq_depth; @@ -1729,11 +1744,9 @@ unsigned long lockdep_count_forward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); - current->lockdep_recursion++; - arch_spin_lock(&lockdep_lock); + lockdep_lock(); ret = __lockdep_count_forward_deps(&this); - arch_spin_unlock(&lockdep_lock); - current->lockdep_recursion--; + lockdep_unlock(); raw_local_irq_restore(flags); return ret; @@ -1758,11 +1771,9 @@ unsigned long lockdep_count_backward_deps(struct lock_class *class) this.class = class; raw_local_irq_save(flags); - current->lockdep_recursion++; - arch_spin_lock(&lockdep_lock); + lockdep_lock(); ret = __lockdep_count_backward_deps(&this); - arch_spin_unlock(&lockdep_lock); - current->lockdep_recursion--; + lockdep_unlock(); raw_local_irq_restore(flags); return ret; @@ -3046,7 +3057,7 @@ static inline int add_chain_cache(struct task_struct *curr, * disabled to make this an IRQ-safe lock.. for recursion reasons * lockdep won't complain about its own locking errors. */ - if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) + if (lockdep_assert_locked()) return 0; chain = alloc_lock_chain(); @@ -5181,8 +5192,7 @@ static void free_zapped_rcu(struct rcu_head *ch) return; raw_local_irq_save(flags); - arch_spin_lock(&lockdep_lock); - current->lockdep_recursion++; + lockdep_lock(); /* closed head */ pf = delayed_free.pf + (delayed_free.index ^ 1); @@ -5194,8 +5204,7 @@ static void free_zapped_rcu(struct rcu_head *ch) */ call_rcu_zapped(delayed_free.pf + delayed_free.index); - current->lockdep_recursion--; - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); raw_local_irq_restore(flags); } @@ -5240,13 +5249,11 @@ static void lockdep_free_key_range_reg(void *start, unsigned long size) init_data_structures_once(); raw_local_irq_save(flags); - arch_spin_lock(&lockdep_lock); - current->lockdep_recursion++; + lockdep_lock(); pf = get_pending_free(); __lockdep_free_key_range(pf, start, size); call_rcu_zapped(pf); - current->lockdep_recursion--; - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); raw_local_irq_restore(flags); /* @@ -5268,10 +5275,10 @@ static void lockdep_free_key_range_imm(void *start, unsigned long size) init_data_structures_once(); raw_local_irq_save(flags); - arch_spin_lock(&lockdep_lock); + lockdep_lock(); __lockdep_free_key_range(pf, start, size); __free_zapped_classes(pf); - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); raw_local_irq_restore(flags); } @@ -5367,10 +5374,10 @@ static void lockdep_reset_lock_imm(struct lockdep_map *lock) unsigned long flags; raw_local_irq_save(flags); - arch_spin_lock(&lockdep_lock); + lockdep_lock(); __lockdep_reset_lock(pf, lock); __free_zapped_classes(pf); - arch_spin_unlock(&lockdep_lock); + lockdep_unlock(); raw_local_irq_restore(flags); } -- cgit v1.2.3-71-gd317 From f6f48e18040402136874a6a71611e081b4d0788a Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 20 Feb 2020 09:45:02 +0100 Subject: lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions nmi_enter() does lockdep_off() and hence lockdep ignores everything. And NMI context makes it impossible to do full IN-NMI tracking like we do IN-HARDIRQ, that could result in graph_lock recursion. However, since look_up_lock_class() is lockless, we can find the class of a lock that has prior use and detect IN-NMI after USED, just not USED after IN-NMI. NOTE: By shifting the lockdep_off() recursion count to bit-16, we can easily differentiate between actual recursion and off. Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Frederic Weisbecker Reviewed-by: Joel Fernandes (Google) Link: https://lkml.kernel.org/r/20200221134215.090538203@infradead.org --- kernel/locking/lockdep.c | 62 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 59 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 47e3acb6b5c9..4c3b1ccc6c2d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -393,15 +393,22 @@ void lockdep_init_task(struct task_struct *task) task->lockdep_recursion = 0; } +/* + * Split the recrursion counter in two to readily detect 'off' vs recursion. + */ +#define LOCKDEP_RECURSION_BITS 16 +#define LOCKDEP_OFF (1U << LOCKDEP_RECURSION_BITS) +#define LOCKDEP_RECURSION_MASK (LOCKDEP_OFF - 1) + void lockdep_off(void) { - current->lockdep_recursion++; + current->lockdep_recursion += LOCKDEP_OFF; } EXPORT_SYMBOL(lockdep_off); void lockdep_on(void) { - current->lockdep_recursion--; + current->lockdep_recursion -= LOCKDEP_OFF; } EXPORT_SYMBOL(lockdep_on); @@ -597,6 +604,7 @@ static const char *usage_str[] = #include "lockdep_states.h" #undef LOCKDEP_STATE [LOCK_USED] = "INITIAL USE", + [LOCK_USAGE_STATES] = "IN-NMI", }; #endif @@ -809,6 +817,7 @@ static int count_matching_names(struct lock_class *new_class) return count + 1; } +/* used from NMI context -- must be lockless */ static inline struct lock_class * look_up_lock_class(const struct lockdep_map *lock, unsigned int subclass) { @@ -4720,6 +4729,36 @@ void lock_downgrade(struct lockdep_map *lock, unsigned long ip) } EXPORT_SYMBOL_GPL(lock_downgrade); +/* NMI context !!! */ +static void verify_lock_unused(struct lockdep_map *lock, struct held_lock *hlock, int subclass) +{ +#ifdef CONFIG_PROVE_LOCKING + struct lock_class *class = look_up_lock_class(lock, subclass); + + /* if it doesn't have a class (yet), it certainly hasn't been used yet */ + if (!class) + return; + + if (!(class->usage_mask & LOCK_USED)) + return; + + hlock->class_idx = class - lock_classes; + + print_usage_bug(current, hlock, LOCK_USED, LOCK_USAGE_STATES); +#endif +} + +static bool lockdep_nmi(void) +{ + if (current->lockdep_recursion & LOCKDEP_RECURSION_MASK) + return false; + + if (!in_nmi()) + return false; + + return true; +} + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: @@ -4730,8 +4769,25 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass, { unsigned long flags; - if (unlikely(current->lockdep_recursion)) + if (unlikely(current->lockdep_recursion)) { + /* XXX allow trylock from NMI ?!? */ + if (lockdep_nmi() && !trylock) { + struct held_lock hlock; + + hlock.acquire_ip = ip; + hlock.instance = lock; + hlock.nest_lock = nest_lock; + hlock.irq_context = 2; // XXX + hlock.trylock = trylock; + hlock.read = read; + hlock.check = check; + hlock.hardirqs_off = true; + hlock.references = 0; + + verify_lock_unused(lock, &hlock, subclass); + } return; + } raw_local_irq_save(flags); check_flags(flags); -- cgit v1.2.3-71-gd317 From 80fbaf1c3f2926c834f8ca915441dfe27ce5487e Mon Sep 17 00:00:00 2001 From: "Peter Zijlstra (Intel)" Date: Sat, 21 Mar 2020 12:25:55 +0100 Subject: rcuwait: Add @state argument to rcuwait_wait_event() Extend rcuwait_wait_event() with a state variable so that it is not restricted to UNINTERRUPTIBLE waits. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113241.824030968@linutronix.de --- include/linux/rcuwait.h | 12 ++++++++++-- kernel/locking/percpu-rwsem.c | 2 +- 2 files changed, 11 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/include/linux/rcuwait.h b/include/linux/rcuwait.h index 75c97e4bbc57..2ffe1ee6d482 100644 --- a/include/linux/rcuwait.h +++ b/include/linux/rcuwait.h @@ -3,6 +3,7 @@ #define _LINUX_RCUWAIT_H_ #include +#include /* * rcuwait provides a way of blocking and waking up a single @@ -30,23 +31,30 @@ extern void rcuwait_wake_up(struct rcuwait *w); * The caller is responsible for locking around rcuwait_wait_event(), * such that writes to @task are properly serialized. */ -#define rcuwait_wait_event(w, condition) \ +#define rcuwait_wait_event(w, condition, state) \ ({ \ + int __ret = 0; \ rcu_assign_pointer((w)->task, current); \ for (;;) { \ /* \ * Implicit barrier (A) pairs with (B) in \ * rcuwait_wake_up(). \ */ \ - set_current_state(TASK_UNINTERRUPTIBLE); \ + set_current_state(state); \ if (condition) \ break; \ \ + if (signal_pending_state(state, current)) { \ + __ret = -EINTR; \ + break; \ + } \ + \ schedule(); \ } \ \ WRITE_ONCE((w)->task, NULL); \ __set_current_state(TASK_RUNNING); \ + __ret; \ }) #endif /* _LINUX_RCUWAIT_H_ */ diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index 183a3aaa8509..a008a1ba21a7 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -234,7 +234,7 @@ void percpu_down_write(struct percpu_rw_semaphore *sem) */ /* Wait for all active readers to complete. */ - rcuwait_wait_event(&sem->writer, readers_active_check(sem)); + rcuwait_wait_event(&sem->writer, readers_active_check(sem), TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL_GPL(percpu_down_write); -- cgit v1.2.3-71-gd317 From e5d4d1756b07d9490a0269a9e68c1e05ee1feb9b Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 21 Mar 2020 12:25:58 +0100 Subject: timekeeping: Split jiffies seqlock seqlock consists of a sequence counter and a spinlock_t which is used to serialize the writers. spinlock_t is substituted by a "sleeping" spinlock on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping code as the writers are executed in hard interrupt and therefore non-preemptible context even on PREEMPT_RT. The spinlock in seqlock cannot be unconditionally replaced by a raw_spinlock_t as many seqlock users have nesting spinlock sections or other code which is not suitable to run in truly atomic context on RT. Instead of providing a raw_seqlock API for a single use case, open code the seqlock for the jiffies use case and implement it with a raw_spinlock_t and a sequence counter. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de --- kernel/time/jiffies.c | 7 ++++--- kernel/time/tick-common.c | 10 ++++++---- kernel/time/tick-sched.c | 19 ++++++++++++------- kernel/time/timekeeping.c | 6 ++++-- kernel/time/timekeeping.h | 3 ++- 5 files changed, 28 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c index d23b434c2ca7..eddcf4970444 100644 --- a/kernel/time/jiffies.c +++ b/kernel/time/jiffies.c @@ -58,7 +58,8 @@ static struct clocksource clocksource_jiffies = { .max_cycles = 10, }; -__cacheline_aligned_in_smp DEFINE_SEQLOCK(jiffies_lock); +__cacheline_aligned_in_smp DEFINE_RAW_SPINLOCK(jiffies_lock); +__cacheline_aligned_in_smp seqcount_t jiffies_seq; #if (BITS_PER_LONG < 64) u64 get_jiffies_64(void) @@ -67,9 +68,9 @@ u64 get_jiffies_64(void) u64 ret; do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); ret = jiffies_64; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); return ret; } EXPORT_SYMBOL(get_jiffies_64); diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c index 7e5d3524e924..6c9c342dd0e5 100644 --- a/kernel/time/tick-common.c +++ b/kernel/time/tick-common.c @@ -84,13 +84,15 @@ int tick_is_oneshot_available(void) static void tick_periodic(int cpu) { if (tick_do_timer_cpu == cpu) { - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); /* Keep track of the next tick event */ tick_next_period = ktime_add(tick_next_period, tick_period); do_timer(1); - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } @@ -162,9 +164,9 @@ void tick_setup_periodic(struct clock_event_device *dev, int broadcast) ktime_t next; do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); next = tick_next_period; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a792d21cac64..4be756b88a48 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -65,7 +65,8 @@ static void tick_do_update_jiffies64(ktime_t now) return; /* Reevaluate with jiffies_lock held */ - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); delta = ktime_sub(now, last_jiffies_update); if (delta >= tick_period) { @@ -91,10 +92,12 @@ static void tick_do_update_jiffies64(ktime_t now) /* Keep the tick_next_period variable up to date */ tick_next_period = ktime_add(last_jiffies_update, tick_period); } else { - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); return; } - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } @@ -105,12 +108,14 @@ static ktime_t tick_init_jiffy_update(void) { ktime_t period; - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); /* Did we start the jiffies update yet ? */ if (last_jiffies_update == 0) last_jiffies_update = tick_next_period; period = last_jiffies_update; - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); return period; } @@ -676,10 +681,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu) /* Read jiffies and the time when jiffies were updated last */ do { - seq = read_seqbegin(&jiffies_lock); + seq = read_seqcount_begin(&jiffies_seq); basemono = last_jiffies_update; basejiff = jiffies; - } while (read_seqretry(&jiffies_lock, seq)); + } while (read_seqcount_retry(&jiffies_seq, seq)); ts->last_jiffies = basejiff; ts->timer_expires_base = basemono; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index ca69290bee2a..856280d2cbd4 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -2397,8 +2397,10 @@ EXPORT_SYMBOL(hardpps); */ void xtime_update(unsigned long ticks) { - write_seqlock(&jiffies_lock); + raw_spin_lock(&jiffies_lock); + write_seqcount_begin(&jiffies_seq); do_timer(ticks); - write_sequnlock(&jiffies_lock); + write_seqcount_end(&jiffies_seq); + raw_spin_unlock(&jiffies_lock); update_wall_time(); } diff --git a/kernel/time/timekeeping.h b/kernel/time/timekeeping.h index 141ab3ab0354..099737f6f10c 100644 --- a/kernel/time/timekeeping.h +++ b/kernel/time/timekeeping.h @@ -25,7 +25,8 @@ static inline void sched_clock_resume(void) { } extern void do_timer(unsigned long ticks); extern void update_wall_time(void); -extern seqlock_t jiffies_lock; +extern raw_spinlock_t jiffies_lock; +extern seqcount_t jiffies_seq; #define CS_NAME_LEN 32 -- cgit v1.2.3-71-gd317 From b3212fe2bc06fa1014b3063b85b2bac4332a1c28 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 21 Mar 2020 12:25:59 +0100 Subject: sched/swait: Prepare usage in completions As a preparation to use simple wait queues for completions: - Provide swake_up_all_locked() to support complete_all() - Make __prepare_to_swait() public available This is done to enable the usage of complete() within truly atomic contexts on a PREEMPT_RT enabled kernel. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.228481202@linutronix.de --- kernel/sched/sched.h | 3 +++ kernel/sched/swait.c | 15 ++++++++++++++- 2 files changed, 17 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9ea647835fd6..fdc77e796324 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2492,3 +2492,6 @@ static inline bool is_per_cpu_kthread(struct task_struct *p) return true; } #endif + +void swake_up_all_locked(struct swait_queue_head *q); +void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c index e83a3f8449f6..e1c655f928c7 100644 --- a/kernel/sched/swait.c +++ b/kernel/sched/swait.c @@ -32,6 +32,19 @@ void swake_up_locked(struct swait_queue_head *q) } EXPORT_SYMBOL(swake_up_locked); +/* + * Wake up all waiters. This is an interface which is solely exposed for + * completions and not for general usage. + * + * It is intentionally different from swake_up_all() to allow usage from + * hard interrupt context and interrupt disabled regions. + */ +void swake_up_all_locked(struct swait_queue_head *q) +{ + while (!list_empty(&q->task_list)) + swake_up_locked(q); +} + void swake_up_one(struct swait_queue_head *q) { unsigned long flags; @@ -69,7 +82,7 @@ void swake_up_all(struct swait_queue_head *q) } EXPORT_SYMBOL(swake_up_all); -static void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait) +void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait) { wait->task = current; if (list_empty(&wait->task_list)) -- cgit v1.2.3-71-gd317 From a5c6234e10280b3ec65e2410ce34904a2580e5f8 Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Sat, 21 Mar 2020 12:26:00 +0100 Subject: completion: Use simple wait queues completion uses a wait_queue_head_t to enqueue waiters. wait_queue_head_t contains a spinlock_t to protect the list of waiters which excludes it from being used in truly atomic context on a PREEMPT_RT enabled kernel. The spinlock in the wait queue head cannot be replaced by a raw_spinlock because: - wait queues can have custom wakeup callbacks, which acquire other spinlock_t locks and have potentially long execution times - wake_up() walks an unbounded number of list entries during the wake up and may wake an unbounded number of waiters. For simplicity and performance reasons complete() should be usable on PREEMPT_RT enabled kernels. completions do not use custom wakeup callbacks and are usually single waiter, except for a few corner cases. Replace the wait queue in the completion with a simple wait queue (swait), which uses a raw_spinlock_t for protecting the waiter list and therefore is safe to use inside truly atomic regions on PREEMPT_RT. There is no semantical or functional change: - completions use the exclusive wait mode which is what swait provides - complete() wakes one exclusive waiter - complete_all() wakes all waiters while holding the lock which protects the wait queue against newly incoming waiters. The conversion to swait preserves this behaviour. complete_all() might cause unbound latencies with a large number of waiters being woken at once, but most complete_all() usage sites are either in testing or initialization code or have only a really small number of concurrent waiters which for now does not cause a latency problem. Keep it simple for now. The fixup of the warning check in the USB gadget driver is just a straight forward conversion of the lockless waiter check from one waitqueue type to the other. Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Greg Kroah-Hartman Reviewed-by: Davidlohr Bueso Reviewed-by: Joel Fernandes (Google) Acked-by: Linus Torvalds Link: https://lkml.kernel.org/r/20200321113242.317954042@linutronix.de --- drivers/usb/gadget/function/f_fs.c | 2 +- include/linux/completion.h | 8 ++++---- kernel/sched/completion.c | 36 +++++++++++++++++++----------------- 3 files changed, 24 insertions(+), 22 deletions(-) (limited to 'kernel') diff --git a/drivers/usb/gadget/function/f_fs.c b/drivers/usb/gadget/function/f_fs.c index 571917677d35..234177dd1ed5 100644 --- a/drivers/usb/gadget/function/f_fs.c +++ b/drivers/usb/gadget/function/f_fs.c @@ -1703,7 +1703,7 @@ static void ffs_data_put(struct ffs_data *ffs) pr_info("%s(): freeing\n", __func__); ffs_data_clear(ffs); BUG_ON(waitqueue_active(&ffs->ev.waitq) || - waitqueue_active(&ffs->ep0req_completion.wait) || + swait_active(&ffs->ep0req_completion.wait) || waitqueue_active(&ffs->wait)); destroy_workqueue(ffs->io_completion_wq); kfree(ffs->dev_name); diff --git a/include/linux/completion.h b/include/linux/completion.h index 519e94915d18..bf8e77001f18 100644 --- a/include/linux/completion.h +++ b/include/linux/completion.h @@ -9,7 +9,7 @@ * See kernel/sched/completion.c for details. */ -#include +#include /* * struct completion - structure used to maintain state for a "completion" @@ -25,7 +25,7 @@ */ struct completion { unsigned int done; - wait_queue_head_t wait; + struct swait_queue_head wait; }; #define init_completion_map(x, m) __init_completion(x) @@ -34,7 +34,7 @@ static inline void complete_acquire(struct completion *x) {} static inline void complete_release(struct completion *x) {} #define COMPLETION_INITIALIZER(work) \ - { 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) } + { 0, __SWAIT_QUEUE_HEAD_INITIALIZER((work).wait) } #define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \ (*({ init_completion_map(&(work), &(map)); &(work); })) @@ -85,7 +85,7 @@ static inline void complete_release(struct completion *x) {} static inline void __init_completion(struct completion *x) { x->done = 0; - init_waitqueue_head(&x->wait); + init_swait_queue_head(&x->wait); } /** diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c index a1ad5b7d5521..f15e96164ff1 100644 --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -29,12 +29,12 @@ void complete(struct completion *x) { unsigned long flags; - spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); if (x->done != UINT_MAX) x->done++; - __wake_up_locked(&x->wait, TASK_NORMAL, 1); - spin_unlock_irqrestore(&x->wait.lock, flags); + swake_up_locked(&x->wait); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete); @@ -58,10 +58,12 @@ void complete_all(struct completion *x) { unsigned long flags; - spin_lock_irqsave(&x->wait.lock, flags); + WARN_ON(irqs_disabled()); + + raw_spin_lock_irqsave(&x->wait.lock, flags); x->done = UINT_MAX; - __wake_up_locked(&x->wait, TASK_NORMAL, 0); - spin_unlock_irqrestore(&x->wait.lock, flags); + swake_up_all_locked(&x->wait); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); } EXPORT_SYMBOL(complete_all); @@ -70,20 +72,20 @@ do_wait_for_common(struct completion *x, long (*action)(long), long timeout, int state) { if (!x->done) { - DECLARE_WAITQUEUE(wait, current); + DECLARE_SWAITQUEUE(wait); - __add_wait_queue_entry_tail_exclusive(&x->wait, &wait); do { if (signal_pending_state(state, current)) { timeout = -ERESTARTSYS; break; } + __prepare_to_swait(&x->wait, &wait); __set_current_state(state); - spin_unlock_irq(&x->wait.lock); + raw_spin_unlock_irq(&x->wait.lock); timeout = action(timeout); - spin_lock_irq(&x->wait.lock); + raw_spin_lock_irq(&x->wait.lock); } while (!x->done && timeout); - __remove_wait_queue(&x->wait, &wait); + __finish_swait(&x->wait, &wait); if (!x->done) return timeout; } @@ -100,9 +102,9 @@ __wait_for_common(struct completion *x, complete_acquire(x); - spin_lock_irq(&x->wait.lock); + raw_spin_lock_irq(&x->wait.lock); timeout = do_wait_for_common(x, action, timeout, state); - spin_unlock_irq(&x->wait.lock); + raw_spin_unlock_irq(&x->wait.lock); complete_release(x); @@ -291,12 +293,12 @@ bool try_wait_for_completion(struct completion *x) if (!READ_ONCE(x->done)) return false; - spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); if (!x->done) ret = false; else if (x->done != UINT_MAX) x->done--; - spin_unlock_irqrestore(&x->wait.lock, flags); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); return ret; } EXPORT_SYMBOL(try_wait_for_completion); @@ -322,8 +324,8 @@ bool completion_done(struct completion *x) * otherwise we can end up freeing the completion before complete() * is done referencing it. */ - spin_lock_irqsave(&x->wait.lock, flags); - spin_unlock_irqrestore(&x->wait.lock, flags); + raw_spin_lock_irqsave(&x->wait.lock, flags); + raw_spin_unlock_irqrestore(&x->wait.lock, flags); return true; } EXPORT_SYMBOL(completion_done); -- cgit v1.2.3-71-gd317 From de8f5e4f2dc1f032b46afda0a78cab5456974f89 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Sat, 21 Mar 2020 12:26:01 +0100 Subject: lockdep: Introduce wait-type checks MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep(). Obviously RCU made the entire ordeal more complex than a simple single value test because RCU can be acquired in (pretty much) any context and while it presents a context to nested locks it is not the same as it got acquired in. Therefore its necessary to split the wait_type into two values, one representing the acquire (outer) and one representing the nested context (inner). For most 'normal' locks these two are the same. [ To make static initialization easier we have the rule that: .outer == INV means .outer == .inner; because INV == 0. ] It further means that its required to find the minimal .inner of the held stack to compare against the outer of the new lock; because while 'normal' RCU presents a CONFIG type to nested locks, if it is taken while already holding a SPIN type it obviously doesn't relax the rules. Below is an example output generated by the trivial test code: raw_spin_lock(&foo); spin_lock(&bar); spin_unlock(&bar); raw_spin_unlock(&foo); [ BUG: Invalid wait context ] ----------------------------- swapper/0/1 is trying to lock: ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187 other info that might help us debug this: 1 lock held by swapper/0/1: #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187 The way to read it is to look at the new -{n,m} part in the lock description; -{3:3} for the attempted lock, and try and match that up to the held locks, which in this case is the one: -{2,2}. This tells that the acquiring lock requires a more relaxed environment than presented by the lock stack. Currently only the normal locks and RCU are converted, the rest of the lockdep users defaults to .inner = INV which is ignored. More conversions can be done when desired. The check for spinlock_t nesting is not enabled by default. It's a separate config option for now as there are known problems which are currently addressed. The config option allows to identify these problems and to verify that the solutions found are indeed solving them. The config switch will be removed and the checks will permanently enabled once the vast majority of issues has been addressed. [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP] [ tglx: Add the config option ] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de --- include/linux/irqflags.h | 8 ++- include/linux/lockdep.h | 71 +++++++++++++++++---- include/linux/mutex.h | 7 +- include/linux/rwlock_types.h | 6 +- include/linux/rwsem.h | 6 +- include/linux/sched.h | 1 + include/linux/spinlock.h | 35 +++++++--- include/linux/spinlock_types.h | 24 +++++-- kernel/irq/handle.c | 7 ++ kernel/locking/lockdep.c | 138 ++++++++++++++++++++++++++++++++++++++-- kernel/locking/mutex-debug.c | 2 +- kernel/locking/rwsem.c | 2 +- kernel/locking/spinlock_debug.c | 6 +- kernel/rcu/update.c | 24 +++++-- lib/Kconfig.debug | 17 +++++ 15 files changed, 307 insertions(+), 47 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 21619c92c377..fdaf28601cbe 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -37,7 +37,12 @@ # define trace_softirqs_enabled(p) ((p)->softirqs_enabled) # define trace_hardirq_enter() \ do { \ - current->hardirq_context++; \ + if (!current->hardirq_context++) \ + current->hardirq_threaded = 0; \ +} while (0) +# define trace_hardirq_threaded() \ +do { \ + current->hardirq_threaded = 1; \ } while (0) # define trace_hardirq_exit() \ do { \ @@ -59,6 +64,7 @@ do { \ # define trace_hardirqs_enabled(p) 0 # define trace_softirqs_enabled(p) 0 # define trace_hardirq_enter() do { } while (0) +# define trace_hardirq_threaded() do { } while (0) # define trace_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) # define lockdep_softirq_exit() do { } while (0) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 664f52c6dd4c..425b4ceb7cd0 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -21,6 +21,22 @@ extern int lock_stat; #include +enum lockdep_wait_type { + LD_WAIT_INV = 0, /* not checked, catch all */ + + LD_WAIT_FREE, /* wait free, rcu etc.. */ + LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ + +#ifdef CONFIG_PROVE_RAW_LOCK_NESTING + LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ +#else + LD_WAIT_CONFIG = LD_WAIT_SPIN, +#endif + LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ + + LD_WAIT_MAX, /* must be last */ +}; + #ifdef CONFIG_LOCKDEP #include @@ -111,6 +127,9 @@ struct lock_class { int name_version; const char *name; + short wait_type_inner; + short wait_type_outer; + #ifdef CONFIG_LOCK_STAT unsigned long contention_point[LOCKSTAT_POINTS]; unsigned long contending_point[LOCKSTAT_POINTS]; @@ -158,6 +177,8 @@ struct lockdep_map { struct lock_class_key *key; struct lock_class *class_cache[NR_LOCKDEP_CACHING_CLASSES]; const char *name; + short wait_type_outer; /* can be taken in this context */ + short wait_type_inner; /* presents this context */ #ifdef CONFIG_LOCK_STAT int cpu; unsigned long ip; @@ -299,8 +320,21 @@ extern void lockdep_unregister_key(struct lock_class_key *key); * to lockdep: */ -extern void lockdep_init_map(struct lockdep_map *lock, const char *name, - struct lock_class_key *key, int subclass); +extern void lockdep_init_map_waits(struct lockdep_map *lock, const char *name, + struct lock_class_key *key, int subclass, short inner, short outer); + +static inline void +lockdep_init_map_wait(struct lockdep_map *lock, const char *name, + struct lock_class_key *key, int subclass, short inner) +{ + lockdep_init_map_waits(lock, name, key, subclass, inner, LD_WAIT_INV); +} + +static inline void lockdep_init_map(struct lockdep_map *lock, const char *name, + struct lock_class_key *key, int subclass) +{ + lockdep_init_map_wait(lock, name, key, subclass, LD_WAIT_INV); +} /* * Reinitialize a lock key - for cases where there is special locking or @@ -308,18 +342,29 @@ extern void lockdep_init_map(struct lockdep_map *lock, const char *name, * of dependencies wrong: they are either too broad (they need a class-split) * or they are too narrow (they suffer from a false class-split): */ -#define lockdep_set_class(lock, key) \ - lockdep_init_map(&(lock)->dep_map, #key, key, 0) -#define lockdep_set_class_and_name(lock, key, name) \ - lockdep_init_map(&(lock)->dep_map, name, key, 0) -#define lockdep_set_class_and_subclass(lock, key, sub) \ - lockdep_init_map(&(lock)->dep_map, #key, key, sub) -#define lockdep_set_subclass(lock, sub) \ - lockdep_init_map(&(lock)->dep_map, #lock, \ - (lock)->dep_map.key, sub) +#define lockdep_set_class(lock, key) \ + lockdep_init_map_waits(&(lock)->dep_map, #key, key, 0, \ + (lock)->dep_map.wait_type_inner, \ + (lock)->dep_map.wait_type_outer) + +#define lockdep_set_class_and_name(lock, key, name) \ + lockdep_init_map_waits(&(lock)->dep_map, name, key, 0, \ + (lock)->dep_map.wait_type_inner, \ + (lock)->dep_map.wait_type_outer) + +#define lockdep_set_class_and_subclass(lock, key, sub) \ + lockdep_init_map_waits(&(lock)->dep_map, #key, key, sub,\ + (lock)->dep_map.wait_type_inner, \ + (lock)->dep_map.wait_type_outer) + +#define lockdep_set_subclass(lock, sub) \ + lockdep_init_map_waits(&(lock)->dep_map, #lock, (lock)->dep_map.key, sub,\ + (lock)->dep_map.wait_type_inner, \ + (lock)->dep_map.wait_type_outer) #define lockdep_set_novalidate_class(lock) \ lockdep_set_class_and_name(lock, &__lockdep_no_validate__, #lock) + /* * Compare locking classes */ @@ -432,6 +477,10 @@ static inline void lockdep_set_selftest_task(struct task_struct *task) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i) do { } while (0) # define lockdep_init() do { } while (0) +# define lockdep_init_map_waits(lock, name, key, sub, inner, outer) \ + do { (void)(name); (void)(key); } while (0) +# define lockdep_init_map_wait(lock, name, key, sub, inner) \ + do { (void)(name); (void)(key); } while (0) # define lockdep_init_map(lock, name, key, sub) \ do { (void)(name); (void)(key); } while (0) # define lockdep_set_class(lock, key) do { (void)(key); } while (0) diff --git a/include/linux/mutex.h b/include/linux/mutex.h index aca8f36dfac9..ae197cc00cc8 100644 --- a/include/linux/mutex.h +++ b/include/linux/mutex.h @@ -109,8 +109,11 @@ do { \ } while (0) #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ - , .dep_map = { .name = #lockname } +# define __DEP_MAP_MUTEX_INITIALIZER(lockname) \ + , .dep_map = { \ + .name = #lockname, \ + .wait_type_inner = LD_WAIT_SLEEP, \ + } #else # define __DEP_MAP_MUTEX_INITIALIZER(lockname) #endif diff --git a/include/linux/rwlock_types.h b/include/linux/rwlock_types.h index 857a72ceb794..3bd03e18061c 100644 --- a/include/linux/rwlock_types.h +++ b/include/linux/rwlock_types.h @@ -22,7 +22,11 @@ typedef struct { #define RWLOCK_MAGIC 0xdeaf1eed #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define RW_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname } +# define RW_DEP_MAP_INIT(lockname) \ + .dep_map = { \ + .name = #lockname, \ + .wait_type_inner = LD_WAIT_CONFIG, \ + } #else # define RW_DEP_MAP_INIT(lockname) #endif diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h index 8a418d9eeb7a..7e5b2a4eb560 100644 --- a/include/linux/rwsem.h +++ b/include/linux/rwsem.h @@ -65,7 +65,11 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem) /* Common initializer macros and functions */ #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname } +# define __RWSEM_DEP_MAP_INIT(lockname) \ + , .dep_map = { \ + .name = #lockname, \ + .wait_type_inner = LD_WAIT_SLEEP, \ + } #else # define __RWSEM_DEP_MAP_INIT(lockname) #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..4d3b9ecce074 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -970,6 +970,7 @@ struct task_struct { #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; + unsigned int hardirq_threaded; unsigned long hardirq_enable_ip; unsigned long hardirq_disable_ip; unsigned int hardirq_enable_event; diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h index 031ce8617df8..d3770b3f9d9a 100644 --- a/include/linux/spinlock.h +++ b/include/linux/spinlock.h @@ -93,12 +93,13 @@ #ifdef CONFIG_DEBUG_SPINLOCK extern void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, - struct lock_class_key *key); -# define raw_spin_lock_init(lock) \ -do { \ - static struct lock_class_key __key; \ - \ - __raw_spin_lock_init((lock), #lock, &__key); \ + struct lock_class_key *key, short inner); + +# define raw_spin_lock_init(lock) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_spin_lock_init((lock), #lock, &__key, LD_WAIT_SPIN); \ } while (0) #else @@ -327,12 +328,26 @@ static __always_inline raw_spinlock_t *spinlock_check(spinlock_t *lock) return &lock->rlock; } -#define spin_lock_init(_lock) \ -do { \ - spinlock_check(_lock); \ - raw_spin_lock_init(&(_lock)->rlock); \ +#ifdef CONFIG_DEBUG_SPINLOCK + +# define spin_lock_init(lock) \ +do { \ + static struct lock_class_key __key; \ + \ + __raw_spin_lock_init(spinlock_check(lock), \ + #lock, &__key, LD_WAIT_CONFIG); \ +} while (0) + +#else + +# define spin_lock_init(_lock) \ +do { \ + spinlock_check(_lock); \ + *(_lock) = __SPIN_LOCK_UNLOCKED(_lock); \ } while (0) +#endif + static __always_inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); diff --git a/include/linux/spinlock_types.h b/include/linux/spinlock_types.h index 24b4e6f2c1a2..6102e6bff3ae 100644 --- a/include/linux/spinlock_types.h +++ b/include/linux/spinlock_types.h @@ -33,8 +33,18 @@ typedef struct raw_spinlock { #define SPINLOCK_OWNER_INIT ((void *)-1L) #ifdef CONFIG_DEBUG_LOCK_ALLOC -# define SPIN_DEP_MAP_INIT(lockname) .dep_map = { .name = #lockname } +# define RAW_SPIN_DEP_MAP_INIT(lockname) \ + .dep_map = { \ + .name = #lockname, \ + .wait_type_inner = LD_WAIT_SPIN, \ + } +# define SPIN_DEP_MAP_INIT(lockname) \ + .dep_map = { \ + .name = #lockname, \ + .wait_type_inner = LD_WAIT_CONFIG, \ + } #else +# define RAW_SPIN_DEP_MAP_INIT(lockname) # define SPIN_DEP_MAP_INIT(lockname) #endif @@ -51,7 +61,7 @@ typedef struct raw_spinlock { { \ .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ SPIN_DEBUG_INIT(lockname) \ - SPIN_DEP_MAP_INIT(lockname) } + RAW_SPIN_DEP_MAP_INIT(lockname) } #define __RAW_SPIN_LOCK_UNLOCKED(lockname) \ (raw_spinlock_t) __RAW_SPIN_LOCK_INITIALIZER(lockname) @@ -72,11 +82,17 @@ typedef struct spinlock { }; } spinlock_t; +#define ___SPIN_LOCK_INITIALIZER(lockname) \ + { \ + .raw_lock = __ARCH_SPIN_LOCK_UNLOCKED, \ + SPIN_DEBUG_INIT(lockname) \ + SPIN_DEP_MAP_INIT(lockname) } + #define __SPIN_LOCK_INITIALIZER(lockname) \ - { { .rlock = __RAW_SPIN_LOCK_INITIALIZER(lockname) } } + { { .rlock = ___SPIN_LOCK_INITIALIZER(lockname) } } #define __SPIN_LOCK_UNLOCKED(lockname) \ - (spinlock_t ) __SPIN_LOCK_INITIALIZER(lockname) + (spinlock_t) __SPIN_LOCK_INITIALIZER(lockname) #define DEFINE_SPINLOCK(x) spinlock_t x = __SPIN_LOCK_UNLOCKED(x) diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c index a4ace611f47f..16ee716e8291 100644 --- a/kernel/irq/handle.c +++ b/kernel/irq/handle.c @@ -145,6 +145,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags for_each_action_of_desc(desc, action) { irqreturn_t res; + /* + * If this IRQ would be threaded under force_irqthreads, mark it so. + */ + if (irq_settings_can_thread(desc) && + !(action->flags & (IRQF_NO_THREAD | IRQF_PERCPU | IRQF_ONESHOT))) + trace_hardirq_threaded(); + trace_irq_handler_entry(irq, action); res = action->handler(irq, action->dev_id); trace_irq_handler_exit(irq, action, res); diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 4c3b1ccc6c2d..6b9f9f321e6d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -683,7 +683,9 @@ static void print_lock_name(struct lock_class *class) printk(KERN_CONT " ("); __print_lock_name(class); - printk(KERN_CONT "){%s}", usage); + printk(KERN_CONT "){%s}-{%hd:%hd}", usage, + class->wait_type_outer ?: class->wait_type_inner, + class->wait_type_inner); } static void print_lockdep_cache(struct lockdep_map *lock) @@ -1264,6 +1266,8 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) WARN_ON_ONCE(!list_empty(&class->locks_before)); WARN_ON_ONCE(!list_empty(&class->locks_after)); class->name_version = count_matching_names(class); + class->wait_type_inner = lock->wait_type_inner; + class->wait_type_outer = lock->wait_type_outer; /* * We use RCU's safe list-add method to make * parallel walking of the hash-list safe: @@ -3948,6 +3952,113 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, return ret; } +static int +print_lock_invalid_wait_context(struct task_struct *curr, + struct held_lock *hlock) +{ + if (!debug_locks_off()) + return 0; + if (debug_locks_silent) + return 0; + + pr_warn("\n"); + pr_warn("=============================\n"); + pr_warn("[ BUG: Invalid wait context ]\n"); + print_kernel_ident(); + pr_warn("-----------------------------\n"); + + pr_warn("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr)); + print_lock(hlock); + + pr_warn("other info that might help us debug this:\n"); + lockdep_print_held_locks(curr); + + pr_warn("stack backtrace:\n"); + dump_stack(); + + return 0; +} + +/* + * Verify the wait_type context. + * + * This check validates we takes locks in the right wait-type order; that is it + * ensures that we do not take mutexes inside spinlocks and do not attempt to + * acquire spinlocks inside raw_spinlocks and the sort. + * + * The entire thing is slightly more complex because of RCU, RCU is a lock that + * can be taken from (pretty much) any context but also has constraints. + * However when taken in a stricter environment the RCU lock does not loosen + * the constraints. + * + * Therefore we must look for the strictest environment in the lock stack and + * compare that to the lock we're trying to acquire. + */ +static int check_wait_context(struct task_struct *curr, struct held_lock *next) +{ + short next_inner = hlock_class(next)->wait_type_inner; + short next_outer = hlock_class(next)->wait_type_outer; + short curr_inner; + int depth; + + if (!curr->lockdep_depth || !next_inner || next->trylock) + return 0; + + if (!next_outer) + next_outer = next_inner; + + /* + * Find start of current irq_context.. + */ + for (depth = curr->lockdep_depth - 1; depth >= 0; depth--) { + struct held_lock *prev = curr->held_locks + depth; + if (prev->irq_context != next->irq_context) + break; + } + depth++; + + /* + * Set appropriate wait type for the context; for IRQs we have to take + * into account force_irqthread as that is implied by PREEMPT_RT. + */ + if (curr->hardirq_context) { + /* + * Check if force_irqthreads will run us threaded. + */ + if (curr->hardirq_threaded) + curr_inner = LD_WAIT_CONFIG; + else + curr_inner = LD_WAIT_SPIN; + } else if (curr->softirq_context) { + /* + * Softirqs are always threaded. + */ + curr_inner = LD_WAIT_CONFIG; + } else { + curr_inner = LD_WAIT_MAX; + } + + for (; depth < curr->lockdep_depth; depth++) { + struct held_lock *prev = curr->held_locks + depth; + short prev_inner = hlock_class(prev)->wait_type_inner; + + if (prev_inner) { + /* + * We can have a bigger inner than a previous one + * when outer is smaller than inner, as with RCU. + * + * Also due to trylocks. + */ + curr_inner = min(curr_inner, prev_inner); + } + } + + if (next_outer > curr_inner) + return print_lock_invalid_wait_context(curr, next); + + return 0; +} + #else /* CONFIG_PROVE_LOCKING */ static inline int @@ -3967,13 +4078,20 @@ static inline int separate_irq_context(struct task_struct *curr, return 0; } +static inline int check_wait_context(struct task_struct *curr, + struct held_lock *next) +{ + return 0; +} + #endif /* CONFIG_PROVE_LOCKING */ /* * Initialize a lock instance's lock-class mapping info: */ -void lockdep_init_map(struct lockdep_map *lock, const char *name, - struct lock_class_key *key, int subclass) +void lockdep_init_map_waits(struct lockdep_map *lock, const char *name, + struct lock_class_key *key, int subclass, + short inner, short outer) { int i; @@ -3994,6 +4112,9 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name, lock->name = name; + lock->wait_type_outer = outer; + lock->wait_type_inner = inner; + /* * No key, no joy, we need to hash something. */ @@ -4027,7 +4148,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name, raw_local_irq_restore(flags); } } -EXPORT_SYMBOL_GPL(lockdep_init_map); +EXPORT_SYMBOL_GPL(lockdep_init_map_waits); struct lock_class_key __lockdep_no_validate__; EXPORT_SYMBOL_GPL(__lockdep_no_validate__); @@ -4128,7 +4249,7 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, class_idx = class - lock_classes; - if (depth) { + if (depth) { /* we're holding locks */ hlock = curr->held_locks + depth - 1; if (hlock->class_idx == class_idx && nest_lock) { if (!references) @@ -4170,6 +4291,9 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass, #endif hlock->pin_count = pin_count; + if (check_wait_context(curr, hlock)) + return 0; + /* Initialize the lock usage bit */ if (!mark_usage(curr, hlock, check)) return 0; @@ -4405,7 +4529,9 @@ __lock_set_class(struct lockdep_map *lock, const char *name, return 0; } - lockdep_init_map(lock, name, key, 0); + lockdep_init_map_waits(lock, name, key, 0, + lock->wait_type_inner, + lock->wait_type_outer); class = register_lock_class(lock, subclass, 0); hlock->class_idx = class - lock_classes; diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c index 771d4ca96dda..a7276aaf2abc 100644 --- a/kernel/locking/mutex-debug.c +++ b/kernel/locking/mutex-debug.c @@ -85,7 +85,7 @@ void debug_mutex_init(struct mutex *lock, const char *name, * Make sure we are not reinitializing a held lock: */ debug_check_no_locks_freed((void *)lock, sizeof(*lock)); - lockdep_init_map(&lock->dep_map, name, key, 0); + lockdep_init_map_wait(&lock->dep_map, name, key, 0, LD_WAIT_SLEEP); #endif lock->magic = lock; } diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index e6f437bafb23..f11b9bd3431d 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -328,7 +328,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name, * Make sure we are not reinitializing a held semaphore: */ debug_check_no_locks_freed((void *)sem, sizeof(*sem)); - lockdep_init_map(&sem->dep_map, name, key, 0); + lockdep_init_map_wait(&sem->dep_map, name, key, 0, LD_WAIT_SLEEP); #endif #ifdef CONFIG_DEBUG_RWSEMS sem->magic = sem; diff --git a/kernel/locking/spinlock_debug.c b/kernel/locking/spinlock_debug.c index 472dd462a40c..b9d93087ee66 100644 --- a/kernel/locking/spinlock_debug.c +++ b/kernel/locking/spinlock_debug.c @@ -14,14 +14,14 @@ #include void __raw_spin_lock_init(raw_spinlock_t *lock, const char *name, - struct lock_class_key *key) + struct lock_class_key *key, short inner) { #ifdef CONFIG_DEBUG_LOCK_ALLOC /* * Make sure we are not reinitializing a held lock: */ debug_check_no_locks_freed((void *)lock, sizeof(*lock)); - lockdep_init_map(&lock->dep_map, name, key, 0); + lockdep_init_map_wait(&lock->dep_map, name, key, 0, inner); #endif lock->raw_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED; lock->magic = SPINLOCK_MAGIC; @@ -39,7 +39,7 @@ void __rwlock_init(rwlock_t *lock, const char *name, * Make sure we are not reinitializing a held lock: */ debug_check_no_locks_freed((void *)lock, sizeof(*lock)); - lockdep_init_map(&lock->dep_map, name, key, 0); + lockdep_init_map_wait(&lock->dep_map, name, key, 0, LD_WAIT_CONFIG); #endif lock->raw_lock = (arch_rwlock_t) __ARCH_RW_LOCK_UNLOCKED; lock->magic = RWLOCK_MAGIC; diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 6c4b862f57d6..8d3eb2fe20ae 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -227,18 +227,30 @@ core_initcall(rcu_set_runtime_mode); #ifdef CONFIG_DEBUG_LOCK_ALLOC static struct lock_class_key rcu_lock_key; -struct lockdep_map rcu_lock_map = - STATIC_LOCKDEP_MAP_INIT("rcu_read_lock", &rcu_lock_key); +struct lockdep_map rcu_lock_map = { + .name = "rcu_read_lock", + .key = &rcu_lock_key, + .wait_type_outer = LD_WAIT_FREE, + .wait_type_inner = LD_WAIT_CONFIG, /* XXX PREEMPT_RCU ? */ +}; EXPORT_SYMBOL_GPL(rcu_lock_map); static struct lock_class_key rcu_bh_lock_key; -struct lockdep_map rcu_bh_lock_map = - STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_bh", &rcu_bh_lock_key); +struct lockdep_map rcu_bh_lock_map = { + .name = "rcu_read_lock_bh", + .key = &rcu_bh_lock_key, + .wait_type_outer = LD_WAIT_FREE, + .wait_type_inner = LD_WAIT_CONFIG, /* PREEMPT_LOCK also makes BH preemptible */ +}; EXPORT_SYMBOL_GPL(rcu_bh_lock_map); static struct lock_class_key rcu_sched_lock_key; -struct lockdep_map rcu_sched_lock_map = - STATIC_LOCKDEP_MAP_INIT("rcu_read_lock_sched", &rcu_sched_lock_key); +struct lockdep_map rcu_sched_lock_map = { + .name = "rcu_read_lock_sched", + .key = &rcu_sched_lock_key, + .wait_type_outer = LD_WAIT_FREE, + .wait_type_inner = LD_WAIT_SPIN, +}; EXPORT_SYMBOL_GPL(rcu_sched_lock_map); static struct lock_class_key rcu_callback_key; diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 69def4a9df00..70813e39f911 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1086,6 +1086,23 @@ config PROVE_LOCKING For more details, see Documentation/locking/lockdep-design.rst. +config PROVE_RAW_LOCK_NESTING + bool "Enable raw_spinlock - spinlock nesting checks" + depends on PROVE_LOCKING + default n + help + Enable the raw_spinlock vs. spinlock nesting checks which ensure + that the lock nesting rules for PREEMPT_RT enabled kernels are + not violated. + + NOTE: There are known nesting problems. So if you enable this + option expect lockdep splats until these problems have been fully + addressed which is work in progress. This config switch allows to + identify and analyze these problems. It will be removed and the + check permanentely enabled once the main issues have been fixed. + + If unsure, select N. + config LOCK_STAT bool "Lock usage statistics" depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT -- cgit v1.2.3-71-gd317 From 40db173965c05a1d803451240ed41707d5bd978d Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sat, 21 Mar 2020 12:26:02 +0100 Subject: lockdep: Add hrtimer context tracing bits Set current->irq_config = 1 for hrtimers which are not marked to expire in hard interrupt context during hrtimer_init(). These timers will expire in softirq context on PREEMPT_RT. Setting this allows lockdep to differentiate these timers. If a timer is marked to expire in hard interrupt context then the timer callback is not supposed to acquire a regular spinlock instead of a raw_spinlock in the expiry callback. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de --- include/linux/irqflags.h | 15 +++++++++++++++ include/linux/sched.h | 1 + kernel/locking/lockdep.c | 2 +- kernel/time/hrtimer.c | 6 +++++- 4 files changed, 22 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index fdaf28601cbe..9c17f9c827aa 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -56,6 +56,19 @@ do { \ do { \ current->softirq_context--; \ } while (0) + +# define lockdep_hrtimer_enter(__hrtimer) \ + do { \ + if (!__hrtimer->is_hard) \ + current->irq_config = 1; \ + } while (0) + +# define lockdep_hrtimer_exit(__hrtimer) \ + do { \ + if (!__hrtimer->is_hard) \ + current->irq_config = 0; \ + } while (0) + #else # define trace_hardirqs_on() do { } while (0) # define trace_hardirqs_off() do { } while (0) @@ -68,6 +81,8 @@ do { \ # define trace_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) # define lockdep_softirq_exit() do { } while (0) +# define lockdep_hrtimer_enter(__hrtimer) do { } while (0) +# define lockdep_hrtimer_exit(__hrtimer) do { } while (0) #endif #if defined(CONFIG_IRQSOFF_TRACER) || \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 4d3b9ecce074..933914cdf219 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -983,6 +983,7 @@ struct task_struct { unsigned int softirq_enable_event; int softirqs_enabled; int softirq_context; + int irq_config; #endif #ifdef CONFIG_LOCKDEP diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 6b9f9f321e6d..0ebf9807d971 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4025,7 +4025,7 @@ static int check_wait_context(struct task_struct *curr, struct held_lock *next) /* * Check if force_irqthreads will run us threaded. */ - if (curr->hardirq_threaded) + if (curr->hardirq_threaded || curr->irq_config) curr_inner = LD_WAIT_CONFIG; else curr_inner = LD_WAIT_SPIN; diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 3a609e7344f3..8cce72501aea 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1404,7 +1404,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id, base = softtimer ? HRTIMER_MAX_CLOCK_BASES / 2 : 0; base += hrtimer_clockid_to_base(clock_id); timer->is_soft = softtimer; - timer->is_hard = !softtimer; + timer->is_hard = !!(mode & HRTIMER_MODE_HARD); timer->base = &cpu_base->clock_base[base]; timerqueue_init(&timer->node); } @@ -1514,7 +1514,11 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, */ raw_spin_unlock_irqrestore(&cpu_base->lock, flags); trace_hrtimer_expire_entry(timer, now); + lockdep_hrtimer_enter(timer); + restart = fn(timer); + + lockdep_hrtimer_exit(timer); trace_hrtimer_expire_exit(timer); raw_spin_lock_irq(&cpu_base->lock); -- cgit v1.2.3-71-gd317 From 49915ac35ca7b07c54295a72d905be5064afb89e Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sat, 21 Mar 2020 12:26:03 +0100 Subject: lockdep: Annotate irq_work Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be invoked in softirq context on PREEMPT_RT. Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq context so lockdep knows that these can safely acquire a spinlock_t. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.643576700@linutronix.de --- include/linux/irq_work.h | 2 ++ include/linux/irqflags.h | 13 +++++++++++++ kernel/irq_work.c | 2 ++ kernel/rcu/tree.c | 1 + kernel/time/tick-sched.c | 1 + 5 files changed, 19 insertions(+) (limited to 'kernel') diff --git a/include/linux/irq_work.h b/include/linux/irq_work.h index 02da997ad12c..3b752e80c017 100644 --- a/include/linux/irq_work.h +++ b/include/linux/irq_work.h @@ -18,6 +18,8 @@ /* Doesn't want IPI, wait for tick: */ #define IRQ_WORK_LAZY BIT(2) +/* Run hard IRQ context, even on RT */ +#define IRQ_WORK_HARD_IRQ BIT(3) #define IRQ_WORK_CLAIMED (IRQ_WORK_PENDING | IRQ_WORK_BUSY) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 9c17f9c827aa..f23f540e0ebb 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -69,6 +69,17 @@ do { \ current->irq_config = 0; \ } while (0) +# define lockdep_irq_work_enter(__work) \ + do { \ + if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\ + current->irq_config = 1; \ + } while (0) +# define lockdep_irq_work_exit(__work) \ + do { \ + if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\ + current->irq_config = 0; \ + } while (0) + #else # define trace_hardirqs_on() do { } while (0) # define trace_hardirqs_off() do { } while (0) @@ -83,6 +94,8 @@ do { \ # define lockdep_softirq_exit() do { } while (0) # define lockdep_hrtimer_enter(__hrtimer) do { } while (0) # define lockdep_hrtimer_exit(__hrtimer) do { } while (0) +# define lockdep_irq_work_enter(__work) do { } while (0) +# define lockdep_irq_work_exit(__work) do { } while (0) #endif #if defined(CONFIG_IRQSOFF_TRACER) || \ diff --git a/kernel/irq_work.c b/kernel/irq_work.c index 828cc30774bc..48b5d1b6af4d 100644 --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -153,7 +153,9 @@ static void irq_work_run_list(struct llist_head *list) */ flags = atomic_fetch_andnot(IRQ_WORK_PENDING, &work->flags); + lockdep_irq_work_enter(work); work->func(work); + lockdep_irq_work_exit(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d91c9156fab2..5066d1dd3077 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1113,6 +1113,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) !rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq && (rnp->ffmask & rdp->grpmask)) { init_irq_work(&rdp->rcu_iw, rcu_iw_handler); + atomic_set(&rdp->rcu_iw.flags, IRQ_WORK_HARD_IRQ); rdp->rcu_iw_pending = true; rdp->rcu_iw_gp_seq = rnp->gp_seq; irq_work_queue_on(&rdp->rcu_iw, rdp->cpu); diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 4be756b88a48..3e2dc9b8858c 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -245,6 +245,7 @@ static void nohz_full_kick_func(struct irq_work *work) static DEFINE_PER_CPU(struct irq_work, nohz_full_kick_work) = { .func = nohz_full_kick_func, + .flags = ATOMIC_INIT(IRQ_WORK_HARD_IRQ), }; /* -- cgit v1.2.3-71-gd317 From d53f2b62fcb63f6547c10d8c62bca19e957b0eef Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sat, 21 Mar 2020 12:26:04 +0100 Subject: lockdep: Add posixtimer context tracing bits Splitting run_posix_cpu_timers() into two parts is work in progress which is stuck on other entry code related problems. The heavy lifting which involves locking of sighand lock will be moved into task context so the necessary execution time is burdened on the task and not on interrupt context. Until this work completes lockdep with the spinlock nesting rules enabled would emit warnings for this known context. Prevent it by setting "->irq_config = 1" for the invocation of run_posix_cpu_timers() so lockdep does not complain when sighand lock is acquried. This will be removed once the split is completed. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200321113242.751182723@linutronix.de --- include/linux/irqflags.h | 12 ++++++++++++ kernel/time/posix-cpu-timers.c | 6 +++++- 2 files changed, 17 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index f23f540e0ebb..a16adbb58f66 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -69,6 +69,16 @@ do { \ current->irq_config = 0; \ } while (0) +# define lockdep_posixtimer_enter() \ + do { \ + current->irq_config = 1; \ + } while (0) + +# define lockdep_posixtimer_exit() \ + do { \ + current->irq_config = 0; \ + } while (0) + # define lockdep_irq_work_enter(__work) \ do { \ if (!(atomic_read(&__work->flags) & IRQ_WORK_HARD_IRQ))\ @@ -94,6 +104,8 @@ do { \ # define lockdep_softirq_exit() do { } while (0) # define lockdep_hrtimer_enter(__hrtimer) do { } while (0) # define lockdep_hrtimer_exit(__hrtimer) do { } while (0) +# define lockdep_posixtimer_enter() do { } while (0) +# define lockdep_posixtimer_exit() do { } while (0) # define lockdep_irq_work_enter(__work) do { } while (0) # define lockdep_irq_work_exit(__work) do { } while (0) #endif diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index 8ff6da77a01f..2c48a7233b19 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -1126,8 +1126,11 @@ void run_posix_cpu_timers(void) if (!fastpath_timer_check(tsk)) return; - if (!lock_task_sighand(tsk, &flags)) + lockdep_posixtimer_enter(); + if (!lock_task_sighand(tsk, &flags)) { + lockdep_posixtimer_exit(); return; + } /* * Here we take off tsk->signal->cpu_timers[N] and * tsk->cpu_timers[N] all the timers that are firing, and @@ -1169,6 +1172,7 @@ void run_posix_cpu_timers(void) cpu_timer_fire(timer); spin_unlock(&timer->it_lock); } + lockdep_posixtimer_exit(); } /* -- cgit v1.2.3-71-gd317 From 2502ec37a7b228b34c1e2e89480f98b92f53046a Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 20 Mar 2020 12:56:40 +0100 Subject: lockdep: Rename trace_hardirq_{enter,exit}() Continue what commit: d820ac4c2fa8 ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]") started, rename these to avoid confusing them with tracepoints. Signed-off-by: Thomas Gleixner Reviewed-by: Thomas Gleixner Acked-by: Will Deacon Link: https://lkml.kernel.org/r/20200320115859.060481361@infradead.org --- include/linux/hardirq.h | 8 ++++---- include/linux/irqflags.h | 8 ++++---- kernel/softirq.c | 7 ++++--- tools/include/linux/irqflags.h | 4 ++-- 4 files changed, 14 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index da0af631ded5..7c8b82f69288 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -37,7 +37,7 @@ extern void rcu_nmi_exit(void); do { \ account_irq_enter_time(current); \ preempt_count_add(HARDIRQ_OFFSET); \ - trace_hardirq_enter(); \ + lockdep_hardirq_enter(); \ } while (0) /* @@ -50,7 +50,7 @@ extern void irq_enter(void); */ #define __irq_exit() \ do { \ - trace_hardirq_exit(); \ + lockdep_hardirq_exit(); \ account_irq_exit_time(current); \ preempt_count_sub(HARDIRQ_OFFSET); \ } while (0) @@ -74,12 +74,12 @@ extern void irq_exit(void); BUG_ON(in_nmi()); \ preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ rcu_nmi_enter(); \ - trace_hardirq_enter(); \ + lockdep_hardirq_enter(); \ } while (0) #define nmi_exit() \ do { \ - trace_hardirq_exit(); \ + lockdep_hardirq_exit(); \ rcu_nmi_exit(); \ BUG_ON(!in_nmi()); \ preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \ diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 21619c92c377..7c4e64589250 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -35,11 +35,11 @@ # define trace_softirq_context(p) ((p)->softirq_context) # define trace_hardirqs_enabled(p) ((p)->hardirqs_enabled) # define trace_softirqs_enabled(p) ((p)->softirqs_enabled) -# define trace_hardirq_enter() \ +# define lockdep_hardirq_enter() \ do { \ current->hardirq_context++; \ } while (0) -# define trace_hardirq_exit() \ +# define lockdep_hardirq_exit() \ do { \ current->hardirq_context--; \ } while (0) @@ -58,8 +58,8 @@ do { \ # define trace_softirq_context(p) 0 # define trace_hardirqs_enabled(p) 0 # define trace_softirqs_enabled(p) 0 -# define trace_hardirq_enter() do { } while (0) -# define trace_hardirq_exit() do { } while (0) +# define lockdep_hardirq_enter() do { } while (0) +# define lockdep_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) # define lockdep_softirq_exit() do { } while (0) #endif diff --git a/kernel/softirq.c b/kernel/softirq.c index 0427a86743a4..b3286892abac 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -226,7 +226,7 @@ static inline bool lockdep_softirq_start(void) if (trace_hardirq_context(current)) { in_hardirq = true; - trace_hardirq_exit(); + lockdep_hardirq_exit(); } lockdep_softirq_enter(); @@ -239,7 +239,7 @@ static inline void lockdep_softirq_end(bool in_hardirq) lockdep_softirq_exit(); if (in_hardirq) - trace_hardirq_enter(); + lockdep_hardirq_enter(); } #else static inline bool lockdep_softirq_start(void) { return false; } @@ -414,7 +414,8 @@ void irq_exit(void) tick_irq_exit(); rcu_irq_exit(); - trace_hardirq_exit(); /* must be last! */ + /* must be last! */ + lockdep_hardirq_exit(); } /* diff --git a/tools/include/linux/irqflags.h b/tools/include/linux/irqflags.h index e734da3e5b33..ced6f64a7367 100644 --- a/tools/include/linux/irqflags.h +++ b/tools/include/linux/irqflags.h @@ -6,8 +6,8 @@ # define trace_softirq_context(p) 0 # define trace_hardirqs_enabled(p) 0 # define trace_softirqs_enabled(p) 0 -# define trace_hardirq_enter() do { } while (0) -# define trace_hardirq_exit() do { } while (0) +# define lockdep_hardirq_enter() do { } while (0) +# define lockdep_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) # define lockdep_softirq_exit() do { } while (0) # define INIT_TRACE_IRQFLAGS -- cgit v1.2.3-71-gd317 From 0d38453c85b426e47375346812d2271680c47988 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 20 Mar 2020 12:56:41 +0100 Subject: lockdep: Rename trace_softirqs_{on,off}() Continue what commit: d820ac4c2fa8 ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]") started, rename these to avoid confusing them with tracepoints. git grep -l "trace_softirqs_\(on\|off\)" | while read file; do sed -ie 's/trace_softirqs_\(on\|off\)/lockdep_softirqs_\1/g' $file; done Reported-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Thomas Gleixner Acked-by: Will Deacon Link: https://lkml.kernel.org/r/20200320115859.119434738@infradead.org --- include/linux/irqflags.h | 10 +++++----- kernel/locking/lockdep.c | 4 ++-- kernel/softirq.c | 6 +++--- 3 files changed, 10 insertions(+), 10 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 7c4e64589250..7ca1f2126ae1 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -15,15 +15,15 @@ #include #include -/* Currently trace_softirqs_on/off is used only by lockdep */ +/* Currently lockdep_softirqs_on/off is used only by lockdep */ #ifdef CONFIG_PROVE_LOCKING - extern void trace_softirqs_on(unsigned long ip); - extern void trace_softirqs_off(unsigned long ip); + extern void lockdep_softirqs_on(unsigned long ip); + extern void lockdep_softirqs_off(unsigned long ip); extern void lockdep_hardirqs_on(unsigned long ip); extern void lockdep_hardirqs_off(unsigned long ip); #else - static inline void trace_softirqs_on(unsigned long ip) { } - static inline void trace_softirqs_off(unsigned long ip) { } + static inline void lockdep_softirqs_on(unsigned long ip) { } + static inline void lockdep_softirqs_off(unsigned long ip) { } static inline void lockdep_hardirqs_on(unsigned long ip) { } static inline void lockdep_hardirqs_off(unsigned long ip) { } #endif diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 32406ef0d6a2..26ef41296fb3 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3468,7 +3468,7 @@ NOKPROBE_SYMBOL(lockdep_hardirqs_off); /* * Softirqs will be enabled: */ -void trace_softirqs_on(unsigned long ip) +void lockdep_softirqs_on(unsigned long ip) { struct task_struct *curr = current; @@ -3508,7 +3508,7 @@ void trace_softirqs_on(unsigned long ip) /* * Softirqs were disabled: */ -void trace_softirqs_off(unsigned long ip) +void lockdep_softirqs_off(unsigned long ip) { struct task_struct *curr = current; diff --git a/kernel/softirq.c b/kernel/softirq.c index b3286892abac..0112ca01694b 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -126,7 +126,7 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt) * Were softirqs turned off above: */ if (softirq_count() == (cnt & SOFTIRQ_MASK)) - trace_softirqs_off(ip); + lockdep_softirqs_off(ip); raw_local_irq_restore(flags); if (preempt_count() == cnt) { @@ -147,7 +147,7 @@ static void __local_bh_enable(unsigned int cnt) trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip()); if (softirq_count() == (cnt & SOFTIRQ_MASK)) - trace_softirqs_on(_RET_IP_); + lockdep_softirqs_on(_RET_IP_); __preempt_count_sub(cnt); } @@ -174,7 +174,7 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt) * Are softirqs going to be turned on now: */ if (softirq_count() == SOFTIRQ_DISABLE_OFFSET) - trace_softirqs_on(ip); + lockdep_softirqs_on(ip); /* * Keep preemption disabled until we are done with * softirq processing: -- cgit v1.2.3-71-gd317 From ef996916e78e03d25e56c2d372e5e21fdb471882 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 20 Mar 2020 12:56:42 +0100 Subject: lockdep: Rename trace_{hard,soft}{irq_context,irqs_enabled}() Continue what commit: d820ac4c2fa8 ("locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]") started, rename these to avoid confusing them with tracepoints. git grep -l "trace_\(soft\|hard\)\(irq_context\|irqs_enabled\)" | while read file; do sed -ie 's/trace_\(soft\|hard\)\(irq_context\|irqs_enabled\)/lockdep_\1\2/g' $file; done Reported-by: Thomas Gleixner Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Thomas Gleixner Acked-by: Will Deacon Link: https://lkml.kernel.org/r/20200320115859.178626842@infradead.org --- include/linux/irqflags.h | 16 ++++++++-------- kernel/locking/lockdep.c | 8 ++++---- kernel/softirq.c | 2 +- tools/include/linux/irqflags.h | 8 ++++---- 4 files changed, 17 insertions(+), 17 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 7ca1f2126ae1..f4c3907e241a 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -31,10 +31,10 @@ #ifdef CONFIG_TRACE_IRQFLAGS extern void trace_hardirqs_on(void); extern void trace_hardirqs_off(void); -# define trace_hardirq_context(p) ((p)->hardirq_context) -# define trace_softirq_context(p) ((p)->softirq_context) -# define trace_hardirqs_enabled(p) ((p)->hardirqs_enabled) -# define trace_softirqs_enabled(p) ((p)->softirqs_enabled) +# define lockdep_hardirq_context(p) ((p)->hardirq_context) +# define lockdep_softirq_context(p) ((p)->softirq_context) +# define lockdep_hardirqs_enabled(p) ((p)->hardirqs_enabled) +# define lockdep_softirqs_enabled(p) ((p)->softirqs_enabled) # define lockdep_hardirq_enter() \ do { \ current->hardirq_context++; \ @@ -54,10 +54,10 @@ do { \ #else # define trace_hardirqs_on() do { } while (0) # define trace_hardirqs_off() do { } while (0) -# define trace_hardirq_context(p) 0 -# define trace_softirq_context(p) 0 -# define trace_hardirqs_enabled(p) 0 -# define trace_softirqs_enabled(p) 0 +# define lockdep_hardirq_context(p) 0 +# define lockdep_softirq_context(p) 0 +# define lockdep_hardirqs_enabled(p) 0 +# define lockdep_softirqs_enabled(p) 0 # define lockdep_hardirq_enter() do { } while (0) # define lockdep_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 26ef41296fb3..4075e3ec460b 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3081,10 +3081,10 @@ print_usage_bug(struct task_struct *curr, struct held_lock *this, pr_warn("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n", curr->comm, task_pid_nr(curr), - trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT, - trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT, - trace_hardirqs_enabled(curr), - trace_softirqs_enabled(curr)); + lockdep_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT, + lockdep_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT, + lockdep_hardirqs_enabled(curr), + lockdep_softirqs_enabled(curr)); print_lock(this); pr_warn("{%s} state was registered at:\n", usage_str[prev_bit]); diff --git a/kernel/softirq.c b/kernel/softirq.c index 0112ca01694b..a47c6dd57452 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -224,7 +224,7 @@ static inline bool lockdep_softirq_start(void) { bool in_hardirq = false; - if (trace_hardirq_context(current)) { + if (lockdep_hardirq_context(current)) { in_hardirq = true; lockdep_hardirq_exit(); } diff --git a/tools/include/linux/irqflags.h b/tools/include/linux/irqflags.h index ced6f64a7367..67e01bbadbfe 100644 --- a/tools/include/linux/irqflags.h +++ b/tools/include/linux/irqflags.h @@ -2,10 +2,10 @@ #ifndef _LIBLOCKDEP_LINUX_TRACE_IRQFLAGS_H_ #define _LIBLOCKDEP_LINUX_TRACE_IRQFLAGS_H_ -# define trace_hardirq_context(p) 0 -# define trace_softirq_context(p) 0 -# define trace_hardirqs_enabled(p) 0 -# define trace_softirqs_enabled(p) 0 +# define lockdep_hardirq_context(p) 0 +# define lockdep_softirq_context(p) 0 +# define lockdep_hardirqs_enabled(p) 0 +# define lockdep_softirqs_enabled(p) 0 # define lockdep_hardirq_enter() do { } while (0) # define lockdep_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) -- cgit v1.2.3-71-gd317 From 0f11ad323dd3d316830152de63b5737a6261ad14 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Mon, 10 Feb 2020 09:58:37 -0800 Subject: rcu: Mark rcu_state.gp_seq to detect concurrent writes The rcu_state structure's gp_seq field is only to be modified by the RCU grace-period kthread, which is single-threaded. This commit therefore enlists KCSAN's help in enforcing this restriction. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 6c624814ee2d..739788ff0f6e 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1209,7 +1209,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp_start, struct rcu_data *rdp, trace_rcu_this_gp(rnp, rdp, gp_seq_req, TPS("NoGPkthread")); goto unlock_out; } - trace_rcu_grace_period(rcu_state.name, READ_ONCE(rcu_state.gp_seq), TPS("newreq")); + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("newreq")); ret = true; /* Caller must wake GP kthread. */ unlock_out: /* Push furthest requested GP to leaf node and rcu_data structure. */ @@ -1657,8 +1657,7 @@ static void rcu_gp_fqs_loop(void) WRITE_ONCE(rcu_state.jiffies_kick_kthreads, jiffies + (j ? 3 * j : 2)); } - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqswait")); rcu_state.gp_state = RCU_GP_WAIT_FQS; ret = swait_event_idle_timeout_exclusive( @@ -1672,13 +1671,11 @@ static void rcu_gp_fqs_loop(void) /* If time for quiescent-state forcing, do it. */ if (ULONG_CMP_GE(jiffies, rcu_state.jiffies_force_qs) || (gf & RCU_GP_FLAG_FQS)) { - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsstart")); rcu_gp_fqs(first_gp_fqs); first_gp_fqs = false; - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqsend")); cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); @@ -1689,8 +1686,7 @@ static void rcu_gp_fqs_loop(void) cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); WARN_ON(signal_pending(current)); - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("fqswaitsig")); ret = 1; /* Keep old FQS timing. */ j = jiffies; @@ -1782,7 +1778,7 @@ static void rcu_gp_cleanup(void) WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); WRITE_ONCE(rcu_state.gp_req_activity, jiffies); trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + rcu_state.gp_seq, TPS("newreq")); } else { WRITE_ONCE(rcu_state.gp_flags, @@ -1801,8 +1797,7 @@ static int __noreturn rcu_gp_kthread(void *unused) /* Handle grace-period start. */ for (;;) { - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("reqwait")); rcu_state.gp_state = RCU_GP_WAIT_GPS; swait_event_idle_exclusive(rcu_state.gp_wq, @@ -1815,8 +1810,7 @@ static int __noreturn rcu_gp_kthread(void *unused) cond_resched_tasks_rcu_qs(); WRITE_ONCE(rcu_state.gp_activity, jiffies); WARN_ON(signal_pending(current)); - trace_rcu_grace_period(rcu_state.name, - READ_ONCE(rcu_state.gp_seq), + trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("reqwaitsig")); } -- cgit v1.2.3-71-gd317 From 127e29815b4b2206c0a97ac1d83f92ffc0e25c34 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Tue, 11 Feb 2020 06:17:33 -0800 Subject: rcu: Make rcu_barrier() account for offline no-CBs CPUs Currently, rcu_barrier() ignores offline CPUs, However, it is possible for an offline no-CBs CPU to have callbacks queued, and rcu_barrier() must wait for those callbacks. This commit therefore makes rcu_barrier() directly invoke the rcu_barrier_func() with interrupts disabled for such CPUs. This requires passing the CPU number into this function so that it can entrain the rcu_barrier() callback onto the correct CPU's callback list, given that the code must instead execute on the current CPU. While in the area, this commit fixes a bug where the first CPU's callback might have been invoked before rcu_segcblist_entrain() returned, which would also result in an early wakeup. Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs") Signed-off-by: Paul E. McKenney [ paulmck: Apply optimization feedback from Boqun Feng. ] Cc: # 5.5.x --- include/trace/events/rcu.h | 1 + kernel/rcu/tree.c | 36 ++++++++++++++++++++++++------------ 2 files changed, 25 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 5e49b06e8104..d56d54c17497 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -712,6 +712,7 @@ TRACE_EVENT_RCU(rcu_torture_read, * "Begin": rcu_barrier() started. * "EarlyExit": rcu_barrier() piggybacked, thus early exit. * "Inc1": rcu_barrier() piggyback check counter incremented. + * "OfflineNoCBQ": rcu_barrier() found offline no-CBs CPU with callbacks. * "OnlineQ": rcu_barrier() found online CPU with callbacks. * "OnlineNQ": rcu_barrier() found online CPU, no callbacks. * "IRQ": An rcu_barrier_callback() callback posted on remote CPU. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 739788ff0f6e..35292c8ffd4a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3097,9 +3097,10 @@ static void rcu_barrier_callback(struct rcu_head *rhp) /* * Called with preemption disabled, and from cross-cpu IRQ context. */ -static void rcu_barrier_func(void *unused) +static void rcu_barrier_func(void *cpu_in) { - struct rcu_data *rdp = raw_cpu_ptr(&rcu_data); + uintptr_t cpu = (uintptr_t)cpu_in; + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); rcu_barrier_trace(TPS("IRQ"), -1, rcu_state.barrier_sequence); rdp->barrier_head.func = rcu_barrier_callback; @@ -3126,7 +3127,7 @@ static void rcu_barrier_func(void *unused) */ void rcu_barrier(void) { - int cpu; + uintptr_t cpu; struct rcu_data *rdp; unsigned long s = rcu_seq_snap(&rcu_state.barrier_sequence); @@ -3149,13 +3150,14 @@ void rcu_barrier(void) rcu_barrier_trace(TPS("Inc1"), -1, rcu_state.barrier_sequence); /* - * Initialize the count to one rather than to zero in order to - * avoid a too-soon return to zero in case of a short grace period - * (or preemption of this task). Exclude CPU-hotplug operations - * to ensure that no offline CPU has callbacks queued. + * Initialize the count to two rather than to zero in order + * to avoid a too-soon return to zero in case of an immediate + * invocation of the just-enqueued callback (or preemption of + * this task). Exclude CPU-hotplug operations to ensure that no + * offline non-offloaded CPU has callbacks queued. */ init_completion(&rcu_state.barrier_completion); - atomic_set(&rcu_state.barrier_cpu_count, 1); + atomic_set(&rcu_state.barrier_cpu_count, 2); get_online_cpus(); /* @@ -3165,13 +3167,23 @@ void rcu_barrier(void) */ for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); - if (!cpu_online(cpu) && + if (cpu_is_offline(cpu) && !rcu_segcblist_is_offloaded(&rdp->cblist)) continue; - if (rcu_segcblist_n_cbs(&rdp->cblist)) { + if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) { rcu_barrier_trace(TPS("OnlineQ"), cpu, rcu_state.barrier_sequence); - smp_call_function_single(cpu, rcu_barrier_func, NULL, 1); + smp_call_function_single(cpu, rcu_barrier_func, (void *)cpu, 1); + } else if (rcu_segcblist_n_cbs(&rdp->cblist) && + cpu_is_offline(cpu)) { + rcu_barrier_trace(TPS("OfflineNoCBQ"), cpu, + rcu_state.barrier_sequence); + local_irq_disable(); + rcu_barrier_func((void *)cpu); + local_irq_enable(); + } else if (cpu_is_offline(cpu)) { + rcu_barrier_trace(TPS("OfflineNoCBNoQ"), cpu, + rcu_state.barrier_sequence); } else { rcu_barrier_trace(TPS("OnlineNQ"), cpu, rcu_state.barrier_sequence); @@ -3183,7 +3195,7 @@ void rcu_barrier(void) * Now that we have an rcu_barrier_callback() callback on each * CPU, and thus each counted, remove the initial count. */ - if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) + if (atomic_sub_and_test(2, &rcu_state.barrier_cpu_count)) complete(&rcu_state.barrier_completion); /* Wait for all rcu_barrier_callback() callbacks to be invoked. */ -- cgit v1.2.3-71-gd317 From 086b2d78375cffe58f5341359bebec0650793811 Mon Sep 17 00:00:00 2001 From: Heiko Carstens Date: Wed, 18 Mar 2020 20:55:20 +0100 Subject: PM: remove s390 specific callbacks ARCH_SAVE_PAGE_KEYS has been introduced in order to be able to save and restore s390 specific storage keys into a hibernation image. With hibernation support removed from s390 there is no point in keeping the callbacks. Acked-by: Christian Borntraeger Acked-by: Peter Oberparleiter Signed-off-by: Heiko Carstens Signed-off-by: Vasily Gorbik --- include/linux/suspend.h | 34 ---------------------------------- kernel/power/Kconfig | 3 --- kernel/power/snapshot.c | 18 ------------------ 3 files changed, 55 deletions(-) (limited to 'kernel') diff --git a/include/linux/suspend.h b/include/linux/suspend.h index 2b2055b035ee..4fcc6fd0cbd6 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -566,38 +566,4 @@ static inline void queue_up_suspend_work(void) {} #endif /* !CONFIG_PM_AUTOSLEEP */ -#ifdef CONFIG_ARCH_SAVE_PAGE_KEYS -/* - * The ARCH_SAVE_PAGE_KEYS functions can be used by an architecture - * to save/restore additional information to/from the array of page - * frame numbers in the hibernation image. For s390 this is used to - * save and restore the storage key for each page that is included - * in the hibernation image. - */ -unsigned long page_key_additional_pages(unsigned long pages); -int page_key_alloc(unsigned long pages); -void page_key_free(void); -void page_key_read(unsigned long *pfn); -void page_key_memorize(unsigned long *pfn); -void page_key_write(void *address); - -#else /* !CONFIG_ARCH_SAVE_PAGE_KEYS */ - -static inline unsigned long page_key_additional_pages(unsigned long pages) -{ - return 0; -} - -static inline int page_key_alloc(unsigned long pages) -{ - return 0; -} - -static inline void page_key_free(void) {} -static inline void page_key_read(unsigned long *pfn) {} -static inline void page_key_memorize(unsigned long *pfn) {} -static inline void page_key_write(void *address) {} - -#endif /* !CONFIG_ARCH_SAVE_PAGE_KEYS */ - #endif /* _LINUX_SUSPEND_H */ diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 7cbfbeacd68a..c208566c844b 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -80,9 +80,6 @@ config HIBERNATION For more information take a look at . -config ARCH_SAVE_PAGE_KEYS - bool - config PM_STD_PARTITION string "Default resume partition" depends on HIBERNATION diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c index ddade80ad276..e99d13b0b8fc 100644 --- a/kernel/power/snapshot.c +++ b/kernel/power/snapshot.c @@ -1744,9 +1744,6 @@ int hibernate_preallocate_memory(void) count += highmem; count -= totalreserve_pages; - /* Add number of pages required for page keys (s390 only). */ - size += page_key_additional_pages(saveable); - /* Compute the maximum number of saveable pages to leave in memory. */ max_size = (count - (size + PAGES_FOR_IO)) / 2 - 2 * DIV_ROUND_UP(reserved_size, PAGE_SIZE); @@ -2075,8 +2072,6 @@ static inline void pack_pfns(unsigned long *buf, struct memory_bitmap *bm) buf[j] = memory_bm_next_pfn(bm); if (unlikely(buf[j] == BM_END_OF_MAP)) break; - /* Save page key for data page (s390 only). */ - page_key_read(buf + j); } } @@ -2226,9 +2221,6 @@ static int unpack_orig_pfns(unsigned long *buf, struct memory_bitmap *bm) if (unlikely(buf[j] == BM_END_OF_MAP)) break; - /* Extract and buffer page key for data page (s390 only). */ - page_key_memorize(buf + j); - if (pfn_valid(buf[j]) && memory_bm_pfn_present(bm, buf[j])) memory_bm_set_bit(bm, buf[j]); else @@ -2623,11 +2615,6 @@ int snapshot_write_next(struct snapshot_handle *handle) if (error) return error; - /* Allocate buffer for page keys. */ - error = page_key_alloc(nr_copy_pages); - if (error) - return error; - hibernate_restore_protection_begin(); } else if (handle->cur <= nr_meta_pages + 1) { error = unpack_orig_pfns(buffer, ©_bm); @@ -2649,8 +2636,6 @@ int snapshot_write_next(struct snapshot_handle *handle) } } else { copy_last_highmem_page(); - /* Restore page key for data page (s390 only). */ - page_key_write(handle->buffer); hibernate_restore_protect_page(handle->buffer); handle->buffer = get_buffer(&orig_bm, &ca); if (IS_ERR(handle->buffer)) @@ -2673,9 +2658,6 @@ int snapshot_write_next(struct snapshot_handle *handle) void snapshot_write_finalize(struct snapshot_handle *handle) { copy_last_highmem_page(); - /* Restore page key for data page (s390 only). */ - page_key_write(handle->buffer); - page_key_free(); hibernate_restore_protect_page(handle->buffer); /* Do that only if we have loaded the image entirely */ if (handle->cur > 1 && handle->cur > nr_meta_pages + nr_copy_pages) { -- cgit v1.2.3-71-gd317 From aa93bdc5500cc93ba31afeda1a61610d117947ad Mon Sep 17 00:00:00 2001 From: Amir Goldstein Date: Thu, 19 Mar 2020 17:10:12 +0200 Subject: fsnotify: use helpers to access data by data_type Create helpers to access path and inode from different data types. Link: https://lore.kernel.org/r/20200319151022.31456-5-amir73il@gmail.com Signed-off-by: Amir Goldstein Signed-off-by: Jan Kara --- fs/notify/fanotify/fanotify.c | 18 ++++++++---------- fs/notify/fsnotify.c | 5 +++-- fs/notify/inotify/inotify_fsnotify.c | 8 +++----- include/linux/fsnotify_backend.h | 34 ++++++++++++++++++++++++++++++---- kernel/audit_fsnotify.c | 13 ++----------- kernel/audit_watch.c | 16 ++-------------- 6 files changed, 48 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index 5778d1347b35..19ec7a4f4d50 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -151,7 +151,7 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group, { __u32 marks_mask = 0, marks_ignored_mask = 0; __u32 test_mask, user_mask = FANOTIFY_OUTGOING_EVENTS; - const struct path *path = data; + const struct path *path = fsnotify_data_path(data, data_type); struct fsnotify_mark *mark; int type; @@ -160,7 +160,7 @@ static u32 fanotify_group_event_mask(struct fsnotify_group *group, if (!FAN_GROUP_FLAG(group, FAN_REPORT_FID)) { /* Do we have path to open a file descriptor? */ - if (data_type != FSNOTIFY_EVENT_PATH) + if (!path) return 0; /* Path type events are only relevant for files and dirs */ if (!d_is_reg(path->dentry) && !d_can_lookup(path->dentry)) @@ -269,11 +269,8 @@ static struct inode *fanotify_fid_inode(struct inode *to_tell, u32 event_mask, { if (event_mask & ALL_FSNOTIFY_DIRENT_EVENTS) return to_tell; - else if (data_type == FSNOTIFY_EVENT_INODE) - return (struct inode *)data; - else if (data_type == FSNOTIFY_EVENT_PATH) - return d_inode(((struct path *)data)->dentry); - return NULL; + + return (struct inode *)fsnotify_data_inode(data, data_type); } struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, @@ -284,6 +281,7 @@ struct fanotify_event *fanotify_alloc_event(struct fsnotify_group *group, struct fanotify_event *event = NULL; gfp_t gfp = GFP_KERNEL_ACCOUNT; struct inode *id = fanotify_fid_inode(inode, mask, data, data_type); + const struct path *path = fsnotify_data_path(data, data_type); /* * For queues with unlimited length lost events are not expected and @@ -324,10 +322,10 @@ init: __maybe_unused if (id && FAN_GROUP_FLAG(group, FAN_REPORT_FID)) { /* Report the event without a file identifier on encode error */ event->fh_type = fanotify_encode_fid(event, id, gfp, fsid); - } else if (data_type == FSNOTIFY_EVENT_PATH) { + } else if (path) { event->fh_type = FILEID_ROOT; - event->path = *((struct path *)data); - path_get(&event->path); + event->path = *path; + path_get(path); } else { event->fh_type = FILEID_INVALID; event->path.mnt = NULL; diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 46f225580009..a5d6467f89a0 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -318,6 +318,7 @@ static void fsnotify_iter_next(struct fsnotify_iter_info *iter_info) int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, const struct qstr *file_name, u32 cookie) { + const struct path *path = fsnotify_data_path(data, data_is); struct fsnotify_iter_info iter_info = {}; struct super_block *sb = to_tell->i_sb; struct mount *mnt = NULL; @@ -325,8 +326,8 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, int ret = 0; __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); - if (data_is == FSNOTIFY_EVENT_PATH) { - mnt = real_mount(((const struct path *)data)->mnt); + if (path) { + mnt = real_mount(path->mnt); mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; } /* An event "on child" is not intended for a mount/sb mark */ diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c index d510223d302c..6bb98522bbfd 100644 --- a/fs/notify/inotify/inotify_fsnotify.c +++ b/fs/notify/inotify/inotify_fsnotify.c @@ -61,6 +61,7 @@ int inotify_handle_event(struct fsnotify_group *group, const struct qstr *file_name, u32 cookie, struct fsnotify_iter_info *iter_info) { + const struct path *path = fsnotify_data_path(data, data_type); struct fsnotify_mark *inode_mark = fsnotify_iter_inode_mark(iter_info); struct inotify_inode_mark *i_mark; struct inotify_event_info *event; @@ -73,12 +74,9 @@ int inotify_handle_event(struct fsnotify_group *group, return 0; if ((inode_mark->mask & FS_EXCL_UNLINK) && - (data_type == FSNOTIFY_EVENT_PATH)) { - const struct path *path = data; + path && d_unlinked(path->dentry)) + return 0; - if (d_unlinked(path->dentry)) - return 0; - } if (file_name) { len = file_name->len; alloc_len += len + 1; diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index db3cabb4600e..5cc838db422a 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -212,10 +212,36 @@ struct fsnotify_group { }; }; -/* when calling fsnotify tell it if the data is a path or inode */ -#define FSNOTIFY_EVENT_NONE 0 -#define FSNOTIFY_EVENT_PATH 1 -#define FSNOTIFY_EVENT_INODE 2 +/* When calling fsnotify tell it if the data is a path or inode */ +enum fsnotify_data_type { + FSNOTIFY_EVENT_NONE, + FSNOTIFY_EVENT_PATH, + FSNOTIFY_EVENT_INODE, +}; + +static inline const struct inode *fsnotify_data_inode(const void *data, + int data_type) +{ + switch (data_type) { + case FSNOTIFY_EVENT_INODE: + return data; + case FSNOTIFY_EVENT_PATH: + return d_inode(((const struct path *)data)->dentry); + default: + return NULL; + } +} + +static inline const struct path *fsnotify_data_path(const void *data, + int data_type) +{ + switch (data_type) { + case FSNOTIFY_EVENT_PATH: + return data; + default: + return NULL; + } +} enum fsnotify_obj_type { FSNOTIFY_OBJ_TYPE_INODE, diff --git a/kernel/audit_fsnotify.c b/kernel/audit_fsnotify.c index f0d243318452..3596448bfdab 100644 --- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -160,23 +160,14 @@ static int audit_mark_handle_event(struct fsnotify_group *group, { struct fsnotify_mark *inode_mark = fsnotify_iter_inode_mark(iter_info); struct audit_fsnotify_mark *audit_mark; - const struct inode *inode = NULL; + const struct inode *inode = fsnotify_data_inode(data, data_type); audit_mark = container_of(inode_mark, struct audit_fsnotify_mark, mark); BUG_ON(group != audit_fsnotify_group); - switch (data_type) { - case (FSNOTIFY_EVENT_PATH): - inode = ((const struct path *)data)->dentry->d_inode; - break; - case (FSNOTIFY_EVENT_INODE): - inode = (const struct inode *)data; - break; - default: - BUG(); + if (WARN_ON(!inode)) return 0; - } if (mask & (FS_CREATE|FS_MOVED_TO|FS_DELETE|FS_MOVED_FROM)) { if (audit_compare_dname_path(dname, audit_mark->path, AUDIT_NAME_FULL)) diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 4508d5e0cf69..dcfbb44c6720 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -473,25 +473,13 @@ static int audit_watch_handle_event(struct fsnotify_group *group, struct fsnotify_iter_info *iter_info) { struct fsnotify_mark *inode_mark = fsnotify_iter_inode_mark(iter_info); - const struct inode *inode; + const struct inode *inode = fsnotify_data_inode(data, data_type); struct audit_parent *parent; parent = container_of(inode_mark, struct audit_parent, mark); BUG_ON(group != audit_watch_group); - - switch (data_type) { - case (FSNOTIFY_EVENT_PATH): - inode = d_backing_inode(((const struct path *)data)->dentry); - break; - case (FSNOTIFY_EVENT_INODE): - inode = (const struct inode *)data; - break; - default: - BUG(); - inode = NULL; - break; - } + WARN_ON(!inode); if (mask & (FS_CREATE|FS_MOVED_TO) && inode) audit_update_watch(parent, dname, inode->i_sb->s_dev, inode->i_ino, 0); -- cgit v1.2.3-71-gd317 From 8bf6c677ddb9c922423ea3bf494fe7c508bfbb8c Mon Sep 17 00:00:00 2001 From: Sebastian Siewior Date: Mon, 23 Mar 2020 16:20:19 +0100 Subject: completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() The warning was intended to spot complete_all() users from hardirq context on PREEMPT_RT. The warning as-is will also trigger in interrupt handlers, which are threaded on PREEMPT_RT, which was not intended. Use lockdep_assert_RT_in_threaded_ctx() which triggers in non-preemptive context on PREEMPT_RT. Fixes: a5c6234e1028 ("completion: Use simple wait queues") Reported-by: kernel test robot Suggested-by: Peter Zijlstra Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200323152019.4qjwluldohuh3by5@linutronix.de --- include/linux/lockdep.h | 15 +++++++++++++++ kernel/sched/completion.c | 2 +- 2 files changed, 16 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 425b4ceb7cd0..206774ac6946 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -711,6 +711,21 @@ do { \ # define lockdep_assert_in_irq() do { } while (0) #endif +#ifdef CONFIG_PROVE_RAW_LOCK_NESTING + +# define lockdep_assert_RT_in_threaded_ctx() do { \ + WARN_ONCE(debug_locks && !current->lockdep_recursion && \ + current->hardirq_context && \ + !(current->hardirq_threaded || current->irq_config), \ + "Not in threaded context on PREEMPT_RT as expected\n"); \ +} while (0) + +#else + +# define lockdep_assert_RT_in_threaded_ctx() do { } while (0) + +#endif + #ifdef CONFIG_LOCKDEP void lockdep_rcu_suspicious(const char *file, const int line, const char *s); #else diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c index f15e96164ff1..a778554f9dad 100644 --- a/kernel/sched/completion.c +++ b/kernel/sched/completion.c @@ -58,7 +58,7 @@ void complete_all(struct completion *x) { unsigned long flags; - WARN_ON(irqs_disabled()); + lockdep_assert_RT_in_threaded_ctx(); raw_spin_lock_irqsave(&x->wait.lock, flags); x->done = UINT_MAX; -- cgit v1.2.3-71-gd317 From 2985bed68083f3da5f6d79c3dbb9196dbc04d02a Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 3 Mar 2020 22:35:58 +0900 Subject: .gitignore: remove too obvious comments Some .gitignore files have comments like "Generated files", "Ignore generated files" at the header part, but they are too obvious. Signed-off-by: Masahiro Yamada Signed-off-by: Greg Kroah-Hartman --- certs/.gitignore | 3 --- drivers/atm/.gitignore | 1 - drivers/video/logo/.gitignore | 3 --- kernel/.gitignore | 3 --- lib/.gitignore | 3 --- scripts/.gitignore | 3 --- scripts/kconfig/.gitignore | 3 --- scripts/selinux/mdp/.gitignore | 1 - security/apparmor/.gitignore | 3 --- sound/oss/.gitignore | 1 - 10 files changed, 24 deletions(-) (limited to 'kernel') diff --git a/certs/.gitignore b/certs/.gitignore index f51aea4a71ec..4d58ba042b37 100644 --- a/certs/.gitignore +++ b/certs/.gitignore @@ -1,4 +1 @@ -# -# Generated files -# x509_certificate_list diff --git a/drivers/atm/.gitignore b/drivers/atm/.gitignore index fc0ae5eb05d8..19f3ffbd1d65 100644 --- a/drivers/atm/.gitignore +++ b/drivers/atm/.gitignore @@ -1,4 +1,3 @@ -# Ignore generated files fore200e_mkfirm fore200e_pca_fw.c pca200e.bin diff --git a/drivers/video/logo/.gitignore b/drivers/video/logo/.gitignore index 9dda1b26b2e4..1551a75afdbd 100644 --- a/drivers/video/logo/.gitignore +++ b/drivers/video/logo/.gitignore @@ -1,6 +1,3 @@ -# -# Generated files -# *_mono.c *_vga16.c *_clut224.c diff --git a/kernel/.gitignore b/kernel/.gitignore index 34d1e77ee9df..0a423a3ca2e1 100644 --- a/kernel/.gitignore +++ b/kernel/.gitignore @@ -1,6 +1,3 @@ -# -# Generated files -# kheaders.md5 timeconst.h hz.bc diff --git a/lib/.gitignore b/lib/.gitignore index f2a39c9e5485..9af73655a239 100644 --- a/lib/.gitignore +++ b/lib/.gitignore @@ -1,6 +1,3 @@ -# -# Generated files -# gen_crc32table gen_crc64table crc32table.h diff --git a/scripts/.gitignore b/scripts/.gitignore index ef45f96cd7a5..9fe29efbcb95 100644 --- a/scripts/.gitignore +++ b/scripts/.gitignore @@ -1,6 +1,3 @@ -# -# Generated files -# bin2c kallsyms unifdef diff --git a/scripts/kconfig/.gitignore b/scripts/kconfig/.gitignore index b5bf92f66d11..588988711e07 100644 --- a/scripts/kconfig/.gitignore +++ b/scripts/kconfig/.gitignore @@ -1,6 +1,3 @@ -# -# Generated files -# *.moc *conf-cfg diff --git a/scripts/selinux/mdp/.gitignore b/scripts/selinux/mdp/.gitignore index 654546d8dffd..0d9f827dc14b 100644 --- a/scripts/selinux/mdp/.gitignore +++ b/scripts/selinux/mdp/.gitignore @@ -1,2 +1 @@ -# Generated file mdp diff --git a/security/apparmor/.gitignore b/security/apparmor/.gitignore index d5b291e94264..0ace1d1dec44 100644 --- a/security/apparmor/.gitignore +++ b/security/apparmor/.gitignore @@ -1,6 +1,3 @@ -# -# Generated include files -# net_names.h capability_names.h rlim_names.h diff --git a/sound/oss/.gitignore b/sound/oss/.gitignore index 12a3920d6fb6..8fd8fd3eff62 100644 --- a/sound/oss/.gitignore +++ b/sound/oss/.gitignore @@ -1,3 +1,2 @@ -#Ignore generated files pss_boot.h trix_boot.h -- cgit v1.2.3-71-gd317 From d198b34f3855eee2571dda03eea75a09c7c31480 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Tue, 3 Mar 2020 22:35:59 +0900 Subject: .gitignore: add SPDX License Identifier Add SPDX License Identifier to all .gitignore files. Signed-off-by: Masahiro Yamada Signed-off-by: Greg Kroah-Hartman --- .gitignore | 1 + Documentation/.gitignore | 1 + Documentation/devicetree/bindings/.gitignore | 1 + Documentation/vm/.gitignore | 1 + arch/.gitignore | 1 + arch/alpha/kernel/.gitignore | 1 + arch/arc/boot/.gitignore | 1 + arch/arc/kernel/.gitignore | 1 + arch/arm/boot/.gitignore | 1 + arch/arm/boot/compressed/.gitignore | 1 + arch/arm/crypto/.gitignore | 1 + arch/arm/kernel/.gitignore | 1 + arch/arm/mach-at91/.gitignore | 1 + arch/arm/mach-omap2/.gitignore | 1 + arch/arm/vdso/.gitignore | 1 + arch/arm64/boot/.gitignore | 1 + arch/arm64/crypto/.gitignore | 1 + arch/arm64/kernel/.gitignore | 1 + arch/arm64/kernel/vdso/.gitignore | 1 + arch/arm64/kernel/vdso32/.gitignore | 1 + arch/ia64/kernel/.gitignore | 1 + arch/m68k/kernel/.gitignore | 1 + arch/microblaze/boot/.gitignore | 1 + arch/microblaze/kernel/.gitignore | 1 + arch/mips/boot/.gitignore | 1 + arch/mips/boot/compressed/.gitignore | 1 + arch/mips/boot/tools/.gitignore | 1 + arch/mips/kernel/.gitignore | 1 + arch/mips/tools/.gitignore | 1 + arch/mips/vdso/.gitignore | 1 + arch/nds32/kernel/.gitignore | 1 + arch/nds32/kernel/vdso/.gitignore | 1 + arch/nios2/boot/.gitignore | 1 + arch/nios2/kernel/.gitignore | 1 + arch/openrisc/kernel/.gitignore | 1 + arch/parisc/boot/.gitignore | 1 + arch/parisc/boot/compressed/.gitignore | 1 + arch/parisc/kernel/.gitignore | 1 + arch/powerpc/boot/.gitignore | 1 + arch/powerpc/kernel/.gitignore | 1 + arch/powerpc/kernel/vdso32/.gitignore | 1 + arch/powerpc/kernel/vdso64/.gitignore | 1 + arch/powerpc/platforms/cell/spufs/.gitignore | 1 + arch/powerpc/purgatory/.gitignore | 1 + arch/riscv/boot/.gitignore | 1 + arch/riscv/kernel/.gitignore | 1 + arch/riscv/kernel/vdso/.gitignore | 1 + arch/s390/boot/.gitignore | 1 + arch/s390/boot/compressed/.gitignore | 1 + arch/s390/kernel/.gitignore | 1 + arch/s390/kernel/vdso64/.gitignore | 1 + arch/s390/purgatory/.gitignore | 1 + arch/s390/tools/.gitignore | 1 + arch/sh/boot/.gitignore | 1 + arch/sh/boot/compressed/.gitignore | 1 + arch/sh/kernel/.gitignore | 1 + arch/sh/kernel/vsyscall/.gitignore | 1 + arch/sparc/boot/.gitignore | 1 + arch/sparc/kernel/.gitignore | 1 + arch/sparc/vdso/.gitignore | 1 + arch/sparc/vdso/vdso32/.gitignore | 1 + arch/um/.gitignore | 1 + arch/unicore32/.gitignore | 1 + arch/x86/.gitignore | 1 + arch/x86/boot/.gitignore | 1 + arch/x86/boot/compressed/.gitignore | 1 + arch/x86/boot/tools/.gitignore | 1 + arch/x86/crypto/.gitignore | 1 + arch/x86/entry/vdso/.gitignore | 1 + arch/x86/entry/vdso/vdso32/.gitignore | 1 + arch/x86/kernel/.gitignore | 1 + arch/x86/kernel/cpu/.gitignore | 1 + arch/x86/lib/.gitignore | 1 + arch/x86/realmode/rm/.gitignore | 1 + arch/x86/tools/.gitignore | 1 + arch/x86/um/vdso/.gitignore | 1 + arch/xtensa/boot/.gitignore | 1 + arch/xtensa/boot/boot-elf/.gitignore | 1 + arch/xtensa/boot/lib/.gitignore | 1 + arch/xtensa/kernel/.gitignore | 1 + certs/.gitignore | 1 + drivers/atm/.gitignore | 1 + drivers/crypto/vmx/.gitignore | 1 + drivers/eisa/.gitignore | 1 + drivers/gpu/drm/i915/.gitignore | 1 + drivers/gpu/drm/radeon/.gitignore | 1 + drivers/memory/.gitignore | 1 + drivers/net/wan/.gitignore | 1 + drivers/scsi/.gitignore | 1 + drivers/scsi/aic7xxx/.gitignore | 1 + drivers/staging/comedi/drivers/ni_routing/tools/.gitignore | 1 + drivers/staging/greybus/tools/.gitignore | 1 + drivers/video/logo/.gitignore | 1 + drivers/zorro/.gitignore | 1 + fs/unicode/.gitignore | 1 + kernel/.gitignore | 1 + kernel/debug/kdb/.gitignore | 1 + lib/.gitignore | 1 + lib/raid6/.gitignore | 1 + net/bpfilter/.gitignore | 1 + net/wireless/.gitignore | 1 + samples/auxdisplay/.gitignore | 1 + samples/bpf/.gitignore | 1 + samples/connector/.gitignore | 1 + samples/hidraw/.gitignore | 1 + samples/mei/.gitignore | 1 + samples/mic/mpssd/.gitignore | 1 + samples/pidfd/.gitignore | 1 + samples/seccomp/.gitignore | 1 + samples/timers/.gitignore | 1 + samples/vfs/.gitignore | 1 + samples/watchdog/.gitignore | 1 + scripts/.gitignore | 1 + scripts/basic/.gitignore | 1 + scripts/dtc/.gitignore | 1 + scripts/gcc-plugins/.gitignore | 1 + scripts/gdb/linux/.gitignore | 1 + scripts/genksyms/.gitignore | 1 + scripts/kconfig/.gitignore | 1 + scripts/mod/.gitignore | 1 + scripts/selinux/genheaders/.gitignore | 1 + scripts/selinux/mdp/.gitignore | 1 + security/apparmor/.gitignore | 1 + security/selinux/.gitignore | 1 + security/tomoyo/.gitignore | 1 + sound/oss/.gitignore | 1 + tools/accounting/.gitignore | 1 + tools/bootconfig/.gitignore | 1 + tools/bpf/.gitignore | 1 + tools/bpf/bpftool/.gitignore | 1 + tools/bpf/runqslower/.gitignore | 1 + tools/build/.gitignore | 1 + tools/build/feature/.gitignore | 1 + tools/cgroup/.gitignore | 1 + tools/gpio/.gitignore | 1 + tools/iio/.gitignore | 1 + tools/laptop/dslm/.gitignore | 1 + tools/leds/.gitignore | 1 + tools/lib/bpf/.gitignore | 1 + tools/lib/lockdep/.gitignore | 1 + tools/lib/traceevent/.gitignore | 1 + tools/memory-model/.gitignore | 1 + tools/memory-model/litmus-tests/.gitignore | 1 + tools/objtool/.gitignore | 1 + tools/pcmcia/.gitignore | 1 + tools/perf/.gitignore | 1 + tools/perf/tests/.gitignore | 1 + tools/power/acpi/.gitignore | 1 + tools/power/cpupower/.gitignore | 1 + tools/power/x86/intel-speed-select/.gitignore | 1 + tools/power/x86/turbostat/.gitignore | 1 + tools/spi/.gitignore | 1 + tools/testing/kunit/.gitignore | 1 + tools/testing/radix-tree/.gitignore | 1 + tools/testing/selftests/.gitignore | 1 + tools/testing/selftests/android/ion/.gitignore | 1 + tools/testing/selftests/arm64/signal/.gitignore | 1 + tools/testing/selftests/arm64/tags/.gitignore | 1 + tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/map_tests/.gitignore | 1 + tools/testing/selftests/bpf/prog_tests/.gitignore | 1 + tools/testing/selftests/bpf/verifier/.gitignore | 1 + tools/testing/selftests/breakpoints/.gitignore | 1 + tools/testing/selftests/capabilities/.gitignore | 1 + tools/testing/selftests/cgroup/.gitignore | 1 + tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/drivers/.gitignore | 1 + tools/testing/selftests/efivarfs/.gitignore | 1 + tools/testing/selftests/exec/.gitignore | 1 + tools/testing/selftests/filesystems/.gitignore | 1 + tools/testing/selftests/filesystems/binderfs/.gitignore | 1 + tools/testing/selftests/filesystems/epoll/.gitignore | 1 + tools/testing/selftests/ftrace/.gitignore | 1 + tools/testing/selftests/futex/functional/.gitignore | 1 + tools/testing/selftests/gpio/.gitignore | 1 + tools/testing/selftests/ia64/.gitignore | 1 + tools/testing/selftests/intel_pstate/.gitignore | 1 + tools/testing/selftests/ipc/.gitignore | 1 + tools/testing/selftests/ir/.gitignore | 1 + tools/testing/selftests/kcmp/.gitignore | 1 + tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/media_tests/.gitignore | 1 + tools/testing/selftests/membarrier/.gitignore | 1 + tools/testing/selftests/memfd/.gitignore | 1 + tools/testing/selftests/mount/.gitignore | 1 + tools/testing/selftests/mqueue/.gitignore | 1 + tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/forwarding/.gitignore | 1 + tools/testing/selftests/net/mptcp/.gitignore | 1 + tools/testing/selftests/networking/timestamping/.gitignore | 1 + tools/testing/selftests/nsfs/.gitignore | 1 + tools/testing/selftests/openat2/.gitignore | 1 + tools/testing/selftests/pidfd/.gitignore | 1 + tools/testing/selftests/powerpc/alignment/.gitignore | 1 + tools/testing/selftests/powerpc/benchmarks/.gitignore | 1 + tools/testing/selftests/powerpc/cache_shape/.gitignore | 1 + tools/testing/selftests/powerpc/copyloops/.gitignore | 1 + tools/testing/selftests/powerpc/dscr/.gitignore | 1 + tools/testing/selftests/powerpc/math/.gitignore | 1 + tools/testing/selftests/powerpc/mm/.gitignore | 1 + tools/testing/selftests/powerpc/pmu/.gitignore | 1 + tools/testing/selftests/powerpc/pmu/ebb/.gitignore | 1 + tools/testing/selftests/powerpc/primitives/.gitignore | 1 + tools/testing/selftests/powerpc/ptrace/.gitignore | 1 + tools/testing/selftests/powerpc/security/.gitignore | 1 + tools/testing/selftests/powerpc/signal/.gitignore | 1 + tools/testing/selftests/powerpc/stringloops/.gitignore | 1 + tools/testing/selftests/powerpc/switch_endian/.gitignore | 1 + tools/testing/selftests/powerpc/syscalls/.gitignore | 1 + tools/testing/selftests/powerpc/tm/.gitignore | 1 + tools/testing/selftests/powerpc/vphn/.gitignore | 1 + tools/testing/selftests/prctl/.gitignore | 1 + tools/testing/selftests/proc/.gitignore | 1 + tools/testing/selftests/pstore/.gitignore | 1 + tools/testing/selftests/ptp/.gitignore | 1 + tools/testing/selftests/ptrace/.gitignore | 1 + tools/testing/selftests/rcutorture/.gitignore | 1 + tools/testing/selftests/rcutorture/formal/srcu-cbmc/.gitignore | 1 + .../selftests/rcutorture/formal/srcu-cbmc/include/linux/.gitignore | 1 + .../rcutorture/formal/srcu-cbmc/tests/store_buffering/.gitignore | 1 + tools/testing/selftests/rseq/.gitignore | 1 + tools/testing/selftests/rtc/.gitignore | 1 + tools/testing/selftests/safesetid/.gitignore | 1 + tools/testing/selftests/seccomp/.gitignore | 1 + tools/testing/selftests/sigaltstack/.gitignore | 1 + tools/testing/selftests/size/.gitignore | 1 + tools/testing/selftests/sparc64/drivers/.gitignore | 1 + tools/testing/selftests/splice/.gitignore | 1 + tools/testing/selftests/sync/.gitignore | 1 + tools/testing/selftests/tc-testing/.gitignore | 1 + tools/testing/selftests/timens/.gitignore | 1 + tools/testing/selftests/timers/.gitignore | 1 + tools/testing/selftests/tmpfs/.gitignore | 1 + tools/testing/selftests/vDSO/.gitignore | 1 + tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/watchdog/.gitignore | 1 + tools/testing/selftests/wireguard/qemu/.gitignore | 1 + tools/testing/selftests/x86/.gitignore | 1 + tools/testing/vsock/.gitignore | 1 + tools/thermal/tmon/.gitignore | 1 + tools/usb/.gitignore | 1 + tools/usb/usbip/.gitignore | 1 + tools/virtio/.gitignore | 1 + tools/vm/.gitignore | 1 + usr/.gitignore | 1 + usr/include/.gitignore | 1 + 246 files changed, 246 insertions(+) (limited to 'kernel') diff --git a/.gitignore b/.gitignore index 72ef86a5570d..2258e906f01c 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only # # NOTE! Don't add files that are generated in specific # subdirectories here. Add them in the ".gitignore" file diff --git a/Documentation/.gitignore b/Documentation/.gitignore index e74fec8693b2..d6dc7c9b8e25 100644 --- a/Documentation/.gitignore +++ b/Documentation/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only output *.pyc diff --git a/Documentation/devicetree/bindings/.gitignore b/Documentation/devicetree/bindings/.gitignore index ef82fcfcccab..66878559822b 100644 --- a/Documentation/devicetree/bindings/.gitignore +++ b/Documentation/devicetree/bindings/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only *.example.dts processed-schema.yaml diff --git a/Documentation/vm/.gitignore b/Documentation/vm/.gitignore index 09b164a5700f..bc74f5643008 100644 --- a/Documentation/vm/.gitignore +++ b/Documentation/vm/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only page-types slabinfo diff --git a/arch/.gitignore b/arch/.gitignore index 741468920320..4191da401dbb 100644 --- a/arch/.gitignore +++ b/arch/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only i386 x86_64 diff --git a/arch/alpha/kernel/.gitignore b/arch/alpha/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/alpha/kernel/.gitignore +++ b/arch/alpha/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/arc/boot/.gitignore b/arch/arc/boot/.gitignore index c4c5fd529c25..675db1494028 100644 --- a/arch/arc/boot/.gitignore +++ b/arch/arc/boot/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only uImage diff --git a/arch/arc/kernel/.gitignore b/arch/arc/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/arc/kernel/.gitignore +++ b/arch/arc/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/arm/boot/.gitignore b/arch/arm/boot/.gitignore index ce1c5ff746e7..8c759326baf4 100644 --- a/arch/arm/boot/.gitignore +++ b/arch/arm/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only Image zImage xipImage diff --git a/arch/arm/boot/compressed/.gitignore b/arch/arm/boot/compressed/.gitignore index 86b2f5d28240..db05c6ef3e31 100644 --- a/arch/arm/boot/compressed/.gitignore +++ b/arch/arm/boot/compressed/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only ashldi3.S bswapsdi2.S font.c diff --git a/arch/arm/crypto/.gitignore b/arch/arm/crypto/.gitignore index 31e1f538df7d..790e204050ba 100644 --- a/arch/arm/crypto/.gitignore +++ b/arch/arm/crypto/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only aesbs-core.S sha256-core.S sha512-core.S diff --git a/arch/arm/kernel/.gitignore b/arch/arm/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/arm/kernel/.gitignore +++ b/arch/arm/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/arm/mach-at91/.gitignore b/arch/arm/mach-at91/.gitignore index 2ecd6f51c8a9..f6d47389675e 100644 --- a/arch/arm/mach-at91/.gitignore +++ b/arch/arm/mach-at91/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only pm_data-offsets.h diff --git a/arch/arm/mach-omap2/.gitignore b/arch/arm/mach-omap2/.gitignore index 79a8d6ea7152..dc7be7556736 100644 --- a/arch/arm/mach-omap2/.gitignore +++ b/arch/arm/mach-omap2/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only pm-asm-offsets.h diff --git a/arch/arm/vdso/.gitignore b/arch/arm/vdso/.gitignore index 6b47f6e0b032..dfa06f5365cf 100644 --- a/arch/arm/vdso/.gitignore +++ b/arch/arm/vdso/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds vdso.so.raw vdsomunge diff --git a/arch/arm64/boot/.gitignore b/arch/arm64/boot/.gitignore index 8dab0bb6ae66..9a7a9009d43a 100644 --- a/arch/arm64/boot/.gitignore +++ b/arch/arm64/boot/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only Image Image.gz diff --git a/arch/arm64/crypto/.gitignore b/arch/arm64/crypto/.gitignore index 879df8781ed5..a11a86217ffb 100644 --- a/arch/arm64/crypto/.gitignore +++ b/arch/arm64/crypto/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only sha256-core.S sha512-core.S diff --git a/arch/arm64/kernel/.gitignore b/arch/arm64/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/arm64/kernel/.gitignore +++ b/arch/arm64/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/arm64/kernel/vdso/.gitignore b/arch/arm64/kernel/vdso/.gitignore index f8b69d84238e..652e31d82582 100644 --- a/arch/arm64/kernel/vdso/.gitignore +++ b/arch/arm64/kernel/vdso/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds diff --git a/arch/arm64/kernel/vdso32/.gitignore b/arch/arm64/kernel/vdso32/.gitignore index 4fea950fa5ed..3542fa24e26b 100644 --- a/arch/arm64/kernel/vdso32/.gitignore +++ b/arch/arm64/kernel/vdso32/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds vdso.so.raw diff --git a/arch/ia64/kernel/.gitignore b/arch/ia64/kernel/.gitignore index 21cb0da5ded8..0374827206e7 100644 --- a/arch/ia64/kernel/.gitignore +++ b/arch/ia64/kernel/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only gate.lds vmlinux.lds diff --git a/arch/m68k/kernel/.gitignore b/arch/m68k/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/m68k/kernel/.gitignore +++ b/arch/m68k/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/microblaze/boot/.gitignore b/arch/microblaze/boot/.gitignore index 679502d64a97..11a9e229f3c0 100644 --- a/arch/microblaze/boot/.gitignore +++ b/arch/microblaze/boot/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only linux.bin* simpleImage.* diff --git a/arch/microblaze/kernel/.gitignore b/arch/microblaze/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/microblaze/kernel/.gitignore +++ b/arch/microblaze/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/mips/boot/.gitignore b/arch/mips/boot/.gitignore index a73d6e2c4f64..2adc8581a175 100644 --- a/arch/mips/boot/.gitignore +++ b/arch/mips/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only mkboot elf2ecoff vmlinux.* diff --git a/arch/mips/boot/compressed/.gitignore b/arch/mips/boot/compressed/.gitignore index ebae133f1d00..d358395614c9 100644 --- a/arch/mips/boot/compressed/.gitignore +++ b/arch/mips/boot/compressed/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only ashldi3.c bswapsi.c diff --git a/arch/mips/boot/tools/.gitignore b/arch/mips/boot/tools/.gitignore index be0ed065249b..d36dc7cf9115 100644 --- a/arch/mips/boot/tools/.gitignore +++ b/arch/mips/boot/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only relocs diff --git a/arch/mips/kernel/.gitignore b/arch/mips/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/mips/kernel/.gitignore +++ b/arch/mips/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/mips/tools/.gitignore b/arch/mips/tools/.gitignore index b0209450d9ff..794817dfb389 100644 --- a/arch/mips/tools/.gitignore +++ b/arch/mips/tools/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only elf-entry loongson3-llsc-check diff --git a/arch/mips/vdso/.gitignore b/arch/mips/vdso/.gitignore index 5286a7d73d79..1f43f6dd8142 100644 --- a/arch/mips/vdso/.gitignore +++ b/arch/mips/vdso/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.so* vdso-*image.c genvdso diff --git a/arch/nds32/kernel/.gitignore b/arch/nds32/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/nds32/kernel/.gitignore +++ b/arch/nds32/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/nds32/kernel/vdso/.gitignore b/arch/nds32/kernel/vdso/.gitignore index f8b69d84238e..652e31d82582 100644 --- a/arch/nds32/kernel/vdso/.gitignore +++ b/arch/nds32/kernel/vdso/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds diff --git a/arch/nios2/boot/.gitignore b/arch/nios2/boot/.gitignore index 64386a8dedd8..ef37cac5bcc0 100644 --- a/arch/nios2/boot/.gitignore +++ b/arch/nios2/boot/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmImage diff --git a/arch/nios2/kernel/.gitignore b/arch/nios2/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/nios2/kernel/.gitignore +++ b/arch/nios2/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/openrisc/kernel/.gitignore b/arch/openrisc/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/openrisc/kernel/.gitignore +++ b/arch/openrisc/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/parisc/boot/.gitignore b/arch/parisc/boot/.gitignore index 017d5912ad2d..adf2ae0e7eda 100644 --- a/arch/parisc/boot/.gitignore +++ b/arch/parisc/boot/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only image bzImage diff --git a/arch/parisc/boot/compressed/.gitignore b/arch/parisc/boot/compressed/.gitignore index 926cd41c1069..b9853a356ab2 100644 --- a/arch/parisc/boot/compressed/.gitignore +++ b/arch/parisc/boot/compressed/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only firmware.c real2.S sizes.h diff --git a/arch/parisc/kernel/.gitignore b/arch/parisc/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/parisc/kernel/.gitignore +++ b/arch/parisc/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/powerpc/boot/.gitignore b/arch/powerpc/boot/.gitignore index 6610665fcf5e..1eee61b82341 100644 --- a/arch/powerpc/boot/.gitignore +++ b/arch/powerpc/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only addnote decompress_inflate.c empty.c diff --git a/arch/powerpc/kernel/.gitignore b/arch/powerpc/kernel/.gitignore index 67ebd3003c05..d71179d3ffe9 100644 --- a/arch/powerpc/kernel/.gitignore +++ b/arch/powerpc/kernel/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only prom_init_check vmlinux.lds diff --git a/arch/powerpc/kernel/vdso32/.gitignore b/arch/powerpc/kernel/vdso32/.gitignore index fea5809857a5..824b863ec6bd 100644 --- a/arch/powerpc/kernel/vdso32/.gitignore +++ b/arch/powerpc/kernel/vdso32/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso32.lds vdso32.so.dbg diff --git a/arch/powerpc/kernel/vdso64/.gitignore b/arch/powerpc/kernel/vdso64/.gitignore index 77a0b423642c..84151a7ba31d 100644 --- a/arch/powerpc/kernel/vdso64/.gitignore +++ b/arch/powerpc/kernel/vdso64/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso64.lds vdso64.so.dbg diff --git a/arch/powerpc/platforms/cell/spufs/.gitignore b/arch/powerpc/platforms/cell/spufs/.gitignore index a09ee8d84d6c..5f3eb224f653 100644 --- a/arch/powerpc/platforms/cell/spufs/.gitignore +++ b/arch/powerpc/platforms/cell/spufs/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only spu_save_dump.h spu_restore_dump.h diff --git a/arch/powerpc/purgatory/.gitignore b/arch/powerpc/purgatory/.gitignore index e9e66f178a6d..b8dc6ff34254 100644 --- a/arch/powerpc/purgatory/.gitignore +++ b/arch/powerpc/purgatory/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only kexec-purgatory.c purgatory.ro diff --git a/arch/riscv/boot/.gitignore b/arch/riscv/boot/.gitignore index 8a45a37d2af4..574c10f8ff68 100644 --- a/arch/riscv/boot/.gitignore +++ b/arch/riscv/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only Image Image.gz loader diff --git a/arch/riscv/kernel/.gitignore b/arch/riscv/kernel/.gitignore index b51634f6a7cd..e052ed331cc1 100644 --- a/arch/riscv/kernel/.gitignore +++ b/arch/riscv/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /vmlinux.lds diff --git a/arch/riscv/kernel/vdso/.gitignore b/arch/riscv/kernel/vdso/.gitignore index 97c2d69d0289..11ebee9e4c1d 100644 --- a/arch/riscv/kernel/vdso/.gitignore +++ b/arch/riscv/kernel/vdso/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds *.tmp diff --git a/arch/s390/boot/.gitignore b/arch/s390/boot/.gitignore index 16ff906e4610..b265bfede188 100644 --- a/arch/s390/boot/.gitignore +++ b/arch/s390/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only image bzImage section_cmp.* diff --git a/arch/s390/boot/compressed/.gitignore b/arch/s390/boot/compressed/.gitignore index e72fcd7ecebb..765a08f1bd77 100644 --- a/arch/s390/boot/compressed/.gitignore +++ b/arch/s390/boot/compressed/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux vmlinux.lds diff --git a/arch/s390/kernel/.gitignore b/arch/s390/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/s390/kernel/.gitignore +++ b/arch/s390/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/s390/kernel/vdso64/.gitignore b/arch/s390/kernel/vdso64/.gitignore index 3fd18cf9fec2..4ec80685fecc 100644 --- a/arch/s390/kernel/vdso64/.gitignore +++ b/arch/s390/kernel/vdso64/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso64.lds diff --git a/arch/s390/purgatory/.gitignore b/arch/s390/purgatory/.gitignore index c82157f46b18..97ca52779457 100644 --- a/arch/s390/purgatory/.gitignore +++ b/arch/s390/purgatory/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only purgatory purgatory.chk purgatory.lds diff --git a/arch/s390/tools/.gitignore b/arch/s390/tools/.gitignore index 71bd6f8eebaf..ea62f37b79ef 100644 --- a/arch/s390/tools/.gitignore +++ b/arch/s390/tools/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only gen_facilities gen_opcode_table diff --git a/arch/sh/boot/.gitignore b/arch/sh/boot/.gitignore index f50fdd9975c5..6603bbbc917d 100644 --- a/arch/sh/boot/.gitignore +++ b/arch/sh/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only zImage vmlinux* uImage* diff --git a/arch/sh/boot/compressed/.gitignore b/arch/sh/boot/compressed/.gitignore index edff113f1b85..37aa53057369 100644 --- a/arch/sh/boot/compressed/.gitignore +++ b/arch/sh/boot/compressed/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only ashiftrt.S ashldi3.c ashlsi3.S diff --git a/arch/sh/kernel/.gitignore b/arch/sh/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/sh/kernel/.gitignore +++ b/arch/sh/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/sh/kernel/vsyscall/.gitignore b/arch/sh/kernel/vsyscall/.gitignore index 40836ad9079c..530a3031a88d 100644 --- a/arch/sh/kernel/vsyscall/.gitignore +++ b/arch/sh/kernel/vsyscall/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vsyscall.lds diff --git a/arch/sparc/boot/.gitignore b/arch/sparc/boot/.gitignore index fc6f3986c76c..f3d8569a21d1 100644 --- a/arch/sparc/boot/.gitignore +++ b/arch/sparc/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only btfix.S btfixupprep image diff --git a/arch/sparc/kernel/.gitignore b/arch/sparc/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/sparc/kernel/.gitignore +++ b/arch/sparc/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/arch/sparc/vdso/.gitignore b/arch/sparc/vdso/.gitignore index ef925b998222..8d4ebc990bf3 100644 --- a/arch/sparc/vdso/.gitignore +++ b/arch/sparc/vdso/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds vdso-image-*.c vdso2c diff --git a/arch/sparc/vdso/vdso32/.gitignore b/arch/sparc/vdso/vdso32/.gitignore index e45fba9d0ced..5167384843b9 100644 --- a/arch/sparc/vdso/vdso32/.gitignore +++ b/arch/sparc/vdso/vdso32/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso32.lds diff --git a/arch/um/.gitignore b/arch/um/.gitignore index a73d3a1cc746..6323e5571887 100644 --- a/arch/um/.gitignore +++ b/arch/um/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only kernel/config.c kernel/config.tmp kernel/vmlinux.lds diff --git a/arch/unicore32/.gitignore b/arch/unicore32/.gitignore index 947e99c2a957..e82f3fb57ba0 100644 --- a/arch/unicore32/.gitignore +++ b/arch/unicore32/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only # # Generated include files # diff --git a/arch/x86/.gitignore b/arch/x86/.gitignore index 5a82bac5e0bc..677111acbaa3 100644 --- a/arch/x86/.gitignore +++ b/arch/x86/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only boot/compressed/vmlinux tools/test_get_len tools/insn_sanity diff --git a/arch/x86/boot/.gitignore b/arch/x86/boot/.gitignore index 09d25dd09307..9cc7f1357b9b 100644 --- a/arch/x86/boot/.gitignore +++ b/arch/x86/boot/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only bootsect bzImage cpustr.h diff --git a/arch/x86/boot/compressed/.gitignore b/arch/x86/boot/compressed/.gitignore index 4a46fab7162e..25805199a506 100644 --- a/arch/x86/boot/compressed/.gitignore +++ b/arch/x86/boot/compressed/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only relocs vmlinux.bin.all vmlinux.relocs diff --git a/arch/x86/boot/tools/.gitignore b/arch/x86/boot/tools/.gitignore index 378eac25d311..ae91f4d0d78b 100644 --- a/arch/x86/boot/tools/.gitignore +++ b/arch/x86/boot/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only build diff --git a/arch/x86/crypto/.gitignore b/arch/x86/crypto/.gitignore index 30be0400a439..580c839bb177 100644 --- a/arch/x86/crypto/.gitignore +++ b/arch/x86/crypto/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only poly1305-x86_64-cryptogams.S diff --git a/arch/x86/entry/vdso/.gitignore b/arch/x86/entry/vdso/.gitignore index aae8ffdd5880..37a6129d597b 100644 --- a/arch/x86/entry/vdso/.gitignore +++ b/arch/x86/entry/vdso/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds vdsox32.lds vdso32-syscall-syms.lds diff --git a/arch/x86/entry/vdso/vdso32/.gitignore b/arch/x86/entry/vdso/vdso32/.gitignore index e45fba9d0ced..5167384843b9 100644 --- a/arch/x86/entry/vdso/vdso32/.gitignore +++ b/arch/x86/entry/vdso/vdso32/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso32.lds diff --git a/arch/x86/kernel/.gitignore b/arch/x86/kernel/.gitignore index 08f4fd731469..ef66569e7e22 100644 --- a/arch/x86/kernel/.gitignore +++ b/arch/x86/kernel/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only vsyscall.lds vsyscall_32.lds vmlinux.lds diff --git a/arch/x86/kernel/cpu/.gitignore b/arch/x86/kernel/cpu/.gitignore index 667df55a4399..0bca7ef7426a 100644 --- a/arch/x86/kernel/cpu/.gitignore +++ b/arch/x86/kernel/cpu/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only capflags.c diff --git a/arch/x86/lib/.gitignore b/arch/x86/lib/.gitignore index 8df89f0a3fe6..8ae0f93ecbfd 100644 --- a/arch/x86/lib/.gitignore +++ b/arch/x86/lib/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only inat-tables.c diff --git a/arch/x86/realmode/rm/.gitignore b/arch/x86/realmode/rm/.gitignore index b6ed3a2555cb..6c3464f46166 100644 --- a/arch/x86/realmode/rm/.gitignore +++ b/arch/x86/realmode/rm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only pasyms.h realmode.lds realmode.relocs diff --git a/arch/x86/tools/.gitignore b/arch/x86/tools/.gitignore index be0ed065249b..d36dc7cf9115 100644 --- a/arch/x86/tools/.gitignore +++ b/arch/x86/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only relocs diff --git a/arch/x86/um/vdso/.gitignore b/arch/x86/um/vdso/.gitignore index f8b69d84238e..652e31d82582 100644 --- a/arch/x86/um/vdso/.gitignore +++ b/arch/x86/um/vdso/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso.lds diff --git a/arch/xtensa/boot/.gitignore b/arch/xtensa/boot/.gitignore index 38177c7ebcab..615f1f741a03 100644 --- a/arch/xtensa/boot/.gitignore +++ b/arch/xtensa/boot/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only uImage zImage.redboot diff --git a/arch/xtensa/boot/boot-elf/.gitignore b/arch/xtensa/boot/boot-elf/.gitignore index 5ff8fbb8561b..7473404500cc 100644 --- a/arch/xtensa/boot/boot-elf/.gitignore +++ b/arch/xtensa/boot/boot-elf/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only boot.lds diff --git a/arch/xtensa/boot/lib/.gitignore b/arch/xtensa/boot/lib/.gitignore index 1629a6167755..805a8249252a 100644 --- a/arch/xtensa/boot/lib/.gitignore +++ b/arch/xtensa/boot/lib/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only inffast.c inflate.c inftrees.c diff --git a/arch/xtensa/kernel/.gitignore b/arch/xtensa/kernel/.gitignore index c5f676c3c224..bbb90f92d051 100644 --- a/arch/xtensa/kernel/.gitignore +++ b/arch/xtensa/kernel/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only vmlinux.lds diff --git a/certs/.gitignore b/certs/.gitignore index 4d58ba042b37..2a2483990686 100644 --- a/certs/.gitignore +++ b/certs/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only x509_certificate_list diff --git a/drivers/atm/.gitignore b/drivers/atm/.gitignore index 19f3ffbd1d65..ddd374e91965 100644 --- a/drivers/atm/.gitignore +++ b/drivers/atm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only fore200e_mkfirm fore200e_pca_fw.c pca200e.bin diff --git a/drivers/crypto/vmx/.gitignore b/drivers/crypto/vmx/.gitignore index af4a7ce4738d..7aa71d83f739 100644 --- a/drivers/crypto/vmx/.gitignore +++ b/drivers/crypto/vmx/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only aesp8-ppc.S ghashp8-ppc.S diff --git a/drivers/eisa/.gitignore b/drivers/eisa/.gitignore index 4b335c0aedb0..7d0a2ad5abe2 100644 --- a/drivers/eisa/.gitignore +++ b/drivers/eisa/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only devlist.h diff --git a/drivers/gpu/drm/i915/.gitignore b/drivers/gpu/drm/i915/.gitignore index d9a77f3b59b2..81972dce1aff 100644 --- a/drivers/gpu/drm/i915/.gitignore +++ b/drivers/gpu/drm/i915/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only *.hdrtest diff --git a/drivers/gpu/drm/radeon/.gitignore b/drivers/gpu/drm/radeon/.gitignore index 403eb3a5891f..9c1a94153983 100644 --- a/drivers/gpu/drm/radeon/.gitignore +++ b/drivers/gpu/drm/radeon/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only mkregtable *_reg_safe.h diff --git a/drivers/memory/.gitignore b/drivers/memory/.gitignore index cbca8b028437..caedc4c7d2db 100644 --- a/drivers/memory/.gitignore +++ b/drivers/memory/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only ti-emif-asm-offsets.h diff --git a/drivers/net/wan/.gitignore b/drivers/net/wan/.gitignore index dae3ea6bb18c..247bfbf10912 100644 --- a/drivers/net/wan/.gitignore +++ b/drivers/net/wan/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only wanxlfw.inc diff --git a/drivers/scsi/.gitignore b/drivers/scsi/.gitignore index e2956741fbd1..5f65cb75f534 100644 --- a/drivers/scsi/.gitignore +++ b/drivers/scsi/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only 53c700_d.h scsi_devinfo_tbl.c diff --git a/drivers/scsi/aic7xxx/.gitignore b/drivers/scsi/aic7xxx/.gitignore index b8ee24d5748a..9aa780221718 100644 --- a/drivers/scsi/aic7xxx/.gitignore +++ b/drivers/scsi/aic7xxx/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only aic79xx_reg.h aic79xx_reg_print.c aic79xx_seq.h diff --git a/drivers/staging/comedi/drivers/ni_routing/tools/.gitignore b/drivers/staging/comedi/drivers/ni_routing/tools/.gitignore index ef38008280a9..e3ebffcd900e 100644 --- a/drivers/staging/comedi/drivers/ni_routing/tools/.gitignore +++ b/drivers/staging/comedi/drivers/ni_routing/tools/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only comedi_h.py *.pyc ni_values.py diff --git a/drivers/staging/greybus/tools/.gitignore b/drivers/staging/greybus/tools/.gitignore index 023654c83068..1fd364aba774 100644 --- a/drivers/staging/greybus/tools/.gitignore +++ b/drivers/staging/greybus/tools/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only loopback_test diff --git a/drivers/video/logo/.gitignore b/drivers/video/logo/.gitignore index 1551a75afdbd..5311d207c0b9 100644 --- a/drivers/video/logo/.gitignore +++ b/drivers/video/logo/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *_mono.c *_vga16.c *_clut224.c diff --git a/drivers/zorro/.gitignore b/drivers/zorro/.gitignore index 34f980bd8ff6..acd6ffb8d77d 100644 --- a/drivers/zorro/.gitignore +++ b/drivers/zorro/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only devlist.h gen-devlist diff --git a/fs/unicode/.gitignore b/fs/unicode/.gitignore index 0381e2221480..9b2467e77b2d 100644 --- a/fs/unicode/.gitignore +++ b/fs/unicode/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only mkutf8data utf8data.h diff --git a/kernel/.gitignore b/kernel/.gitignore index 0a423a3ca2e1..78701ea37c97 100644 --- a/kernel/.gitignore +++ b/kernel/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only kheaders.md5 timeconst.h hz.bc diff --git a/kernel/debug/kdb/.gitignore b/kernel/debug/kdb/.gitignore index 396d12eda9e8..df259542a236 100644 --- a/kernel/debug/kdb/.gitignore +++ b/kernel/debug/kdb/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only gen-kdb_cmds.c diff --git a/lib/.gitignore b/lib/.gitignore index 9af73655a239..327cb2c7f2c9 100644 --- a/lib/.gitignore +++ b/lib/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only gen_crc32table gen_crc64table crc32table.h diff --git a/lib/raid6/.gitignore b/lib/raid6/.gitignore index 3de0d8921286..6be57745afd1 100644 --- a/lib/raid6/.gitignore +++ b/lib/raid6/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only mktables altivec*.c int*.c diff --git a/net/bpfilter/.gitignore b/net/bpfilter/.gitignore index e97084e3eea2..f34e85ee8204 100644 --- a/net/bpfilter/.gitignore +++ b/net/bpfilter/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only bpfilter_umh diff --git a/net/wireless/.gitignore b/net/wireless/.gitignore index 61cbc304a3d3..1a29cd69d6cf 100644 --- a/net/wireless/.gitignore +++ b/net/wireless/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only shipped-certs.c extra-certs.c diff --git a/samples/auxdisplay/.gitignore b/samples/auxdisplay/.gitignore index 7af222860a96..2ed744c0e741 100644 --- a/samples/auxdisplay/.gitignore +++ b/samples/auxdisplay/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only cfag12864b-example diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 74d31fd3c99c..23837f2ed458 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only cpustat fds_example hbm diff --git a/samples/connector/.gitignore b/samples/connector/.gitignore index d2b9c32accd4..d86f2ff9c947 100644 --- a/samples/connector/.gitignore +++ b/samples/connector/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only ucon diff --git a/samples/hidraw/.gitignore b/samples/hidraw/.gitignore index 05e51a685242..d7a6074ebcf9 100644 --- a/samples/hidraw/.gitignore +++ b/samples/hidraw/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only hid-example diff --git a/samples/mei/.gitignore b/samples/mei/.gitignore index f356b81ca1ec..db5e802f041e 100644 --- a/samples/mei/.gitignore +++ b/samples/mei/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only mei-amt-version diff --git a/samples/mic/mpssd/.gitignore b/samples/mic/mpssd/.gitignore index 8b7c72f07c92..aa03f1eb37a0 100644 --- a/samples/mic/mpssd/.gitignore +++ b/samples/mic/mpssd/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only mpssd diff --git a/samples/pidfd/.gitignore b/samples/pidfd/.gitignore index be52b3ba6e4b..eea857fca736 100644 --- a/samples/pidfd/.gitignore +++ b/samples/pidfd/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only pidfd-metadata diff --git a/samples/seccomp/.gitignore b/samples/seccomp/.gitignore index d1e2e817d556..4a5a5b7db30b 100644 --- a/samples/seccomp/.gitignore +++ b/samples/seccomp/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only bpf-direct bpf-fancy dropper diff --git a/samples/timers/.gitignore b/samples/timers/.gitignore index c5c45d7ec0df..40510c33cf08 100644 --- a/samples/timers/.gitignore +++ b/samples/timers/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only hpet_example diff --git a/samples/vfs/.gitignore b/samples/vfs/.gitignore index 0806eb0be62d..8fdabf7e5373 100644 --- a/samples/vfs/.gitignore +++ b/samples/vfs/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only test-fsmount test-statx diff --git a/samples/watchdog/.gitignore b/samples/watchdog/.gitignore index ff0ebb540333..74153b831244 100644 --- a/samples/watchdog/.gitignore +++ b/samples/watchdog/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only watchdog-simple diff --git a/scripts/.gitignore b/scripts/.gitignore index 9fe29efbcb95..0d1c8e217cd7 100644 --- a/scripts/.gitignore +++ b/scripts/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only bin2c kallsyms unifdef diff --git a/scripts/basic/.gitignore b/scripts/basic/.gitignore index a776371a3502..98ae1f509592 100644 --- a/scripts/basic/.gitignore +++ b/scripts/basic/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only fixdep diff --git a/scripts/dtc/.gitignore b/scripts/dtc/.gitignore index 2e6e60d64ede..b814e6076bdb 100644 --- a/scripts/dtc/.gitignore +++ b/scripts/dtc/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only dtc diff --git a/scripts/gcc-plugins/.gitignore b/scripts/gcc-plugins/.gitignore index de92ed9e3d83..b04e0f0f033e 100644 --- a/scripts/gcc-plugins/.gitignore +++ b/scripts/gcc-plugins/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only randomize_layout_seed.h diff --git a/scripts/gdb/linux/.gitignore b/scripts/gdb/linux/.gitignore index 2573543842d0..43234cbcb529 100644 --- a/scripts/gdb/linux/.gitignore +++ b/scripts/gdb/linux/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.pyc *.pyo constants.py diff --git a/scripts/genksyms/.gitignore b/scripts/genksyms/.gitignore index b119c7da2863..999af710f83d 100644 --- a/scripts/genksyms/.gitignore +++ b/scripts/genksyms/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only genksyms diff --git a/scripts/kconfig/.gitignore b/scripts/kconfig/.gitignore index 588988711e07..12a67fdab541 100644 --- a/scripts/kconfig/.gitignore +++ b/scripts/kconfig/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.moc *conf-cfg diff --git a/scripts/mod/.gitignore b/scripts/mod/.gitignore index 3bd11b603173..07e4a39f90a6 100644 --- a/scripts/mod/.gitignore +++ b/scripts/mod/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only elfconfig.h mk_elfconfig modpost diff --git a/scripts/selinux/genheaders/.gitignore b/scripts/selinux/genheaders/.gitignore index 4c0b646ff8d5..5fcadd307908 100644 --- a/scripts/selinux/genheaders/.gitignore +++ b/scripts/selinux/genheaders/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only genheaders diff --git a/scripts/selinux/mdp/.gitignore b/scripts/selinux/mdp/.gitignore index 0d9f827dc14b..a7482287e77f 100644 --- a/scripts/selinux/mdp/.gitignore +++ b/scripts/selinux/mdp/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only mdp diff --git a/security/apparmor/.gitignore b/security/apparmor/.gitignore index 0ace1d1dec44..6d1eb1c15c18 100644 --- a/security/apparmor/.gitignore +++ b/security/apparmor/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only net_names.h capability_names.h rlim_names.h diff --git a/security/selinux/.gitignore b/security/selinux/.gitignore index 2e5040a3d48b..168fae13ca5a 100644 --- a/security/selinux/.gitignore +++ b/security/selinux/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only av_permissions.h flask.h diff --git a/security/tomoyo/.gitignore b/security/tomoyo/.gitignore index dc0f220a210b..9f300cdce362 100644 --- a/security/tomoyo/.gitignore +++ b/security/tomoyo/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only builtin-policy.h policy/*.conf diff --git a/sound/oss/.gitignore b/sound/oss/.gitignore index 8fd8fd3eff62..ac678430408b 100644 --- a/sound/oss/.gitignore +++ b/sound/oss/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only pss_boot.h trix_boot.h diff --git a/tools/accounting/.gitignore b/tools/accounting/.gitignore index 86485203c4ae..c45fb4ed4309 100644 --- a/tools/accounting/.gitignore +++ b/tools/accounting/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only getdelays diff --git a/tools/bootconfig/.gitignore b/tools/bootconfig/.gitignore index e7644dfaa4a7..b77513cae685 100644 --- a/tools/bootconfig/.gitignore +++ b/tools/bootconfig/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only bootconfig diff --git a/tools/bpf/.gitignore b/tools/bpf/.gitignore index 59024197e71d..cf53342175e7 100644 --- a/tools/bpf/.gitignore +++ b/tools/bpf/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only FEATURE-DUMP.bpf feature bpf_asm diff --git a/tools/bpf/bpftool/.gitignore b/tools/bpf/bpftool/.gitignore index b13926432b84..5c232659a98b 100644 --- a/tools/bpf/bpftool/.gitignore +++ b/tools/bpf/bpftool/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.d /bpftool bpftool*.8 diff --git a/tools/bpf/runqslower/.gitignore b/tools/bpf/runqslower/.gitignore index 90a456a2a72f..ffdb70230c8b 100644 --- a/tools/bpf/runqslower/.gitignore +++ b/tools/bpf/runqslower/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /.output diff --git a/tools/build/.gitignore b/tools/build/.gitignore index a776371a3502..98ae1f509592 100644 --- a/tools/build/.gitignore +++ b/tools/build/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only fixdep diff --git a/tools/build/feature/.gitignore b/tools/build/feature/.gitignore index 09b335b98842..15fcd34acdb9 100644 --- a/tools/build/feature/.gitignore +++ b/tools/build/feature/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.d *.bin *.output diff --git a/tools/cgroup/.gitignore b/tools/cgroup/.gitignore index 633cd9b874f9..46a82775f2ca 100644 --- a/tools/cgroup/.gitignore +++ b/tools/cgroup/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only cgroup_event_listener diff --git a/tools/gpio/.gitignore b/tools/gpio/.gitignore index a94c0e83b209..d0a66c48865c 100644 --- a/tools/gpio/.gitignore +++ b/tools/gpio/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only gpio-event-mon gpio-hammer lsgpio diff --git a/tools/iio/.gitignore b/tools/iio/.gitignore index 3758202618bd..5bd6f4df98b7 100644 --- a/tools/iio/.gitignore +++ b/tools/iio/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only iio_event_monitor iio_generic_buffer lsiio diff --git a/tools/laptop/dslm/.gitignore b/tools/laptop/dslm/.gitignore index 9fc984e64386..f7f1296b96ae 100644 --- a/tools/laptop/dslm/.gitignore +++ b/tools/laptop/dslm/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only dslm diff --git a/tools/leds/.gitignore b/tools/leds/.gitignore index ac96d9f53dfc..06bd3ee1b7c9 100644 --- a/tools/leds/.gitignore +++ b/tools/leds/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only uledmon diff --git a/tools/lib/bpf/.gitignore b/tools/lib/bpf/.gitignore index e97c2ebcf447..8a81b3679d2b 100644 --- a/tools/lib/bpf/.gitignore +++ b/tools/lib/bpf/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only libbpf_version.h libbpf.pc FEATURE-DUMP.libbpf diff --git a/tools/lib/lockdep/.gitignore b/tools/lib/lockdep/.gitignore index cc0e7a9f99e3..6c308ac4388c 100644 --- a/tools/lib/lockdep/.gitignore +++ b/tools/lib/lockdep/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only liblockdep.so.* diff --git a/tools/lib/traceevent/.gitignore b/tools/lib/traceevent/.gitignore index 9e9f25fb1922..7123c70b9ebc 100644 --- a/tools/lib/traceevent/.gitignore +++ b/tools/lib/traceevent/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only TRACEEVENT-CFLAGS libtraceevent-dynamic-list libtraceevent.so.* diff --git a/tools/memory-model/.gitignore b/tools/memory-model/.gitignore index b1d34c52f3c3..cf4cd66d8fbf 100644 --- a/tools/memory-model/.gitignore +++ b/tools/memory-model/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only litmus diff --git a/tools/memory-model/litmus-tests/.gitignore b/tools/memory-model/litmus-tests/.gitignore index 6e2ddc54152f..c492a1ddad91 100644 --- a/tools/memory-model/litmus-tests/.gitignore +++ b/tools/memory-model/litmus-tests/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only *.litmus.out diff --git a/tools/objtool/.gitignore b/tools/objtool/.gitignore index 914cff12899b..45cefda24c7b 100644 --- a/tools/objtool/.gitignore +++ b/tools/objtool/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only arch/x86/lib/inat-tables.c objtool fixdep diff --git a/tools/pcmcia/.gitignore b/tools/pcmcia/.gitignore index 53d081336757..94cb97b77f06 100644 --- a/tools/pcmcia/.gitignore +++ b/tools/pcmcia/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only crc32hash diff --git a/tools/perf/.gitignore b/tools/perf/.gitignore index bf1252dc2cb0..f3f84781fd74 100644 --- a/tools/perf/.gitignore +++ b/tools/perf/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only PERF-CFLAGS PERF-GUI-VARS PERF-VERSION-FILE diff --git a/tools/perf/tests/.gitignore b/tools/perf/tests/.gitignore index 8cc30e731c73..d053b325f728 100644 --- a/tools/perf/tests/.gitignore +++ b/tools/perf/tests/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only llvm-src-base.c llvm-src-kbuild.c llvm-src-prologue.c diff --git a/tools/power/acpi/.gitignore b/tools/power/acpi/.gitignore index f698a0e5bfa6..0b319fc8bb17 100644 --- a/tools/power/acpi/.gitignore +++ b/tools/power/acpi/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only /acpidbg /acpidump /ec diff --git a/tools/power/cpupower/.gitignore b/tools/power/cpupower/.gitignore index 1f9977cc609c..7677329c42a6 100644 --- a/tools/power/cpupower/.gitignore +++ b/tools/power/cpupower/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only .libs libcpupower.so libcpupower.so.* diff --git a/tools/power/x86/intel-speed-select/.gitignore b/tools/power/x86/intel-speed-select/.gitignore index f61145925ce9..a814f89fe75f 100644 --- a/tools/power/x86/intel-speed-select/.gitignore +++ b/tools/power/x86/intel-speed-select/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only include/ intel-speed-select diff --git a/tools/power/x86/turbostat/.gitignore b/tools/power/x86/turbostat/.gitignore index 7521370d3568..e13109b43cd1 100644 --- a/tools/power/x86/turbostat/.gitignore +++ b/tools/power/x86/turbostat/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only turbostat diff --git a/tools/spi/.gitignore b/tools/spi/.gitignore index 4280576397e8..14ddba3d2195 100644 --- a/tools/spi/.gitignore +++ b/tools/spi/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only spidev_fdx spidev_test diff --git a/tools/testing/kunit/.gitignore b/tools/testing/kunit/.gitignore index c791ff59a37a..1c63e31f7edf 100644 --- a/tools/testing/kunit/.gitignore +++ b/tools/testing/kunit/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] \ No newline at end of file diff --git a/tools/testing/radix-tree/.gitignore b/tools/testing/radix-tree/.gitignore index 3834899b6693..d971516401e6 100644 --- a/tools/testing/radix-tree/.gitignore +++ b/tools/testing/radix-tree/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only generated/map-shift.h idr.c idr-test diff --git a/tools/testing/selftests/.gitignore b/tools/testing/selftests/.gitignore index 61df01cdf0b2..ac0505cef4ae 100644 --- a/tools/testing/selftests/.gitignore +++ b/tools/testing/selftests/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only gpiogpio-event-mon gpiogpio-hammer gpioinclude/ diff --git a/tools/testing/selftests/android/ion/.gitignore b/tools/testing/selftests/android/ion/.gitignore index 95e8f4561474..78eae9972bb1 100644 --- a/tools/testing/selftests/android/ion/.gitignore +++ b/tools/testing/selftests/android/ion/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only ionapp_export ionapp_import ionmap_test diff --git a/tools/testing/selftests/arm64/signal/.gitignore b/tools/testing/selftests/arm64/signal/.gitignore index 3c5b4e8ff894..78c902045ca7 100644 --- a/tools/testing/selftests/arm64/signal/.gitignore +++ b/tools/testing/selftests/arm64/signal/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only mangle_* fake_sigreturn_* !*.[ch] diff --git a/tools/testing/selftests/arm64/tags/.gitignore b/tools/testing/selftests/arm64/tags/.gitignore index e8fae8d61ed6..f4f6c5112463 100644 --- a/tools/testing/selftests/arm64/tags/.gitignore +++ b/tools/testing/selftests/arm64/tags/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only tags_test diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index ec464859c6b6..e759d7eb1297 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only test_verifier test_maps test_lru_map diff --git a/tools/testing/selftests/bpf/map_tests/.gitignore b/tools/testing/selftests/bpf/map_tests/.gitignore index 45984a364647..89c4a3d37544 100644 --- a/tools/testing/selftests/bpf/map_tests/.gitignore +++ b/tools/testing/selftests/bpf/map_tests/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only tests.h diff --git a/tools/testing/selftests/bpf/prog_tests/.gitignore b/tools/testing/selftests/bpf/prog_tests/.gitignore index 45984a364647..89c4a3d37544 100644 --- a/tools/testing/selftests/bpf/prog_tests/.gitignore +++ b/tools/testing/selftests/bpf/prog_tests/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only tests.h diff --git a/tools/testing/selftests/bpf/verifier/.gitignore b/tools/testing/selftests/bpf/verifier/.gitignore index 45984a364647..89c4a3d37544 100644 --- a/tools/testing/selftests/bpf/verifier/.gitignore +++ b/tools/testing/selftests/bpf/verifier/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only tests.h diff --git a/tools/testing/selftests/breakpoints/.gitignore b/tools/testing/selftests/breakpoints/.gitignore index a23bb4a6f06c..def2e97dab9a 100644 --- a/tools/testing/selftests/breakpoints/.gitignore +++ b/tools/testing/selftests/breakpoints/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only breakpoint_test step_after_suspend_test diff --git a/tools/testing/selftests/capabilities/.gitignore b/tools/testing/selftests/capabilities/.gitignore index b732dd0d4738..426d9adca67c 100644 --- a/tools/testing/selftests/capabilities/.gitignore +++ b/tools/testing/selftests/capabilities/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only test_execve validate_cap diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore index 7f9835624793..aa6de65b0838 100644 --- a/tools/testing/selftests/cgroup/.gitignore +++ b/tools/testing/selftests/cgroup/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only test_memcontrol test_core test_freezer diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore index 0dc4f32c6cb8..a81085742d40 100644 --- a/tools/testing/selftests/clone3/.gitignore +++ b/tools/testing/selftests/clone3/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only clone3 clone3_clear_sighand clone3_set_tid diff --git a/tools/testing/selftests/drivers/.gitignore b/tools/testing/selftests/drivers/.gitignore index f6aebcc27b76..ca74f2e1c719 100644 --- a/tools/testing/selftests/drivers/.gitignore +++ b/tools/testing/selftests/drivers/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /dma-buf/udmabuf diff --git a/tools/testing/selftests/efivarfs/.gitignore b/tools/testing/selftests/efivarfs/.gitignore index 33618493562b..807407f7f58b 100644 --- a/tools/testing/selftests/efivarfs/.gitignore +++ b/tools/testing/selftests/efivarfs/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only create-read open-unlink diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore index b02279da6fa1..c078ece12ff0 100644 --- a/tools/testing/selftests/exec/.gitignore +++ b/tools/testing/selftests/exec/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only subdir* script* execveat diff --git a/tools/testing/selftests/filesystems/.gitignore b/tools/testing/selftests/filesystems/.gitignore index 8449cf6716ce..f0c0ff20d6cf 100644 --- a/tools/testing/selftests/filesystems/.gitignore +++ b/tools/testing/selftests/filesystems/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only dnotify_test devpts_pts diff --git a/tools/testing/selftests/filesystems/binderfs/.gitignore b/tools/testing/selftests/filesystems/binderfs/.gitignore index 8a5d9bf63dd4..8e5cf9084894 100644 --- a/tools/testing/selftests/filesystems/binderfs/.gitignore +++ b/tools/testing/selftests/filesystems/binderfs/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only binderfs_test diff --git a/tools/testing/selftests/filesystems/epoll/.gitignore b/tools/testing/selftests/filesystems/epoll/.gitignore index 9ae8db44ec14..9090157258b1 100644 --- a/tools/testing/selftests/filesystems/epoll/.gitignore +++ b/tools/testing/selftests/filesystems/epoll/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only epoll_wakeup_test diff --git a/tools/testing/selftests/ftrace/.gitignore b/tools/testing/selftests/ftrace/.gitignore index 98d8a5a63049..2659417cb2c7 100644 --- a/tools/testing/selftests/ftrace/.gitignore +++ b/tools/testing/selftests/ftrace/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only logs diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore index a09f57061902..0efcd494daab 100644 --- a/tools/testing/selftests/futex/functional/.gitignore +++ b/tools/testing/selftests/futex/functional/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only futex_requeue_pi futex_requeue_pi_mismatched_ops futex_requeue_pi_signal_restart diff --git a/tools/testing/selftests/gpio/.gitignore b/tools/testing/selftests/gpio/.gitignore index 7d14f743d1a4..4c69408f3e84 100644 --- a/tools/testing/selftests/gpio/.gitignore +++ b/tools/testing/selftests/gpio/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only gpio-mockup-chardev diff --git a/tools/testing/selftests/ia64/.gitignore b/tools/testing/selftests/ia64/.gitignore index ab806edc8732..e962fb2a08d5 100644 --- a/tools/testing/selftests/ia64/.gitignore +++ b/tools/testing/selftests/ia64/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only aliasing-test diff --git a/tools/testing/selftests/intel_pstate/.gitignore b/tools/testing/selftests/intel_pstate/.gitignore index 3bfcbae5fa13..862de222a3f3 100644 --- a/tools/testing/selftests/intel_pstate/.gitignore +++ b/tools/testing/selftests/intel_pstate/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only aperf msr diff --git a/tools/testing/selftests/ipc/.gitignore b/tools/testing/selftests/ipc/.gitignore index 9af04c9353c0..9ed280e4c704 100644 --- a/tools/testing/selftests/ipc/.gitignore +++ b/tools/testing/selftests/ipc/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only msgque_test msgque diff --git a/tools/testing/selftests/ir/.gitignore b/tools/testing/selftests/ir/.gitignore index 070ea0c75fb8..0bbada8c1811 100644 --- a/tools/testing/selftests/ir/.gitignore +++ b/tools/testing/selftests/ir/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only ir_loopback diff --git a/tools/testing/selftests/kcmp/.gitignore b/tools/testing/selftests/kcmp/.gitignore index 5a9b3732b2de..38ccdfe80ef7 100644 --- a/tools/testing/selftests/kcmp/.gitignore +++ b/tools/testing/selftests/kcmp/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only kcmp_test kcmp-test-file diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 30072c3f52fb..2f85dc944fbd 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only /s390x/sync_regs_test /s390x/memop /x86_64/cr4_cpuid_sync_test diff --git a/tools/testing/selftests/media_tests/.gitignore b/tools/testing/selftests/media_tests/.gitignore index 8745eba39012..da438e780ffe 100644 --- a/tools/testing/selftests/media_tests/.gitignore +++ b/tools/testing/selftests/media_tests/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only media_device_test media_device_open video_device_test diff --git a/tools/testing/selftests/membarrier/.gitignore b/tools/testing/selftests/membarrier/.gitignore index f2f7ec0a99b4..f2fbba178601 100644 --- a/tools/testing/selftests/membarrier/.gitignore +++ b/tools/testing/selftests/membarrier/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only membarrier_test_multi_thread membarrier_test_single_thread diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore index afe87c40ac80..dd9a051f608e 100644 --- a/tools/testing/selftests/memfd/.gitignore +++ b/tools/testing/selftests/memfd/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only fuse_mnt fuse_test memfd_test diff --git a/tools/testing/selftests/mount/.gitignore b/tools/testing/selftests/mount/.gitignore index 856ad4107eb3..0bc64a6d4c18 100644 --- a/tools/testing/selftests/mount/.gitignore +++ b/tools/testing/selftests/mount/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only unprivileged-remount-test diff --git a/tools/testing/selftests/mqueue/.gitignore b/tools/testing/selftests/mqueue/.gitignore index d8d42377205a..72ad8ca691c9 100644 --- a/tools/testing/selftests/mqueue/.gitignore +++ b/tools/testing/selftests/mqueue/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only mq_open_tests mq_perf_tests diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index ecc52d4c034d..08ad1a7109e2 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only msg_zerocopy socket psock_fanout diff --git a/tools/testing/selftests/net/forwarding/.gitignore b/tools/testing/selftests/net/forwarding/.gitignore index a793eef5b876..2dea317f12e7 100644 --- a/tools/testing/selftests/net/forwarding/.gitignore +++ b/tools/testing/selftests/net/forwarding/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only forwarding.config diff --git a/tools/testing/selftests/net/mptcp/.gitignore b/tools/testing/selftests/net/mptcp/.gitignore index d72f07642738..beea6541fb21 100644 --- a/tools/testing/selftests/net/mptcp/.gitignore +++ b/tools/testing/selftests/net/mptcp/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only mptcp_connect *.pcap diff --git a/tools/testing/selftests/networking/timestamping/.gitignore b/tools/testing/selftests/networking/timestamping/.gitignore index d9355035e746..f4f031db8bbf 100644 --- a/tools/testing/selftests/networking/timestamping/.gitignore +++ b/tools/testing/selftests/networking/timestamping/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only timestamping rxtimestamp txtimestamp diff --git a/tools/testing/selftests/nsfs/.gitignore b/tools/testing/selftests/nsfs/.gitignore index 2ab2c824ce86..ed79ebdf286e 100644 --- a/tools/testing/selftests/nsfs/.gitignore +++ b/tools/testing/selftests/nsfs/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only owner pidns diff --git a/tools/testing/selftests/openat2/.gitignore b/tools/testing/selftests/openat2/.gitignore index bd68f6c3fd07..82a4846cbc4b 100644 --- a/tools/testing/selftests/openat2/.gitignore +++ b/tools/testing/selftests/openat2/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /*_test diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore index 39559d723c41..2d4db5afb142 100644 --- a/tools/testing/selftests/pidfd/.gitignore +++ b/tools/testing/selftests/pidfd/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only pidfd_open_test pidfd_poll_test pidfd_test diff --git a/tools/testing/selftests/powerpc/alignment/.gitignore b/tools/testing/selftests/powerpc/alignment/.gitignore index 6d4fd014511c..28bc6ca13cc6 100644 --- a/tools/testing/selftests/powerpc/alignment/.gitignore +++ b/tools/testing/selftests/powerpc/alignment/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only copy_first_unaligned alignment_handler diff --git a/tools/testing/selftests/powerpc/benchmarks/.gitignore b/tools/testing/selftests/powerpc/benchmarks/.gitignore index 9161679b1e1a..c9ce13983c99 100644 --- a/tools/testing/selftests/powerpc/benchmarks/.gitignore +++ b/tools/testing/selftests/powerpc/benchmarks/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only gettimeofday context_switch fork diff --git a/tools/testing/selftests/powerpc/cache_shape/.gitignore b/tools/testing/selftests/powerpc/cache_shape/.gitignore index ec1848434be5..b385eee3012c 100644 --- a/tools/testing/selftests/powerpc/cache_shape/.gitignore +++ b/tools/testing/selftests/powerpc/cache_shape/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only cache_shape diff --git a/tools/testing/selftests/powerpc/copyloops/.gitignore b/tools/testing/selftests/powerpc/copyloops/.gitignore index 12ef5b031974..ddaf140b8255 100644 --- a/tools/testing/selftests/powerpc/copyloops/.gitignore +++ b/tools/testing/selftests/powerpc/copyloops/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only copyuser_64_t0 copyuser_64_t1 copyuser_64_t2 diff --git a/tools/testing/selftests/powerpc/dscr/.gitignore b/tools/testing/selftests/powerpc/dscr/.gitignore index b585c6c1564a..1d08b15af697 100644 --- a/tools/testing/selftests/powerpc/dscr/.gitignore +++ b/tools/testing/selftests/powerpc/dscr/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only dscr_default_test dscr_explicit_test dscr_inherit_exec_test diff --git a/tools/testing/selftests/powerpc/math/.gitignore b/tools/testing/selftests/powerpc/math/.gitignore index 50ded63e25b7..e31ca6f453ed 100644 --- a/tools/testing/selftests/powerpc/math/.gitignore +++ b/tools/testing/selftests/powerpc/math/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only fpu_syscall vmx_syscall fpu_preempt diff --git a/tools/testing/selftests/powerpc/mm/.gitignore b/tools/testing/selftests/powerpc/mm/.gitignore index 0ebeaea22641..7cf7ad261d02 100644 --- a/tools/testing/selftests/powerpc/mm/.gitignore +++ b/tools/testing/selftests/powerpc/mm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only hugetlb_vs_thp_test subpage_prot tempfile diff --git a/tools/testing/selftests/powerpc/pmu/.gitignore b/tools/testing/selftests/powerpc/pmu/.gitignore index e748f336eed3..ff7896903d7b 100644 --- a/tools/testing/selftests/powerpc/pmu/.gitignore +++ b/tools/testing/selftests/powerpc/pmu/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only count_instructions l3_bank_test per_event_excludes diff --git a/tools/testing/selftests/powerpc/pmu/ebb/.gitignore b/tools/testing/selftests/powerpc/pmu/ebb/.gitignore index 42bddbed8b64..2920fb39439b 100644 --- a/tools/testing/selftests/powerpc/pmu/ebb/.gitignore +++ b/tools/testing/selftests/powerpc/pmu/ebb/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only reg_access_test event_attributes_test cycles_test diff --git a/tools/testing/selftests/powerpc/primitives/.gitignore b/tools/testing/selftests/powerpc/primitives/.gitignore index 4cc4e31bed1d..1e5c04e24254 100644 --- a/tools/testing/selftests/powerpc/primitives/.gitignore +++ b/tools/testing/selftests/powerpc/primitives/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only load_unaligned_zeropad diff --git a/tools/testing/selftests/powerpc/ptrace/.gitignore b/tools/testing/selftests/powerpc/ptrace/.gitignore index dce19f221c46..0e96150b7c7e 100644 --- a/tools/testing/selftests/powerpc/ptrace/.gitignore +++ b/tools/testing/selftests/powerpc/ptrace/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only ptrace-gpr ptrace-tm-gpr ptrace-tm-spd-gpr diff --git a/tools/testing/selftests/powerpc/security/.gitignore b/tools/testing/selftests/powerpc/security/.gitignore index 0b969fba3beb..f795e06f5ae3 100644 --- a/tools/testing/selftests/powerpc/security/.gitignore +++ b/tools/testing/selftests/powerpc/security/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only rfi_flush diff --git a/tools/testing/selftests/powerpc/signal/.gitignore b/tools/testing/selftests/powerpc/signal/.gitignore index dca5852a1546..f897b55a44dd 100644 --- a/tools/testing/selftests/powerpc/signal/.gitignore +++ b/tools/testing/selftests/powerpc/signal/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only signal signal_tm sigfuz diff --git a/tools/testing/selftests/powerpc/stringloops/.gitignore b/tools/testing/selftests/powerpc/stringloops/.gitignore index 31a17e0ba884..b0dfc74aa57e 100644 --- a/tools/testing/selftests/powerpc/stringloops/.gitignore +++ b/tools/testing/selftests/powerpc/stringloops/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only memcmp_64 memcmp_32 strlen diff --git a/tools/testing/selftests/powerpc/switch_endian/.gitignore b/tools/testing/selftests/powerpc/switch_endian/.gitignore index 89e762eab676..30e962cf84d1 100644 --- a/tools/testing/selftests/powerpc/switch_endian/.gitignore +++ b/tools/testing/selftests/powerpc/switch_endian/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only switch_endian_test check-reversed.S diff --git a/tools/testing/selftests/powerpc/syscalls/.gitignore b/tools/testing/selftests/powerpc/syscalls/.gitignore index f0f3fcc9d802..b00cab225476 100644 --- a/tools/testing/selftests/powerpc/syscalls/.gitignore +++ b/tools/testing/selftests/powerpc/syscalls/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only ipc_unmuxed diff --git a/tools/testing/selftests/powerpc/tm/.gitignore b/tools/testing/selftests/powerpc/tm/.gitignore index 98f2708d86cc..7baf2a46002f 100644 --- a/tools/testing/selftests/powerpc/tm/.gitignore +++ b/tools/testing/selftests/powerpc/tm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only tm-resched-dscr tm-syscall tm-signal-msr-resv diff --git a/tools/testing/selftests/powerpc/vphn/.gitignore b/tools/testing/selftests/powerpc/vphn/.gitignore index 7c04395010cb..b744aedfd1f2 100644 --- a/tools/testing/selftests/powerpc/vphn/.gitignore +++ b/tools/testing/selftests/powerpc/vphn/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only test-vphn diff --git a/tools/testing/selftests/prctl/.gitignore b/tools/testing/selftests/prctl/.gitignore index 0b5c27447bf6..91af2b631bc9 100644 --- a/tools/testing/selftests/prctl/.gitignore +++ b/tools/testing/selftests/prctl/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test disable-tsc-test diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore index 66fab4c58ed4..4bca5a9327a4 100644 --- a/tools/testing/selftests/proc/.gitignore +++ b/tools/testing/selftests/proc/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only /fd-001-lookup /fd-002-posix-eq /fd-003-kthread diff --git a/tools/testing/selftests/pstore/.gitignore b/tools/testing/selftests/pstore/.gitignore index 5a4a26e5464b..9938fb406389 100644 --- a/tools/testing/selftests/pstore/.gitignore +++ b/tools/testing/selftests/pstore/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only logs *uuid diff --git a/tools/testing/selftests/ptp/.gitignore b/tools/testing/selftests/ptp/.gitignore index f562e49d6917..534ca26eee48 100644 --- a/tools/testing/selftests/ptp/.gitignore +++ b/tools/testing/selftests/ptp/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only testptp diff --git a/tools/testing/selftests/ptrace/.gitignore b/tools/testing/selftests/ptrace/.gitignore index cfcc49a7def7..7bebf9534a86 100644 --- a/tools/testing/selftests/ptrace/.gitignore +++ b/tools/testing/selftests/ptrace/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only get_syscall_info peeksiginfo diff --git a/tools/testing/selftests/rcutorture/.gitignore b/tools/testing/selftests/rcutorture/.gitignore index ccc240275d1c..f6cbce77460b 100644 --- a/tools/testing/selftests/rcutorture/.gitignore +++ b/tools/testing/selftests/rcutorture/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only initrd b[0-9]* res diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/.gitignore b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/.gitignore index 712a3d41a325..24e27957efcc 100644 --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/.gitignore +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only srcu.c diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/.gitignore b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/.gitignore index 1d016e66980a..57d296341304 100644 --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/.gitignore +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/include/linux/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only srcu.h diff --git a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/tests/store_buffering/.gitignore b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/tests/store_buffering/.gitignore index f47cb2045f13..d65462d64816 100644 --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/tests/store_buffering/.gitignore +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/tests/store_buffering/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only *.out diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore index cc610da7e369..5910888ebfe1 100644 --- a/tools/testing/selftests/rseq/.gitignore +++ b/tools/testing/selftests/rseq/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only basic_percpu_ops_test basic_test basic_rseq_op_test diff --git a/tools/testing/selftests/rtc/.gitignore b/tools/testing/selftests/rtc/.gitignore index d0ad44f6294a..fb2d533aa575 100644 --- a/tools/testing/selftests/rtc/.gitignore +++ b/tools/testing/selftests/rtc/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only rtctest setdate diff --git a/tools/testing/selftests/safesetid/.gitignore b/tools/testing/selftests/safesetid/.gitignore index 9c1a629bca01..25d3db172907 100644 --- a/tools/testing/selftests/safesetid/.gitignore +++ b/tools/testing/selftests/safesetid/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only safesetid-test diff --git a/tools/testing/selftests/seccomp/.gitignore b/tools/testing/selftests/seccomp/.gitignore index 5af29d3a1b0a..dec678577f9c 100644 --- a/tools/testing/selftests/seccomp/.gitignore +++ b/tools/testing/selftests/seccomp/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only seccomp_bpf seccomp_benchmark diff --git a/tools/testing/selftests/sigaltstack/.gitignore b/tools/testing/selftests/sigaltstack/.gitignore index 35897b0a3f44..50a19a8888ce 100644 --- a/tools/testing/selftests/sigaltstack/.gitignore +++ b/tools/testing/selftests/sigaltstack/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only sas diff --git a/tools/testing/selftests/size/.gitignore b/tools/testing/selftests/size/.gitignore index 189b7818de34..923e18eed1a0 100644 --- a/tools/testing/selftests/size/.gitignore +++ b/tools/testing/selftests/size/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only get_size diff --git a/tools/testing/selftests/sparc64/drivers/.gitignore b/tools/testing/selftests/sparc64/drivers/.gitignore index 90e835ed74e6..0331f77373b5 100644 --- a/tools/testing/selftests/sparc64/drivers/.gitignore +++ b/tools/testing/selftests/sparc64/drivers/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only adi-test diff --git a/tools/testing/selftests/splice/.gitignore b/tools/testing/selftests/splice/.gitignore index 1e23fefd68e8..d5a2da428752 100644 --- a/tools/testing/selftests/splice/.gitignore +++ b/tools/testing/selftests/splice/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only default_file_splice_read diff --git a/tools/testing/selftests/sync/.gitignore b/tools/testing/selftests/sync/.gitignore index f5091e7792f2..f1152357712f 100644 --- a/tools/testing/selftests/sync/.gitignore +++ b/tools/testing/selftests/sync/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only sync_test diff --git a/tools/testing/selftests/tc-testing/.gitignore b/tools/testing/selftests/tc-testing/.gitignore index c26d72e0166f..d52f65de23b4 100644 --- a/tools/testing/selftests/tc-testing/.gitignore +++ b/tools/testing/selftests/tc-testing/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only __pycache__/ *.pyc plugins/ diff --git a/tools/testing/selftests/timens/.gitignore b/tools/testing/selftests/timens/.gitignore index 789f21e81028..2e43851b47c1 100644 --- a/tools/testing/selftests/timens/.gitignore +++ b/tools/testing/selftests/timens/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only clock_nanosleep exec gettime_perf diff --git a/tools/testing/selftests/timers/.gitignore b/tools/testing/selftests/timers/.gitignore index 32a9eadb2d4e..bb5326ff900b 100644 --- a/tools/testing/selftests/timers/.gitignore +++ b/tools/testing/selftests/timers/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only alarmtimer-suspend change_skew clocksource-switch diff --git a/tools/testing/selftests/tmpfs/.gitignore b/tools/testing/selftests/tmpfs/.gitignore index a96838fad74d..b1afaa925905 100644 --- a/tools/testing/selftests/tmpfs/.gitignore +++ b/tools/testing/selftests/tmpfs/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /bug-link-o-tmpfile diff --git a/tools/testing/selftests/vDSO/.gitignore b/tools/testing/selftests/vDSO/.gitignore index 133bf9ee986c..382cfb39a1a3 100644 --- a/tools/testing/selftests/vDSO/.gitignore +++ b/tools/testing/selftests/vDSO/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only vdso_test vdso_standalone_test_x86 diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 31b3c98b6d34..876c48d7f41d 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only hugepage-mmap hugepage-shm map_hugetlb diff --git a/tools/testing/selftests/watchdog/.gitignore b/tools/testing/selftests/watchdog/.gitignore index 5aac51575c7e..61d7b89cdbca 100644 --- a/tools/testing/selftests/watchdog/.gitignore +++ b/tools/testing/selftests/watchdog/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only watchdog-test diff --git a/tools/testing/selftests/wireguard/qemu/.gitignore b/tools/testing/selftests/wireguard/qemu/.gitignore index 415b542a9d59..bfa15e6feb2f 100644 --- a/tools/testing/selftests/wireguard/qemu/.gitignore +++ b/tools/testing/selftests/wireguard/qemu/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only build/ distfiles/ diff --git a/tools/testing/selftests/x86/.gitignore b/tools/testing/selftests/x86/.gitignore index 7757f73ff9a3..022a1f3b64ef 100644 --- a/tools/testing/selftests/x86/.gitignore +++ b/tools/testing/selftests/x86/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *_32 *_64 single_step_syscall diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore index 7f7a2ccc30c4..87ca2731cff9 100644 --- a/tools/testing/vsock/.gitignore +++ b/tools/testing/vsock/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.d vsock_test vsock_diag_test diff --git a/tools/thermal/tmon/.gitignore b/tools/thermal/tmon/.gitignore index 06e96be65276..d9e97a0308f5 100644 --- a/tools/thermal/tmon/.gitignore +++ b/tools/thermal/tmon/.gitignore @@ -1 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only /tmon diff --git a/tools/usb/.gitignore b/tools/usb/.gitignore index 1b7448981435..fce1ef5a9267 100644 --- a/tools/usb/.gitignore +++ b/tools/usb/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only ffs-test testusb diff --git a/tools/usb/usbip/.gitignore b/tools/usb/usbip/.gitignore index 03b892c8bd8c..597361a96dbb 100644 --- a/tools/usb/usbip/.gitignore +++ b/tools/usb/usbip/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only Makefile Makefile.in aclocal.m4 diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore index 1cfbb0157a46..075588c4da08 100644 --- a/tools/virtio/.gitignore +++ b/tools/virtio/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only *.d virtio_test vringh_test diff --git a/tools/vm/.gitignore b/tools/vm/.gitignore index 44f095fa2604..79bb92ae1bb3 100644 --- a/tools/vm/.gitignore +++ b/tools/vm/.gitignore @@ -1,2 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only slabinfo page-types diff --git a/usr/.gitignore b/usr/.gitignore index 610de736b75e..935442ed1eb2 100644 --- a/usr/.gitignore +++ b/usr/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only gen_init_cpio initramfs_data.cpio /initramfs_inc_data diff --git a/usr/include/.gitignore b/usr/include/.gitignore index a0991ff4402b..d2fab782cb7d 100644 --- a/usr/include/.gitignore +++ b/usr/include/.gitignore @@ -1,3 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only * !.gitignore !Makefile -- cgit v1.2.3-71-gd317 From 93ef1429e556f739a208e3968883ed51480580e8 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:50:54 +0000 Subject: cpu/hotplug: Add new {add,remove}_cpu() functions The new functions use device_{online,offline}() which are userspace safe. This is in preparation to move cpu_{up, down} kernel users to use a safer interface that is not racy with userspace. Suggested-by: "Paul E. McKenney" Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Reviewed-by: Paul E. McKenney Link: https://lkml.kernel.org/r/20200323135110.30522-2-qais.yousef@arm.com --- include/linux/cpu.h | 2 ++ kernel/cpu.c | 24 ++++++++++++++++++++++++ 2 files changed, 26 insertions(+) (limited to 'kernel') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 1ca2baf817ed..cf8cf38dca43 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -89,6 +89,7 @@ extern ssize_t arch_cpu_release(const char *, size_t); #ifdef CONFIG_SMP extern bool cpuhp_tasks_frozen; int cpu_up(unsigned int cpu); +int add_cpu(unsigned int cpu); void notify_cpu_starting(unsigned int cpu); extern void cpu_maps_update_begin(void); extern void cpu_maps_update_done(void); @@ -118,6 +119,7 @@ extern void cpu_hotplug_disable(void); extern void cpu_hotplug_enable(void); void clear_tasks_mm_cpumask(int cpu); int cpu_down(unsigned int cpu); +int remove_cpu(unsigned int cpu); #else /* CONFIG_HOTPLUG_CPU */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 9c706af713fb..069802f7010f 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1057,6 +1057,18 @@ int cpu_down(unsigned int cpu) } EXPORT_SYMBOL(cpu_down); +int remove_cpu(unsigned int cpu) +{ + int ret; + + lock_device_hotplug(); + ret = device_offline(get_cpu_device(cpu)); + unlock_device_hotplug(); + + return ret; +} +EXPORT_SYMBOL_GPL(remove_cpu); + #else #define takedown_cpu NULL #endif /*CONFIG_HOTPLUG_CPU*/ @@ -1209,6 +1221,18 @@ int cpu_up(unsigned int cpu) } EXPORT_SYMBOL_GPL(cpu_up); +int add_cpu(unsigned int cpu) +{ + int ret; + + lock_device_hotplug(); + ret = device_online(get_cpu_device(cpu)); + unlock_device_hotplug(); + + return ret; +} +EXPORT_SYMBOL_GPL(add_cpu); + #ifdef CONFIG_PM_SLEEP_SMP static cpumask_var_t frozen_cpus; -- cgit v1.2.3-71-gd317 From 0441a5597c5d02ed7abed1d989eaaa1595dedac5 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:50:55 +0000 Subject: cpu/hotplug: Create a new function to shutdown nonboot cpus This function will be used later in machine_shutdown() for some architectures. disable_nonboot_cpus() is not safe to use when doing machine_down(), because it relies on freeze_secondary_cpus() which in turn is a suspend/resume related freeze and could abort if the logic detects any pending activities that can prevent finishing the offlining process. Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323135110.30522-3-qais.yousef@arm.com --- include/linux/cpu.h | 2 ++ kernel/cpu.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 44 insertions(+) (limited to 'kernel') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index cf8cf38dca43..64a246e9c8db 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -120,6 +120,7 @@ extern void cpu_hotplug_enable(void); void clear_tasks_mm_cpumask(int cpu); int cpu_down(unsigned int cpu); int remove_cpu(unsigned int cpu); +extern void smp_shutdown_nonboot_cpus(unsigned int primary_cpu); #else /* CONFIG_HOTPLUG_CPU */ @@ -131,6 +132,7 @@ static inline int cpus_read_trylock(void) { return true; } static inline void lockdep_assert_cpus_held(void) { } static inline void cpu_hotplug_disable(void) { } static inline void cpu_hotplug_enable(void) { } +static inline void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) { } #endif /* !CONFIG_HOTPLUG_CPU */ /* Wrappers which go away once all code is converted */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 069802f7010f..03c727195b65 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1069,6 +1069,48 @@ int remove_cpu(unsigned int cpu) } EXPORT_SYMBOL_GPL(remove_cpu); +void smp_shutdown_nonboot_cpus(unsigned int primary_cpu) +{ + unsigned int cpu; + int error; + + cpu_maps_update_begin(); + + /* + * Make certain the cpu I'm about to reboot on is online. + * + * This is inline to what migrate_to_reboot_cpu() already do. + */ + if (!cpu_online(primary_cpu)) + primary_cpu = cpumask_first(cpu_online_mask); + + for_each_online_cpu(cpu) { + if (cpu == primary_cpu) + continue; + + error = cpu_down_maps_locked(cpu, CPUHP_OFFLINE); + if (error) { + pr_err("Failed to offline CPU%d - error=%d", + cpu, error); + break; + } + } + + /* + * Ensure all but the reboot CPU are offline. + */ + BUG_ON(num_online_cpus() > 1); + + /* + * Make sure the CPUs won't be enabled by someone else after this + * point. Kexec will reboot to a new kernel shortly resetting + * everything along the way. + */ + cpu_hotplug_disabled++; + + cpu_maps_update_done(); +} + #else #define takedown_cpu NULL #endif /*CONFIG_HOTPLUG_CPU*/ -- cgit v1.2.3-71-gd317 From d720f98604391dab6aa3cb4c1bc005ed1aba4703 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:51:01 +0000 Subject: cpu/hotplug: Provide bringup_hibernate_cpu() arm64 uses cpu_up() in the resume from hibernation code to ensure that the CPU on which the system hibernated is online. Provide a core function for this. [ tglx: Split out from the combo arm64 patch ] Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Acked-by: Catalin Marinas Cc: Will Deacon Link: https://lkml.kernel.org/r/20200323135110.30522-9-qais.yousef@arm.com --- include/linux/cpu.h | 1 + kernel/cpu.c | 23 +++++++++++++++++++++++ 2 files changed, 24 insertions(+) (limited to 'kernel') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 64a246e9c8db..9dc1e892e193 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -93,6 +93,7 @@ int add_cpu(unsigned int cpu); void notify_cpu_starting(unsigned int cpu); extern void cpu_maps_update_begin(void); extern void cpu_maps_update_done(void); +int bringup_hibernate_cpu(unsigned int sleep_cpu); #else /* CONFIG_SMP */ #define cpuhp_tasks_frozen 0 diff --git a/kernel/cpu.c b/kernel/cpu.c index 03c727195b65..f803678684e9 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1275,6 +1275,29 @@ int add_cpu(unsigned int cpu) } EXPORT_SYMBOL_GPL(add_cpu); +/** + * bringup_hibernate_cpu - Bring up the CPU that we hibernated on + * @sleep_cpu: The cpu we hibernated on and should be brought up. + * + * On some architectures like arm64, we can hibernate on any CPU, but on + * wake up the CPU we hibernated on might be offline as a side effect of + * using maxcpus= for example. + */ +int bringup_hibernate_cpu(unsigned int sleep_cpu) +{ + int ret; + + if (!cpu_online(sleep_cpu)) { + pr_info("Hibernated on a CPU that is offline! Bringing CPU up.\n"); + ret = cpu_up(sleep_cpu); + if (ret) { + pr_err("Failed to bring hibernate-CPU up!\n"); + return ret; + } + } + return 0; +} + #ifdef CONFIG_PM_SLEEP_SMP static cpumask_var_t frozen_cpus; -- cgit v1.2.3-71-gd317 From 457bc8ed3ec7edc567200302d9312ac8bbc31316 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:51:08 +0000 Subject: torture: Replace cpu_up/down() with add/remove_cpu() The core device API performs extra housekeeping bits that are missing from directly calling cpu_up/down(). See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and serialization during LPM") for an example description of what might go wrong. This also prepares to make cpu_up/down() a private interface of the CPU subsystem. Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Acked-by: "Paul E. McKenney" Link: https://lkml.kernel.org/r/20200323135110.30522-16-qais.yousef@arm.com --- kernel/torture.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/torture.c b/kernel/torture.c index 7c13f5558b71..a479689eac73 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -97,7 +97,7 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes, torture_type, cpu); starttime = jiffies; (*n_offl_attempts)++; - ret = cpu_down(cpu); + ret = remove_cpu(cpu); if (ret) { if (verbose) pr_alert("%s" TORTURE_FLAG @@ -148,7 +148,7 @@ bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes, torture_type, cpu); starttime = jiffies; (*n_onl_attempts)++; - ret = cpu_up(cpu); + ret = add_cpu(cpu); if (ret) { if (verbose) pr_alert("%s" TORTURE_FLAG @@ -192,17 +192,18 @@ torture_onoff(void *arg) for_each_online_cpu(cpu) maxcpu = cpu; WARN_ON(maxcpu < 0); - if (!IS_MODULE(CONFIG_TORTURE_TEST)) + if (!IS_MODULE(CONFIG_TORTURE_TEST)) { for_each_possible_cpu(cpu) { if (cpu_online(cpu)) continue; - ret = cpu_up(cpu); + ret = add_cpu(cpu); if (ret && verbose) { pr_alert("%s" TORTURE_FLAG "%s: Initial online %d: errno %d\n", __func__, torture_type, cpu, ret); } } + } if (maxcpu == 0) { VERBOSE_TOROUT_STRING("Only one CPU, so CPU-hotplug testing is disabled"); -- cgit v1.2.3-71-gd317 From b99a26593b5190fac6b5c1f81a7f8cc128a25c98 Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:51:09 +0000 Subject: cpu/hotplug: Move bringup of secondary CPUs out of smp_init() This is the last direct user of cpu_up() before it can become an internal implementation detail of the cpu subsystem. Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323135110.30522-17-qais.yousef@arm.com --- include/linux/cpu.h | 1 + kernel/cpu.c | 12 ++++++++++++ kernel/smp.c | 9 +-------- 3 files changed, 14 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 9dc1e892e193..8b295f78da25 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -94,6 +94,7 @@ void notify_cpu_starting(unsigned int cpu); extern void cpu_maps_update_begin(void); extern void cpu_maps_update_done(void); int bringup_hibernate_cpu(unsigned int sleep_cpu); +void bringup_nonboot_cpus(unsigned int setup_max_cpus); #else /* CONFIG_SMP */ #define cpuhp_tasks_frozen 0 diff --git a/kernel/cpu.c b/kernel/cpu.c index f803678684e9..4783d811ccf8 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1298,6 +1298,18 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu) return 0; } +void bringup_nonboot_cpus(unsigned int setup_max_cpus) +{ + unsigned int cpu; + + for_each_present_cpu(cpu) { + if (num_online_cpus() >= setup_max_cpus) + break; + if (!cpu_online(cpu)) + cpu_up(cpu); + } +} + #ifdef CONFIG_PM_SLEEP_SMP static cpumask_var_t frozen_cpus; diff --git a/kernel/smp.c b/kernel/smp.c index 97f1d9765c94..786092aabdcd 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -597,20 +597,13 @@ void __init setup_nr_cpu_ids(void) void __init smp_init(void) { int num_nodes, num_cpus; - unsigned int cpu; idle_threads_init(); cpuhp_threads_init(); pr_info("Bringing up secondary CPUs ...\n"); - /* FIXME: This should be done in userspace --RR */ - for_each_present_cpu(cpu) { - if (num_online_cpus() >= setup_max_cpus) - break; - if (!cpu_online(cpu)) - cpu_up(cpu); - } + bringup_nonboot_cpus(setup_max_cpus); num_nodes = num_online_nodes(); num_cpus = num_online_cpus(); -- cgit v1.2.3-71-gd317 From 33c3736ec88811b9b6f6ce2cc8967f6b97c3db5e Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Mon, 23 Mar 2020 13:51:10 +0000 Subject: cpu/hotplug: Hide cpu_up/down() Use separate functions for the device core to bring a CPU up and down. Users outside the device core must use add/remove_cpu() which will take care of extra housekeeping work like keeping sysfs in sync. Make cpu_up/down() static and replace the extra layer of indirection. [ tglx: Removed the extra wrapper functions and adjusted function names ] Signed-off-by: Qais Yousef Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200323135110.30522-18-qais.yousef@arm.com --- drivers/base/cpu.c | 4 ++-- include/linux/cpu.h | 4 ++-- kernel/cpu.c | 42 ++++++++++++++++++++++++++++-------------- 3 files changed, 32 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c index 6265871a4af2..b93a8f82f215 100644 --- a/drivers/base/cpu.c +++ b/drivers/base/cpu.c @@ -55,7 +55,7 @@ static int cpu_subsys_online(struct device *dev) if (from_nid == NUMA_NO_NODE) return -ENODEV; - ret = cpu_up(cpuid); + ret = cpu_device_up(dev); /* * When hot adding memory to memoryless node and enabling a cpu * on the node, node number of the cpu may internally change. @@ -69,7 +69,7 @@ static int cpu_subsys_online(struct device *dev) static int cpu_subsys_offline(struct device *dev) { - return cpu_down(dev->id); + return cpu_device_down(dev); } void unregister_cpu(struct cpu *cpu) diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 8b295f78da25..9ead281157d3 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -88,8 +88,8 @@ extern ssize_t arch_cpu_release(const char *, size_t); #ifdef CONFIG_SMP extern bool cpuhp_tasks_frozen; -int cpu_up(unsigned int cpu); int add_cpu(unsigned int cpu); +int cpu_device_up(struct device *dev); void notify_cpu_starting(unsigned int cpu); extern void cpu_maps_update_begin(void); extern void cpu_maps_update_done(void); @@ -120,8 +120,8 @@ extern void lockdep_assert_cpus_held(void); extern void cpu_hotplug_disable(void); extern void cpu_hotplug_enable(void); void clear_tasks_mm_cpumask(int cpu); -int cpu_down(unsigned int cpu); int remove_cpu(unsigned int cpu); +int cpu_device_down(struct device *dev); extern void smp_shutdown_nonboot_cpus(unsigned int primary_cpu); #else /* CONFIG_HOTPLUG_CPU */ diff --git a/kernel/cpu.c b/kernel/cpu.c index 4783d811ccf8..30848496cbc7 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1041,7 +1041,7 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target) return _cpu_down(cpu, 0, target); } -static int do_cpu_down(unsigned int cpu, enum cpuhp_state target) +static int cpu_down(unsigned int cpu, enum cpuhp_state target) { int err; @@ -1051,11 +1051,18 @@ static int do_cpu_down(unsigned int cpu, enum cpuhp_state target) return err; } -int cpu_down(unsigned int cpu) +/** + * cpu_device_down - Bring down a cpu device + * @dev: Pointer to the cpu device to offline + * + * This function is meant to be used by device core cpu subsystem only. + * + * Other subsystems should use remove_cpu() instead. + */ +int cpu_device_down(struct device *dev) { - return do_cpu_down(cpu, CPUHP_OFFLINE); + return cpu_down(dev->id, CPUHP_OFFLINE); } -EXPORT_SYMBOL(cpu_down); int remove_cpu(unsigned int cpu) { @@ -1178,8 +1185,8 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) } /* - * The caller of do_cpu_up might have raced with another - * caller. Ignore it for now. + * The caller of cpu_up() might have raced with another + * caller. Nothing to do. */ if (st->state >= target) goto out; @@ -1223,7 +1230,7 @@ out: return ret; } -static int do_cpu_up(unsigned int cpu, enum cpuhp_state target) +static int cpu_up(unsigned int cpu, enum cpuhp_state target) { int err = 0; @@ -1257,11 +1264,18 @@ out: return err; } -int cpu_up(unsigned int cpu) +/** + * cpu_device_up - Bring up a cpu device + * @dev: Pointer to the cpu device to online + * + * This function is meant to be used by device core cpu subsystem only. + * + * Other subsystems should use add_cpu() instead. + */ +int cpu_device_up(struct device *dev) { - return do_cpu_up(cpu, CPUHP_ONLINE); + return cpu_up(dev->id, CPUHP_ONLINE); } -EXPORT_SYMBOL_GPL(cpu_up); int add_cpu(unsigned int cpu) { @@ -1289,7 +1303,7 @@ int bringup_hibernate_cpu(unsigned int sleep_cpu) if (!cpu_online(sleep_cpu)) { pr_info("Hibernated on a CPU that is offline! Bringing CPU up.\n"); - ret = cpu_up(sleep_cpu); + ret = cpu_up(sleep_cpu, CPUHP_ONLINE); if (ret) { pr_err("Failed to bring hibernate-CPU up!\n"); return ret; @@ -1306,7 +1320,7 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus) if (num_online_cpus() >= setup_max_cpus) break; if (!cpu_online(cpu)) - cpu_up(cpu); + cpu_up(cpu, CPUHP_ONLINE); } } @@ -2129,9 +2143,9 @@ static ssize_t write_cpuhp_target(struct device *dev, goto out; if (st->state < target) - ret = do_cpu_up(dev->id, target); + ret = cpu_up(dev->id, target); else - ret = do_cpu_down(dev->id, target); + ret = cpu_down(dev->id, target); out: unlock_device_hotplug(); return ret ? ret : count; -- cgit v1.2.3-71-gd317 From eea9673250db4e854e9998ef9da6d4584857f0ea Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 25 Mar 2020 10:03:36 -0500 Subject: exec: Add exec_update_mutex to replace cred_guard_mutex The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai Signed-off-by: "Eric W. Biederman" Signed-off-by: Bernd Edlinger Signed-off-by: Eric W. Biederman --- fs/exec.c | 22 +++++++++++++++++++--- include/linux/binfmts.h | 8 +++++++- include/linux/sched/signal.h | 9 ++++++++- init/init_task.c | 1 + kernel/fork.c | 1 + 5 files changed, 36 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/fs/exec.c b/fs/exec.c index d820a7272a76..0e46ec57fe0a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1010,16 +1010,26 @@ ssize_t read_code(struct file *file, unsigned long addr, loff_t pos, size_t len) } EXPORT_SYMBOL(read_code); +/* + * Maps the mm_struct mm into the current task struct. + * On success, this function returns with the mutex + * exec_update_mutex locked. + */ static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; struct mm_struct *old_mm, *active_mm; + int ret; /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm); + ret = mutex_lock_killable(&tsk->signal->exec_update_mutex); + if (ret) + return ret; + if (old_mm) { sync_mm_rss(old_mm); /* @@ -1031,9 +1041,11 @@ static int exec_mmap(struct mm_struct *mm) down_read(&old_mm->mmap_sem); if (unlikely(old_mm->core_state)) { up_read(&old_mm->mmap_sem); + mutex_unlock(&tsk->signal->exec_update_mutex); return -EINTR; } } + task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1288,11 +1300,12 @@ int flush_old_exec(struct linux_binprm * bprm) goto out; /* - * After clearing bprm->mm (to mark that current is using the - * prepared mm now), we have nothing left of the original + * After setting bprm->called_exec_mmap (to mark that current is + * using the prepared mm now), we have nothing left of the original * process. If anything from here on returns an error, the check * in search_binary_handler() will SEGV current. */ + bprm->called_exec_mmap = 1; bprm->mm = NULL; #ifdef CONFIG_POSIX_TIMERS @@ -1438,6 +1451,8 @@ static void free_bprm(struct linux_binprm *bprm) { free_arg_pages(bprm); if (bprm->cred) { + if (bprm->called_exec_mmap) + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); abort_creds(bprm->cred); } @@ -1487,6 +1502,7 @@ void install_exec_creds(struct linux_binprm *bprm) * credentials; any time after this it may be unlocked. */ security_bprm_committed_creds(bprm); + mutex_unlock(¤t->signal->exec_update_mutex); mutex_unlock(¤t->signal->cred_guard_mutex); } EXPORT_SYMBOL(install_exec_creds); @@ -1678,7 +1694,7 @@ int search_binary_handler(struct linux_binprm *bprm) read_lock(&binfmt_lock); put_binfmt(fmt); - if (retval < 0 && !bprm->mm) { + if (retval < 0 && bprm->called_exec_mmap) { /* we got to flush_old_exec() and failed after it */ read_unlock(&binfmt_lock); force_sigsegv(SIGSEGV); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index b40fc633f3be..a345d9fed3d8 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -44,7 +44,13 @@ struct linux_binprm { * exec has happened. Used to sanitize execution environment * and to set AT_SECURE auxv for glibc. */ - secureexec:1; + secureexec:1, + /* + * Set by flush_old_exec, when exec_mmap has been called. + * This is past the point of no return, when the + * exec_update_mutex has been taken. + */ + called_exec_mmap:1; #ifdef __alpha__ unsigned int taso:1; #endif diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 88050259c466..a29df79540ce 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -224,7 +224,14 @@ struct signal_struct { struct mutex cred_guard_mutex; /* guard against foreign influences on * credential calculations - * (notably. ptrace) */ + * (notably. ptrace) + * Deprecated do not use in new code. + * Use exec_update_mutex instead. + */ + struct mutex exec_update_mutex; /* Held while task_struct is being + * updated during exec, and may have + * inconsistent permissions. + */ } __randomize_layout; /* diff --git a/init/init_task.c b/init/init_task.c index 9e5cbe5eab7b..bd403ed3e418 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -26,6 +26,7 @@ static struct signal_struct init_signals = { .multiprocess = HLIST_HEAD_INIT, .rlim = INIT_RLIMITS, .cred_guard_mutex = __MUTEX_INITIALIZER(init_signals.cred_guard_mutex), + .exec_update_mutex = __MUTEX_INITIALIZER(init_signals.exec_update_mutex), #ifdef CONFIG_POSIX_TIMERS .posix_timers = LIST_HEAD_INIT(init_signals.posix_timers), .cputimer = { diff --git a/kernel/fork.c b/kernel/fork.c index 60a1295f4384..12896a6ecee6 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1594,6 +1594,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->oom_score_adj_min = current->signal->oom_score_adj_min; mutex_init(&sig->cred_guard_mutex); + mutex_init(&sig->exec_update_mutex); return 0; } -- cgit v1.2.3-71-gd317 From 3e74fabd39710ee29fa25618d2c2b40cfa7d76c7 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger Date: Fri, 20 Mar 2020 21:26:04 +0100 Subject: exec: Fix a deadlock in strace This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex. This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/ Signed-off-by: Bernd Edlinger Reviewed-by: Kees Cook Signed-off-by: Eric W. Biederman --- kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 12896a6ecee6..e0668a79bca1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1224,7 +1224,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) struct mm_struct *mm; int err; - err = mutex_lock_killable(&task->signal->cred_guard_mutex); + err = mutex_lock_killable(&task->signal->exec_update_mutex); if (err) return ERR_PTR(err); @@ -1234,7 +1234,7 @@ struct mm_struct *mm_access(struct task_struct *task, unsigned int mode) mmput(mm); mm = ERR_PTR(-EACCES); } - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); return mm; } -- cgit v1.2.3-71-gd317 From aa884c11313656fee7b12972614b6333f154655c Mon Sep 17 00:00:00 2001 From: Bernd Edlinger Date: Fri, 20 Mar 2020 21:26:50 +0100 Subject: kernel: doc: remove outdated comment cred.c This removes an outdated comment in prepare_kernel_cred. There is no "cred_replace_mutex" any more, so the comment must go away. Signed-off-by: Bernd Edlinger Reviewed-by: Kees Cook Signed-off-by: Eric W. Biederman --- kernel/cred.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cred.c b/kernel/cred.c index 809a985b1793..71a792616917 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -675,8 +675,6 @@ void __init cred_init(void) * The caller may change these controls afterwards if desired. * * Returns the new credentials or NULL if out of memory. - * - * Does not take, and does not return holding current->cred_replace_mutex. */ struct cred *prepare_kernel_cred(struct task_struct *daemon) { -- cgit v1.2.3-71-gd317 From 454e3126cb842388e22df6b3ac3da44062c00765 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger Date: Fri, 20 Mar 2020 21:27:05 +0100 Subject: kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex. Signed-off-by: Bernd Edlinger Signed-off-by: Eric W. Biederman --- kernel/kcmp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/kcmp.c b/kernel/kcmp.c index a0e3d7a0e8b8..b3ff9288c6cc 100644 --- a/kernel/kcmp.c +++ b/kernel/kcmp.c @@ -173,8 +173,8 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, /* * One should have enough rights to inspect task details. */ - ret = kcmp_lock(&task1->signal->cred_guard_mutex, - &task2->signal->cred_guard_mutex); + ret = kcmp_lock(&task1->signal->exec_update_mutex, + &task2->signal->exec_update_mutex); if (ret) goto err; if (!ptrace_may_access(task1, PTRACE_MODE_READ_REALCREDS) || @@ -229,8 +229,8 @@ SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type, } err_unlock: - kcmp_unlock(&task1->signal->cred_guard_mutex, - &task2->signal->cred_guard_mutex); + kcmp_unlock(&task1->signal->exec_update_mutex, + &task2->signal->exec_update_mutex); err: put_task_struct(task1); put_task_struct(task2); -- cgit v1.2.3-71-gd317 From 6914303824bb572278568330d72fc1f8f9814e67 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger Date: Fri, 20 Mar 2020 21:27:55 +0100 Subject: perf: Use new infrastructure to fix deadlocks in execve This changes perf_event_set_clock to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading. Signed-off-by: Bernd Edlinger Signed-off-by: Eric W. Biederman --- kernel/events/core.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index e453589da97c..71cba8cfccbc 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1249,7 +1249,7 @@ static void put_ctx(struct perf_event_context *ctx) * function. * * Lock order: - * cred_guard_mutex + * exec_update_mutex * task_struct::perf_event_mutex * perf_event_context::mutex * perf_event::child_mutex; @@ -11263,14 +11263,14 @@ SYSCALL_DEFINE5(perf_event_open, } if (task) { - err = mutex_lock_interruptible(&task->signal->cred_guard_mutex); + err = mutex_lock_interruptible(&task->signal->exec_update_mutex); if (err) goto err_task; /* * Reuse ptrace permission checks for now. * - * We must hold cred_guard_mutex across this and any potential + * We must hold exec_update_mutex across this and any potential * perf_install_in_context() call for this new event to * serialize against exec() altering our credentials (and the * perf_event_exit_task() that could imply). @@ -11559,7 +11559,7 @@ SYSCALL_DEFINE5(perf_event_open, mutex_unlock(&ctx->mutex); if (task) { - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); put_task_struct(task); } @@ -11595,7 +11595,7 @@ err_alloc: free_event(event); err_cred: if (task) - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); err_task: if (task) put_task_struct(task); @@ -11900,7 +11900,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn) /* * When a child task exits, feed back event values to parent events. * - * Can be called with cred_guard_mutex held when called from + * Can be called with exec_update_mutex held when called from * install_exec_creds(). */ void perf_event_exit_task(struct task_struct *child) -- cgit v1.2.3-71-gd317 From 501f9328bf5c6b5e4863da4b50e0e86792de3aa9 Mon Sep 17 00:00:00 2001 From: Bernd Edlinger Date: Sat, 21 Mar 2020 02:46:16 +0000 Subject: pidfd: Use new infrastructure to fix deadlocks in execve This changes __pidfd_fget to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials do not change before exec_update_mutex is locked. Therefore whatever file access is possible with holding the cred_guard_mutex here is also possbile with the exec_update_mutex. Signed-off-by: Bernd Edlinger Signed-off-by: Eric W. Biederman --- kernel/pid.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/pid.c b/kernel/pid.c index 60820e72634c..efd34874b3d1 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -577,7 +577,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) struct file *file; int ret; - ret = mutex_lock_killable(&task->signal->cred_guard_mutex); + ret = mutex_lock_killable(&task->signal->exec_update_mutex); if (ret) return ERR_PTR(ret); @@ -586,7 +586,7 @@ static struct file *__pidfd_fget(struct task_struct *task, int fd) else file = ERR_PTR(-EPERM); - mutex_unlock(&task->signal->cred_guard_mutex); + mutex_unlock(&task->signal->exec_update_mutex); return file ?: ERR_PTR(-EBADF); } -- cgit v1.2.3-71-gd317 From 07cd263148a53ebcc911b4c96d89a72df31c8f49 Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Tue, 24 Mar 2020 10:38:15 -0700 Subject: bpf: Verifer, refactor adjust_scalar_min_max_vals Pull per op ALU logic into individual functions. We are about to add u32 versions of each of these by pull them out the code gets a bit more readable here and nicer in the next patch. Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158507149518.15666.15672349629329072411.stgit@john-Precision-5820-Tower --- kernel/bpf/verifier.c | 403 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 239 insertions(+), 164 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 745f3cfdf3b2..559c9afd673c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4843,6 +4843,237 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, return 0; } +static void scalar_min_max_add(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s64 smin_val = src_reg->smin_value; + s64 smax_val = src_reg->smax_value; + u64 umin_val = src_reg->umin_value; + u64 umax_val = src_reg->umax_value; + + if (signed_add_overflows(dst_reg->smin_value, smin_val) || + signed_add_overflows(dst_reg->smax_value, smax_val)) { + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } else { + dst_reg->smin_value += smin_val; + dst_reg->smax_value += smax_val; + } + if (dst_reg->umin_value + umin_val < umin_val || + dst_reg->umax_value + umax_val < umax_val) { + dst_reg->umin_value = 0; + dst_reg->umax_value = U64_MAX; + } else { + dst_reg->umin_value += umin_val; + dst_reg->umax_value += umax_val; + } + dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg->var_off); +} + +static void scalar_min_max_sub(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s64 smin_val = src_reg->smin_value; + s64 smax_val = src_reg->smax_value; + u64 umin_val = src_reg->umin_value; + u64 umax_val = src_reg->umax_value; + + if (signed_sub_overflows(dst_reg->smin_value, smax_val) || + signed_sub_overflows(dst_reg->smax_value, smin_val)) { + /* Overflow possible, we know nothing */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } else { + dst_reg->smin_value -= smax_val; + dst_reg->smax_value -= smin_val; + } + if (dst_reg->umin_value < umax_val) { + /* Overflow possible, we know nothing */ + dst_reg->umin_value = 0; + dst_reg->umax_value = U64_MAX; + } else { + /* Cannot overflow (as long as bounds are consistent) */ + dst_reg->umin_value -= umax_val; + dst_reg->umax_value -= umin_val; + } + dst_reg->var_off = tnum_sub(dst_reg->var_off, src_reg->var_off); +} + +static void scalar_min_max_mul(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s64 smin_val = src_reg->smin_value; + u64 umin_val = src_reg->umin_value; + u64 umax_val = src_reg->umax_value; + + dst_reg->var_off = tnum_mul(dst_reg->var_off, src_reg->var_off); + if (smin_val < 0 || dst_reg->smin_value < 0) { + /* Ain't nobody got time to multiply that sign */ + __mark_reg_unbounded(dst_reg); + __update_reg_bounds(dst_reg); + return; + } + /* Both values are positive, so we can work with unsigned and + * copy the result to signed (unless it exceeds S64_MAX). + */ + if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) { + /* Potential overflow, we know nothing */ + __mark_reg_unbounded(dst_reg); + /* (except what we can learn from the var_off) */ + __update_reg_bounds(dst_reg); + return; + } + dst_reg->umin_value *= umin_val; + dst_reg->umax_value *= umax_val; + if (dst_reg->umax_value > S64_MAX) { + /* Overflow possible, we know nothing */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } else { + dst_reg->smin_value = dst_reg->umin_value; + dst_reg->smax_value = dst_reg->umax_value; + } +} + +static void scalar_min_max_and(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s64 smin_val = src_reg->smin_value; + u64 umax_val = src_reg->umax_value; + + /* We get our minimum from the var_off, since that's inherently + * bitwise. Our maximum is the minimum of the operands' maxima. + */ + dst_reg->var_off = tnum_and(dst_reg->var_off, src_reg->var_off); + dst_reg->umin_value = dst_reg->var_off.value; + dst_reg->umax_value = min(dst_reg->umax_value, umax_val); + if (dst_reg->smin_value < 0 || smin_val < 0) { + /* Lose signed bounds when ANDing negative numbers, + * ain't nobody got time for that. + */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } else { + /* ANDing two positives gives a positive, so safe to + * cast result into s64. + */ + dst_reg->smin_value = dst_reg->umin_value; + dst_reg->smax_value = dst_reg->umax_value; + } + /* We may learn something more from the var_off */ + __update_reg_bounds(dst_reg); +} + +static void scalar_min_max_or(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s64 smin_val = src_reg->smin_value; + u64 umin_val = src_reg->umin_value; + + /* We get our maximum from the var_off, and our minimum is the + * maximum of the operands' minima + */ + dst_reg->var_off = tnum_or(dst_reg->var_off, src_reg->var_off); + dst_reg->umin_value = max(dst_reg->umin_value, umin_val); + dst_reg->umax_value = dst_reg->var_off.value | dst_reg->var_off.mask; + if (dst_reg->smin_value < 0 || smin_val < 0) { + /* Lose signed bounds when ORing negative numbers, + * ain't nobody got time for that. + */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } else { + /* ORing two positives gives a positive, so safe to + * cast result into s64. + */ + dst_reg->smin_value = dst_reg->umin_value; + dst_reg->smax_value = dst_reg->umax_value; + } + /* We may learn something more from the var_off */ + __update_reg_bounds(dst_reg); +} + +static void scalar_min_max_lsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + u64 umax_val = src_reg->umax_value; + u64 umin_val = src_reg->umin_value; + + /* We lose all sign bit information (except what we can pick + * up from var_off) + */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + /* If we might shift our top bit out, then we know nothing */ + if (dst_reg->umax_value > 1ULL << (63 - umax_val)) { + dst_reg->umin_value = 0; + dst_reg->umax_value = U64_MAX; + } else { + dst_reg->umin_value <<= umin_val; + dst_reg->umax_value <<= umax_val; + } + dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val); + /* We may learn something more from the var_off */ + __update_reg_bounds(dst_reg); +} + +static void scalar_min_max_rsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + u64 umax_val = src_reg->umax_value; + u64 umin_val = src_reg->umin_value; + + /* BPF_RSH is an unsigned shift. If the value in dst_reg might + * be negative, then either: + * 1) src_reg might be zero, so the sign bit of the result is + * unknown, so we lose our signed bounds + * 2) it's known negative, thus the unsigned bounds capture the + * signed bounds + * 3) the signed bounds cross zero, so they tell us nothing + * about the result + * If the value in dst_reg is known nonnegative, then again the + * unsigned bounts capture the signed bounds. + * Thus, in all cases it suffices to blow away our signed bounds + * and rely on inferring new ones from the unsigned bounds and + * var_off of the result. + */ + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val); + dst_reg->umin_value >>= umax_val; + dst_reg->umax_value >>= umin_val; + /* We may learn something more from the var_off */ + __update_reg_bounds(dst_reg); +} + +static void scalar_min_max_arsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg, + u64 insn_bitness) +{ + u64 umin_val = src_reg->umin_value; + + /* Upon reaching here, src_known is true and + * umax_val is equal to umin_val. + */ + if (insn_bitness == 32) { + dst_reg->smin_value = (u32)(((s32)dst_reg->smin_value) >> umin_val); + dst_reg->smax_value = (u32)(((s32)dst_reg->smax_value) >> umin_val); + } else { + dst_reg->smin_value >>= umin_val; + dst_reg->smax_value >>= umin_val; + } + + dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val, + insn_bitness); + + /* blow away the dst_reg umin_value/umax_value and rely on + * dst_reg var_off to refine the result. + */ + dst_reg->umin_value = 0; + dst_reg->umax_value = U64_MAX; + __update_reg_bounds(dst_reg); +} + /* WARNING: This function does calculations on 64-bit values, but the actual * execution may occur on 32-bit values. Therefore, things like bitshifts * need extra checks in the 32-bit case. @@ -4899,23 +5130,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, verbose(env, "R%d tried to add from different pointers or scalars\n", dst); return ret; } - if (signed_add_overflows(dst_reg->smin_value, smin_val) || - signed_add_overflows(dst_reg->smax_value, smax_val)) { - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - } else { - dst_reg->smin_value += smin_val; - dst_reg->smax_value += smax_val; - } - if (dst_reg->umin_value + umin_val < umin_val || - dst_reg->umax_value + umax_val < umax_val) { - dst_reg->umin_value = 0; - dst_reg->umax_value = U64_MAX; - } else { - dst_reg->umin_value += umin_val; - dst_reg->umax_value += umax_val; - } - dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg.var_off); + scalar_min_max_add(dst_reg, &src_reg); break; case BPF_SUB: ret = sanitize_val_alu(env, insn); @@ -4923,54 +5138,10 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, verbose(env, "R%d tried to sub from different pointers or scalars\n", dst); return ret; } - if (signed_sub_overflows(dst_reg->smin_value, smax_val) || - signed_sub_overflows(dst_reg->smax_value, smin_val)) { - /* Overflow possible, we know nothing */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - } else { - dst_reg->smin_value -= smax_val; - dst_reg->smax_value -= smin_val; - } - if (dst_reg->umin_value < umax_val) { - /* Overflow possible, we know nothing */ - dst_reg->umin_value = 0; - dst_reg->umax_value = U64_MAX; - } else { - /* Cannot overflow (as long as bounds are consistent) */ - dst_reg->umin_value -= umax_val; - dst_reg->umax_value -= umin_val; - } - dst_reg->var_off = tnum_sub(dst_reg->var_off, src_reg.var_off); + scalar_min_max_sub(dst_reg, &src_reg); break; case BPF_MUL: - dst_reg->var_off = tnum_mul(dst_reg->var_off, src_reg.var_off); - if (smin_val < 0 || dst_reg->smin_value < 0) { - /* Ain't nobody got time to multiply that sign */ - __mark_reg_unbounded(dst_reg); - __update_reg_bounds(dst_reg); - break; - } - /* Both values are positive, so we can work with unsigned and - * copy the result to signed (unless it exceeds S64_MAX). - */ - if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) { - /* Potential overflow, we know nothing */ - __mark_reg_unbounded(dst_reg); - /* (except what we can learn from the var_off) */ - __update_reg_bounds(dst_reg); - break; - } - dst_reg->umin_value *= umin_val; - dst_reg->umax_value *= umax_val; - if (dst_reg->umax_value > S64_MAX) { - /* Overflow possible, we know nothing */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - } else { - dst_reg->smin_value = dst_reg->umin_value; - dst_reg->smax_value = dst_reg->umax_value; - } + scalar_min_max_mul(dst_reg, &src_reg); break; case BPF_AND: if (src_known && dst_known) { @@ -4978,27 +5149,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, src_reg.var_off.value); break; } - /* We get our minimum from the var_off, since that's inherently - * bitwise. Our maximum is the minimum of the operands' maxima. - */ - dst_reg->var_off = tnum_and(dst_reg->var_off, src_reg.var_off); - dst_reg->umin_value = dst_reg->var_off.value; - dst_reg->umax_value = min(dst_reg->umax_value, umax_val); - if (dst_reg->smin_value < 0 || smin_val < 0) { - /* Lose signed bounds when ANDing negative numbers, - * ain't nobody got time for that. - */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - } else { - /* ANDing two positives gives a positive, so safe to - * cast result into s64. - */ - dst_reg->smin_value = dst_reg->umin_value; - dst_reg->smax_value = dst_reg->umax_value; - } - /* We may learn something more from the var_off */ - __update_reg_bounds(dst_reg); + scalar_min_max_and(dst_reg, &src_reg); break; case BPF_OR: if (src_known && dst_known) { @@ -5006,28 +5157,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, src_reg.var_off.value); break; } - /* We get our maximum from the var_off, and our minimum is the - * maximum of the operands' minima - */ - dst_reg->var_off = tnum_or(dst_reg->var_off, src_reg.var_off); - dst_reg->umin_value = max(dst_reg->umin_value, umin_val); - dst_reg->umax_value = dst_reg->var_off.value | - dst_reg->var_off.mask; - if (dst_reg->smin_value < 0 || smin_val < 0) { - /* Lose signed bounds when ORing negative numbers, - * ain't nobody got time for that. - */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - } else { - /* ORing two positives gives a positive, so safe to - * cast result into s64. - */ - dst_reg->smin_value = dst_reg->umin_value; - dst_reg->smax_value = dst_reg->umax_value; - } - /* We may learn something more from the var_off */ - __update_reg_bounds(dst_reg); + scalar_min_max_or(dst_reg, &src_reg); break; case BPF_LSH: if (umax_val >= insn_bitness) { @@ -5037,22 +5167,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - /* We lose all sign bit information (except what we can pick - * up from var_off) - */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - /* If we might shift our top bit out, then we know nothing */ - if (dst_reg->umax_value > 1ULL << (63 - umax_val)) { - dst_reg->umin_value = 0; - dst_reg->umax_value = U64_MAX; - } else { - dst_reg->umin_value <<= umin_val; - dst_reg->umax_value <<= umax_val; - } - dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val); - /* We may learn something more from the var_off */ - __update_reg_bounds(dst_reg); + scalar_min_max_lsh(dst_reg, &src_reg); break; case BPF_RSH: if (umax_val >= insn_bitness) { @@ -5062,27 +5177,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - /* BPF_RSH is an unsigned shift. If the value in dst_reg might - * be negative, then either: - * 1) src_reg might be zero, so the sign bit of the result is - * unknown, so we lose our signed bounds - * 2) it's known negative, thus the unsigned bounds capture the - * signed bounds - * 3) the signed bounds cross zero, so they tell us nothing - * about the result - * If the value in dst_reg is known nonnegative, then again the - * unsigned bounts capture the signed bounds. - * Thus, in all cases it suffices to blow away our signed bounds - * and rely on inferring new ones from the unsigned bounds and - * var_off of the result. - */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; - dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val); - dst_reg->umin_value >>= umax_val; - dst_reg->umax_value >>= umin_val; - /* We may learn something more from the var_off */ - __update_reg_bounds(dst_reg); + scalar_min_max_rsh(dst_reg, &src_reg); break; case BPF_ARSH: if (umax_val >= insn_bitness) { @@ -5092,27 +5187,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - - /* Upon reaching here, src_known is true and - * umax_val is equal to umin_val. - */ - if (insn_bitness == 32) { - dst_reg->smin_value = (u32)(((s32)dst_reg->smin_value) >> umin_val); - dst_reg->smax_value = (u32)(((s32)dst_reg->smax_value) >> umin_val); - } else { - dst_reg->smin_value >>= umin_val; - dst_reg->smax_value >>= umin_val; - } - - dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val, - insn_bitness); - - /* blow away the dst_reg umin_value/umax_value and rely on - * dst_reg var_off to refine the result. - */ - dst_reg->umin_value = 0; - dst_reg->umax_value = U64_MAX; - __update_reg_bounds(dst_reg); + scalar_min_max_arsh(dst_reg, &src_reg, insn_bitness); break; default: mark_reg_unknown(env, regs, insn->dst_reg); -- cgit v1.2.3-71-gd317 From 294f2fc6da27620a506e6c050241655459ccd6bd Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Tue, 24 Mar 2020 10:38:37 -0700 Subject: bpf: Verifer, adjust_scalar_min_max_vals to always call update_reg_bounds() Currently, for all op verification we call __red_deduce_bounds() and __red_bound_offset() but we only call __update_reg_bounds() in bitwise ops. However, we could benefit from calling __update_reg_bounds() in BPF_ADD, BPF_SUB, and BPF_MUL cases as well. For example, a register with state 'R1_w=invP0' when we subtract from it, w1 -= 2 Before coerce we will now have an smin_value=S64_MIN, smax_value=U64_MAX and unsigned bounds umin_value=0, umax_value=U64_MAX. These will then be clamped to S32_MIN, U32_MAX values by coerce in the case of alu32 op as done in above example. However tnum will be a constant because the ALU op is done on a constant. Without update_reg_bounds() we have a scenario where tnum is a const but our unsigned bounds do not reflect this. By calling update_reg_bounds after coerce to 32bit we further refine the umin_value to U64_MAX in the alu64 case or U32_MAX in the alu32 case above. Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158507151689.15666.566796274289413203.stgit@john-Precision-5820-Tower --- kernel/bpf/verifier.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 559c9afd673c..2ea2a868324e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5199,6 +5199,7 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, coerce_reg_to_size(dst_reg, 4); } + __update_reg_bounds(dst_reg); __reg_deduce_bounds(dst_reg); __reg_bound_offset(dst_reg); return 0; -- cgit v1.2.3-71-gd317 From 00c4eddf7ee5cb4941d669d605815454dc9a5419 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Tue, 24 Mar 2020 23:57:41 -0700 Subject: bpf: Factor out cgroup storages operations Refactor cgroup attach/detach code to abstract away common operations performed on all types of cgroup storages. This makes high-level logic more apparent, plus allows to reuse more code across multiple functions. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200325065746.640559-2-andriin@fb.com --- kernel/bpf/cgroup.c | 118 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 72 insertions(+), 46 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 9a500fadbef5..9c8472823a7f 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -28,6 +28,58 @@ void cgroup_bpf_offline(struct cgroup *cgrp) percpu_ref_kill(&cgrp->bpf.refcnt); } +static void bpf_cgroup_storages_free(struct bpf_cgroup_storage *storages[]) +{ + enum bpf_cgroup_storage_type stype; + + for_each_cgroup_storage_type(stype) + bpf_cgroup_storage_free(storages[stype]); +} + +static int bpf_cgroup_storages_alloc(struct bpf_cgroup_storage *storages[], + struct bpf_prog *prog) +{ + enum bpf_cgroup_storage_type stype; + + for_each_cgroup_storage_type(stype) { + storages[stype] = bpf_cgroup_storage_alloc(prog, stype); + if (IS_ERR(storages[stype])) { + storages[stype] = NULL; + bpf_cgroup_storages_free(storages); + return -ENOMEM; + } + } + + return 0; +} + +static void bpf_cgroup_storages_assign(struct bpf_cgroup_storage *dst[], + struct bpf_cgroup_storage *src[]) +{ + enum bpf_cgroup_storage_type stype; + + for_each_cgroup_storage_type(stype) + dst[stype] = src[stype]; +} + +static void bpf_cgroup_storages_link(struct bpf_cgroup_storage *storages[], + struct cgroup* cgrp, + enum bpf_attach_type attach_type) +{ + enum bpf_cgroup_storage_type stype; + + for_each_cgroup_storage_type(stype) + bpf_cgroup_storage_link(storages[stype], cgrp, attach_type); +} + +static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[]) +{ + enum bpf_cgroup_storage_type stype; + + for_each_cgroup_storage_type(stype) + bpf_cgroup_storage_unlink(storages[stype]); +} + /** * cgroup_bpf_release() - put references of all bpf programs and * release all cgroup bpf data @@ -37,7 +89,6 @@ static void cgroup_bpf_release(struct work_struct *work) { struct cgroup *p, *cgrp = container_of(work, struct cgroup, bpf.release_work); - enum bpf_cgroup_storage_type stype; struct bpf_prog_array *old_array; unsigned int type; @@ -50,10 +101,8 @@ static void cgroup_bpf_release(struct work_struct *work) list_for_each_entry_safe(pl, tmp, progs, node) { list_del(&pl->node); bpf_prog_put(pl->prog); - for_each_cgroup_storage_type(stype) { - bpf_cgroup_storage_unlink(pl->storage[stype]); - bpf_cgroup_storage_free(pl->storage[stype]); - } + bpf_cgroup_storages_unlink(pl->storage); + bpf_cgroup_storages_free(pl->storage); kfree(pl); static_branch_dec(&cgroup_bpf_enabled_key); } @@ -138,7 +187,7 @@ static int compute_effective_progs(struct cgroup *cgrp, enum bpf_attach_type type, struct bpf_prog_array **array) { - enum bpf_cgroup_storage_type stype; + struct bpf_prog_array_item *item; struct bpf_prog_array *progs; struct bpf_prog_list *pl; struct cgroup *p = cgrp; @@ -166,10 +215,10 @@ static int compute_effective_progs(struct cgroup *cgrp, if (!pl->prog) continue; - progs->items[cnt].prog = pl->prog; - for_each_cgroup_storage_type(stype) - progs->items[cnt].cgroup_storage[stype] = - pl->storage[stype]; + item = &progs->items[cnt]; + item->prog = pl->prog; + bpf_cgroup_storages_assign(item->cgroup_storage, + pl->storage); cnt++; } } while ((p = cgroup_parent(p))); @@ -305,7 +354,6 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE], *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL}; struct bpf_prog_list *pl, *replace_pl = NULL; - enum bpf_cgroup_storage_type stype; int err; if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) || @@ -341,37 +389,25 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, replace_pl = list_first_entry(progs, typeof(*pl), node); } - for_each_cgroup_storage_type(stype) { - storage[stype] = bpf_cgroup_storage_alloc(prog, stype); - if (IS_ERR(storage[stype])) { - storage[stype] = NULL; - for_each_cgroup_storage_type(stype) - bpf_cgroup_storage_free(storage[stype]); - return -ENOMEM; - } - } + if (bpf_cgroup_storages_alloc(storage, prog)) + return -ENOMEM; if (replace_pl) { pl = replace_pl; old_prog = pl->prog; - for_each_cgroup_storage_type(stype) { - old_storage[stype] = pl->storage[stype]; - bpf_cgroup_storage_unlink(old_storage[stype]); - } + bpf_cgroup_storages_unlink(pl->storage); + bpf_cgroup_storages_assign(old_storage, pl->storage); } else { pl = kmalloc(sizeof(*pl), GFP_KERNEL); if (!pl) { - for_each_cgroup_storage_type(stype) - bpf_cgroup_storage_free(storage[stype]); + bpf_cgroup_storages_free(storage); return -ENOMEM; } list_add_tail(&pl->node, progs); } pl->prog = prog; - for_each_cgroup_storage_type(stype) - pl->storage[stype] = storage[stype]; - + bpf_cgroup_storages_assign(pl->storage, storage); cgrp->bpf.flags[type] = saved_flags; err = update_effective_progs(cgrp, type); @@ -379,27 +415,20 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, goto cleanup; static_branch_inc(&cgroup_bpf_enabled_key); - for_each_cgroup_storage_type(stype) { - if (!old_storage[stype]) - continue; - bpf_cgroup_storage_free(old_storage[stype]); - } + bpf_cgroup_storages_free(old_storage); if (old_prog) { bpf_prog_put(old_prog); static_branch_dec(&cgroup_bpf_enabled_key); } - for_each_cgroup_storage_type(stype) - bpf_cgroup_storage_link(storage[stype], cgrp, type); + bpf_cgroup_storages_link(storage, cgrp, type); return 0; cleanup: /* and cleanup the prog list */ pl->prog = old_prog; - for_each_cgroup_storage_type(stype) { - bpf_cgroup_storage_free(pl->storage[stype]); - pl->storage[stype] = old_storage[stype]; - bpf_cgroup_storage_link(old_storage[stype], cgrp, type); - } + bpf_cgroup_storages_free(pl->storage); + bpf_cgroup_storages_assign(pl->storage, old_storage); + bpf_cgroup_storages_link(pl->storage, cgrp, type); if (!replace_pl) { list_del(&pl->node); kfree(pl); @@ -420,7 +449,6 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, enum bpf_attach_type type) { struct list_head *progs = &cgrp->bpf.progs[type]; - enum bpf_cgroup_storage_type stype; u32 flags = cgrp->bpf.flags[type]; struct bpf_prog *old_prog = NULL; struct bpf_prog_list *pl; @@ -467,10 +495,8 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, /* now can actually delete it from this cgroup list */ list_del(&pl->node); - for_each_cgroup_storage_type(stype) { - bpf_cgroup_storage_unlink(pl->storage[stype]); - bpf_cgroup_storage_free(pl->storage[stype]); - } + bpf_cgroup_storages_unlink(pl->storage); + bpf_cgroup_storages_free(pl->storage); kfree(pl); if (list_empty(progs)) /* last program was detached, reset flags to zero */ -- cgit v1.2.3-71-gd317 From e28784e3781e19f546bd2c2cd7c1c4e7c54e7f73 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Tue, 24 Mar 2020 23:57:42 -0700 Subject: bpf: Factor out attach_type to prog_type mapping for attach/detach Factor out logic mapping expected program attach type to program type and subsequent handling of program attach/detach. Also list out all supported cgroup BPF program types explicitly to prevent accidental bugs once more program types are added to a mapping. Do the same for prog_query API. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200325065746.640559-3-andriin@fb.com --- kernel/bpf/syscall.c | 153 ++++++++++++++++++++++----------------------------- 1 file changed, 66 insertions(+), 87 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 85567a6ea5f9..fd4181939064 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2535,36 +2535,18 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog, } } -#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd - -#define BPF_F_ATTACH_MASK \ - (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE) - -static int bpf_prog_attach(const union bpf_attr *attr) +static enum bpf_prog_type +attach_type_to_prog_type(enum bpf_attach_type attach_type) { - enum bpf_prog_type ptype; - struct bpf_prog *prog; - int ret; - - if (!capable(CAP_NET_ADMIN)) - return -EPERM; - - if (CHECK_ATTR(BPF_PROG_ATTACH)) - return -EINVAL; - - if (attr->attach_flags & ~BPF_F_ATTACH_MASK) - return -EINVAL; - - switch (attr->attach_type) { + switch (attach_type) { case BPF_CGROUP_INET_INGRESS: case BPF_CGROUP_INET_EGRESS: - ptype = BPF_PROG_TYPE_CGROUP_SKB; + return BPF_PROG_TYPE_CGROUP_SKB; break; case BPF_CGROUP_INET_SOCK_CREATE: case BPF_CGROUP_INET4_POST_BIND: case BPF_CGROUP_INET6_POST_BIND: - ptype = BPF_PROG_TYPE_CGROUP_SOCK; - break; + return BPF_PROG_TYPE_CGROUP_SOCK; case BPF_CGROUP_INET4_BIND: case BPF_CGROUP_INET6_BIND: case BPF_CGROUP_INET4_CONNECT: @@ -2573,37 +2555,53 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_CGROUP_UDP6_SENDMSG: case BPF_CGROUP_UDP4_RECVMSG: case BPF_CGROUP_UDP6_RECVMSG: - ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; - break; + return BPF_PROG_TYPE_CGROUP_SOCK_ADDR; case BPF_CGROUP_SOCK_OPS: - ptype = BPF_PROG_TYPE_SOCK_OPS; - break; + return BPF_PROG_TYPE_SOCK_OPS; case BPF_CGROUP_DEVICE: - ptype = BPF_PROG_TYPE_CGROUP_DEVICE; - break; + return BPF_PROG_TYPE_CGROUP_DEVICE; case BPF_SK_MSG_VERDICT: - ptype = BPF_PROG_TYPE_SK_MSG; - break; + return BPF_PROG_TYPE_SK_MSG; case BPF_SK_SKB_STREAM_PARSER: case BPF_SK_SKB_STREAM_VERDICT: - ptype = BPF_PROG_TYPE_SK_SKB; - break; + return BPF_PROG_TYPE_SK_SKB; case BPF_LIRC_MODE2: - ptype = BPF_PROG_TYPE_LIRC_MODE2; - break; + return BPF_PROG_TYPE_LIRC_MODE2; case BPF_FLOW_DISSECTOR: - ptype = BPF_PROG_TYPE_FLOW_DISSECTOR; - break; + return BPF_PROG_TYPE_FLOW_DISSECTOR; case BPF_CGROUP_SYSCTL: - ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; - break; + return BPF_PROG_TYPE_CGROUP_SYSCTL; case BPF_CGROUP_GETSOCKOPT: case BPF_CGROUP_SETSOCKOPT: - ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; - break; + return BPF_PROG_TYPE_CGROUP_SOCKOPT; default: - return -EINVAL; + return BPF_PROG_TYPE_UNSPEC; } +} + +#define BPF_PROG_ATTACH_LAST_FIELD replace_bpf_fd + +#define BPF_F_ATTACH_MASK \ + (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI | BPF_F_REPLACE) + +static int bpf_prog_attach(const union bpf_attr *attr) +{ + enum bpf_prog_type ptype; + struct bpf_prog *prog; + int ret; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (CHECK_ATTR(BPF_PROG_ATTACH)) + return -EINVAL; + + if (attr->attach_flags & ~BPF_F_ATTACH_MASK) + return -EINVAL; + + ptype = attach_type_to_prog_type(attr->attach_type); + if (ptype == BPF_PROG_TYPE_UNSPEC) + return -EINVAL; prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype); if (IS_ERR(prog)) @@ -2625,8 +2623,17 @@ static int bpf_prog_attach(const union bpf_attr *attr) case BPF_PROG_TYPE_FLOW_DISSECTOR: ret = skb_flow_dissector_bpf_prog_attach(attr, prog); break; - default: + case BPF_PROG_TYPE_CGROUP_DEVICE: + case BPF_PROG_TYPE_CGROUP_SKB: + case BPF_PROG_TYPE_CGROUP_SOCK: + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + case BPF_PROG_TYPE_CGROUP_SOCKOPT: + case BPF_PROG_TYPE_CGROUP_SYSCTL: + case BPF_PROG_TYPE_SOCK_OPS: ret = cgroup_bpf_prog_attach(attr, ptype, prog); + break; + default: + ret = -EINVAL; } if (ret) @@ -2646,53 +2653,27 @@ static int bpf_prog_detach(const union bpf_attr *attr) if (CHECK_ATTR(BPF_PROG_DETACH)) return -EINVAL; - switch (attr->attach_type) { - case BPF_CGROUP_INET_INGRESS: - case BPF_CGROUP_INET_EGRESS: - ptype = BPF_PROG_TYPE_CGROUP_SKB; - break; - case BPF_CGROUP_INET_SOCK_CREATE: - case BPF_CGROUP_INET4_POST_BIND: - case BPF_CGROUP_INET6_POST_BIND: - ptype = BPF_PROG_TYPE_CGROUP_SOCK; - break; - case BPF_CGROUP_INET4_BIND: - case BPF_CGROUP_INET6_BIND: - case BPF_CGROUP_INET4_CONNECT: - case BPF_CGROUP_INET6_CONNECT: - case BPF_CGROUP_UDP4_SENDMSG: - case BPF_CGROUP_UDP6_SENDMSG: - case BPF_CGROUP_UDP4_RECVMSG: - case BPF_CGROUP_UDP6_RECVMSG: - ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR; - break; - case BPF_CGROUP_SOCK_OPS: - ptype = BPF_PROG_TYPE_SOCK_OPS; - break; - case BPF_CGROUP_DEVICE: - ptype = BPF_PROG_TYPE_CGROUP_DEVICE; - break; - case BPF_SK_MSG_VERDICT: - return sock_map_get_from_fd(attr, NULL); - case BPF_SK_SKB_STREAM_PARSER: - case BPF_SK_SKB_STREAM_VERDICT: + ptype = attach_type_to_prog_type(attr->attach_type); + + switch (ptype) { + case BPF_PROG_TYPE_SK_MSG: + case BPF_PROG_TYPE_SK_SKB: return sock_map_get_from_fd(attr, NULL); - case BPF_LIRC_MODE2: + case BPF_PROG_TYPE_LIRC_MODE2: return lirc_prog_detach(attr); - case BPF_FLOW_DISSECTOR: + case BPF_PROG_TYPE_FLOW_DISSECTOR: return skb_flow_dissector_bpf_prog_detach(attr); - case BPF_CGROUP_SYSCTL: - ptype = BPF_PROG_TYPE_CGROUP_SYSCTL; - break; - case BPF_CGROUP_GETSOCKOPT: - case BPF_CGROUP_SETSOCKOPT: - ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT; - break; + case BPF_PROG_TYPE_CGROUP_DEVICE: + case BPF_PROG_TYPE_CGROUP_SKB: + case BPF_PROG_TYPE_CGROUP_SOCK: + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + case BPF_PROG_TYPE_CGROUP_SOCKOPT: + case BPF_PROG_TYPE_CGROUP_SYSCTL: + case BPF_PROG_TYPE_SOCK_OPS: + return cgroup_bpf_prog_detach(attr, ptype); default: return -EINVAL; } - - return cgroup_bpf_prog_detach(attr, ptype); } #define BPF_PROG_QUERY_LAST_FIELD query.prog_cnt @@ -2726,7 +2707,7 @@ static int bpf_prog_query(const union bpf_attr *attr, case BPF_CGROUP_SYSCTL: case BPF_CGROUP_GETSOCKOPT: case BPF_CGROUP_SETSOCKOPT: - break; + return cgroup_bpf_prog_query(attr, uattr); case BPF_LIRC_MODE2: return lirc_prog_query(attr, uattr); case BPF_FLOW_DISSECTOR: @@ -2734,8 +2715,6 @@ static int bpf_prog_query(const union bpf_attr *attr, default: return -EINVAL; } - - return cgroup_bpf_prog_query(attr, uattr); } #define BPF_PROG_TEST_RUN_LAST_FIELD test.ctx_out -- cgit v1.2.3-71-gd317 From f54a5bba120398e4d404e9553e6b92e6822eade0 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Thu, 26 Mar 2020 11:16:13 +0800 Subject: bpf: Remove unused vairable 'bpf_xdp_link_lops' kernel/bpf/syscall.c:2263:34: warning: 'bpf_xdp_link_lops' defined but not used [-Wunused-const-variable=] static const struct bpf_link_ops bpf_xdp_link_lops; ^~~~~~~~~~~~~~~~~ commit 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction") involded this unused variable, remove it. Reported-by: Hulk Robot Signed-off-by: YueHaibing Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20200326031613.19372-1-yuehaibing@huawei.com --- kernel/bpf/syscall.c | 1 - 1 file changed, 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index fd4181939064..b2584b25748c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2252,7 +2252,6 @@ static int bpf_link_release(struct inode *inode, struct file *filp) #ifdef CONFIG_PROC_FS static const struct bpf_link_ops bpf_raw_tp_lops; static const struct bpf_link_ops bpf_tracing_link_lops; -static const struct bpf_link_ops bpf_xdp_link_lops; static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp) { -- cgit v1.2.3-71-gd317 From 96aaab686505c449e24d76e76507290dcc30e008 Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Wed, 25 Mar 2020 21:45:28 +0900 Subject: perf/core: Add PERF_RECORD_CGROUP event To support cgroup tracking, add CGROUP event to save a link between cgroup path and id number. This is needed since cgroups can go away when userspace tries to read the cgroup info (from the id) later. The attr.cgroup bit was also added to enable cgroup tracking from userspace. This event will be generated when a new cgroup becomes active. Userspace might need to synthesize those events for existing cgroups. Committer testing: From the resulting kernel, using /sys/kernel/btf/vmlinux: $ pahole perf_event_attr | grep -w cgroup -B5 -A1 __u64 write_backward:1; /* 40:27 8 */ __u64 namespaces:1; /* 40:28 8 */ __u64 ksymbol:1; /* 40:29 8 */ __u64 bpf_event:1; /* 40:30 8 */ __u64 aux_output:1; /* 40:31 8 */ __u64 cgroup:1; /* 40:32 8 */ __u64 __reserved_1:31; /* 40:33 8 */ $ Reported-by: kbuild test robot Signed-off-by: Namhyung Kim Tested-by: Arnaldo Carvalho de Melo Acked-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo [staticize perf_event_cgroup function] Cc: Adrian Hunter Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Johannes Weiner Cc: Mark Rutland Cc: Zefan Li Link: http://lore.kernel.org/lkml/20200325124536.2800725-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- include/uapi/linux/perf_event.h | 13 ++++- kernel/events/core.c | 111 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 123 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 397cfd65b3fe..de95f6c7b273 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -381,7 +381,8 @@ struct perf_event_attr { ksymbol : 1, /* include ksymbol events */ bpf_event : 1, /* include bpf events */ aux_output : 1, /* generate AUX records instead of events */ - __reserved_1 : 32; + cgroup : 1, /* include cgroup events */ + __reserved_1 : 31; union { __u32 wakeup_events; /* wakeup every n events */ @@ -1012,6 +1013,16 @@ enum perf_event_type { */ PERF_RECORD_BPF_EVENT = 18, + /* + * struct { + * struct perf_event_header header; + * u64 id; + * char path[]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_CGROUP = 19, + PERF_RECORD_MAX, /* non-ABI */ }; diff --git a/kernel/events/core.c b/kernel/events/core.c index d22e4ba59dfa..994932d5e474 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -387,6 +387,7 @@ static atomic_t nr_freq_events __read_mostly; static atomic_t nr_switch_events __read_mostly; static atomic_t nr_ksymbol_events __read_mostly; static atomic_t nr_bpf_events __read_mostly; +static atomic_t nr_cgroup_events __read_mostly; static LIST_HEAD(pmus); static DEFINE_MUTEX(pmus_lock); @@ -4608,6 +4609,8 @@ static void unaccount_event(struct perf_event *event) atomic_dec(&nr_comm_events); if (event->attr.namespaces) atomic_dec(&nr_namespaces_events); + if (event->attr.cgroup) + atomic_dec(&nr_cgroup_events); if (event->attr.task) atomic_dec(&nr_task_events); if (event->attr.freq) @@ -7735,6 +7738,105 @@ void perf_event_namespaces(struct task_struct *task) NULL); } +/* + * cgroup tracking + */ +#ifdef CONFIG_CGROUP_PERF + +struct perf_cgroup_event { + char *path; + int path_size; + struct { + struct perf_event_header header; + u64 id; + char path[]; + } event_id; +}; + +static int perf_event_cgroup_match(struct perf_event *event) +{ + return event->attr.cgroup; +} + +static void perf_event_cgroup_output(struct perf_event *event, void *data) +{ + struct perf_cgroup_event *cgroup_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + u16 header_size = cgroup_event->event_id.header.size; + int ret; + + if (!perf_event_cgroup_match(event)) + return; + + perf_event_header__init_id(&cgroup_event->event_id.header, + &sample, event); + ret = perf_output_begin(&handle, event, + cgroup_event->event_id.header.size); + if (ret) + goto out; + + perf_output_put(&handle, cgroup_event->event_id); + __output_copy(&handle, cgroup_event->path, cgroup_event->path_size); + + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +out: + cgroup_event->event_id.header.size = header_size; +} + +static void perf_event_cgroup(struct cgroup *cgrp) +{ + struct perf_cgroup_event cgroup_event; + char path_enomem[16] = "//enomem"; + char *pathname; + size_t size; + + if (!atomic_read(&nr_cgroup_events)) + return; + + cgroup_event = (struct perf_cgroup_event){ + .event_id = { + .header = { + .type = PERF_RECORD_CGROUP, + .misc = 0, + .size = sizeof(cgroup_event.event_id), + }, + .id = cgroup_id(cgrp), + }, + }; + + pathname = kmalloc(PATH_MAX, GFP_KERNEL); + if (pathname == NULL) { + cgroup_event.path = path_enomem; + } else { + /* just to be sure to have enough space for alignment */ + cgroup_path(cgrp, pathname, PATH_MAX - sizeof(u64)); + cgroup_event.path = pathname; + } + + /* + * Since our buffer works in 8 byte units we need to align our string + * size to a multiple of 8. However, we must guarantee the tail end is + * zero'd out to avoid leaking random bits to userspace. + */ + size = strlen(cgroup_event.path) + 1; + while (!IS_ALIGNED(size, sizeof(u64))) + cgroup_event.path[size++] = '\0'; + + cgroup_event.event_id.header.size += size; + cgroup_event.path_size = size; + + perf_iterate_sb(perf_event_cgroup_output, + &cgroup_event, + NULL); + + kfree(pathname); +} + +#endif + /* * mmap tracking */ @@ -10781,6 +10883,8 @@ static void account_event(struct perf_event *event) atomic_inc(&nr_comm_events); if (event->attr.namespaces) atomic_inc(&nr_namespaces_events); + if (event->attr.cgroup) + atomic_inc(&nr_cgroup_events); if (event->attr.task) atomic_inc(&nr_task_events); if (event->attr.freq) @@ -12757,6 +12861,12 @@ static void perf_cgroup_css_free(struct cgroup_subsys_state *css) kfree(jc); } +static int perf_cgroup_css_online(struct cgroup_subsys_state *css) +{ + perf_event_cgroup(css->cgroup); + return 0; +} + static int __perf_cgroup_move(void *info) { struct task_struct *task = info; @@ -12778,6 +12888,7 @@ static void perf_cgroup_attach(struct cgroup_taskset *tset) struct cgroup_subsys perf_event_cgrp_subsys = { .css_alloc = perf_cgroup_css_alloc, .css_free = perf_cgroup_css_free, + .css_online = perf_cgroup_css_online, .attach = perf_cgroup_attach, /* * Implicitly enable on dfl hierarchy so that perf events can -- cgit v1.2.3-71-gd317 From 6546b19f95acc986807de981402bbac6b3a94b0f Mon Sep 17 00:00:00 2001 From: Namhyung Kim Date: Wed, 25 Mar 2020 21:45:29 +0900 Subject: perf/core: Add PERF_SAMPLE_CGROUP feature The PERF_SAMPLE_CGROUP bit is to save (perf_event) cgroup information in the sample. It will add a 64-bit id to identify current cgroup and it's the file handle in the cgroup file system. Userspace should use this information with PERF_RECORD_CGROUP event to match which cgroup it belongs. I put it before PERF_SAMPLE_AUX for simplicity since it just needs a 64-bit word. But if we want bigger samples, I can work on that direction too. Committer testing: $ pahole perf_sample_data | grep -w cgroup -B5 -A5 /* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */ struct perf_regs regs_intr; /* 312 16 */ /* --- cacheline 5 boundary (320 bytes) was 8 bytes ago --- */ u64 stack_user_size; /* 328 8 */ u64 phys_addr; /* 336 8 */ u64 cgroup; /* 344 8 */ /* size: 384, cachelines: 6, members: 22 */ /* padding: 32 */ }; $ Signed-off-by: Namhyung Kim Tested-by: Arnaldo Carvalho de Melo Acked-by: Peter Zijlstra (Intel) Acked-by: Tejun Heo Cc: Alexander Shishkin Cc: Jiri Olsa Cc: Johannes Weiner Cc: Mark Rutland Cc: Zefan Li Link: http://lore.kernel.org/lkml/20200325124536.2800725-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- include/linux/perf_event.h | 1 + include/uapi/linux/perf_event.h | 3 ++- init/Kconfig | 3 ++- kernel/events/core.c | 22 ++++++++++++++++++++++ 4 files changed, 27 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 8768a39b5258..9c3e7619c929 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1020,6 +1020,7 @@ struct perf_sample_data { u64 stack_user_size; u64 phys_addr; + u64 cgroup; } ____cacheline_aligned; /* default value for data source */ diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index de95f6c7b273..7b2d6fc9e6ed 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -142,8 +142,9 @@ enum perf_event_sample_format { PERF_SAMPLE_REGS_INTR = 1U << 18, PERF_SAMPLE_PHYS_ADDR = 1U << 19, PERF_SAMPLE_AUX = 1U << 20, + PERF_SAMPLE_CGROUP = 1U << 21, - PERF_SAMPLE_MAX = 1U << 21, /* non-ABI */ + PERF_SAMPLE_MAX = 1U << 22, /* non-ABI */ __PERF_SAMPLE_CALLCHAIN_EARLY = 1ULL << 63, /* non-ABI; internal use */ }; diff --git a/init/Kconfig b/init/Kconfig index 20a6ac33761c..7766b06a0038 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1027,7 +1027,8 @@ config CGROUP_PERF help This option extends the perf per-cpu mode to restrict monitoring to threads which belong to the cgroup specified and run on the - designated cpu. + designated cpu. Or this can be used to have cgroup ID in samples + so that it can monitor performance events among cgroups. Say N if unsure. diff --git a/kernel/events/core.c b/kernel/events/core.c index 994932d5e474..1569979c8912 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1862,6 +1862,9 @@ static void __perf_event_header_size(struct perf_event *event, u64 sample_type) if (sample_type & PERF_SAMPLE_PHYS_ADDR) size += sizeof(data->phys_addr); + if (sample_type & PERF_SAMPLE_CGROUP) + size += sizeof(data->cgroup); + event->header_size = size; } @@ -6867,6 +6870,9 @@ void perf_output_sample(struct perf_output_handle *handle, if (sample_type & PERF_SAMPLE_PHYS_ADDR) perf_output_put(handle, data->phys_addr); + if (sample_type & PERF_SAMPLE_CGROUP) + perf_output_put(handle, data->cgroup); + if (sample_type & PERF_SAMPLE_AUX) { perf_output_put(handle, data->aux_size); @@ -7066,6 +7072,16 @@ void perf_prepare_sample(struct perf_event_header *header, if (sample_type & PERF_SAMPLE_PHYS_ADDR) data->phys_addr = perf_virt_to_phys(data->addr); +#ifdef CONFIG_CGROUP_PERF + if (sample_type & PERF_SAMPLE_CGROUP) { + struct cgroup *cgrp; + + /* protected by RCU */ + cgrp = task_css_check(current, perf_event_cgrp_id, 1)->cgroup; + data->cgroup = cgroup_id(cgrp); + } +#endif + if (sample_type & PERF_SAMPLE_AUX) { u64 size; @@ -11264,6 +11280,12 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr, if (attr->sample_type & PERF_SAMPLE_REGS_INTR) ret = perf_reg_validate(attr->sample_regs_intr); + +#ifndef CONFIG_CGROUP_PERF + if (attr->sample_type & PERF_SAMPLE_CGROUP) + return -EINVAL; +#endif + out: return ret; -- cgit v1.2.3-71-gd317 From 07b8b10ec94f852502db739047a2803ed36ccf46 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Fri, 27 Mar 2020 16:21:22 -0400 Subject: ring-buffer: Make resize disable per cpu buffer instead of total buffer When the ring buffer becomes writable for even when the trace file is read, it must still not be resized. But since tracers can be activated while the trace file is being read, the irqsoff tracer can modify the per CPU buffers, and this can cause the reader of the trace file to update the wrong buffer's resize disable bit, as the irqsoff tracer swaps out cpu buffers. By making the resize disable per cpu_buffer, it makes the update follow the per cpu_buffer even if it's swapped out with the snapshot buffer and keeps the release of the trace file modifying the same data as the open did. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 43 ++++++++++++++++++++++++++++++------------- 1 file changed, 30 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 5979327254f9..e2de5b448c91 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -441,6 +441,7 @@ enum { struct ring_buffer_per_cpu { int cpu; atomic_t record_disabled; + atomic_t resize_disabled; struct trace_buffer *buffer; raw_spinlock_t reader_lock; /* serialize readers */ arch_spinlock_t lock; @@ -484,7 +485,6 @@ struct trace_buffer { unsigned flags; int cpus; atomic_t record_disabled; - atomic_t resize_disabled; cpumask_var_t cpumask; struct lock_class_key *reader_lock_key; @@ -1740,18 +1740,24 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, size = nr_pages * BUF_PAGE_SIZE; - /* - * Don't succeed if resizing is disabled, as a reader might be - * manipulating the ring buffer and is expecting a sane state while - * this is true. - */ - if (atomic_read(&buffer->resize_disabled)) - return -EBUSY; - /* prevent another thread from changing buffer sizes */ mutex_lock(&buffer->mutex); + if (cpu_id == RING_BUFFER_ALL_CPUS) { + /* + * Don't succeed if resizing is disabled, as a reader might be + * manipulating the ring buffer and is expecting a sane state while + * this is true. + */ + for_each_buffer_cpu(buffer, cpu) { + cpu_buffer = buffer->buffers[cpu]; + if (atomic_read(&cpu_buffer->resize_disabled)) { + err = -EBUSY; + goto out_err_unlock; + } + } + /* calculate the pages to update */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; @@ -1819,6 +1825,16 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, if (nr_pages == cpu_buffer->nr_pages) goto out; + /* + * Don't succeed if resizing is disabled, as a reader might be + * manipulating the ring buffer and is expecting a sane state while + * this is true. + */ + if (atomic_read(&cpu_buffer->resize_disabled)) { + err = -EBUSY; + goto out_err_unlock; + } + cpu_buffer->nr_pages_to_update = nr_pages - cpu_buffer->nr_pages; @@ -1888,6 +1904,7 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, free_buffer_page(bpage); } } + out_err_unlock: mutex_unlock(&buffer->mutex); return err; } @@ -4294,7 +4311,7 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags) iter->cpu_buffer = cpu_buffer; - atomic_inc(&buffer->resize_disabled); + atomic_inc(&cpu_buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); return iter; @@ -4369,7 +4386,7 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); atomic_dec(&cpu_buffer->record_disabled); - atomic_dec(&cpu_buffer->buffer->resize_disabled); + atomic_dec(&cpu_buffer->resize_disabled); kfree(iter->event); kfree(iter); } @@ -4474,7 +4491,7 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu) if (!cpumask_test_cpu(cpu, buffer->cpumask)) return; - atomic_inc(&buffer->resize_disabled); + atomic_inc(&cpu_buffer->resize_disabled); atomic_inc(&cpu_buffer->record_disabled); /* Make sure all commits have finished */ @@ -4495,7 +4512,7 @@ void ring_buffer_reset_cpu(struct trace_buffer *buffer, int cpu) raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); atomic_dec(&cpu_buffer->record_disabled); - atomic_dec(&buffer->resize_disabled); + atomic_dec(&cpu_buffer->resize_disabled); } EXPORT_SYMBOL_GPL(ring_buffer_reset_cpu); -- cgit v1.2.3-71-gd317 From 1039221cc2787dee51a7ffbf9b0e79d192dadf76 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:30 -0400 Subject: ring-buffer: Do not disable recording when there is an iterator Now that the iterator can handle a concurrent writer, do not disable writing to the ring buffer when there is an iterator present. Link: http://lkml.kernel.org/r/20200317213416.759770696@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index e2de5b448c91..af2f10d9f3f1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4312,7 +4312,6 @@ ring_buffer_read_prepare(struct trace_buffer *buffer, int cpu, gfp_t flags) iter->cpu_buffer = cpu_buffer; atomic_inc(&cpu_buffer->resize_disabled); - atomic_inc(&cpu_buffer->record_disabled); return iter; } @@ -4385,7 +4384,6 @@ ring_buffer_read_finish(struct ring_buffer_iter *iter) rb_check_pages(cpu_buffer); raw_spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags); - atomic_dec(&cpu_buffer->record_disabled); atomic_dec(&cpu_buffer->resize_disabled); kfree(iter->event); kfree(iter); -- cgit v1.2.3-71-gd317 From 06e0a548bad0f43a21e036db018e4dadb501ce8b Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:31 -0400 Subject: tracing: Do not disable tracing when reading the trace file When opening the "trace" file, it is no longer necessary to disable tracing. Note, a new option is created called "pause-on-trace", when set, will cause the trace file to emulate its original behavior. Link: http://lkml.kernel.org/r/20200317213416.903351225@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- Documentation/trace/ftrace.rst | 6 ++++++ kernel/trace/trace.c | 9 ++++++--- kernel/trace/trace.h | 1 + 3 files changed, 13 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index 99a0890e20ec..c33950a35d65 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -1125,6 +1125,12 @@ Here are the available options: the trace displays additional information about the latency, as described in "Latency trace format". + pause-on-trace + When set, opening the trace file for read, will pause + writing to the ring buffer (as if tracing_on was set to zero). + This simulates the original behavior of the trace file. + When the file is closed, tracing will be enabled again. + record-cmd When any event or tracer is enabled, a hook is enabled in the sched_switch trace point to fill comm cache diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 47889123be7f..650fa81fffe8 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4273,8 +4273,11 @@ __tracing_open(struct inode *inode, struct file *file, bool snapshot) if (trace_clocks[tr->clock_id].in_ns) iter->iter_flags |= TRACE_FILE_TIME_IN_NS; - /* stop the trace while dumping if we are not opening "snapshot" */ - if (!iter->snapshot) + /* + * If pause-on-trace is enabled, then stop the trace while + * dumping, unless this is the "snapshot" file + */ + if (!iter->snapshot && (tr->trace_flags & TRACE_ITER_PAUSE_ON_TRACE)) tracing_stop_tr(tr); if (iter->cpu_file == RING_BUFFER_ALL_CPUS) { @@ -4371,7 +4374,7 @@ static int tracing_release(struct inode *inode, struct file *file) if (iter->trace && iter->trace->close) iter->trace->close(iter); - if (!iter->snapshot) + if (!iter->snapshot && tr->stop_count) /* reenable tracing if it was previously enabled */ tracing_start_tr(tr); diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index c61e1b1c85a6..f37e05135986 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -1302,6 +1302,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf, C(IRQ_INFO, "irq-info"), \ C(MARKERS, "markers"), \ C(EVENT_FORK, "event-fork"), \ + C(PAUSE_ON_TRACE, "pause-on-trace"), \ FUNCTION_FLAGS \ FGRAPH_FLAGS \ STACK_FLAGS \ -- cgit v1.2.3-71-gd317 From c9b7a4a72ff64e67b7e877a99fd652230dc26058 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Tue, 17 Mar 2020 17:32:32 -0400 Subject: ring-buffer/tracing: Have iterator acknowledge dropped events Have the ring_buffer_iterator set a flag if events were dropped as it were to go and peek at the next event. Have the trace file display this fact if it happened with a "LOST EVENTS" message. Link: http://lkml.kernel.org/r/20200317213417.045858900@goodmis.org Signed-off-by: Steven Rostedt (VMware) --- include/linux/ring_buffer.h | 1 + kernel/trace/ring_buffer.c | 16 ++++++++++++++++ kernel/trace/trace.c | 16 ++++++++++++---- 3 files changed, 29 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 0ae603b79b0e..c76b2f3b3ac4 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -138,6 +138,7 @@ ring_buffer_iter_peek(struct ring_buffer_iter *iter, u64 *ts); void ring_buffer_iter_advance(struct ring_buffer_iter *iter); void ring_buffer_iter_reset(struct ring_buffer_iter *iter); int ring_buffer_iter_empty(struct ring_buffer_iter *iter); +bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter); unsigned long ring_buffer_size(struct trace_buffer *buffer, int cpu); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index af2f10d9f3f1..6f0b42ceeb00 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -510,6 +510,7 @@ struct ring_buffer_iter { u64 read_stamp; u64 page_stamp; struct ring_buffer_event *event; + int missed_events; }; /** @@ -1988,6 +1989,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter) iter->page_stamp = iter->read_stamp = iter->head_page->page->time_stamp; iter->head = 0; iter->next_event = 0; + iter->missed_events = 1; return NULL; } @@ -4191,6 +4193,20 @@ ring_buffer_peek(struct trace_buffer *buffer, int cpu, u64 *ts, return event; } +/** ring_buffer_iter_dropped - report if there are dropped events + * @iter: The ring buffer iterator + * + * Returns true if there was dropped events since the last peek. + */ +bool ring_buffer_iter_dropped(struct ring_buffer_iter *iter) +{ + bool ret = iter->missed_events != 0; + + iter->missed_events = 0; + return ret; +} +EXPORT_SYMBOL_GPL(ring_buffer_iter_dropped); + /** * ring_buffer_iter_peek - peek at the next event to be read * @iter: The ring buffer iterator diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 650fa81fffe8..5e634b9c1e0a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3388,11 +3388,15 @@ peek_next_entry(struct trace_iterator *iter, int cpu, u64 *ts, struct ring_buffer_event *event; struct ring_buffer_iter *buf_iter = trace_buffer_iter(iter, cpu); - if (buf_iter) + if (buf_iter) { event = ring_buffer_iter_peek(buf_iter, ts); - else + if (lost_events) + *lost_events = ring_buffer_iter_dropped(buf_iter) ? + (unsigned long)-1 : 0; + } else { event = ring_buffer_peek(iter->array_buffer->buffer, cpu, ts, lost_events); + } if (event) { iter->ent_size = ring_buffer_event_length(event); @@ -4005,8 +4009,12 @@ enum print_line_t print_trace_line(struct trace_iterator *iter) enum print_line_t ret; if (iter->lost_events) { - trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n", - iter->cpu, iter->lost_events); + if (iter->lost_events == (unsigned long)-1) + trace_seq_printf(&iter->seq, "CPU:%d [LOST EVENTS]\n", + iter->cpu); + else + trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n", + iter->cpu, iter->lost_events); if (trace_seq_has_overflowed(&iter->seq)) return TRACE_TYPE_PARTIAL_LINE; } -- cgit v1.2.3-71-gd317 From 6a13a0d7b4d1171ef9b80ad69abc37e1daa941b3 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Tue, 24 Mar 2020 16:34:48 +0900 Subject: ftrace/kprobe: Show the maxactive number on kprobe_events Show maxactive parameter on kprobe_events. This allows user to save the current configuration and restore it without losing maxactive parameter. Link: http://lkml.kernel.org/r/4762764a-6df7-bc93-ed60-e336146dce1f@gmail.com Link: http://lkml.kernel.org/r/158503528846.22706.5549974121212526020.stgit@devnote2 Cc: stable@vger.kernel.org Fixes: 696ced4fb1d76 ("tracing/kprobes: expose maxactive for kretprobe in kprobe_events") Reported-by: Taeung Song Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 362cca52f5de..d0568af4a0ef 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1078,6 +1078,8 @@ static int trace_kprobe_show(struct seq_file *m, struct dyn_event *ev) int i; seq_putc(m, trace_kprobe_is_return(tk) ? 'r' : 'p'); + if (trace_kprobe_is_return(tk) && tk->rp.maxactive) + seq_printf(m, "%d", tk->rp.maxactive); seq_printf(m, ":%s/%s", trace_probe_group_name(&tk->tp), trace_probe_name(&tk->tp)); -- cgit v1.2.3-71-gd317 From 717e3f5ebc823e5ecfdd15155b0fd81af9fc58d6 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 19 Mar 2020 23:40:40 -0400 Subject: ftrace: Make function trace pid filtering a bit more exact The set_ftrace_pid file is used to filter function tracing to only trace tasks that are listed in that file. Instead of testing the pids listed in that file (it's a bitmask) at each function trace event, the logic is done via a sched_switch hook. A flag is set when the next task to run is in the list of pids in the set_ftrace_pid file. But the sched_switch hook is not at the exact location of when the task switches, and the flag gets set before the task to be traced actually runs. This leaves a residue of traced functions that do not belong to the pid that should be filtered on. By changing the logic slightly, where instead of having a boolean flag to test, record the pid that should be traced, with special values for not to trace and always trace. Then at each function call, a check will be made to see if the function should be ignored, or if the current pid matches the function that should be traced, and only trace if it matches (or if it has the special value to always trace). Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 32 +++++++++++++++++++++++++------- kernel/trace/trace.h | 4 ++-- 2 files changed, 27 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 3f7ee102868a..34ae736cb1f8 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -139,13 +139,23 @@ static inline void ftrace_ops_init(struct ftrace_ops *ops) #endif } +#define FTRACE_PID_IGNORE -1 +#define FTRACE_PID_TRACE -2 + static void ftrace_pid_func(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *op, struct pt_regs *regs) { struct trace_array *tr = op->private; + int pid; - if (tr && this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid)) - return; + if (tr) { + pid = this_cpu_read(tr->array_buffer.data->ftrace_ignore_pid); + if (pid == FTRACE_PID_IGNORE) + return; + if (pid != FTRACE_PID_TRACE && + pid != current->pid) + return; + } op->saved_func(ip, parent_ip, op, regs); } @@ -6924,8 +6934,12 @@ ftrace_filter_pid_sched_switch_probe(void *data, bool preempt, pid_list = rcu_dereference_sched(tr->function_pids); - this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, - trace_ignore_this_task(pid_list, next)); + if (trace_ignore_this_task(pid_list, next)) + this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, + FTRACE_PID_IGNORE); + else + this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, + next->pid); } static void @@ -6978,7 +6992,7 @@ static void clear_ftrace_pids(struct trace_array *tr) unregister_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr); for_each_possible_cpu(cpu) - per_cpu_ptr(tr->array_buffer.data, cpu)->ftrace_ignore_pid = false; + per_cpu_ptr(tr->array_buffer.data, cpu)->ftrace_ignore_pid = FTRACE_PID_TRACE; rcu_assign_pointer(tr->function_pids, NULL); @@ -7103,8 +7117,12 @@ static void ignore_task_cpu(void *data) pid_list = rcu_dereference_protected(tr->function_pids, mutex_is_locked(&ftrace_lock)); - this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, - trace_ignore_this_task(pid_list, current)); + if (trace_ignore_this_task(pid_list, current)) + this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, + FTRACE_PID_IGNORE); + else + this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, + current->pid); } static ssize_t diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index f37e05135986..fdc72f5f0bb0 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -178,10 +178,10 @@ struct trace_array_cpu { kuid_t uid; char comm[TASK_COMM_LEN]; - bool ignore_pid; #ifdef CONFIG_FUNCTION_TRACER - bool ftrace_ignore_pid; + int ftrace_ignore_pid; #endif + bool ignore_pid; }; struct tracer; -- cgit v1.2.3-71-gd317 From b3b1e6ededa4337940adba6cf06e8351056e3097 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 19 Mar 2020 23:19:06 -0400 Subject: ftrace: Create set_ftrace_notrace_pid to not trace tasks There's currently a way to select a task that should only be traced by functions, but there's no way to select a task not to be traced by the function tracer. Add a set_ftrace_notrace_pid file that acts the same as set_ftrace_pid (and is also affected by function-fork), but the task pids in this file will not be traced even if they are listed in the set_ftrace_pid file. This makes it easy for tools like trace-cmd to "hide" itself from the function tracer when it is recording other tasks. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 181 ++++++++++++++++++++++++++++++++++++++------ kernel/trace/trace.c | 20 +++-- kernel/trace/trace.h | 2 + kernel/trace/trace_events.c | 12 +-- 4 files changed, 180 insertions(+), 35 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 34ae736cb1f8..7239d9acd09f 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -102,7 +102,7 @@ static bool ftrace_pids_enabled(struct ftrace_ops *ops) tr = ops->private; - return tr->function_pids != NULL; + return tr->function_pids != NULL || tr->function_no_pids != NULL; } static void ftrace_update_trampoline(struct ftrace_ops *ops); @@ -6931,10 +6931,12 @@ ftrace_filter_pid_sched_switch_probe(void *data, bool preempt, { struct trace_array *tr = data; struct trace_pid_list *pid_list; + struct trace_pid_list *no_pid_list; pid_list = rcu_dereference_sched(tr->function_pids); + no_pid_list = rcu_dereference_sched(tr->function_no_pids); - if (trace_ignore_this_task(pid_list, next)) + if (trace_ignore_this_task(pid_list, no_pid_list, next)) this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, FTRACE_PID_IGNORE); else @@ -6952,6 +6954,9 @@ ftrace_pid_follow_sched_process_fork(void *data, pid_list = rcu_dereference_sched(tr->function_pids); trace_filter_add_remove_task(pid_list, self, task); + + pid_list = rcu_dereference_sched(tr->function_no_pids); + trace_filter_add_remove_task(pid_list, self, task); } static void @@ -6962,6 +6967,9 @@ ftrace_pid_follow_sched_process_exit(void *data, struct task_struct *task) pid_list = rcu_dereference_sched(tr->function_pids); trace_filter_add_remove_task(pid_list, NULL, task); + + pid_list = rcu_dereference_sched(tr->function_no_pids); + trace_filter_add_remove_task(pid_list, NULL, task); } void ftrace_pid_follow_fork(struct trace_array *tr, bool enable) @@ -6979,42 +6987,64 @@ void ftrace_pid_follow_fork(struct trace_array *tr, bool enable) } } -static void clear_ftrace_pids(struct trace_array *tr) +enum { + TRACE_PIDS = BIT(0), + TRACE_NO_PIDS = BIT(1), +}; + +static void clear_ftrace_pids(struct trace_array *tr, int type) { struct trace_pid_list *pid_list; + struct trace_pid_list *no_pid_list; int cpu; pid_list = rcu_dereference_protected(tr->function_pids, lockdep_is_held(&ftrace_lock)); - if (!pid_list) + no_pid_list = rcu_dereference_protected(tr->function_no_pids, + lockdep_is_held(&ftrace_lock)); + + /* Make sure there's something to do */ + if (!(((type & TRACE_PIDS) && pid_list) || + ((type & TRACE_NO_PIDS) && no_pid_list))) return; - unregister_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr); + /* See if the pids still need to be checked after this */ + if (!((!(type & TRACE_PIDS) && pid_list) || + (!(type & TRACE_NO_PIDS) && no_pid_list))) { + unregister_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr); + for_each_possible_cpu(cpu) + per_cpu_ptr(tr->array_buffer.data, cpu)->ftrace_ignore_pid = FTRACE_PID_TRACE; + } - for_each_possible_cpu(cpu) - per_cpu_ptr(tr->array_buffer.data, cpu)->ftrace_ignore_pid = FTRACE_PID_TRACE; + if (type & TRACE_PIDS) + rcu_assign_pointer(tr->function_pids, NULL); - rcu_assign_pointer(tr->function_pids, NULL); + if (type & TRACE_NO_PIDS) + rcu_assign_pointer(tr->function_no_pids, NULL); /* Wait till all users are no longer using pid filtering */ synchronize_rcu(); - trace_free_pid_list(pid_list); + if ((type & TRACE_PIDS) && pid_list) + trace_free_pid_list(pid_list); + + if ((type & TRACE_NO_PIDS) && no_pid_list) + trace_free_pid_list(no_pid_list); } void ftrace_clear_pids(struct trace_array *tr) { mutex_lock(&ftrace_lock); - clear_ftrace_pids(tr); + clear_ftrace_pids(tr, TRACE_PIDS | TRACE_NO_PIDS); mutex_unlock(&ftrace_lock); } -static void ftrace_pid_reset(struct trace_array *tr) +static void ftrace_pid_reset(struct trace_array *tr, int type) { mutex_lock(&ftrace_lock); - clear_ftrace_pids(tr); + clear_ftrace_pids(tr, type); ftrace_update_pid_func(); ftrace_startup_all(0); @@ -7078,9 +7108,45 @@ static const struct seq_operations ftrace_pid_sops = { .show = fpid_show, }; -static int -ftrace_pid_open(struct inode *inode, struct file *file) +static void *fnpid_start(struct seq_file *m, loff_t *pos) + __acquires(RCU) +{ + struct trace_pid_list *pid_list; + struct trace_array *tr = m->private; + + mutex_lock(&ftrace_lock); + rcu_read_lock_sched(); + + pid_list = rcu_dereference_sched(tr->function_no_pids); + + if (!pid_list) + return !(*pos) ? FTRACE_NO_PIDS : NULL; + + return trace_pid_start(pid_list, pos); +} + +static void *fnpid_next(struct seq_file *m, void *v, loff_t *pos) { + struct trace_array *tr = m->private; + struct trace_pid_list *pid_list = rcu_dereference_sched(tr->function_no_pids); + + if (v == FTRACE_NO_PIDS) { + (*pos)++; + return NULL; + } + return trace_pid_next(pid_list, v, pos); +} + +static const struct seq_operations ftrace_no_pid_sops = { + .start = fnpid_start, + .next = fnpid_next, + .stop = fpid_stop, + .show = fpid_show, +}; + +static int pid_open(struct inode *inode, struct file *file, int type) +{ + const struct seq_operations *seq_ops; struct trace_array *tr = inode->i_private; struct seq_file *m; int ret = 0; @@ -7091,9 +7157,18 @@ ftrace_pid_open(struct inode *inode, struct file *file) if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) - ftrace_pid_reset(tr); + ftrace_pid_reset(tr, type); + + switch (type) { + case TRACE_PIDS: + seq_ops = &ftrace_pid_sops; + break; + case TRACE_NO_PIDS: + seq_ops = &ftrace_no_pid_sops; + break; + } - ret = seq_open(file, &ftrace_pid_sops); + ret = seq_open(file, seq_ops); if (ret < 0) { trace_array_put(tr); } else { @@ -7105,10 +7180,23 @@ ftrace_pid_open(struct inode *inode, struct file *file) return ret; } +static int +ftrace_pid_open(struct inode *inode, struct file *file) +{ + return pid_open(inode, file, TRACE_PIDS); +} + +static int +ftrace_no_pid_open(struct inode *inode, struct file *file) +{ + return pid_open(inode, file, TRACE_NO_PIDS); +} + static void ignore_task_cpu(void *data) { struct trace_array *tr = data; struct trace_pid_list *pid_list; + struct trace_pid_list *no_pid_list; /* * This function is called by on_each_cpu() while the @@ -7116,8 +7204,10 @@ static void ignore_task_cpu(void *data) */ pid_list = rcu_dereference_protected(tr->function_pids, mutex_is_locked(&ftrace_lock)); + no_pid_list = rcu_dereference_protected(tr->function_no_pids, + mutex_is_locked(&ftrace_lock)); - if (trace_ignore_this_task(pid_list, current)) + if (trace_ignore_this_task(pid_list, no_pid_list, current)) this_cpu_write(tr->array_buffer.data->ftrace_ignore_pid, FTRACE_PID_IGNORE); else @@ -7126,12 +7216,13 @@ static void ignore_task_cpu(void *data) } static ssize_t -ftrace_pid_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) +pid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos, int type) { struct seq_file *m = filp->private_data; struct trace_array *tr = m->private; - struct trace_pid_list *filtered_pids = NULL; + struct trace_pid_list *filtered_pids; + struct trace_pid_list *other_pids; struct trace_pid_list *pid_list; ssize_t ret; @@ -7140,19 +7231,39 @@ ftrace_pid_write(struct file *filp, const char __user *ubuf, mutex_lock(&ftrace_lock); - filtered_pids = rcu_dereference_protected(tr->function_pids, + switch (type) { + case TRACE_PIDS: + filtered_pids = rcu_dereference_protected(tr->function_pids, lockdep_is_held(&ftrace_lock)); + other_pids = rcu_dereference_protected(tr->function_no_pids, + lockdep_is_held(&ftrace_lock)); + break; + case TRACE_NO_PIDS: + filtered_pids = rcu_dereference_protected(tr->function_no_pids, + lockdep_is_held(&ftrace_lock)); + other_pids = rcu_dereference_protected(tr->function_pids, + lockdep_is_held(&ftrace_lock)); + break; + } ret = trace_pid_write(filtered_pids, &pid_list, ubuf, cnt); if (ret < 0) goto out; - rcu_assign_pointer(tr->function_pids, pid_list); + switch (type) { + case TRACE_PIDS: + rcu_assign_pointer(tr->function_pids, pid_list); + break; + case TRACE_NO_PIDS: + rcu_assign_pointer(tr->function_no_pids, pid_list); + break; + } + if (filtered_pids) { synchronize_rcu(); trace_free_pid_list(filtered_pids); - } else if (pid_list) { + } else if (pid_list && !other_pids) { /* Register a probe to set whether to ignore the tracing of a task */ register_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr); } @@ -7175,6 +7286,20 @@ ftrace_pid_write(struct file *filp, const char __user *ubuf, return ret; } +static ssize_t +ftrace_pid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return pid_write(filp, ubuf, cnt, ppos, TRACE_PIDS); +} + +static ssize_t +ftrace_no_pid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return pid_write(filp, ubuf, cnt, ppos, TRACE_NO_PIDS); +} + static int ftrace_pid_release(struct inode *inode, struct file *file) { @@ -7193,10 +7318,20 @@ static const struct file_operations ftrace_pid_fops = { .release = ftrace_pid_release, }; +static const struct file_operations ftrace_no_pid_fops = { + .open = ftrace_no_pid_open, + .write = ftrace_no_pid_write, + .read = seq_read, + .llseek = tracing_lseek, + .release = ftrace_pid_release, +}; + void ftrace_init_tracefs(struct trace_array *tr, struct dentry *d_tracer) { trace_create_file("set_ftrace_pid", 0644, d_tracer, tr, &ftrace_pid_fops); + trace_create_file("set_ftrace_notrace_pid", 0644, d_tracer, + tr, &ftrace_no_pid_fops); } void __init ftrace_init_tracefs_toplevel(struct trace_array *tr, diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 5e634b9c1e0a..6519b7afc499 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -386,16 +386,22 @@ trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid) * Returns false if @task should be traced. */ bool -trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct task_struct *task) +trace_ignore_this_task(struct trace_pid_list *filtered_pids, + struct trace_pid_list *filtered_no_pids, + struct task_struct *task) { /* - * Return false, because if filtered_pids does not exist, - * all pids are good to trace. + * If filterd_no_pids is not empty, and the task's pid is listed + * in filtered_no_pids, then return true. + * Otherwise, if filtered_pids is empty, that means we can + * trace all tasks. If it has content, then only trace pids + * within filtered_pids. */ - if (!filtered_pids) - return false; - return !trace_find_filtered_pid(filtered_pids, task->pid); + return (filtered_pids && + !trace_find_filtered_pid(filtered_pids, task->pid)) || + (filtered_no_pids && + trace_find_filtered_pid(filtered_no_pids, task->pid)); } /** @@ -5013,6 +5019,8 @@ static const char readme_msg[] = #ifdef CONFIG_FUNCTION_TRACER " set_ftrace_pid\t- Write pid(s) to only function trace those pids\n" "\t\t (function)\n" + " set_ftrace_notrace_pid\t- Write pid(s) to not function trace those pids\n" + "\t\t (function)\n" #endif #ifdef CONFIG_FUNCTION_GRAPH_TRACER " set_graph_function\t- Trace the nested calls of a function (function_graph)\n" diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index fdc72f5f0bb0..6b5ff5adb4ad 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -331,6 +331,7 @@ struct trace_array { #ifdef CONFIG_FUNCTION_TRACER struct ftrace_ops *ops; struct trace_pid_list __rcu *function_pids; + struct trace_pid_list __rcu *function_no_pids; #ifdef CONFIG_DYNAMIC_FTRACE /* All of these are protected by the ftrace_lock */ struct list_head func_probes; @@ -782,6 +783,7 @@ extern int pid_max; bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids, pid_t search_pid); bool trace_ignore_this_task(struct trace_pid_list *filtered_pids, + struct trace_pid_list *filtered_no_pids, struct task_struct *task); void trace_filter_add_remove_task(struct trace_pid_list *pid_list, struct task_struct *self, diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index f38234ecea18..c196d0dc5871 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -549,8 +549,8 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt, pid_list = rcu_dereference_sched(tr->filtered_pids); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, prev) && - trace_ignore_this_task(pid_list, next)); + trace_ignore_this_task(pid_list, NULL, prev) && + trace_ignore_this_task(pid_list, NULL, next)); } static void @@ -563,7 +563,7 @@ event_filter_pid_sched_switch_probe_post(void *data, bool preempt, pid_list = rcu_dereference_sched(tr->filtered_pids); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, next)); + trace_ignore_this_task(pid_list, NULL, next)); } static void @@ -579,7 +579,7 @@ event_filter_pid_sched_wakeup_probe_pre(void *data, struct task_struct *task) pid_list = rcu_dereference_sched(tr->filtered_pids); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, task)); + trace_ignore_this_task(pid_list, NULL, task)); } static void @@ -596,7 +596,7 @@ event_filter_pid_sched_wakeup_probe_post(void *data, struct task_struct *task) /* Set tracing if current is enabled */ this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, current)); + trace_ignore_this_task(pid_list, NULL, current)); } static void __ftrace_clear_event_pids(struct trace_array *tr) @@ -1597,7 +1597,7 @@ static void ignore_task_cpu(void *data) mutex_is_locked(&event_mutex)); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, current)); + trace_ignore_this_task(pid_list, NULL, current)); } static ssize_t -- cgit v1.2.3-71-gd317 From 2768362603018da2be44ae4d01f22406152db05a Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 25 Mar 2020 19:51:19 -0400 Subject: tracing: Create set_event_notrace_pid to not trace tasks There's currently a way to select a task that should only have its events traced, but there's no way to select a task not to have itsevents traced. Add a set_event_notrace_pid file that acts the same as set_event_pid (and is also affected by event-fork), but the task pids in this file will not be traced even if they are listed in the set_event_pid file. This makes it easy for tools like trace-cmd to "hide" itself from beint traced by events when it is recording other tasks. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 11 +- kernel/trace/trace.h | 25 ++++ kernel/trace/trace_events.c | 280 ++++++++++++++++++++++++++++++++++---------- 3 files changed, 244 insertions(+), 72 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 7239d9acd09f..0bb62e64280c 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -6987,11 +6987,6 @@ void ftrace_pid_follow_fork(struct trace_array *tr, bool enable) } } -enum { - TRACE_PIDS = BIT(0), - TRACE_NO_PIDS = BIT(1), -}; - static void clear_ftrace_pids(struct trace_array *tr, int type) { struct trace_pid_list *pid_list; @@ -7004,13 +6999,11 @@ static void clear_ftrace_pids(struct trace_array *tr, int type) lockdep_is_held(&ftrace_lock)); /* Make sure there's something to do */ - if (!(((type & TRACE_PIDS) && pid_list) || - ((type & TRACE_NO_PIDS) && no_pid_list))) + if (!pid_type_enabled(type, pid_list, no_pid_list)) return; /* See if the pids still need to be checked after this */ - if (!((!(type & TRACE_PIDS) && pid_list) || - (!(type & TRACE_NO_PIDS) && no_pid_list))) { + if (!still_need_pid_events(type, pid_list, no_pid_list)) { unregister_trace_sched_switch(ftrace_filter_pid_sched_switch_probe, tr); for_each_possible_cpu(cpu) per_cpu_ptr(tr->array_buffer.data, cpu)->ftrace_ignore_pid = FTRACE_PID_TRACE; diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index 6b5ff5adb4ad..4eb1d004d5f2 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -207,6 +207,30 @@ struct trace_pid_list { unsigned long *pids; }; +enum { + TRACE_PIDS = BIT(0), + TRACE_NO_PIDS = BIT(1), +}; + +static inline bool pid_type_enabled(int type, struct trace_pid_list *pid_list, + struct trace_pid_list *no_pid_list) +{ + /* Return true if the pid list in type has pids */ + return ((type & TRACE_PIDS) && pid_list) || + ((type & TRACE_NO_PIDS) && no_pid_list); +} + +static inline bool still_need_pid_events(int type, struct trace_pid_list *pid_list, + struct trace_pid_list *no_pid_list) +{ + /* + * Turning off what is in @type, return true if the "other" + * pid list, still has pids in it. + */ + return (!(type & TRACE_PIDS) && pid_list) || + (!(type & TRACE_NO_PIDS) && no_pid_list); +} + typedef bool (*cond_update_fn_t)(struct trace_array *tr, void *cond_data); /** @@ -285,6 +309,7 @@ struct trace_array { #endif #endif struct trace_pid_list __rcu *filtered_pids; + struct trace_pid_list __rcu *filtered_no_pids; /* * max_lock is used to protect the swapping of buffers * when taking a max snapshot. The buffers themselves are diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index c196d0dc5871..242f59e7f17d 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -232,10 +232,13 @@ bool trace_event_ignore_this_pid(struct trace_event_file *trace_file) { struct trace_array *tr = trace_file->tr; struct trace_array_cpu *data; + struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; pid_list = rcu_dereference_raw(tr->filtered_pids); - if (!pid_list) + no_pid_list = rcu_dereference_raw(tr->filtered_no_pids); + + if (!pid_list && !no_pid_list) return false; data = this_cpu_ptr(tr->array_buffer.data); @@ -510,6 +513,9 @@ event_filter_pid_sched_process_exit(void *data, struct task_struct *task) pid_list = rcu_dereference_raw(tr->filtered_pids); trace_filter_add_remove_task(pid_list, NULL, task); + + pid_list = rcu_dereference_raw(tr->filtered_no_pids); + trace_filter_add_remove_task(pid_list, NULL, task); } static void @@ -522,6 +528,9 @@ event_filter_pid_sched_process_fork(void *data, pid_list = rcu_dereference_sched(tr->filtered_pids); trace_filter_add_remove_task(pid_list, self, task); + + pid_list = rcu_dereference_sched(tr->filtered_no_pids); + trace_filter_add_remove_task(pid_list, self, task); } void trace_event_follow_fork(struct trace_array *tr, bool enable) @@ -544,13 +553,23 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt, struct task_struct *prev, struct task_struct *next) { struct trace_array *tr = data; + struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; + bool ret; pid_list = rcu_dereference_sched(tr->filtered_pids); + no_pid_list = rcu_dereference_sched(tr->filtered_no_pids); - this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, NULL, prev) && - trace_ignore_this_task(pid_list, NULL, next)); + /* + * Sched switch is funny, as we only want to ignore it + * in the notrace case if both prev and next should be ignored. + */ + ret = trace_ignore_this_task(NULL, no_pid_list, prev) && + trace_ignore_this_task(NULL, no_pid_list, next); + + this_cpu_write(tr->array_buffer.data->ignore_pid, ret || + (trace_ignore_this_task(pid_list, NULL, prev) && + trace_ignore_this_task(pid_list, NULL, next))); } static void @@ -558,18 +577,21 @@ event_filter_pid_sched_switch_probe_post(void *data, bool preempt, struct task_struct *prev, struct task_struct *next) { struct trace_array *tr = data; + struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; pid_list = rcu_dereference_sched(tr->filtered_pids); + no_pid_list = rcu_dereference_sched(tr->filtered_no_pids); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, NULL, next)); + trace_ignore_this_task(pid_list, no_pid_list, next)); } static void event_filter_pid_sched_wakeup_probe_pre(void *data, struct task_struct *task) { struct trace_array *tr = data; + struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; /* Nothing to do if we are already tracing */ @@ -577,15 +599,17 @@ event_filter_pid_sched_wakeup_probe_pre(void *data, struct task_struct *task) return; pid_list = rcu_dereference_sched(tr->filtered_pids); + no_pid_list = rcu_dereference_sched(tr->filtered_no_pids); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, NULL, task)); + trace_ignore_this_task(pid_list, no_pid_list, task)); } static void event_filter_pid_sched_wakeup_probe_post(void *data, struct task_struct *task) { struct trace_array *tr = data; + struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; /* Nothing to do if we are not tracing */ @@ -593,23 +617,15 @@ event_filter_pid_sched_wakeup_probe_post(void *data, struct task_struct *task) return; pid_list = rcu_dereference_sched(tr->filtered_pids); + no_pid_list = rcu_dereference_sched(tr->filtered_no_pids); /* Set tracing if current is enabled */ this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, NULL, current)); + trace_ignore_this_task(pid_list, no_pid_list, current)); } -static void __ftrace_clear_event_pids(struct trace_array *tr) +static void unregister_pid_events(struct trace_array *tr) { - struct trace_pid_list *pid_list; - struct trace_event_file *file; - int cpu; - - pid_list = rcu_dereference_protected(tr->filtered_pids, - lockdep_is_held(&event_mutex)); - if (!pid_list) - return; - unregister_trace_sched_switch(event_filter_pid_sched_switch_probe_pre, tr); unregister_trace_sched_switch(event_filter_pid_sched_switch_probe_post, tr); @@ -621,26 +637,55 @@ static void __ftrace_clear_event_pids(struct trace_array *tr) unregister_trace_sched_waking(event_filter_pid_sched_wakeup_probe_pre, tr); unregister_trace_sched_waking(event_filter_pid_sched_wakeup_probe_post, tr); +} - list_for_each_entry(file, &tr->events, list) { - clear_bit(EVENT_FILE_FL_PID_FILTER_BIT, &file->flags); +static void __ftrace_clear_event_pids(struct trace_array *tr, int type) +{ + struct trace_pid_list *pid_list; + struct trace_pid_list *no_pid_list; + struct trace_event_file *file; + int cpu; + + pid_list = rcu_dereference_protected(tr->filtered_pids, + lockdep_is_held(&event_mutex)); + no_pid_list = rcu_dereference_protected(tr->filtered_no_pids, + lockdep_is_held(&event_mutex)); + + /* Make sure there's something to do */ + if (!pid_type_enabled(type, pid_list, no_pid_list)) + return; + + if (!still_need_pid_events(type, pid_list, no_pid_list)) { + unregister_pid_events(tr); + + list_for_each_entry(file, &tr->events, list) { + clear_bit(EVENT_FILE_FL_PID_FILTER_BIT, &file->flags); + } + + for_each_possible_cpu(cpu) + per_cpu_ptr(tr->array_buffer.data, cpu)->ignore_pid = false; } - for_each_possible_cpu(cpu) - per_cpu_ptr(tr->array_buffer.data, cpu)->ignore_pid = false; + if (type & TRACE_PIDS) + rcu_assign_pointer(tr->filtered_pids, NULL); - rcu_assign_pointer(tr->filtered_pids, NULL); + if (type & TRACE_NO_PIDS) + rcu_assign_pointer(tr->filtered_no_pids, NULL); /* Wait till all users are no longer using pid filtering */ tracepoint_synchronize_unregister(); - trace_free_pid_list(pid_list); + if ((type & TRACE_PIDS) && pid_list) + trace_free_pid_list(pid_list); + + if ((type & TRACE_NO_PIDS) && no_pid_list) + trace_free_pid_list(no_pid_list); } -static void ftrace_clear_event_pids(struct trace_array *tr) +static void ftrace_clear_event_pids(struct trace_array *tr, int type) { mutex_lock(&event_mutex); - __ftrace_clear_event_pids(tr); + __ftrace_clear_event_pids(tr, type); mutex_unlock(&event_mutex); } @@ -1013,15 +1058,32 @@ static void t_stop(struct seq_file *m, void *p) } static void * -p_next(struct seq_file *m, void *v, loff_t *pos) +__next(struct seq_file *m, void *v, loff_t *pos, int type) { struct trace_array *tr = m->private; - struct trace_pid_list *pid_list = rcu_dereference_sched(tr->filtered_pids); + struct trace_pid_list *pid_list; + + if (type == TRACE_PIDS) + pid_list = rcu_dereference_sched(tr->filtered_pids); + else + pid_list = rcu_dereference_sched(tr->filtered_no_pids); return trace_pid_next(pid_list, v, pos); } -static void *p_start(struct seq_file *m, loff_t *pos) +static void * +p_next(struct seq_file *m, void *v, loff_t *pos) +{ + return __next(m, v, pos, TRACE_PIDS); +} + +static void * +np_next(struct seq_file *m, void *v, loff_t *pos) +{ + return __next(m, v, pos, TRACE_NO_PIDS); +} + +static void *__start(struct seq_file *m, loff_t *pos, int type) __acquires(RCU) { struct trace_pid_list *pid_list; @@ -1036,7 +1098,10 @@ static void *p_start(struct seq_file *m, loff_t *pos) mutex_lock(&event_mutex); rcu_read_lock_sched(); - pid_list = rcu_dereference_sched(tr->filtered_pids); + if (type == TRACE_PIDS) + pid_list = rcu_dereference_sched(tr->filtered_pids); + else + pid_list = rcu_dereference_sched(tr->filtered_no_pids); if (!pid_list) return NULL; @@ -1044,6 +1109,18 @@ static void *p_start(struct seq_file *m, loff_t *pos) return trace_pid_start(pid_list, pos); } +static void *p_start(struct seq_file *m, loff_t *pos) + __acquires(RCU) +{ + return __start(m, pos, TRACE_PIDS); +} + +static void *np_start(struct seq_file *m, loff_t *pos) + __acquires(RCU) +{ + return __start(m, pos, TRACE_NO_PIDS); +} + static void p_stop(struct seq_file *m, void *p) __releases(RCU) { @@ -1588,6 +1665,7 @@ static void ignore_task_cpu(void *data) { struct trace_array *tr = data; struct trace_pid_list *pid_list; + struct trace_pid_list *no_pid_list; /* * This function is called by on_each_cpu() while the @@ -1595,18 +1673,50 @@ static void ignore_task_cpu(void *data) */ pid_list = rcu_dereference_protected(tr->filtered_pids, mutex_is_locked(&event_mutex)); + no_pid_list = rcu_dereference_protected(tr->filtered_no_pids, + mutex_is_locked(&event_mutex)); this_cpu_write(tr->array_buffer.data->ignore_pid, - trace_ignore_this_task(pid_list, NULL, current)); + trace_ignore_this_task(pid_list, no_pid_list, current)); +} + +static void register_pid_events(struct trace_array *tr) +{ + /* + * Register a probe that is called before all other probes + * to set ignore_pid if next or prev do not match. + * Register a probe this is called after all other probes + * to only keep ignore_pid set if next pid matches. + */ + register_trace_prio_sched_switch(event_filter_pid_sched_switch_probe_pre, + tr, INT_MAX); + register_trace_prio_sched_switch(event_filter_pid_sched_switch_probe_post, + tr, 0); + + register_trace_prio_sched_wakeup(event_filter_pid_sched_wakeup_probe_pre, + tr, INT_MAX); + register_trace_prio_sched_wakeup(event_filter_pid_sched_wakeup_probe_post, + tr, 0); + + register_trace_prio_sched_wakeup_new(event_filter_pid_sched_wakeup_probe_pre, + tr, INT_MAX); + register_trace_prio_sched_wakeup_new(event_filter_pid_sched_wakeup_probe_post, + tr, 0); + + register_trace_prio_sched_waking(event_filter_pid_sched_wakeup_probe_pre, + tr, INT_MAX); + register_trace_prio_sched_waking(event_filter_pid_sched_wakeup_probe_post, + tr, 0); } static ssize_t -ftrace_event_pid_write(struct file *filp, const char __user *ubuf, - size_t cnt, loff_t *ppos) +event_pid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos, int type) { struct seq_file *m = filp->private_data; struct trace_array *tr = m->private; struct trace_pid_list *filtered_pids = NULL; + struct trace_pid_list *other_pids = NULL; struct trace_pid_list *pid_list; struct trace_event_file *file; ssize_t ret; @@ -1620,14 +1730,26 @@ ftrace_event_pid_write(struct file *filp, const char __user *ubuf, mutex_lock(&event_mutex); - filtered_pids = rcu_dereference_protected(tr->filtered_pids, - lockdep_is_held(&event_mutex)); + if (type == TRACE_PIDS) { + filtered_pids = rcu_dereference_protected(tr->filtered_pids, + lockdep_is_held(&event_mutex)); + other_pids = rcu_dereference_protected(tr->filtered_no_pids, + lockdep_is_held(&event_mutex)); + } else { + filtered_pids = rcu_dereference_protected(tr->filtered_no_pids, + lockdep_is_held(&event_mutex)); + other_pids = rcu_dereference_protected(tr->filtered_pids, + lockdep_is_held(&event_mutex)); + } ret = trace_pid_write(filtered_pids, &pid_list, ubuf, cnt); if (ret < 0) goto out; - rcu_assign_pointer(tr->filtered_pids, pid_list); + if (type == TRACE_PIDS) + rcu_assign_pointer(tr->filtered_pids, pid_list); + else + rcu_assign_pointer(tr->filtered_no_pids, pid_list); list_for_each_entry(file, &tr->events, list) { set_bit(EVENT_FILE_FL_PID_FILTER_BIT, &file->flags); @@ -1636,32 +1758,8 @@ ftrace_event_pid_write(struct file *filp, const char __user *ubuf, if (filtered_pids) { tracepoint_synchronize_unregister(); trace_free_pid_list(filtered_pids); - } else if (pid_list) { - /* - * Register a probe that is called before all other probes - * to set ignore_pid if next or prev do not match. - * Register a probe this is called after all other probes - * to only keep ignore_pid set if next pid matches. - */ - register_trace_prio_sched_switch(event_filter_pid_sched_switch_probe_pre, - tr, INT_MAX); - register_trace_prio_sched_switch(event_filter_pid_sched_switch_probe_post, - tr, 0); - - register_trace_prio_sched_wakeup(event_filter_pid_sched_wakeup_probe_pre, - tr, INT_MAX); - register_trace_prio_sched_wakeup(event_filter_pid_sched_wakeup_probe_post, - tr, 0); - - register_trace_prio_sched_wakeup_new(event_filter_pid_sched_wakeup_probe_pre, - tr, INT_MAX); - register_trace_prio_sched_wakeup_new(event_filter_pid_sched_wakeup_probe_post, - tr, 0); - - register_trace_prio_sched_waking(event_filter_pid_sched_wakeup_probe_pre, - tr, INT_MAX); - register_trace_prio_sched_waking(event_filter_pid_sched_wakeup_probe_post, - tr, 0); + } else if (pid_list && !other_pids) { + register_pid_events(tr); } /* @@ -1680,9 +1778,24 @@ ftrace_event_pid_write(struct file *filp, const char __user *ubuf, return ret; } +static ssize_t +ftrace_event_pid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return event_pid_write(filp, ubuf, cnt, ppos, TRACE_PIDS); +} + +static ssize_t +ftrace_event_npid_write(struct file *filp, const char __user *ubuf, + size_t cnt, loff_t *ppos) +{ + return event_pid_write(filp, ubuf, cnt, ppos, TRACE_NO_PIDS); +} + static int ftrace_event_avail_open(struct inode *inode, struct file *file); static int ftrace_event_set_open(struct inode *inode, struct file *file); static int ftrace_event_set_pid_open(struct inode *inode, struct file *file); +static int ftrace_event_set_npid_open(struct inode *inode, struct file *file); static int ftrace_event_release(struct inode *inode, struct file *file); static const struct seq_operations show_event_seq_ops = { @@ -1706,6 +1819,13 @@ static const struct seq_operations show_set_pid_seq_ops = { .stop = p_stop, }; +static const struct seq_operations show_set_no_pid_seq_ops = { + .start = np_start, + .next = np_next, + .show = trace_pid_show, + .stop = p_stop, +}; + static const struct file_operations ftrace_avail_fops = { .open = ftrace_event_avail_open, .read = seq_read, @@ -1729,6 +1849,14 @@ static const struct file_operations ftrace_set_event_pid_fops = { .release = ftrace_event_release, }; +static const struct file_operations ftrace_set_event_notrace_pid_fops = { + .open = ftrace_event_set_npid_open, + .read = seq_read, + .write = ftrace_event_npid_write, + .llseek = seq_lseek, + .release = ftrace_event_release, +}; + static const struct file_operations ftrace_enable_fops = { .open = tracing_open_generic, .read = event_enable_read, @@ -1858,7 +1986,28 @@ ftrace_event_set_pid_open(struct inode *inode, struct file *file) if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC)) - ftrace_clear_event_pids(tr); + ftrace_clear_event_pids(tr, TRACE_PIDS); + + ret = ftrace_event_open(inode, file, seq_ops); + if (ret < 0) + trace_array_put(tr); + return ret; +} + +static int +ftrace_event_set_npid_open(struct inode *inode, struct file *file) +{ + const struct seq_operations *seq_ops = &show_set_no_pid_seq_ops; + struct trace_array *tr = inode->i_private; + int ret; + + ret = tracing_check_open_get_tr(tr); + if (ret) + return ret; + + if ((file->f_mode & FMODE_WRITE) && + (file->f_flags & O_TRUNC)) + ftrace_clear_event_pids(tr, TRACE_NO_PIDS); ret = ftrace_event_open(inode, file, seq_ops); if (ret < 0) @@ -3075,6 +3224,11 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr) if (!entry) pr_warn("Could not create tracefs 'set_event_pid' entry\n"); + entry = tracefs_create_file("set_event_notrace_pid", 0644, parent, + tr, &ftrace_set_event_notrace_pid_fops); + if (!entry) + pr_warn("Could not create tracefs 'set_event_notrace_pid' entry\n"); + /* ring buffer internal formats */ entry = trace_create_file("header_page", 0444, d_events, ring_buffer_print_page_header, @@ -3158,7 +3312,7 @@ int event_trace_del_tracer(struct trace_array *tr) clear_event_triggers(tr); /* Clear the pid list */ - __ftrace_clear_event_pids(tr); + __ftrace_clear_event_pids(tr, TRACE_PIDS | TRACE_NO_PIDS); /* Disable any running events */ __ftrace_set_clr_event_nolock(tr, NULL, NULL, NULL, 0); -- cgit v1.2.3-71-gd317 From f318903c0bf42448b4c884732df2bbb0ef7a2284 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 27 Mar 2020 16:58:52 +0100 Subject: bpf: Add netns cookie and enable it for bpf cgroup hooks In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net --- include/linux/bpf.h | 1 + include/net/net_namespace.h | 10 ++++++++++ include/uapi/linux/bpf.h | 16 +++++++++++++++- kernel/bpf/verifier.c | 16 ++++++++++------ net/core/filter.c | 37 +++++++++++++++++++++++++++++++++++++ net/core/net_namespace.c | 15 +++++++++++++++ tools/include/uapi/linux/bpf.h | 16 +++++++++++++++- 7 files changed, 103 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index bdb981c204fa..78046c570596 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -233,6 +233,7 @@ enum bpf_arg_type { ARG_CONST_SIZE_OR_ZERO, /* number of bytes accessed from memory or 0 */ ARG_PTR_TO_CTX, /* pointer to context */ + ARG_PTR_TO_CTX_OR_NULL, /* pointer to context or NULL */ ARG_ANYTHING, /* any (initialized) argument is ok */ ARG_PTR_TO_SPIN_LOCK, /* pointer to bpf_spin_lock */ ARG_PTR_TO_SOCK_COMMON, /* pointer to sock_common */ diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 854d39ef1ca3..1c6edfdb9a2c 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -168,6 +168,9 @@ struct net { #ifdef CONFIG_XFRM struct netns_xfrm xfrm; #endif + + atomic64_t net_cookie; /* written once */ + #if IS_ENABLED(CONFIG_IP_VS) struct netns_ipvs *ipvs; #endif @@ -273,6 +276,8 @@ static inline int check_net(const struct net *net) void net_drop_ns(void *); +u64 net_gen_cookie(struct net *net); + #else static inline struct net *get_net(struct net *net) @@ -300,6 +305,11 @@ static inline int check_net(const struct net *net) return 1; } +static inline u64 net_gen_cookie(struct net *net) +{ + return 0; +} + #define net_drop_ns NULL #endif diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 5d01c5c7e598..bd81c4555206 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2950,6 +2950,19 @@ union bpf_attr { * restricted to raw_tracepoint bpf programs. * Return * 0 on success, or a negative error in case of failure. + * + * u64 bpf_get_netns_cookie(void *ctx) + * Description + * Retrieve the cookie (generated by the kernel) of the network + * namespace the input *ctx* is associated with. The network + * namespace cookie remains stable for its lifetime and provides + * a global identifier that can be assumed unique. If *ctx* is + * NULL, then the helper returns the cookie for the initial + * network namespace. The cookie itself is very similar to that + * of bpf_get_socket_cookie() helper, but for network namespaces + * instead of sockets. + * Return + * A 8-byte long opaque number. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3073,7 +3086,8 @@ union bpf_attr { FN(jiffies64), \ FN(read_branch_records), \ FN(get_ns_current_pid_tgid), \ - FN(xdp_output), + FN(xdp_output), \ + FN(get_netns_cookie), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2ea2a868324e..46ba86c540e2 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, expected_type = CONST_PTR_TO_MAP; if (type != expected_type) goto err_type; - } else if (arg_type == ARG_PTR_TO_CTX) { + } else if (arg_type == ARG_PTR_TO_CTX || + arg_type == ARG_PTR_TO_CTX_OR_NULL) { expected_type = PTR_TO_CTX; - if (type != expected_type) - goto err_type; - err = check_ctx_reg(env, reg, regno); - if (err < 0) - return err; + if (!(register_is_null(reg) && + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { + if (type != expected_type) + goto err_type; + err = check_ctx_reg(env, reg, regno); + if (err < 0) + return err; + } } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { expected_type = PTR_TO_SOCK_COMMON; /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ diff --git a/net/core/filter.c b/net/core/filter.c index 6cb7e0e24473..e249a499cbe5 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4141,6 +4141,39 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_ops_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +static u64 __bpf_get_netns_cookie(struct sock *sk) +{ +#ifdef CONFIG_NET_NS + return net_gen_cookie(sk ? sk->sk_net.net : &init_net); +#else + return 0; +#endif +} + +BPF_CALL_1(bpf_get_netns_cookie_sock, struct sock *, ctx) +{ + return __bpf_get_netns_cookie(ctx); +} + +static const struct bpf_func_proto bpf_get_netns_cookie_sock_proto = { + .func = bpf_get_netns_cookie_sock, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, +}; + +BPF_CALL_1(bpf_get_netns_cookie_sock_addr, struct bpf_sock_addr_kern *, ctx) +{ + return __bpf_get_netns_cookie(ctx ? ctx->sk : NULL); +} + +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { + .func = bpf_get_netns_cookie_sock_addr, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, +}; + BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) { struct sock *sk = sk_to_full_sk(skb->sk); @@ -5968,6 +6001,8 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_local_storage_proto; case BPF_FUNC_get_socket_cookie: return &bpf_get_socket_cookie_sock_proto; + case BPF_FUNC_get_netns_cookie: + return &bpf_get_netns_cookie_sock_proto; case BPF_FUNC_perf_event_output: return &bpf_event_output_data_proto; default: @@ -5994,6 +6029,8 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) } case BPF_FUNC_get_socket_cookie: return &bpf_get_socket_cookie_sock_addr_proto; + case BPF_FUNC_get_netns_cookie: + return &bpf_get_netns_cookie_sock_addr_proto; case BPF_FUNC_get_local_storage: return &bpf_get_local_storage_proto; case BPF_FUNC_perf_event_output: diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 757cc1d084e7..190ca66a383b 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -69,6 +69,20 @@ EXPORT_SYMBOL_GPL(pernet_ops_rwsem); static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS; +static atomic64_t cookie_gen; + +u64 net_gen_cookie(struct net *net) +{ + while (1) { + u64 res = atomic64_read(&net->net_cookie); + + if (res) + return res; + res = atomic64_inc_return(&cookie_gen); + atomic64_cmpxchg(&net->net_cookie, 0, res); + } +} + static struct net_generic *net_alloc_generic(void) { struct net_generic *ng; @@ -1087,6 +1101,7 @@ static int __init net_ns_init(void) panic("Could not allocate generic netns"); rcu_assign_pointer(init_net.gen, ng); + net_gen_cookie(&init_net); down_write(&pernet_ops_rwsem); if (setup_net(&init_net, &init_user_ns)) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 5d01c5c7e598..bd81c4555206 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2950,6 +2950,19 @@ union bpf_attr { * restricted to raw_tracepoint bpf programs. * Return * 0 on success, or a negative error in case of failure. + * + * u64 bpf_get_netns_cookie(void *ctx) + * Description + * Retrieve the cookie (generated by the kernel) of the network + * namespace the input *ctx* is associated with. The network + * namespace cookie remains stable for its lifetime and provides + * a global identifier that can be assumed unique. If *ctx* is + * NULL, then the helper returns the cookie for the initial + * network namespace. The cookie itself is very similar to that + * of bpf_get_socket_cookie() helper, but for network namespaces + * instead of sockets. + * Return + * A 8-byte long opaque number. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3073,7 +3086,8 @@ union bpf_attr { FN(jiffies64), \ FN(read_branch_records), \ FN(get_ns_current_pid_tgid), \ - FN(xdp_output), + FN(xdp_output), \ + FN(get_netns_cookie), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call -- cgit v1.2.3-71-gd317 From 0f09abd105da6c37713d2b253730a86cb45e127a Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 27 Mar 2020 16:58:54 +0100 Subject: bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 7723628101aa ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 21 ++++++++++++++++++++- kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 18 ++++++++++++++++++ net/core/filter.c | 12 ++++++++++++ tools/include/uapi/linux/bpf.h | 21 ++++++++++++++++++++- 6 files changed, 72 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 78046c570596..372708eeaecd 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1501,6 +1501,7 @@ extern const struct bpf_func_proto bpf_get_stack_proto; extern const struct bpf_func_proto bpf_sock_map_update_proto; extern const struct bpf_func_proto bpf_sock_hash_update_proto; extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto; +extern const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto; extern const struct bpf_func_proto bpf_msg_redirect_hash_proto; extern const struct bpf_func_proto bpf_msg_redirect_map_proto; extern const struct bpf_func_proto bpf_sk_redirect_hash_proto; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index bd81c4555206..222ba11966e3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2963,6 +2963,24 @@ union bpf_attr { * instead of sockets. * Return * A 8-byte long opaque number. + * + * u64 bpf_get_current_ancestor_cgroup_id(int ancestor_level) + * Description + * Return id of cgroup v2 that is ancestor of the cgroup associated + * with the current task at the *ancestor_level*. The root cgroup + * is at *ancestor_level* zero and each step down the hierarchy + * increments the level. If *ancestor_level* == level of cgroup + * associated with the current task, then return value will be the + * same as that of **bpf_get_current_cgroup_id**\ (). + * + * The helper is useful to implement policies based on cgroups + * that are upper in hierarchy than immediate cgroup associated + * with the current task. + * + * The format of returned id and helper limitations are same as in + * **bpf_get_current_cgroup_id**\ (). + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3087,7 +3105,8 @@ union bpf_attr { FN(read_branch_records), \ FN(get_ns_current_pid_tgid), \ FN(xdp_output), \ - FN(get_netns_cookie), + FN(get_netns_cookie), \ + FN(get_current_ancestor_cgroup_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 914f3463aa41..916f5132a984 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2156,6 +2156,7 @@ const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak; const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; const struct bpf_func_proto bpf_get_current_comm_proto __weak; const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak; +const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto __weak; const struct bpf_func_proto bpf_get_local_storage_proto __weak; const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto __weak; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 01878db15eaf..bafc53ddd350 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -340,6 +340,24 @@ const struct bpf_func_proto bpf_get_current_cgroup_id_proto = { .ret_type = RET_INTEGER, }; +BPF_CALL_1(bpf_get_current_ancestor_cgroup_id, int, ancestor_level) +{ + struct cgroup *cgrp = task_dfl_cgroup(current); + struct cgroup *ancestor; + + ancestor = cgroup_ancestor(cgrp, ancestor_level); + if (!ancestor) + return 0; + return cgroup_id(ancestor); +} + +const struct bpf_func_proto bpf_get_current_ancestor_cgroup_id_proto = { + .func = bpf_get_current_ancestor_cgroup_id, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, +}; + #ifdef CONFIG_CGROUP_BPF DECLARE_PER_CPU(struct bpf_cgroup_storage*, bpf_cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]); diff --git a/net/core/filter.c b/net/core/filter.c index 3083c7746ee0..5cec3ac9e3dd 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -6018,6 +6018,12 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_netns_cookie_sock_proto; case BPF_FUNC_perf_event_output: return &bpf_event_output_data_proto; +#ifdef CONFIG_CGROUPS + case BPF_FUNC_get_current_cgroup_id: + return &bpf_get_current_cgroup_id_proto; + case BPF_FUNC_get_current_ancestor_cgroup_id: + return &bpf_get_current_ancestor_cgroup_id_proto; +#endif #ifdef CONFIG_CGROUP_NET_CLASSID case BPF_FUNC_get_cgroup_classid: return &bpf_get_cgroup_classid_curr_proto; @@ -6052,6 +6058,12 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_local_storage_proto; case BPF_FUNC_perf_event_output: return &bpf_event_output_data_proto; +#ifdef CONFIG_CGROUPS + case BPF_FUNC_get_current_cgroup_id: + return &bpf_get_current_cgroup_id_proto; + case BPF_FUNC_get_current_ancestor_cgroup_id: + return &bpf_get_current_ancestor_cgroup_id_proto; +#endif #ifdef CONFIG_CGROUP_NET_CLASSID case BPF_FUNC_get_cgroup_classid: return &bpf_get_cgroup_classid_curr_proto; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index bd81c4555206..222ba11966e3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -2963,6 +2963,24 @@ union bpf_attr { * instead of sockets. * Return * A 8-byte long opaque number. + * + * u64 bpf_get_current_ancestor_cgroup_id(int ancestor_level) + * Description + * Return id of cgroup v2 that is ancestor of the cgroup associated + * with the current task at the *ancestor_level*. The root cgroup + * is at *ancestor_level* zero and each step down the hierarchy + * increments the level. If *ancestor_level* == level of cgroup + * associated with the current task, then return value will be the + * same as that of **bpf_get_current_cgroup_id**\ (). + * + * The helper is useful to implement policies based on cgroups + * that are upper in hierarchy than immediate cgroup associated + * with the current task. + * + * The format of returned id and helper limitations are same as in + * **bpf_get_current_cgroup_id**\ (). + * Return + * The id is returned or 0 in case the id could not be retrieved. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3087,7 +3105,8 @@ union bpf_attr { FN(read_branch_records), \ FN(get_ns_current_pid_tgid), \ FN(xdp_output), \ - FN(get_netns_cookie), + FN(get_netns_cookie), \ + FN(get_current_ancestor_cgroup_id), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call -- cgit v1.2.3-71-gd317 From a08971e9488d12a10a46eb433612229767b61fd5 Mon Sep 17 00:00:00 2001 From: Al Viro Date: Sun, 16 Feb 2020 10:17:27 -0500 Subject: futex: arch_futex_atomic_op_inuser() calling conventions change Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro --- arch/alpha/include/asm/futex.h | 5 ++--- arch/arc/include/asm/futex.h | 5 +++-- arch/arm/include/asm/futex.h | 5 +++-- arch/arm64/include/asm/futex.h | 5 ++--- arch/hexagon/include/asm/futex.h | 5 ++--- arch/ia64/include/asm/futex.h | 5 ++--- arch/microblaze/include/asm/futex.h | 5 ++--- arch/mips/include/asm/futex.h | 5 ++--- arch/nds32/include/asm/futex.h | 6 ++---- arch/openrisc/include/asm/futex.h | 5 ++--- arch/parisc/include/asm/futex.h | 5 +++-- arch/powerpc/include/asm/futex.h | 5 ++--- arch/riscv/include/asm/futex.h | 5 ++--- arch/s390/include/asm/futex.h | 4 ++-- arch/sh/include/asm/futex.h | 5 ++--- arch/sparc/include/asm/futex_64.h | 6 ++---- arch/x86/include/asm/futex.h | 5 ++--- arch/xtensa/include/asm/futex.h | 5 ++--- include/asm-generic/futex.h | 4 ++-- kernel/futex.c | 5 ++--- 20 files changed, 43 insertions(+), 57 deletions(-) (limited to 'kernel') diff --git a/arch/alpha/include/asm/futex.h b/arch/alpha/include/asm/futex.h index bfd3c01038f8..da67afd578fd 100644 --- a/arch/alpha/include/asm/futex.h +++ b/arch/alpha/include/asm/futex.h @@ -31,7 +31,8 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -53,8 +54,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/arc/include/asm/futex.h b/arch/arc/include/asm/futex.h index 9d0d070e6c22..607d1c16d4dd 100644 --- a/arch/arc/include/asm/futex.h +++ b/arch/arc/include/asm/futex.h @@ -75,10 +75,12 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; + #ifndef CONFIG_ARC_HAS_LLSC preempt_disable(); /* to guarantee atomic r-m-w of futex op */ #endif - pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -101,7 +103,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); #ifndef CONFIG_ARC_HAS_LLSC preempt_enable(); #endif diff --git a/arch/arm/include/asm/futex.h b/arch/arm/include/asm/futex.h index 83c391b597d4..e133da303a98 100644 --- a/arch/arm/include/asm/futex.h +++ b/arch/arm/include/asm/futex.h @@ -134,10 +134,12 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret, tmp; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; + #ifndef CONFIG_SMP preempt_disable(); #endif - pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -159,7 +161,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); #ifndef CONFIG_SMP preempt_enable(); #endif diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h index 6cc26a127819..97f6a63810ec 100644 --- a/arch/arm64/include/asm/futex.h +++ b/arch/arm64/include/asm/futex.h @@ -48,7 +48,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr) int oldval = 0, ret, tmp; u32 __user *uaddr = __uaccess_mask_ptr(_uaddr); - pagefault_disable(); + if (!access_ok(_uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -75,8 +76,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/hexagon/include/asm/futex.h b/arch/hexagon/include/asm/futex.h index 0191f7c7193e..6b9c554aee78 100644 --- a/arch/hexagon/include/asm/futex.h +++ b/arch/hexagon/include/asm/futex.h @@ -36,7 +36,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -62,8 +63,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/ia64/include/asm/futex.h b/arch/ia64/include/asm/futex.h index 2e106d462196..1db26b432d8c 100644 --- a/arch/ia64/include/asm/futex.h +++ b/arch/ia64/include/asm/futex.h @@ -50,7 +50,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -74,8 +75,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/microblaze/include/asm/futex.h b/arch/microblaze/include/asm/futex.h index 8c90357e5983..86131ed84c9a 100644 --- a/arch/microblaze/include/asm/futex.h +++ b/arch/microblaze/include/asm/futex.h @@ -34,7 +34,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -56,8 +57,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/mips/include/asm/futex.h b/arch/mips/include/asm/futex.h index 110220705e97..2bf8f6014579 100644 --- a/arch/mips/include/asm/futex.h +++ b/arch/mips/include/asm/futex.h @@ -89,7 +89,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -116,8 +117,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/nds32/include/asm/futex.h b/arch/nds32/include/asm/futex.h index 5213c65c2e0b..4223f473bd36 100644 --- a/arch/nds32/include/asm/futex.h +++ b/arch/nds32/include/asm/futex.h @@ -66,8 +66,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: __futex_atomic_op("move %0, %3", ret, oldval, tmp, uaddr, @@ -93,8 +93,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/openrisc/include/asm/futex.h b/arch/openrisc/include/asm/futex.h index fe894e6331ae..865e9cd0d97b 100644 --- a/arch/openrisc/include/asm/futex.h +++ b/arch/openrisc/include/asm/futex.h @@ -35,7 +35,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -57,8 +58,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/parisc/include/asm/futex.h b/arch/parisc/include/asm/futex.h index d2c3e4106851..c10cc9010cc1 100644 --- a/arch/parisc/include/asm/futex.h +++ b/arch/parisc/include/asm/futex.h @@ -39,8 +39,10 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) int oldval, ret; u32 tmp; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; + _futex_spin_lock_irqsave(uaddr, &flags); - pagefault_disable(); ret = -EFAULT; if (unlikely(get_user(oldval, uaddr) != 0)) @@ -73,7 +75,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -EFAULT; out_pagefault_enable: - pagefault_enable(); _futex_spin_unlock_irqrestore(uaddr, &flags); if (!ret) diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h index bc7d9d06a6d9..f187bb5e524e 100644 --- a/arch/powerpc/include/asm/futex.h +++ b/arch/powerpc/include/asm/futex.h @@ -35,8 +35,9 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; allow_read_write_user(uaddr, uaddr, sizeof(*uaddr)); - pagefault_disable(); switch (op) { case FUTEX_OP_SET: @@ -58,8 +59,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - *oval = oldval; prevent_read_write_user(uaddr, uaddr, sizeof(*uaddr)); diff --git a/arch/riscv/include/asm/futex.h b/arch/riscv/include/asm/futex.h index fdfaf7f3df7c..1b00badb9f87 100644 --- a/arch/riscv/include/asm/futex.h +++ b/arch/riscv/include/asm/futex.h @@ -46,7 +46,8 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) { int oldval = 0, ret = 0; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -73,8 +74,6 @@ arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr) ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/s390/include/asm/futex.h b/arch/s390/include/asm/futex.h index 5e97a4353147..ed965c3ecd5b 100644 --- a/arch/s390/include/asm/futex.h +++ b/arch/s390/include/asm/futex.h @@ -28,8 +28,9 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, int oldval = 0, newval, ret; mm_segment_t old_fs; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; old_fs = enable_sacf_uaccess(); - pagefault_disable(); switch (op) { case FUTEX_OP_SET: __futex_atomic_op("lr %2,%5\n", @@ -54,7 +55,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, default: ret = -ENOSYS; } - pagefault_enable(); disable_sacf_uaccess(old_fs); if (!ret) diff --git a/arch/sh/include/asm/futex.h b/arch/sh/include/asm/futex.h index 3190ec89df81..324fa680b13d 100644 --- a/arch/sh/include/asm/futex.h +++ b/arch/sh/include/asm/futex.h @@ -34,7 +34,8 @@ static inline int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 oldval, newval, prev; int ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; do { ret = get_user(oldval, uaddr); @@ -67,8 +68,6 @@ static inline int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, ret = futex_atomic_cmpxchg_inatomic(&prev, uaddr, oldval, newval); } while (!ret && prev != oldval); - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/sparc/include/asm/futex_64.h b/arch/sparc/include/asm/futex_64.h index 0865ce77ec00..84fffaaf59d3 100644 --- a/arch/sparc/include/asm/futex_64.h +++ b/arch/sparc/include/asm/futex_64.h @@ -35,11 +35,11 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret, tem; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; if (unlikely((((unsigned long) uaddr) & 0x3UL))) return -EINVAL; - pagefault_disable(); - switch (op) { case FUTEX_OP_SET: __futex_cas_op("mov\t%4, %1", ret, oldval, uaddr, oparg); @@ -60,8 +60,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/x86/include/asm/futex.h b/arch/x86/include/asm/futex.h index 13c83fe97988..6bcd1c1486d9 100644 --- a/arch/x86/include/asm/futex.h +++ b/arch/x86/include/asm/futex.h @@ -47,7 +47,8 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, { int oldval = 0, ret, tem; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -70,8 +71,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/arch/xtensa/include/asm/futex.h b/arch/xtensa/include/asm/futex.h index 964611083224..a1a27b2ea460 100644 --- a/arch/xtensa/include/asm/futex.h +++ b/arch/xtensa/include/asm/futex.h @@ -72,7 +72,8 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, #if XCHAL_HAVE_S32C1I || XCHAL_HAVE_EXCLUSIVE int oldval = 0, ret; - pagefault_disable(); + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; switch (op) { case FUTEX_OP_SET: @@ -99,8 +100,6 @@ static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, ret = -ENOSYS; } - pagefault_enable(); - if (!ret) *oval = oldval; diff --git a/include/asm-generic/futex.h b/include/asm-generic/futex.h index 02970b11f71f..3eab7ba912fc 100644 --- a/include/asm-generic/futex.h +++ b/include/asm-generic/futex.h @@ -33,8 +33,9 @@ arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr) int oldval, ret; u32 tmp; + if (!access_ok(uaddr, sizeof(u32))) + return -EFAULT; preempt_disable(); - pagefault_disable(); ret = -EFAULT; if (unlikely(get_user(oldval, uaddr) != 0)) @@ -67,7 +68,6 @@ arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr) ret = -EFAULT; out_pagefault_enable: - pagefault_enable(); preempt_enable(); if (ret == 0) diff --git a/kernel/futex.c b/kernel/futex.c index 0cf84c8664f2..7fdd2c949487 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -1723,10 +1723,9 @@ static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr) oparg = 1 << oparg; } - if (!access_ok(uaddr, sizeof(u32))) - return -EFAULT; - + pagefault_disable(); ret = arch_futex_atomic_op_inuser(op, oparg, &oldval, uaddr); + pagefault_enable(); if (ret) return ret; -- cgit v1.2.3-71-gd317 From e98eac6ff1b45e4e73f2e6031b37c256ccb5d36b Mon Sep 17 00:00:00 2001 From: Thomas Gleixner Date: Fri, 27 Mar 2020 12:06:44 +0100 Subject: cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus() A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e910a ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by: Boqun Feng Signed-off-by: Thomas Gleixner Cc: Pavankumar Kondeti Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de --- include/linux/cpu.h | 12 +++++++++--- kernel/cpu.c | 4 ++-- 2 files changed, 11 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/include/linux/cpu.h b/include/linux/cpu.h index 9ead281157d3..beaed2dc269e 100644 --- a/include/linux/cpu.h +++ b/include/linux/cpu.h @@ -144,12 +144,18 @@ static inline void get_online_cpus(void) { cpus_read_lock(); } static inline void put_online_cpus(void) { cpus_read_unlock(); } #ifdef CONFIG_PM_SLEEP_SMP -extern int freeze_secondary_cpus(int primary); +int __freeze_secondary_cpus(int primary, bool suspend); +static inline int freeze_secondary_cpus(int primary) +{ + return __freeze_secondary_cpus(primary, true); +} + static inline int disable_nonboot_cpus(void) { - return freeze_secondary_cpus(0); + return __freeze_secondary_cpus(0, false); } -extern void enable_nonboot_cpus(void); + +void enable_nonboot_cpus(void); static inline int suspend_disable_secondary_cpus(void) { diff --git a/kernel/cpu.c b/kernel/cpu.c index 30848496cbc7..12ae636e9cb6 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1327,7 +1327,7 @@ void bringup_nonboot_cpus(unsigned int setup_max_cpus) #ifdef CONFIG_PM_SLEEP_SMP static cpumask_var_t frozen_cpus; -int freeze_secondary_cpus(int primary) +int __freeze_secondary_cpus(int primary, bool suspend) { int cpu, error = 0; @@ -1352,7 +1352,7 @@ int freeze_secondary_cpus(int primary) if (cpu == primary) continue; - if (pm_wakeup_pending()) { + if (suspend && pm_wakeup_pending()) { pr_info("Wakeup pending. Abort CPU freeze\n"); error = -EBUSY; break; -- cgit v1.2.3-71-gd317 From fc611f47f2188ade2b48ff6902d5cce8baac0c58 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Sun, 29 Mar 2020 01:43:49 +0100 Subject: bpf: Introduce BPF_PROG_TYPE_LSM Introduce types and configs for bpf programs that can be attached to LSM hooks. The programs can be enabled by the config option CONFIG_BPF_LSM. Signed-off-by: KP Singh Signed-off-by: Daniel Borkmann Reviewed-by: Brendan Jackman Reviewed-by: Florent Revest Reviewed-by: Thomas Garnier Acked-by: Yonghong Song Acked-by: Andrii Nakryiko Acked-by: James Morris Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org --- MAINTAINERS | 1 + include/linux/bpf.h | 3 +++ include/linux/bpf_types.h | 4 ++++ include/uapi/linux/bpf.h | 2 ++ init/Kconfig | 12 ++++++++++++ kernel/bpf/Makefile | 1 + kernel/bpf/bpf_lsm.c | 17 +++++++++++++++++ kernel/trace/bpf_trace.c | 12 ++++++------ tools/include/uapi/linux/bpf.h | 2 ++ tools/lib/bpf/libbpf_probes.c | 1 + 10 files changed, 49 insertions(+), 6 deletions(-) create mode 100644 kernel/bpf/bpf_lsm.c (limited to 'kernel') diff --git a/MAINTAINERS b/MAINTAINERS index 5dbee41045bc..3197fe9256b2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3147,6 +3147,7 @@ R: Martin KaFai Lau R: Song Liu R: Yonghong Song R: Andrii Nakryiko +R: KP Singh L: netdev@vger.kernel.org L: bpf@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 372708eeaecd..3bde59a8453b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1515,6 +1515,9 @@ extern const struct bpf_func_proto bpf_tcp_sock_proto; extern const struct bpf_func_proto bpf_jiffies64_proto; extern const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto; +const struct bpf_func_proto *bpf_tracing_func_proto( + enum bpf_func_id func_id, const struct bpf_prog *prog); + /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index c81d4ece79a4..ba0c2d56f8a3 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -70,6 +70,10 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_STRUCT_OPS, bpf_struct_ops, void *, void *) BPF_PROG_TYPE(BPF_PROG_TYPE_EXT, bpf_extension, void *, void *) +#ifdef CONFIG_BPF_LSM +BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm, + void *, void *) +#endif /* CONFIG_BPF_LSM */ #endif BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 222ba11966e3..f1fbc36f58d3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -181,6 +181,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_STRUCT_OPS, BPF_PROG_TYPE_EXT, + BPF_PROG_TYPE_LSM, }; enum bpf_attach_type { @@ -211,6 +212,7 @@ enum bpf_attach_type { BPF_TRACE_FENTRY, BPF_TRACE_FEXIT, BPF_MODIFY_RETURN, + BPF_LSM_MAC, __MAX_BPF_ATTACH_TYPE }; diff --git a/init/Kconfig b/init/Kconfig index 20a6ac33761c..deae572d1927 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1616,6 +1616,18 @@ config KALLSYMS_BASE_RELATIVE # end of the "standard kernel features (expert users)" menu # syscall, maps, verifier + +config BPF_LSM + bool "LSM Instrumentation with BPF" + depends on BPF_SYSCALL + depends on SECURITY + depends on BPF_JIT + help + Enables instrumentation of the security hooks with eBPF programs for + implementing dynamic MAC and Audit Policies. + + If you are unsure how to answer this question, answer N. + config BPF_SYSCALL bool "Enable bpf() system call" select BPF diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 046ce5d98033..f2d7be596966 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -29,4 +29,5 @@ obj-$(CONFIG_DEBUG_INFO_BTF) += sysfs_btf.o endif ifeq ($(CONFIG_BPF_JIT),y) obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o +obj-${CONFIG_BPF_LSM} += bpf_lsm.o endif diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c new file mode 100644 index 000000000000..82875039ca90 --- /dev/null +++ b/kernel/bpf/bpf_lsm.c @@ -0,0 +1,17 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (C) 2020 Google LLC. + */ + +#include +#include +#include + +const struct bpf_prog_ops lsm_prog_ops = { +}; + +const struct bpf_verifier_ops lsm_verifier_ops = { + .get_func_proto = bpf_tracing_func_proto, + .is_valid_access = btf_ctx_access, +}; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index e619eedb5919..37ffceab608f 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -779,8 +779,8 @@ static const struct bpf_func_proto bpf_send_signal_thread_proto = { .arg1_type = ARG_ANYTHING, }; -static const struct bpf_func_proto * -tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +const struct bpf_func_proto * +bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { case BPF_FUNC_map_lookup_elem: @@ -865,7 +865,7 @@ kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_override_return_proto; #endif default: - return tracing_func_proto(func_id, prog); + return bpf_tracing_func_proto(func_id, prog); } } @@ -975,7 +975,7 @@ tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_get_stack: return &bpf_get_stack_proto_tp; default: - return tracing_func_proto(func_id, prog); + return bpf_tracing_func_proto(func_id, prog); } } @@ -1082,7 +1082,7 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_read_branch_records: return &bpf_read_branch_records_proto; default: - return tracing_func_proto(func_id, prog); + return bpf_tracing_func_proto(func_id, prog); } } @@ -1210,7 +1210,7 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_get_stack: return &bpf_get_stack_proto_raw_tp; default: - return tracing_func_proto(func_id, prog); + return bpf_tracing_func_proto(func_id, prog); } } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 222ba11966e3..f1fbc36f58d3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -181,6 +181,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_STRUCT_OPS, BPF_PROG_TYPE_EXT, + BPF_PROG_TYPE_LSM, }; enum bpf_attach_type { @@ -211,6 +212,7 @@ enum bpf_attach_type { BPF_TRACE_FENTRY, BPF_TRACE_FEXIT, BPF_MODIFY_RETURN, + BPF_LSM_MAC, __MAX_BPF_ATTACH_TYPE }; diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index b782ebef6ac9..2c92059c0c90 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -108,6 +108,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, case BPF_PROG_TYPE_TRACING: case BPF_PROG_TYPE_STRUCT_OPS: case BPF_PROG_TYPE_EXT: + case BPF_PROG_TYPE_LSM: default: break; } -- cgit v1.2.3-71-gd317 From 9d3fdea789c8fab51381c2d609932fabe94c0517 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Sun, 29 Mar 2020 01:43:51 +0100 Subject: bpf: lsm: Provide attachment points for BPF LSM programs When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_, are generated for each LSM hook. These functions are initialized as LSM hooks in a subsequent patch. Signed-off-by: KP Singh Signed-off-by: Daniel Borkmann Reviewed-by: Brendan Jackman Reviewed-by: Florent Revest Reviewed-by: Kees Cook Acked-by: Yonghong Song Acked-by: James Morris Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org --- include/linux/bpf_lsm.h | 22 ++++++++++++++++++++++ kernel/bpf/bpf_lsm.c | 14 ++++++++++++++ 2 files changed, 36 insertions(+) create mode 100644 include/linux/bpf_lsm.h (limited to 'kernel') diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h new file mode 100644 index 000000000000..83b96895829f --- /dev/null +++ b/include/linux/bpf_lsm.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* + * Copyright (C) 2020 Google LLC. + */ + +#ifndef _LINUX_BPF_LSM_H +#define _LINUX_BPF_LSM_H + +#include +#include + +#ifdef CONFIG_BPF_LSM + +#define LSM_HOOK(RET, DEFAULT, NAME, ...) \ + RET bpf_lsm_##NAME(__VA_ARGS__); +#include +#undef LSM_HOOK + +#endif /* CONFIG_BPF_LSM */ + +#endif /* _LINUX_BPF_LSM_H */ diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index 82875039ca90..3b3bbb28603e 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -7,6 +7,20 @@ #include #include #include +#include +#include + +/* For every LSM hook that allows attachment of BPF programs, declare a nop + * function where a BPF program can be attached. + */ +#define LSM_HOOK(RET, DEFAULT, NAME, ...) \ +noinline RET bpf_lsm_##NAME(__VA_ARGS__) \ +{ \ + return DEFAULT; \ +} + +#include +#undef LSM_HOOK const struct bpf_prog_ops lsm_prog_ops = { }; -- cgit v1.2.3-71-gd317 From 9e4e01dfd3254c7f04f24b7c6b29596bc12332f3 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Sun, 29 Mar 2020 01:43:52 +0100 Subject: bpf: lsm: Implement attach, detach and execution JITed BPF programs are dynamically attached to the LSM hooks using BPF trampolines. The trampoline prologue generates code to handle conversion of the signature of the hook to the appropriate BPF context. The allocated trampoline programs are attached to the nop functions initialized as LSM hooks. BPF_PROG_TYPE_LSM programs must have a GPL compatible license and and need CAP_SYS_ADMIN (required for loading eBPF programs). Upon attachment: * A BPF fexit trampoline is used for LSM hooks with a void return type. * A BPF fmod_ret trampoline is used for LSM hooks which return an int. The attached programs can override the return value of the bpf LSM hook to indicate a MAC Policy decision. Signed-off-by: KP Singh Signed-off-by: Daniel Borkmann Reviewed-by: Brendan Jackman Reviewed-by: Florent Revest Acked-by: Andrii Nakryiko Acked-by: James Morris Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org --- include/linux/bpf_lsm.h | 11 ++++++++++ kernel/bpf/bpf_lsm.c | 23 ++++++++++++++++++++ kernel/bpf/btf.c | 16 +++++++++++++- kernel/bpf/syscall.c | 57 +++++++++++++++++++++++++++++++++---------------- kernel/bpf/trampoline.c | 17 +++++++++++---- kernel/bpf/verifier.c | 19 +++++++++++++---- 6 files changed, 116 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h index 83b96895829f..af74712af585 100644 --- a/include/linux/bpf_lsm.h +++ b/include/linux/bpf_lsm.h @@ -17,6 +17,17 @@ #include #undef LSM_HOOK +int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog); + +#else /* !CONFIG_BPF_LSM */ + +static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_BPF_LSM */ #endif /* _LINUX_BPF_LSM_H */ diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index 3b3bbb28603e..19636703b24e 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -9,6 +9,8 @@ #include #include #include +#include +#include /* For every LSM hook that allows attachment of BPF programs, declare a nop * function where a BPF program can be attached. @@ -22,6 +24,27 @@ noinline RET bpf_lsm_##NAME(__VA_ARGS__) \ #include #undef LSM_HOOK +#define BPF_LSM_SYM_PREFX "bpf_lsm_" + +int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, + const struct bpf_prog *prog) +{ + if (!prog->gpl_compatible) { + bpf_log(vlog, + "LSM programs must have a GPL compatible license\n"); + return -EINVAL; + } + + if (strncmp(BPF_LSM_SYM_PREFX, prog->aux->attach_func_name, + sizeof(BPF_LSM_SYM_PREFX) - 1)) { + bpf_log(vlog, "attach_btf_id %u points to wrong type name %s\n", + prog->aux->attach_btf_id, prog->aux->attach_func_name); + return -EINVAL; + } + + return 0; +} + const struct bpf_prog_ops lsm_prog_ops = { }; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 6f397c4da05e..de335cd386f0 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3710,7 +3710,21 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, } if (arg == nr_args) { - if (prog->expected_attach_type == BPF_TRACE_FEXIT) { + if (prog->expected_attach_type == BPF_TRACE_FEXIT || + prog->expected_attach_type == BPF_LSM_MAC) { + /* When LSM programs are attached to void LSM hooks + * they use FEXIT trampolines and when attached to + * int LSM hooks, they use MODIFY_RETURN trampolines. + * + * While the LSM programs are BPF_MODIFY_RETURN-like + * the check: + * + * if (ret_type != 'int') + * return -EINVAL; + * + * is _not_ done here. This is still safe as LSM hooks + * have only void and int return types. + */ if (!t) return true; t = btf_type_by_id(btf, t->type); diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b2584b25748c..a616b63f23b4 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -25,6 +25,7 @@ #include #include #include +#include #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ @@ -1935,6 +1936,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, switch (prog_type) { case BPF_PROG_TYPE_TRACING: + case BPF_PROG_TYPE_LSM: case BPF_PROG_TYPE_STRUCT_OPS: case BPF_PROG_TYPE_EXT: break; @@ -2366,10 +2368,28 @@ static int bpf_tracing_prog_attach(struct bpf_prog *prog) struct file *link_file; int link_fd, err; - if (prog->expected_attach_type != BPF_TRACE_FENTRY && - prog->expected_attach_type != BPF_TRACE_FEXIT && - prog->expected_attach_type != BPF_MODIFY_RETURN && - prog->type != BPF_PROG_TYPE_EXT) { + switch (prog->type) { + case BPF_PROG_TYPE_TRACING: + if (prog->expected_attach_type != BPF_TRACE_FENTRY && + prog->expected_attach_type != BPF_TRACE_FEXIT && + prog->expected_attach_type != BPF_MODIFY_RETURN) { + err = -EINVAL; + goto out_put_prog; + } + break; + case BPF_PROG_TYPE_EXT: + if (prog->expected_attach_type != 0) { + err = -EINVAL; + goto out_put_prog; + } + break; + case BPF_PROG_TYPE_LSM: + if (prog->expected_attach_type != BPF_LSM_MAC) { + err = -EINVAL; + goto out_put_prog; + } + break; + default: err = -EINVAL; goto out_put_prog; } @@ -2448,16 +2468,10 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) if (IS_ERR(prog)) return PTR_ERR(prog); - if (prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT && - prog->type != BPF_PROG_TYPE_TRACING && - prog->type != BPF_PROG_TYPE_EXT && - prog->type != BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE) { - err = -EINVAL; - goto out_put_prog; - } - - if (prog->type == BPF_PROG_TYPE_TRACING || - prog->type == BPF_PROG_TYPE_EXT) { + switch (prog->type) { + case BPF_PROG_TYPE_TRACING: + case BPF_PROG_TYPE_EXT: + case BPF_PROG_TYPE_LSM: if (attr->raw_tracepoint.name) { /* The attach point for this category of programs * should be specified via btf_id during program load. @@ -2465,11 +2479,14 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) err = -EINVAL; goto out_put_prog; } - if (prog->expected_attach_type == BPF_TRACE_RAW_TP) + if (prog->type == BPF_PROG_TYPE_TRACING && + prog->expected_attach_type == BPF_TRACE_RAW_TP) { tp_name = prog->aux->attach_func_name; - else - return bpf_tracing_prog_attach(prog); - } else { + break; + } + return bpf_tracing_prog_attach(prog); + case BPF_PROG_TYPE_RAW_TRACEPOINT: + case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE: if (strncpy_from_user(buf, u64_to_user_ptr(attr->raw_tracepoint.name), sizeof(buf) - 1) < 0) { @@ -2478,6 +2495,10 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) } buf[sizeof(buf) - 1] = 0; tp_name = buf; + break; + default: + err = -EINVAL; + goto out_put_prog; } btp = bpf_get_raw_tracepoint(tp_name); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index f30bca2a4d01..9be85aa4ec5f 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -6,6 +6,7 @@ #include #include #include +#include /* dummy _ops. The verifier will operate on target program's ops. */ const struct bpf_verifier_ops bpf_extension_verifier_ops = { @@ -233,15 +234,23 @@ out: return err; } -static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(enum bpf_attach_type t) +static enum bpf_tramp_prog_type bpf_attach_type_to_tramp(struct bpf_prog *prog) { - switch (t) { + switch (prog->expected_attach_type) { case BPF_TRACE_FENTRY: return BPF_TRAMP_FENTRY; case BPF_MODIFY_RETURN: return BPF_TRAMP_MODIFY_RETURN; case BPF_TRACE_FEXIT: return BPF_TRAMP_FEXIT; + case BPF_LSM_MAC: + if (!prog->aux->attach_func_proto->type) + /* The function returns void, we cannot modify its + * return value. + */ + return BPF_TRAMP_FEXIT; + else + return BPF_TRAMP_MODIFY_RETURN; default: return BPF_TRAMP_REPLACE; } @@ -255,7 +264,7 @@ int bpf_trampoline_link_prog(struct bpf_prog *prog) int cnt; tr = prog->aux->trampoline; - kind = bpf_attach_type_to_tramp(prog->expected_attach_type); + kind = bpf_attach_type_to_tramp(prog); mutex_lock(&tr->mutex); if (tr->extension_prog) { /* cannot attach fentry/fexit if extension prog is attached. @@ -305,7 +314,7 @@ int bpf_trampoline_unlink_prog(struct bpf_prog *prog) int err; tr = prog->aux->trampoline; - kind = bpf_attach_type_to_tramp(prog->expected_attach_type); + kind = bpf_attach_type_to_tramp(prog); mutex_lock(&tr->mutex); if (kind == BPF_TRAMP_REPLACE) { WARN_ON_ONCE(!tr->extension_prog); diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 46ba86c540e2..047b2e876399 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "disasm.h" @@ -6492,8 +6493,9 @@ static int check_return_code(struct bpf_verifier_env *env) struct tnum range = tnum_range(0, 1); int err; - /* The struct_ops func-ptr's return type could be "void" */ - if (env->prog->type == BPF_PROG_TYPE_STRUCT_OPS && + /* LSM and struct_ops func-ptr's return type could be "void" */ + if ((env->prog->type == BPF_PROG_TYPE_STRUCT_OPS || + env->prog->type == BPF_PROG_TYPE_LSM) && !prog->aux->attach_func_proto->type) return 0; @@ -9923,7 +9925,9 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) return check_struct_ops_btf_id(env); - if (prog->type != BPF_PROG_TYPE_TRACING && !prog_extension) + if (prog->type != BPF_PROG_TYPE_TRACING && + prog->type != BPF_PROG_TYPE_LSM && + !prog_extension) return 0; if (!btf_id) { @@ -10054,8 +10058,16 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) return -EINVAL; /* fallthrough */ case BPF_MODIFY_RETURN: + case BPF_LSM_MAC: case BPF_TRACE_FENTRY: case BPF_TRACE_FEXIT: + prog->aux->attach_func_name = tname; + if (prog->type == BPF_PROG_TYPE_LSM) { + ret = bpf_lsm_verify_prog(&env->log, prog); + if (ret < 0) + return ret; + } + if (!btf_type_is_func(t)) { verbose(env, "attach_btf_id %u is not a function\n", btf_id); @@ -10070,7 +10082,6 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) tr = bpf_trampoline_lookup(key); if (!tr) return -ENOMEM; - prog->aux->attach_func_name = tname; /* t is either vmlinux type or another program's type */ prog->aux->attach_func_proto = t; mutex_lock(&tr->mutex); -- cgit v1.2.3-71-gd317 From 3db81afd99494a33f1c3839103f0429c8f30cb9d Mon Sep 17 00:00:00 2001 From: Sven Schnelle Date: Tue, 10 Mar 2020 13:33:32 +0100 Subject: seccomp: Add missing compat_ioctl for notify Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit userland (both s390 and x86) doesn't work because there's no compat_ioctl handler defined. Add the handler. Signed-off-by: Sven Schnelle Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com Signed-off-by: Kees Cook --- kernel/seccomp.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 29022c1bbe18..ec5c606bc3a1 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -1225,6 +1225,7 @@ static const struct file_operations seccomp_notify_ops = { .poll = seccomp_notify_poll, .release = seccomp_notify_release, .unlocked_ioctl = seccomp_notify_ioctl, + .compat_ioctl = seccomp_notify_ioctl, }; static struct file *init_listener(struct seccomp_filter *filter) -- cgit v1.2.3-71-gd317 From f2d67fec0b43edce8c416101cdc52e71145b5fef Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Mon, 30 Mar 2020 18:03:22 +0200 Subject: bpf: Undo incorrect __reg_bound_offset32 handling Anatoly has been fuzzing with kBdysch harness and reported a hang in one of the outcomes: 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 12: (95) exit from 8 to 9: safe from 5 to 6: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020;0x10000001f)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umin_value=271581184,umax_value=271581311,var_off=(0x10300000;0x7f)) R1_w=invP808464432 R10=fp0 9: safe from 8 to 9: safe verification time 589 usec stack depth 0 processed 17 insns (limit 1000000) [...] The underlying program was xlated as follows: # bpftool p d x i 9 0: (b7) r0 = 808464432 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 9: (76) if w0 s>= 0x303030 goto pc+2 10: (05) goto pc-1 11: (05) goto pc-1 12: (95) exit The verifier rewrote original instructions it recognized as dead code with 'goto pc-1', but reality differs from verifier simulation in that we're actually able to trigger a hang due to hitting the 'goto pc-1' instructions. Taking different examples to make the issue more obvious: in this example we're probing bounds on a completely unknown scalar variable in r1: [...] 5: R0_w=inv1 R1_w=inv(id=0) R10=fp0 5: (18) r2 = 0x4000000000 7: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R10=fp0 7: (18) r3 = 0x2000000000 9: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R10=fp0 9: (18) r4 = 0x400 11: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R10=fp0 11: (18) r5 = 0x200 13: R0_w=inv1 R1_w=inv(id=0) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 13: (2d) if r1 > r2 goto pc+4 R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: R0_w=inv1 R1_w=inv(id=0,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 14: (ad) if r1 < r3 goto pc+3 R0_w=inv1 R1_w=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2_w=inv274877906944 R3_w=inv137438953472 R4_w=inv1024 R5_w=inv512 R10=fp0 15: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7fffffffff)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 15: (2e) if w1 > w4 goto pc+2 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 16: (ae) if w1 < w5 goto pc+1 R0=inv1 R1=inv(id=0,umin_value=137438953472,umax_value=274877906944,var_off=(0x0; 0x7f00000000)) R2=inv274877906944 R3=inv137438953472 R4=inv1024 R5=inv512 R10=fp0 [...] We're first probing lower/upper bounds via jmp64, later we do a similar check via jmp32 and examine the resulting var_off there. After fall-through in insn 14, we get the following bounded r1 with 0x7fffffffff unknown marked bits in the variable section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 Now, in insn 15 and 16, we perform a similar probe with lower/upper bounds in jmp32. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111100000000000000000000000000000000 / 0x7f00000000 min: 0b010000000000000000000000000000000000000 / 0x2000000000 The lower/upper bounds haven't changed since they have high bits set in u64 space and the jmp32 tests can only refine bounds in the low bits. However, for the var part the expectation would have been 0x7f000007ff or something less precise up to 0x7fffffffff. A outcome of 0x7f00000000 is not correct since it would contradict the earlier probed bounds where we know that the result should have been in [0x200,0x400] in u32 space. Therefore, tests with such info will lead to wrong verifier assumptions later on like falsely predicting conditional jumps to be always taken, etc. The issue here is that __reg_bound_offset32()'s implementation from commit 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") makes an incorrect range assumption: static void __reg_bound_offset32(struct bpf_reg_state *reg) { u64 mask = 0xffffFFFF; struct tnum range = tnum_range(reg->umin_value & mask, reg->umax_value & mask); struct tnum lo32 = tnum_cast(reg->var_off, 4); struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the above walk-through example, __reg_bound_offset32() as-is chose a range after masking with 0xffffffff of [0x0,0x0] since umin:0x2000000000 and umax:0x4000000000 and therefore the lo32 part was clamped to 0x0 as well. However, in the umin:0x2000000000 and umax:0x4000000000 range above we'd end up with an actual possible interval of [0x0,0xffffffff] for u32 space instead. In case of the original reproducer, the situation looked as follows at insn 5 for r0: [...] 5: R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x30303020; 0x10000001f)) R1_w=invP808464432 R10=fp0 0x30303030 0x13030302f [...] After the fall-through, we similarly forced the var_off result into the wrong range [0x30303030,0x3030302f] suggesting later on that fixed bits must only be of 0x30303020 with 0x10000001f unknowns whereas such assumption can only be made when both bounds in hi32 range match. Originally, I was thinking to fix this by moving reg into a temp reg and use proper coerce_reg_to_size() helper on the temp reg where we can then based on that define the range tnum for later intersection: static void __reg_bound_offset32(struct bpf_reg_state *reg) { struct bpf_reg_state tmp = *reg; struct tnum lo32, hi32, range; coerce_reg_to_size(&tmp, 4); range = tnum_range(tmp.umin_value, tmp.umax_value); lo32 = tnum_cast(reg->var_off, 4); hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); } In the case of the concrete example, this gives us a more conservative unknown section. Thus, after knowing r1 <= 0x4000000000 and r1 >= 0x2000000000 and w1 <= 0x400 and w1 >= 0x200: max: 0b100000000000000000000000000000000000000 / 0x4000000000 var: 0b111111111111111111111111111111111111111 / 0x7fffffffff min: 0b010000000000000000000000000000000000000 / 0x2000000000 However, above new __reg_bound_offset32() has no effect on refining the knowledge of the register contents. Meaning, if the bounds in hi32 range mismatch we'll get the identity function given the range reg spans [0x0,0xffffffff] and we cast var_off into lo32 only to later on binary or it again with the hi32. Likewise, if the bounds in hi32 range match, then we mask both bounds with 0xffffffff, use the resulting umin/umax for the range to later intersect the lo32 with it. However, _prior_ called __reg_bound_offset() did already such intersection on the full reg and we therefore would only repeat the same operation on the lo32 part twice. Given this has no effect and the original commit had false assumptions, this patch reverts the code entirely which is also more straight forward for stable trees: apparently 581738a681b6 got auto-selected by Sasha's ML system and misclassified as a fix, so it got sucked into v5.4 where it should never have landed. A revert is low-risk also from a user PoV since it requires a recent kernel and llc to opt-into -mcpu=v3 BPF CPU to generate jmp32 instructions. A proper bounds refinement would need a significantly more complex approach which is currently being worked, but no stable material [0]. Hence revert is best option for stable. After the revert, the original reported program gets rejected as follows: 1: (7f) r0 >>= r0 2: (14) w0 -= 808464432 3: (07) r0 += 808464432 4: (b7) r1 = 808464432 5: (de) if w1 s<= w0 goto pc+0 R0_w=invP(id=0,umin_value=808464432,umax_value=5103431727,var_off=(0x0; 0x1ffffffff)) R1_w=invP808464432 R10=fp0 6: (07) r0 += -2144337872 7: (14) w0 -= -1607454672 8: (25) if r0 > 0x30303030 goto pc+0 R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x3fffffff)) R1_w=invP808464432 R10=fp0 9: (76) if w0 s>= 0x303030 goto pc+2 R0=invP(id=0,umax_value=3158063,var_off=(0x0; 0x3fffff)) R1=invP808464432 R10=fp0 10: (30) r0 = *(u8 *)skb[808464432] BPF_LD_[ABS|IND] uses reserved fields processed 11 insns (limit 1000000) [...] [0] https://lore.kernel.org/bpf/158507130343.15666.8018068546764556975.stgit@john-Precision-5820-Tower/T/ Fixes: 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") Reported-by: Anatoly Trosinenko Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200330160324.15259-2-daniel@iogearbox.net --- kernel/bpf/verifier.c | 19 ------------------- 1 file changed, 19 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 047b2e876399..2a84f73a93a1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1036,17 +1036,6 @@ static void __reg_bound_offset(struct bpf_reg_state *reg) reg->umax_value)); } -static void __reg_bound_offset32(struct bpf_reg_state *reg) -{ - u64 mask = 0xffffFFFF; - struct tnum range = tnum_range(reg->umin_value & mask, - reg->umax_value & mask); - struct tnum lo32 = tnum_cast(reg->var_off, 4); - struct tnum hi32 = tnum_lshift(tnum_rshift(reg->var_off, 32), 32); - - reg->var_off = tnum_or(hi32, tnum_intersect(lo32, range)); -} - /* Reset the min/max bounds of a register */ static void __mark_reg_unbounded(struct bpf_reg_state *reg) { @@ -5805,10 +5794,6 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, /* We might have learned some bits from the bounds. */ __reg_bound_offset(false_reg); __reg_bound_offset(true_reg); - if (is_jmp32) { - __reg_bound_offset32(false_reg); - __reg_bound_offset32(true_reg); - } /* Intersecting with the old var_off might have improved our bounds * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), * then new var_off is (0; 0x7f...fc) which improves our umax. @@ -5918,10 +5903,6 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, /* We might have learned some bits from the bounds. */ __reg_bound_offset(false_reg); __reg_bound_offset(true_reg); - if (is_jmp32) { - __reg_bound_offset32(false_reg); - __reg_bound_offset32(true_reg); - } /* Intersecting with the old var_off might have improved our bounds * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), * then new var_off is (0; 0x7f...fc) which improves our umax. -- cgit v1.2.3-71-gd317 From 604dca5e3af1db98bd123b7bfc02b017af99e3a0 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Mon, 30 Mar 2020 18:03:23 +0200 Subject: bpf: Fix tnum constraints for 32-bit comparisons The BPF verifier tried to track values based on 32-bit comparisons by (ab)using the tnum state via 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions"). The idea is that after a check like this: if ((u32)r0 > 3) exit We can't meaningfully constrain the arithmetic-range-based tracking, but we can update the tnum state to (value=0,mask=0xffff'ffff'0000'0003). However, the implementation from 581738a681b6 didn't compute the tnum constraint based on the fixed operand, but instead derives it from the arithmetic-range-based tracking. This means that after the following sequence of operations: if (r0 >= 0x1'0000'0001) exit if ((u32)r0 > 7) exit The verifier assumed that the lower half of r0 is in the range (0, 0) and apply the tnum constraint (value=0,mask=0xffff'ffff'0000'0000) thus causing the overall tnum to be (value=0,mask=0x1'0000'0000), which was incorrect. Provide a fixed implementation. Fixes: 581738a681b6 ("bpf: Provide better register bounds after jmp32 instructions") Signed-off-by: Jann Horn Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200330160324.15259-3-daniel@iogearbox.net --- kernel/bpf/verifier.c | 108 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 72 insertions(+), 36 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2a84f73a93a1..6fce6f096c16 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5678,6 +5678,70 @@ static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg) reg->smax_value <= 0 && reg->smin_value >= S32_MIN); } +/* Constrain the possible values of @reg with unsigned upper bound @bound. + * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. + * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits + * of @reg. + */ +static void set_upper_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, + bool is_exclusive) +{ + if (is_exclusive) { + /* There are no values for `reg` that make `reg<0` true. */ + if (bound == 0) + return; + bound--; + } + if (is_jmp32) { + /* Constrain the register's value in the tnum representation. + * For 64-bit comparisons this happens later in + * __reg_bound_offset(), but for 32-bit comparisons, we can be + * more precise than what can be derived from the updated + * numeric bounds. + */ + struct tnum t = tnum_range(0, bound); + + t.mask |= ~0xffffffffULL; /* upper half is unknown */ + reg->var_off = tnum_intersect(reg->var_off, t); + + /* Compute the 64-bit bound from the 32-bit bound. */ + bound += gen_hi_max(reg->var_off); + } + reg->umax_value = min(reg->umax_value, bound); +} + +/* Constrain the possible values of @reg with unsigned lower bound @bound. + * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. + * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits + * of @reg. + */ +static void set_lower_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, + bool is_exclusive) +{ + if (is_exclusive) { + /* There are no values for `reg` that make `reg>MAX` true. */ + if (bound == (is_jmp32 ? U32_MAX : U64_MAX)) + return; + bound++; + } + if (is_jmp32) { + /* Constrain the register's value in the tnum representation. + * For 64-bit comparisons this happens later in + * __reg_bound_offset(), but for 32-bit comparisons, we can be + * more precise than what can be derived from the updated + * numeric bounds. + */ + struct tnum t = tnum_range(bound, U32_MAX); + + t.mask |= ~0xffffffffULL; /* upper half is unknown */ + reg->var_off = tnum_intersect(reg->var_off, t); + + /* Compute the 64-bit bound from the 32-bit bound. */ + bound += gen_hi_min(reg->var_off); + } + reg->umin_value = max(reg->umin_value, bound); +} + /* Adjusts the register min/max values in the case that the dst_reg is the * variable register that we are working on, and src_reg is a constant or we're * simply doing a BPF_K check. @@ -5733,15 +5797,8 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, case BPF_JGE: case BPF_JGT: { - u64 false_umax = opcode == BPF_JGT ? val : val - 1; - u64 true_umin = opcode == BPF_JGT ? val + 1 : val; - - if (is_jmp32) { - false_umax += gen_hi_max(false_reg->var_off); - true_umin += gen_hi_min(true_reg->var_off); - } - false_reg->umax_value = min(false_reg->umax_value, false_umax); - true_reg->umin_value = max(true_reg->umin_value, true_umin); + set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); + set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); break; } case BPF_JSGE: @@ -5762,15 +5819,8 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, case BPF_JLE: case BPF_JLT: { - u64 false_umin = opcode == BPF_JLT ? val : val + 1; - u64 true_umax = opcode == BPF_JLT ? val - 1 : val; - - if (is_jmp32) { - false_umin += gen_hi_min(false_reg->var_off); - true_umax += gen_hi_max(true_reg->var_off); - } - false_reg->umin_value = max(false_reg->umin_value, false_umin); - true_reg->umax_value = min(true_reg->umax_value, true_umax); + set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); + set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); break; } case BPF_JSLE: @@ -5845,15 +5895,8 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, case BPF_JGE: case BPF_JGT: { - u64 false_umin = opcode == BPF_JGT ? val : val + 1; - u64 true_umax = opcode == BPF_JGT ? val - 1 : val; - - if (is_jmp32) { - false_umin += gen_hi_min(false_reg->var_off); - true_umax += gen_hi_max(true_reg->var_off); - } - false_reg->umin_value = max(false_reg->umin_value, false_umin); - true_reg->umax_value = min(true_reg->umax_value, true_umax); + set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); + set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); break; } case BPF_JSGE: @@ -5871,15 +5914,8 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, case BPF_JLE: case BPF_JLT: { - u64 false_umax = opcode == BPF_JLT ? val : val - 1; - u64 true_umin = opcode == BPF_JLT ? val + 1 : val; - - if (is_jmp32) { - false_umax += gen_hi_max(false_reg->var_off); - true_umin += gen_hi_min(true_reg->var_off); - } - false_reg->umax_value = min(false_reg->umax_value, false_umax); - true_reg->umin_value = max(true_reg->umin_value, true_umin); + set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); + set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); break; } case BPF_JSLE: -- cgit v1.2.3-71-gd317 From 0fc31b10cfb7e5158bafb9d30839fbc9241c40c4 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Mon, 30 Mar 2020 18:03:24 +0200 Subject: bpf: Simplify reg_set_min_max_inv handling reg_set_min_max_inv() contains exactly the same logic as reg_set_min_max(), just flipped around. While this makes sense in a cBPF verifier (where ALU operations are not symmetric), it does not make sense for eBPF. Replace reg_set_min_max_inv() with a helper that flips the opcode around, then lets reg_set_min_max() do the complicated work. Signed-off-by: Jann Horn Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200330160324.15259-4-daniel@iogearbox.net --- kernel/bpf/verifier.c | 108 ++++++++++---------------------------------------- 1 file changed, 22 insertions(+), 86 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 6fce6f096c16..b55842033073 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -5836,7 +5836,7 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, break; } default: - break; + return; } __reg_deduce_bounds(false_reg); @@ -5859,92 +5859,28 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, struct bpf_reg_state *false_reg, u64 val, u8 opcode, bool is_jmp32) { - s64 sval; - - if (__is_pointer_value(false, false_reg)) - return; - - val = is_jmp32 ? (u32)val : val; - sval = is_jmp32 ? (s64)(s32)val : (s64)val; - - switch (opcode) { - case BPF_JEQ: - case BPF_JNE: - { - struct bpf_reg_state *reg = - opcode == BPF_JEQ ? true_reg : false_reg; - - if (is_jmp32) { - u64 old_v = reg->var_off.value; - u64 hi_mask = ~0xffffffffULL; - - reg->var_off.value = (old_v & hi_mask) | val; - reg->var_off.mask &= hi_mask; - } else { - __mark_reg_known(reg, val); - } - break; - } - case BPF_JSET: - false_reg->var_off = tnum_and(false_reg->var_off, - tnum_const(~val)); - if (is_power_of_2(val)) - true_reg->var_off = tnum_or(true_reg->var_off, - tnum_const(val)); - break; - case BPF_JGE: - case BPF_JGT: - { - set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); - set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); - break; - } - case BPF_JSGE: - case BPF_JSGT: - { - s64 false_smin = opcode == BPF_JSGT ? sval : sval + 1; - s64 true_smax = opcode == BPF_JSGT ? sval - 1 : sval; - - if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) - break; - false_reg->smin_value = max(false_reg->smin_value, false_smin); - true_reg->smax_value = min(true_reg->smax_value, true_smax); - break; - } - case BPF_JLE: - case BPF_JLT: - { - set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); - set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); - break; - } - case BPF_JSLE: - case BPF_JSLT: - { - s64 false_smax = opcode == BPF_JSLT ? sval : sval - 1; - s64 true_smin = opcode == BPF_JSLT ? sval + 1 : sval; - - if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) - break; - false_reg->smax_value = min(false_reg->smax_value, false_smax); - true_reg->smin_value = max(true_reg->smin_value, true_smin); - break; - } - default: - break; - } - - __reg_deduce_bounds(false_reg); - __reg_deduce_bounds(true_reg); - /* We might have learned some bits from the bounds. */ - __reg_bound_offset(false_reg); - __reg_bound_offset(true_reg); - /* Intersecting with the old var_off might have improved our bounds - * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), - * then new var_off is (0; 0x7f...fc) which improves our umax. + /* How can we transform "a b" into "b a"? */ + static const u8 opcode_flip[16] = { + /* these stay the same */ + [BPF_JEQ >> 4] = BPF_JEQ, + [BPF_JNE >> 4] = BPF_JNE, + [BPF_JSET >> 4] = BPF_JSET, + /* these swap "lesser" and "greater" (L and G in the opcodes) */ + [BPF_JGE >> 4] = BPF_JLE, + [BPF_JGT >> 4] = BPF_JLT, + [BPF_JLE >> 4] = BPF_JGE, + [BPF_JLT >> 4] = BPF_JGT, + [BPF_JSGE >> 4] = BPF_JSLE, + [BPF_JSGT >> 4] = BPF_JSLT, + [BPF_JSLE >> 4] = BPF_JSGE, + [BPF_JSLT >> 4] = BPF_JSGT + }; + opcode = opcode_flip[opcode >> 4]; + /* This uses zero as "not present in table"; luckily the zero opcode, + * BPF_JA, can't get here. */ - __update_reg_bounds(false_reg); - __update_reg_bounds(true_reg); + if (opcode) + reg_set_min_max(true_reg, false_reg, val, opcode, is_jmp32); } /* Regs are known to be equal, so intersect their min/max/var_off */ -- cgit v1.2.3-71-gd317 From f50b49a0bfcaf53e6394a873b588bc4cca2aab78 Mon Sep 17 00:00:00 2001 From: KP Singh Date: Mon, 30 Mar 2020 16:42:46 +0200 Subject: bpf: btf: Fix arg verification in btf_ctx_access() The bounds checking for the arguments accessed in the BPF program breaks when the expected_attach_type is not BPF_TRACE_FEXIT, BPF_LSM_MAC or BPF_MODIFY_RETURN resulting in no check being done for the default case (the programs which do not receive the return value of the attached function in its arguments) when the index of the argument being accessed is equal to the number of arguments (nr_args). This was a result of a misplaced "else if" block introduced by the Commit 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Reported-by: Jann Horn Signed-off-by: KP Singh Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200330144246.338-1-kpsingh@chromium.org --- kernel/bpf/btf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index de335cd386f0..3b6dcfb6ea49 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -3709,9 +3709,16 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, nr_args--; } + if (arg > nr_args) { + bpf_log(log, "func '%s' doesn't have %d-th argument\n", + tname, arg + 1); + return false; + } + if (arg == nr_args) { - if (prog->expected_attach_type == BPF_TRACE_FEXIT || - prog->expected_attach_type == BPF_LSM_MAC) { + switch (prog->expected_attach_type) { + case BPF_LSM_MAC: + case BPF_TRACE_FEXIT: /* When LSM programs are attached to void LSM hooks * they use FEXIT trampolines and when attached to * int LSM hooks, they use MODIFY_RETURN trampolines. @@ -3728,7 +3735,8 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, if (!t) return true; t = btf_type_by_id(btf, t->type); - } else if (prog->expected_attach_type == BPF_MODIFY_RETURN) { + break; + case BPF_MODIFY_RETURN: /* For now the BPF_MODIFY_RETURN can only be attached to * functions that return an int. */ @@ -3742,17 +3750,19 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, btf_kind_str[BTF_INFO_KIND(t->info)]); return false; } + break; + default: + bpf_log(log, "func '%s' doesn't have %d-th argument\n", + tname, arg + 1); + return false; } - } else if (arg >= nr_args) { - bpf_log(log, "func '%s' doesn't have %d-th argument\n", - tname, arg + 1); - return false; } else { if (!t) /* Default prog with 5 args */ return true; t = btf_type_by_id(btf, args[arg].type); } + /* skip modifiers */ while (btf_type_is_modifier(t)) t = btf_type_by_id(btf, t->type); -- cgit v1.2.3-71-gd317 From 100605035e151c187360670c1776fb684808f145 Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Mon, 30 Mar 2020 14:36:19 -0700 Subject: bpf: Verifier, do_refine_retval_range may clamp umin to 0 incorrectly do_refine_retval_range() is called to refine return values from specified helpers, probe_read_str and get_stack at the moment, the reasoning is because both have a max value as part of their input arguments and because the helper ensure the return value will not be larger than this we can set smax values of the return register, r0. However, the return value is a signed integer so setting umax is incorrect It leads to further confusion when the do_refine_retval_range() then calls, __reg_deduce_bounds() which will see a umax value as meaning the value is unsigned and then assuming it is unsigned set the smin = umin which in this case results in 'smin = 0' and an 'smax = X' where X is the input argument from the helper call. Here are the comments from _reg_deduce_bounds() on why this would be safe to do. /* Learn sign from unsigned bounds. Signed bounds cross the sign * boundary, so we must be careful. */ if ((s64)reg->umax_value >= 0) { /* Positive. We can't learn anything from the smin, but smax * is positive, hence safe. */ reg->smin_value = reg->umin_value; reg->smax_value = reg->umax_value = min_t(u64, reg->smax_value, reg->umax_value); But now we incorrectly have a return value with type int with the signed bounds (0,X). Suppose the return value is negative, which is possible the we have the verifier and reality out of sync. Among other things this may result in any error handling code being falsely detected as dead-code and removed. For instance the example below shows using bpf_probe_read_str() causes the error path to be identified as dead code and removed. >From the 'llvm-object -S' dump, r2 = 100 call 45 if r0 s< 0 goto +4 r4 = *(u32 *)(r7 + 0) But from dump xlate (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (61) r4 = *(u32 *)(r7 +0) <-- dropped if goto Due to verifier state after call being R0=inv(id=0,umax_value=100,var_off=(0x0; 0x7f)) To fix omit setting the umax value because its not safe. The only actual bounds we know is the smax. This results in the correct bounds (SMIN, X) where X is the max length from the helper. After this the new verifier state looks like the following after call 45. R0=inv(id=0,smax_value=100) Then xlated version no longer removed dead code giving the expected result, (b7) r2 = 100 (85) call bpf_probe_read_compat_str#-96768 (c5) if r0 s< 0x0 goto pc+4 (61) r4 = *(u32 *)(r7 +0) Note, bpf_probe_read_* calls are root only so we wont hit this case with non-root bpf users. v3: comment had some documentation about meta set to null case which is not relevant here and confusing to include in the comment. v2 note: In original version we set msize_smax_value from check_func_arg() and propagated this into smax of retval. The logic was smax is the bound on the retval we set and because the type in the helper is ARG_CONST_SIZE we know that the reg is a positive tnum_const() so umax=smax. Alexei pointed out though this is a bit odd to read because the register in check_func_arg() has a C type of u32 and the umax bound would be the normally relavent bound here. Pulling in extra knowledge about future checks makes reading the code a bit tricky. Further having a signed meta data that can only ever be positive is also a bit odd. So dropped the msize_smax_value metadata and made it a u64 msize_max_value to indicate its unsigned. And additionally save bound from umax value in check_arg_funcs which is the same as smax due to as noted above tnumx_cont and negative check but reads better. By my analysis nothing functionally changes in v2 but it does get easier to read so that is win. Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158560417900.10843.14351995140624628941.stgit@john-Precision-5820-Tower --- kernel/bpf/verifier.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b55842033073..dda3b94d9661 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -229,8 +229,7 @@ struct bpf_call_arg_meta { bool pkt_access; int regno; int access_size; - s64 msize_smax_value; - u64 msize_umax_value; + u64 msize_max_value; int ref_obj_id; int func_id; u32 btf_id; @@ -3571,11 +3570,15 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, } else if (arg_type_is_mem_size(arg_type)) { bool zero_size_allowed = (arg_type == ARG_CONST_SIZE_OR_ZERO); - /* remember the mem_size which may be used later - * to refine return values. + /* This is used to refine r0 return value bounds for helpers + * that enforce this value as an upper bound on return values. + * See do_refine_retval_range() for helpers that can refine + * the return value. C type of helper is u32 so we pull register + * bound from umax_value however, if negative verifier errors + * out. Only upper bounds can be learned because retval is an + * int type and negative retvals are allowed. */ - meta->msize_smax_value = reg->smax_value; - meta->msize_umax_value = reg->umax_value; + meta->msize_max_value = reg->umax_value; /* The register is SCALAR_VALUE; the access check * happens using its boundaries. @@ -4118,10 +4121,10 @@ static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type, func_id != BPF_FUNC_probe_read_str)) return; - ret_reg->smax_value = meta->msize_smax_value; - ret_reg->umax_value = meta->msize_umax_value; + ret_reg->smax_value = meta->msize_max_value; __reg_deduce_bounds(ret_reg); __reg_bound_offset(ret_reg); + __update_reg_bounds(ret_reg); } static int -- cgit v1.2.3-71-gd317 From 3f50f132d8400e129fc9eb68b5020167ef80a244 Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Mon, 30 Mar 2020 14:36:39 -0700 Subject: bpf: Verifier, do explicit ALU32 bounds tracking It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(< Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower --- include/linux/bpf_verifier.h | 4 + include/linux/limits.h | 1 + include/linux/tnum.h | 12 + kernel/bpf/tnum.c | 15 + kernel/bpf/verifier.c | 1118 +++++++++++++++++++++++++++++++----------- 5 files changed, 869 insertions(+), 281 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 5406e6e96585..6abd5a778fcd 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -123,6 +123,10 @@ struct bpf_reg_state { s64 smax_value; /* maximum possible (s64)value */ u64 umin_value; /* minimum possible (u64)value */ u64 umax_value; /* maximum possible (u64)value */ + s32 s32_min_value; /* minimum possible (s32)value */ + s32 s32_max_value; /* maximum possible (s32)value */ + u32 u32_min_value; /* minimum possible (u32)value */ + u32 u32_max_value; /* maximum possible (u32)value */ /* parentage chain for liveness checking */ struct bpf_reg_state *parent; /* Inside the callee two registers can be both PTR_TO_STACK like diff --git a/include/linux/limits.h b/include/linux/limits.h index 76afcd24ff8c..0d3de82dd354 100644 --- a/include/linux/limits.h +++ b/include/linux/limits.h @@ -27,6 +27,7 @@ #define S16_MAX ((s16)(U16_MAX >> 1)) #define S16_MIN ((s16)(-S16_MAX - 1)) #define U32_MAX ((u32)~0U) +#define U32_MIN ((u32)0) #define S32_MAX ((s32)(U32_MAX >> 1)) #define S32_MIN ((s32)(-S32_MAX - 1)) #define U64_MAX ((u64)~0ULL) diff --git a/include/linux/tnum.h b/include/linux/tnum.h index ea627d1ab7e3..498dbcedb451 100644 --- a/include/linux/tnum.h +++ b/include/linux/tnum.h @@ -86,4 +86,16 @@ int tnum_strn(char *str, size_t size, struct tnum a); /* Format a tnum as tristate binary expansion */ int tnum_sbin(char *str, size_t size, struct tnum a); +/* Returns the 32-bit subreg */ +struct tnum tnum_subreg(struct tnum a); +/* Returns the tnum with the lower 32-bit subreg cleared */ +struct tnum tnum_clear_subreg(struct tnum a); +/* Returns the tnum with the lower 32-bit subreg set to value */ +struct tnum tnum_const_subreg(struct tnum a, u32 value); +/* Returns true if 32-bit subreg @a is a known constant*/ +static inline bool tnum_subreg_is_const(struct tnum a) +{ + return !(tnum_subreg(a)).mask; +} + #endif /* _LINUX_TNUM_H */ diff --git a/kernel/bpf/tnum.c b/kernel/bpf/tnum.c index d4f335a9a899..ceac5281bd31 100644 --- a/kernel/bpf/tnum.c +++ b/kernel/bpf/tnum.c @@ -194,3 +194,18 @@ int tnum_sbin(char *str, size_t size, struct tnum a) str[min(size - 1, (size_t)64)] = 0; return 64; } + +struct tnum tnum_subreg(struct tnum a) +{ + return tnum_cast(a, 4); +} + +struct tnum tnum_clear_subreg(struct tnum a) +{ + return tnum_lshift(tnum_rshift(a, 32), 32); +} + +struct tnum tnum_const_subreg(struct tnum a, u32 value) +{ + return tnum_or(tnum_clear_subreg(a), tnum_const(value)); +} diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index dda3b94d9661..1c60d001bb46 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -550,6 +550,22 @@ static void print_verifier_state(struct bpf_verifier_env *env, tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off); verbose(env, ",var_off=%s", tn_buf); } + if (reg->s32_min_value != reg->smin_value && + reg->s32_min_value != S32_MIN) + verbose(env, ",s32_min_value=%d", + (int)(reg->s32_min_value)); + if (reg->s32_max_value != reg->smax_value && + reg->s32_max_value != S32_MAX) + verbose(env, ",s32_max_value=%d", + (int)(reg->s32_max_value)); + if (reg->u32_min_value != reg->umin_value && + reg->u32_min_value != U32_MIN) + verbose(env, ",u32_min_value=%d", + (int)(reg->u32_min_value)); + if (reg->u32_max_value != reg->umax_value && + reg->u32_max_value != U32_MAX) + verbose(env, ",u32_max_value=%d", + (int)(reg->u32_max_value)); } verbose(env, ")"); } @@ -924,6 +940,20 @@ static void __mark_reg_known(struct bpf_reg_state *reg, u64 imm) reg->smax_value = (s64)imm; reg->umin_value = imm; reg->umax_value = imm; + + reg->s32_min_value = (s32)imm; + reg->s32_max_value = (s32)imm; + reg->u32_min_value = (u32)imm; + reg->u32_max_value = (u32)imm; +} + +static void __mark_reg32_known(struct bpf_reg_state *reg, u64 imm) +{ + reg->var_off = tnum_const_subreg(reg->var_off, imm); + reg->s32_min_value = (s32)imm; + reg->s32_max_value = (s32)imm; + reg->u32_min_value = (u32)imm; + reg->u32_max_value = (u32)imm; } /* Mark the 'variable offset' part of a register as zero. This should be @@ -978,8 +1008,52 @@ static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg, tnum_equals_const(reg->var_off, 0); } -/* Attempts to improve min/max values based on var_off information */ -static void __update_reg_bounds(struct bpf_reg_state *reg) +/* Reset the min/max bounds of a register */ +static void __mark_reg_unbounded(struct bpf_reg_state *reg) +{ + reg->smin_value = S64_MIN; + reg->smax_value = S64_MAX; + reg->umin_value = 0; + reg->umax_value = U64_MAX; + + reg->s32_min_value = S32_MIN; + reg->s32_max_value = S32_MAX; + reg->u32_min_value = 0; + reg->u32_max_value = U32_MAX; +} + +static void __mark_reg64_unbounded(struct bpf_reg_state *reg) +{ + reg->smin_value = S64_MIN; + reg->smax_value = S64_MAX; + reg->umin_value = 0; + reg->umax_value = U64_MAX; +} + +static void __mark_reg32_unbounded(struct bpf_reg_state *reg) +{ + reg->s32_min_value = S32_MIN; + reg->s32_max_value = S32_MAX; + reg->u32_min_value = 0; + reg->u32_max_value = U32_MAX; +} + +static void __update_reg32_bounds(struct bpf_reg_state *reg) +{ + struct tnum var32_off = tnum_subreg(reg->var_off); + + /* min signed is max(sign bit) | min(other bits) */ + reg->s32_min_value = max_t(s32, reg->s32_min_value, + var32_off.value | (var32_off.mask & S32_MIN)); + /* max signed is min(sign bit) | max(other bits) */ + reg->s32_max_value = min_t(s32, reg->s32_max_value, + var32_off.value | (var32_off.mask & S32_MAX)); + reg->u32_min_value = max_t(u32, reg->u32_min_value, (u32)var32_off.value); + reg->u32_max_value = min(reg->u32_max_value, + (u32)(var32_off.value | var32_off.mask)); +} + +static void __update_reg64_bounds(struct bpf_reg_state *reg) { /* min signed is max(sign bit) | min(other bits) */ reg->smin_value = max_t(s64, reg->smin_value, @@ -992,8 +1066,48 @@ static void __update_reg_bounds(struct bpf_reg_state *reg) reg->var_off.value | reg->var_off.mask); } +static void __update_reg_bounds(struct bpf_reg_state *reg) +{ + __update_reg32_bounds(reg); + __update_reg64_bounds(reg); +} + /* Uses signed min/max values to inform unsigned, and vice-versa */ -static void __reg_deduce_bounds(struct bpf_reg_state *reg) +static void __reg32_deduce_bounds(struct bpf_reg_state *reg) +{ + /* Learn sign from signed bounds. + * If we cannot cross the sign boundary, then signed and unsigned bounds + * are the same, so combine. This works even in the negative case, e.g. + * -3 s<= x s<= -1 implies 0xf...fd u<= x u<= 0xf...ff. + */ + if (reg->s32_min_value >= 0 || reg->s32_max_value < 0) { + reg->s32_min_value = reg->u32_min_value = + max_t(u32, reg->s32_min_value, reg->u32_min_value); + reg->s32_max_value = reg->u32_max_value = + min_t(u32, reg->s32_max_value, reg->u32_max_value); + return; + } + /* Learn sign from unsigned bounds. Signed bounds cross the sign + * boundary, so we must be careful. + */ + if ((s32)reg->u32_max_value >= 0) { + /* Positive. We can't learn anything from the smin, but smax + * is positive, hence safe. + */ + reg->s32_min_value = reg->u32_min_value; + reg->s32_max_value = reg->u32_max_value = + min_t(u32, reg->s32_max_value, reg->u32_max_value); + } else if ((s32)reg->u32_min_value < 0) { + /* Negative. We can't learn anything from the smax, but smin + * is negative, hence safe. + */ + reg->s32_min_value = reg->u32_min_value = + max_t(u32, reg->s32_min_value, reg->u32_min_value); + reg->s32_max_value = reg->u32_max_value; + } +} + +static void __reg64_deduce_bounds(struct bpf_reg_state *reg) { /* Learn sign from signed bounds. * If we cannot cross the sign boundary, then signed and unsigned bounds @@ -1027,21 +1141,106 @@ static void __reg_deduce_bounds(struct bpf_reg_state *reg) } } +static void __reg_deduce_bounds(struct bpf_reg_state *reg) +{ + __reg32_deduce_bounds(reg); + __reg64_deduce_bounds(reg); +} + /* Attempts to improve var_off based on unsigned min/max information */ static void __reg_bound_offset(struct bpf_reg_state *reg) { - reg->var_off = tnum_intersect(reg->var_off, - tnum_range(reg->umin_value, - reg->umax_value)); + struct tnum var64_off = tnum_intersect(reg->var_off, + tnum_range(reg->umin_value, + reg->umax_value)); + struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off), + tnum_range(reg->u32_min_value, + reg->u32_max_value)); + + reg->var_off = tnum_or(tnum_clear_subreg(var64_off), var32_off); } -/* Reset the min/max bounds of a register */ -static void __mark_reg_unbounded(struct bpf_reg_state *reg) +static void __reg_assign_32_into_64(struct bpf_reg_state *reg) { - reg->smin_value = S64_MIN; - reg->smax_value = S64_MAX; - reg->umin_value = 0; - reg->umax_value = U64_MAX; + reg->umin_value = reg->u32_min_value; + reg->umax_value = reg->u32_max_value; + /* Attempt to pull 32-bit signed bounds into 64-bit bounds + * but must be positive otherwise set to worse case bounds + * and refine later from tnum. + */ + if (reg->s32_min_value > 0) + reg->smin_value = reg->s32_min_value; + else + reg->smin_value = 0; + if (reg->s32_max_value > 0) + reg->smax_value = reg->s32_max_value; + else + reg->smax_value = U32_MAX; +} + +static void __reg_combine_32_into_64(struct bpf_reg_state *reg) +{ + /* special case when 64-bit register has upper 32-bit register + * zeroed. Typically happens after zext or <<32, >>32 sequence + * allowing us to use 32-bit bounds directly, + */ + if (tnum_equals_const(tnum_clear_subreg(reg->var_off), 0)) { + __reg_assign_32_into_64(reg); + } else { + /* Otherwise the best we can do is push lower 32bit known and + * unknown bits into register (var_off set from jmp logic) + * then learn as much as possible from the 64-bit tnum + * known and unknown bits. The previous smin/smax bounds are + * invalid here because of jmp32 compare so mark them unknown + * so they do not impact tnum bounds calculation. + */ + __mark_reg64_unbounded(reg); + __update_reg_bounds(reg); + } + + /* Intersecting with the old var_off might have improved our bounds + * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), + * then new var_off is (0; 0x7f...fc) which improves our umax. + */ + __reg_deduce_bounds(reg); + __reg_bound_offset(reg); + __update_reg_bounds(reg); +} + +static bool __reg64_bound_s32(s64 a) +{ + if (a > S32_MIN && a < S32_MAX) + return true; + return false; +} + +static bool __reg64_bound_u32(u64 a) +{ + if (a > U32_MIN && a < U32_MAX) + return true; + return false; +} + +static void __reg_combine_64_into_32(struct bpf_reg_state *reg) +{ + __mark_reg32_unbounded(reg); + + if (__reg64_bound_s32(reg->smin_value)) + reg->s32_min_value = (s32)reg->smin_value; + if (__reg64_bound_s32(reg->smax_value)) + reg->s32_max_value = (s32)reg->smax_value; + if (__reg64_bound_u32(reg->umin_value)) + reg->u32_min_value = (u32)reg->umin_value; + if (__reg64_bound_u32(reg->umax_value)) + reg->u32_max_value = (u32)reg->umax_value; + + /* Intersecting with the old var_off might have improved our bounds + * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), + * then new var_off is (0; 0x7f...fc) which improves our umax. + */ + __reg_deduce_bounds(reg); + __reg_bound_offset(reg); + __update_reg_bounds(reg); } /* Mark a register as having a completely unknown (scalar) value. */ @@ -2774,6 +2973,12 @@ static int check_tp_buffer_access(struct bpf_verifier_env *env, return 0; } +/* BPF architecture zero extends alu32 ops into 64-bit registesr */ +static void zext_32_to_64(struct bpf_reg_state *reg) +{ + reg->var_off = tnum_subreg(reg->var_off); + __reg_assign_32_into_64(reg); +} /* truncate register to smaller size (in bytes) * must be called with size < BPF_REG_SIZE @@ -2796,6 +3001,14 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size) } reg->smin_value = reg->umin_value; reg->smax_value = reg->umax_value; + + /* If size is smaller than 32bit register the 32bit register + * values are also truncated so we push 64-bit bounds into + * 32-bit bounds. Above were truncated < 32-bits already. + */ + if (size >= 4) + return; + __reg_combine_64_into_32(reg); } static bool bpf_map_is_rdonly(const struct bpf_map *map) @@ -4431,7 +4644,17 @@ static bool signed_add_overflows(s64 a, s64 b) return res < a; } -static bool signed_sub_overflows(s64 a, s64 b) +static bool signed_add32_overflows(s64 a, s64 b) +{ + /* Do the add in u32, where overflow is well-defined */ + s32 res = (s32)((u32)a + (u32)b); + + if (b < 0) + return res > a; + return res < a; +} + +static bool signed_sub_overflows(s32 a, s32 b) { /* Do the sub in u64, where overflow is well-defined */ s64 res = (s64)((u64)a - (u64)b); @@ -4441,6 +4664,16 @@ static bool signed_sub_overflows(s64 a, s64 b) return res > a; } +static bool signed_sub32_overflows(s32 a, s32 b) +{ + /* Do the sub in u64, where overflow is well-defined */ + s32 res = (s32)((u32)a - (u32)b); + + if (b < 0) + return res < a; + return res > a; +} + static bool check_reg_sane_offset(struct bpf_verifier_env *env, const struct bpf_reg_state *reg, enum bpf_reg_type type) @@ -4677,6 +4910,9 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, !check_reg_sane_offset(env, ptr_reg, ptr_reg->type)) return -EINVAL; + /* pointer types do not carry 32-bit bounds at the moment. */ + __mark_reg32_unbounded(dst_reg); + switch (opcode) { case BPF_ADD: ret = sanitize_ptr_alu(env, insn, ptr_reg, dst_reg, smin_val < 0); @@ -4840,6 +5076,32 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env, return 0; } +static void scalar32_min_max_add(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s32 smin_val = src_reg->s32_min_value; + s32 smax_val = src_reg->s32_max_value; + u32 umin_val = src_reg->u32_min_value; + u32 umax_val = src_reg->u32_max_value; + + if (signed_add32_overflows(dst_reg->s32_min_value, smin_val) || + signed_add32_overflows(dst_reg->s32_max_value, smax_val)) { + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } else { + dst_reg->s32_min_value += smin_val; + dst_reg->s32_max_value += smax_val; + } + if (dst_reg->u32_min_value + umin_val < umin_val || + dst_reg->u32_max_value + umax_val < umax_val) { + dst_reg->u32_min_value = 0; + dst_reg->u32_max_value = U32_MAX; + } else { + dst_reg->u32_min_value += umin_val; + dst_reg->u32_max_value += umax_val; + } +} + static void scalar_min_max_add(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { @@ -4864,7 +5126,34 @@ static void scalar_min_max_add(struct bpf_reg_state *dst_reg, dst_reg->umin_value += umin_val; dst_reg->umax_value += umax_val; } - dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg->var_off); +} + +static void scalar32_min_max_sub(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s32 smin_val = src_reg->s32_min_value; + s32 smax_val = src_reg->s32_max_value; + u32 umin_val = src_reg->u32_min_value; + u32 umax_val = src_reg->u32_max_value; + + if (signed_sub32_overflows(dst_reg->s32_min_value, smax_val) || + signed_sub32_overflows(dst_reg->s32_max_value, smin_val)) { + /* Overflow possible, we know nothing */ + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } else { + dst_reg->s32_min_value -= smax_val; + dst_reg->s32_max_value -= smin_val; + } + if (dst_reg->u32_min_value < umax_val) { + /* Overflow possible, we know nothing */ + dst_reg->u32_min_value = 0; + dst_reg->u32_max_value = U32_MAX; + } else { + /* Cannot overflow (as long as bounds are consistent) */ + dst_reg->u32_min_value -= umax_val; + dst_reg->u32_max_value -= umin_val; + } } static void scalar_min_max_sub(struct bpf_reg_state *dst_reg, @@ -4893,7 +5182,38 @@ static void scalar_min_max_sub(struct bpf_reg_state *dst_reg, dst_reg->umin_value -= umax_val; dst_reg->umax_value -= umin_val; } - dst_reg->var_off = tnum_sub(dst_reg->var_off, src_reg->var_off); +} + +static void scalar32_min_max_mul(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + s32 smin_val = src_reg->s32_min_value; + u32 umin_val = src_reg->u32_min_value; + u32 umax_val = src_reg->u32_max_value; + + if (smin_val < 0 || dst_reg->s32_min_value < 0) { + /* Ain't nobody got time to multiply that sign */ + __mark_reg32_unbounded(dst_reg); + return; + } + /* Both values are positive, so we can work with unsigned and + * copy the result to signed (unless it exceeds S32_MAX). + */ + if (umax_val > U16_MAX || dst_reg->u32_max_value > U16_MAX) { + /* Potential overflow, we know nothing */ + __mark_reg32_unbounded(dst_reg); + return; + } + dst_reg->u32_min_value *= umin_val; + dst_reg->u32_max_value *= umax_val; + if (dst_reg->u32_max_value > S32_MAX) { + /* Overflow possible, we know nothing */ + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } else { + dst_reg->s32_min_value = dst_reg->u32_min_value; + dst_reg->s32_max_value = dst_reg->u32_max_value; + } } static void scalar_min_max_mul(struct bpf_reg_state *dst_reg, @@ -4903,11 +5223,9 @@ static void scalar_min_max_mul(struct bpf_reg_state *dst_reg, u64 umin_val = src_reg->umin_value; u64 umax_val = src_reg->umax_value; - dst_reg->var_off = tnum_mul(dst_reg->var_off, src_reg->var_off); if (smin_val < 0 || dst_reg->smin_value < 0) { /* Ain't nobody got time to multiply that sign */ - __mark_reg_unbounded(dst_reg); - __update_reg_bounds(dst_reg); + __mark_reg64_unbounded(dst_reg); return; } /* Both values are positive, so we can work with unsigned and @@ -4915,9 +5233,7 @@ static void scalar_min_max_mul(struct bpf_reg_state *dst_reg, */ if (umax_val > U32_MAX || dst_reg->umax_value > U32_MAX) { /* Potential overflow, we know nothing */ - __mark_reg_unbounded(dst_reg); - /* (except what we can learn from the var_off) */ - __update_reg_bounds(dst_reg); + __mark_reg64_unbounded(dst_reg); return; } dst_reg->umin_value *= umin_val; @@ -4932,16 +5248,59 @@ static void scalar_min_max_mul(struct bpf_reg_state *dst_reg, } } +static void scalar32_min_max_and(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + bool src_known = tnum_subreg_is_const(src_reg->var_off); + bool dst_known = tnum_subreg_is_const(dst_reg->var_off); + struct tnum var32_off = tnum_subreg(dst_reg->var_off); + s32 smin_val = src_reg->s32_min_value; + u32 umax_val = src_reg->u32_max_value; + + /* Assuming scalar64_min_max_and will be called so its safe + * to skip updating register for known 32-bit case. + */ + if (src_known && dst_known) + return; + + /* We get our minimum from the var_off, since that's inherently + * bitwise. Our maximum is the minimum of the operands' maxima. + */ + dst_reg->u32_min_value = var32_off.value; + dst_reg->u32_max_value = min(dst_reg->u32_max_value, umax_val); + if (dst_reg->s32_min_value < 0 || smin_val < 0) { + /* Lose signed bounds when ANDing negative numbers, + * ain't nobody got time for that. + */ + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } else { + /* ANDing two positives gives a positive, so safe to + * cast result into s64. + */ + dst_reg->s32_min_value = dst_reg->u32_min_value; + dst_reg->s32_max_value = dst_reg->u32_max_value; + } + +} + static void scalar_min_max_and(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { + bool src_known = tnum_is_const(src_reg->var_off); + bool dst_known = tnum_is_const(dst_reg->var_off); s64 smin_val = src_reg->smin_value; u64 umax_val = src_reg->umax_value; + if (src_known && dst_known) { + __mark_reg_known(dst_reg, dst_reg->var_off.value & + src_reg->var_off.value); + return; + } + /* We get our minimum from the var_off, since that's inherently * bitwise. Our maximum is the minimum of the operands' maxima. */ - dst_reg->var_off = tnum_and(dst_reg->var_off, src_reg->var_off); dst_reg->umin_value = dst_reg->var_off.value; dst_reg->umax_value = min(dst_reg->umax_value, umax_val); if (dst_reg->smin_value < 0 || smin_val < 0) { @@ -4961,16 +5320,58 @@ static void scalar_min_max_and(struct bpf_reg_state *dst_reg, __update_reg_bounds(dst_reg); } +static void scalar32_min_max_or(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + bool src_known = tnum_subreg_is_const(src_reg->var_off); + bool dst_known = tnum_subreg_is_const(dst_reg->var_off); + struct tnum var32_off = tnum_subreg(dst_reg->var_off); + s32 smin_val = src_reg->smin_value; + u32 umin_val = src_reg->umin_value; + + /* Assuming scalar64_min_max_or will be called so it is safe + * to skip updating register for known case. + */ + if (src_known && dst_known) + return; + + /* We get our maximum from the var_off, and our minimum is the + * maximum of the operands' minima + */ + dst_reg->u32_min_value = max(dst_reg->u32_min_value, umin_val); + dst_reg->u32_max_value = var32_off.value | var32_off.mask; + if (dst_reg->s32_min_value < 0 || smin_val < 0) { + /* Lose signed bounds when ORing negative numbers, + * ain't nobody got time for that. + */ + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } else { + /* ORing two positives gives a positive, so safe to + * cast result into s64. + */ + dst_reg->s32_min_value = dst_reg->umin_value; + dst_reg->s32_max_value = dst_reg->umax_value; + } +} + static void scalar_min_max_or(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { + bool src_known = tnum_is_const(src_reg->var_off); + bool dst_known = tnum_is_const(dst_reg->var_off); s64 smin_val = src_reg->smin_value; u64 umin_val = src_reg->umin_value; + if (src_known && dst_known) { + __mark_reg_known(dst_reg, dst_reg->var_off.value | + src_reg->var_off.value); + return; + } + /* We get our maximum from the var_off, and our minimum is the * maximum of the operands' minima */ - dst_reg->var_off = tnum_or(dst_reg->var_off, src_reg->var_off); dst_reg->umin_value = max(dst_reg->umin_value, umin_val); dst_reg->umax_value = dst_reg->var_off.value | dst_reg->var_off.mask; if (dst_reg->smin_value < 0 || smin_val < 0) { @@ -4990,17 +5391,62 @@ static void scalar_min_max_or(struct bpf_reg_state *dst_reg, __update_reg_bounds(dst_reg); } -static void scalar_min_max_lsh(struct bpf_reg_state *dst_reg, - struct bpf_reg_state *src_reg) +static void __scalar32_min_max_lsh(struct bpf_reg_state *dst_reg, + u64 umin_val, u64 umax_val) { - u64 umax_val = src_reg->umax_value; - u64 umin_val = src_reg->umin_value; - /* We lose all sign bit information (except what we can pick * up from var_off) */ - dst_reg->smin_value = S64_MIN; - dst_reg->smax_value = S64_MAX; + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + /* If we might shift our top bit out, then we know nothing */ + if (umax_val > 31 || dst_reg->u32_max_value > 1ULL << (31 - umax_val)) { + dst_reg->u32_min_value = 0; + dst_reg->u32_max_value = U32_MAX; + } else { + dst_reg->u32_min_value <<= umin_val; + dst_reg->u32_max_value <<= umax_val; + } +} + +static void scalar32_min_max_lsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + u32 umax_val = src_reg->u32_max_value; + u32 umin_val = src_reg->u32_min_value; + /* u32 alu operation will zext upper bits */ + struct tnum subreg = tnum_subreg(dst_reg->var_off); + + __scalar32_min_max_lsh(dst_reg, umin_val, umax_val); + dst_reg->var_off = tnum_subreg(tnum_lshift(subreg, umin_val)); + /* Not required but being careful mark reg64 bounds as unknown so + * that we are forced to pick them up from tnum and zext later and + * if some path skips this step we are still safe. + */ + __mark_reg64_unbounded(dst_reg); + __update_reg32_bounds(dst_reg); +} + +static void __scalar64_min_max_lsh(struct bpf_reg_state *dst_reg, + u64 umin_val, u64 umax_val) +{ + /* Special case <<32 because it is a common compiler pattern to sign + * extend subreg by doing <<32 s>>32. In this case if 32bit bounds are + * positive we know this shift will also be positive so we can track + * bounds correctly. Otherwise we lose all sign bit information except + * what we can pick up from var_off. Perhaps we can generalize this + * later to shifts of any length. + */ + if (umin_val == 32 && umax_val == 32 && dst_reg->s32_max_value >= 0) + dst_reg->smax_value = (s64)dst_reg->s32_max_value << 32; + else + dst_reg->smax_value = S64_MAX; + + if (umin_val == 32 && umax_val == 32 && dst_reg->s32_min_value >= 0) + dst_reg->smin_value = (s64)dst_reg->s32_min_value << 32; + else + dst_reg->smin_value = S64_MIN; + /* If we might shift our top bit out, then we know nothing */ if (dst_reg->umax_value > 1ULL << (63 - umax_val)) { dst_reg->umin_value = 0; @@ -5009,11 +5455,55 @@ static void scalar_min_max_lsh(struct bpf_reg_state *dst_reg, dst_reg->umin_value <<= umin_val; dst_reg->umax_value <<= umax_val; } +} + +static void scalar_min_max_lsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + u64 umax_val = src_reg->umax_value; + u64 umin_val = src_reg->umin_value; + + /* scalar64 calc uses 32bit unshifted bounds so must be called first */ + __scalar64_min_max_lsh(dst_reg, umin_val, umax_val); + __scalar32_min_max_lsh(dst_reg, umin_val, umax_val); + dst_reg->var_off = tnum_lshift(dst_reg->var_off, umin_val); /* We may learn something more from the var_off */ __update_reg_bounds(dst_reg); } +static void scalar32_min_max_rsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + struct tnum subreg = tnum_subreg(dst_reg->var_off); + u32 umax_val = src_reg->u32_max_value; + u32 umin_val = src_reg->u32_min_value; + + /* BPF_RSH is an unsigned shift. If the value in dst_reg might + * be negative, then either: + * 1) src_reg might be zero, so the sign bit of the result is + * unknown, so we lose our signed bounds + * 2) it's known negative, thus the unsigned bounds capture the + * signed bounds + * 3) the signed bounds cross zero, so they tell us nothing + * about the result + * If the value in dst_reg is known nonnegative, then again the + * unsigned bounts capture the signed bounds. + * Thus, in all cases it suffices to blow away our signed bounds + * and rely on inferring new ones from the unsigned bounds and + * var_off of the result. + */ + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + + dst_reg->var_off = tnum_rshift(subreg, umin_val); + dst_reg->u32_min_value >>= umax_val; + dst_reg->u32_max_value >>= umin_val; + + __mark_reg64_unbounded(dst_reg); + __update_reg32_bounds(dst_reg); +} + static void scalar_min_max_rsh(struct bpf_reg_state *dst_reg, struct bpf_reg_state *src_reg) { @@ -5039,35 +5529,62 @@ static void scalar_min_max_rsh(struct bpf_reg_state *dst_reg, dst_reg->var_off = tnum_rshift(dst_reg->var_off, umin_val); dst_reg->umin_value >>= umax_val; dst_reg->umax_value >>= umin_val; - /* We may learn something more from the var_off */ + + /* Its not easy to operate on alu32 bounds here because it depends + * on bits being shifted in. Take easy way out and mark unbounded + * so we can recalculate later from tnum. + */ + __mark_reg32_unbounded(dst_reg); __update_reg_bounds(dst_reg); } -static void scalar_min_max_arsh(struct bpf_reg_state *dst_reg, - struct bpf_reg_state *src_reg, - u64 insn_bitness) +static void scalar32_min_max_arsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) { - u64 umin_val = src_reg->umin_value; + u64 umin_val = src_reg->u32_min_value; /* Upon reaching here, src_known is true and * umax_val is equal to umin_val. */ - if (insn_bitness == 32) { - dst_reg->smin_value = (u32)(((s32)dst_reg->smin_value) >> umin_val); - dst_reg->smax_value = (u32)(((s32)dst_reg->smax_value) >> umin_val); - } else { - dst_reg->smin_value >>= umin_val; - dst_reg->smax_value >>= umin_val; - } + dst_reg->s32_min_value = (u32)(((s32)dst_reg->s32_min_value) >> umin_val); + dst_reg->s32_max_value = (u32)(((s32)dst_reg->s32_max_value) >> umin_val); - dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val, - insn_bitness); + dst_reg->var_off = tnum_arshift(tnum_subreg(dst_reg->var_off), umin_val, 32); + + /* blow away the dst_reg umin_value/umax_value and rely on + * dst_reg var_off to refine the result. + */ + dst_reg->u32_min_value = 0; + dst_reg->u32_max_value = U32_MAX; + + __mark_reg64_unbounded(dst_reg); + __update_reg32_bounds(dst_reg); +} + +static void scalar_min_max_arsh(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + u64 umin_val = src_reg->umin_value; + + /* Upon reaching here, src_known is true and umax_val is equal + * to umin_val. + */ + dst_reg->smin_value >>= umin_val; + dst_reg->smax_value >>= umin_val; + + dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val, 64); /* blow away the dst_reg umin_value/umax_value and rely on * dst_reg var_off to refine the result. */ dst_reg->umin_value = 0; dst_reg->umax_value = U64_MAX; + + /* Its not easy to operate on alu32 bounds here because it depends + * on bits being shifted in from upper 32-bits. Take easy way out + * and mark unbounded so we can recalculate later from tnum. + */ + __mark_reg32_unbounded(dst_reg); __update_reg_bounds(dst_reg); } @@ -5085,33 +5602,47 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, bool src_known, dst_known; s64 smin_val, smax_val; u64 umin_val, umax_val; + s32 s32_min_val, s32_max_val; + u32 u32_min_val, u32_max_val; u64 insn_bitness = (BPF_CLASS(insn->code) == BPF_ALU64) ? 64 : 32; u32 dst = insn->dst_reg; int ret; - - if (insn_bitness == 32) { - /* Relevant for 32-bit RSH: Information can propagate towards - * LSB, so it isn't sufficient to only truncate the output to - * 32 bits. - */ - coerce_reg_to_size(dst_reg, 4); - coerce_reg_to_size(&src_reg, 4); - } + bool alu32 = (BPF_CLASS(insn->code) != BPF_ALU64); smin_val = src_reg.smin_value; smax_val = src_reg.smax_value; umin_val = src_reg.umin_value; umax_val = src_reg.umax_value; - src_known = tnum_is_const(src_reg.var_off); - dst_known = tnum_is_const(dst_reg->var_off); - if ((src_known && (smin_val != smax_val || umin_val != umax_val)) || - smin_val > smax_val || umin_val > umax_val) { - /* Taint dst register if offset had invalid bounds derived from - * e.g. dead branches. - */ - __mark_reg_unknown(env, dst_reg); - return 0; + s32_min_val = src_reg.s32_min_value; + s32_max_val = src_reg.s32_max_value; + u32_min_val = src_reg.u32_min_value; + u32_max_val = src_reg.u32_max_value; + + if (alu32) { + src_known = tnum_subreg_is_const(src_reg.var_off); + dst_known = tnum_subreg_is_const(dst_reg->var_off); + if ((src_known && + (s32_min_val != s32_max_val || u32_min_val != u32_max_val)) || + s32_min_val > s32_max_val || u32_min_val > u32_max_val) { + /* Taint dst register if offset had invalid bounds + * derived from e.g. dead branches. + */ + __mark_reg_unknown(env, dst_reg); + return 0; + } + } else { + src_known = tnum_is_const(src_reg.var_off); + dst_known = tnum_is_const(dst_reg->var_off); + if ((src_known && + (smin_val != smax_val || umin_val != umax_val)) || + smin_val > smax_val || umin_val > umax_val) { + /* Taint dst register if offset had invalid bounds + * derived from e.g. dead branches. + */ + __mark_reg_unknown(env, dst_reg); + return 0; + } } if (!src_known && @@ -5120,6 +5651,20 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, return 0; } + /* Calculate sign/unsigned bounds and tnum for alu32 and alu64 bit ops. + * There are two classes of instructions: The first class we track both + * alu32 and alu64 sign/unsigned bounds independently this provides the + * greatest amount of precision when alu operations are mixed with jmp32 + * operations. These operations are BPF_ADD, BPF_SUB, BPF_MUL, BPF_ADD, + * and BPF_OR. This is possible because these ops have fairly easy to + * understand and calculate behavior in both 32-bit and 64-bit alu ops. + * See alu32 verifier tests for examples. The second class of + * operations, BPF_LSH, BPF_RSH, and BPF_ARSH, however are not so easy + * with regards to tracking sign/unsigned bounds because the bits may + * cross subreg boundaries in the alu64 case. When this happens we mark + * the reg unbounded in the subreg bound space and use the resulting + * tnum to calculate an approximation of the sign/unsigned bounds. + */ switch (opcode) { case BPF_ADD: ret = sanitize_val_alu(env, insn); @@ -5127,7 +5672,9 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, verbose(env, "R%d tried to add from different pointers or scalars\n", dst); return ret; } + scalar32_min_max_add(dst_reg, &src_reg); scalar_min_max_add(dst_reg, &src_reg); + dst_reg->var_off = tnum_add(dst_reg->var_off, src_reg.var_off); break; case BPF_SUB: ret = sanitize_val_alu(env, insn); @@ -5135,25 +5682,23 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, verbose(env, "R%d tried to sub from different pointers or scalars\n", dst); return ret; } + scalar32_min_max_sub(dst_reg, &src_reg); scalar_min_max_sub(dst_reg, &src_reg); + dst_reg->var_off = tnum_sub(dst_reg->var_off, src_reg.var_off); break; case BPF_MUL: + dst_reg->var_off = tnum_mul(dst_reg->var_off, src_reg.var_off); + scalar32_min_max_mul(dst_reg, &src_reg); scalar_min_max_mul(dst_reg, &src_reg); break; case BPF_AND: - if (src_known && dst_known) { - __mark_reg_known(dst_reg, dst_reg->var_off.value & - src_reg.var_off.value); - break; - } + dst_reg->var_off = tnum_and(dst_reg->var_off, src_reg.var_off); + scalar32_min_max_and(dst_reg, &src_reg); scalar_min_max_and(dst_reg, &src_reg); break; case BPF_OR: - if (src_known && dst_known) { - __mark_reg_known(dst_reg, dst_reg->var_off.value | - src_reg.var_off.value); - break; - } + dst_reg->var_off = tnum_or(dst_reg->var_off, src_reg.var_off); + scalar32_min_max_or(dst_reg, &src_reg); scalar_min_max_or(dst_reg, &src_reg); break; case BPF_LSH: @@ -5164,7 +5709,10 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - scalar_min_max_lsh(dst_reg, &src_reg); + if (alu32) + scalar32_min_max_lsh(dst_reg, &src_reg); + else + scalar_min_max_lsh(dst_reg, &src_reg); break; case BPF_RSH: if (umax_val >= insn_bitness) { @@ -5174,7 +5722,10 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - scalar_min_max_rsh(dst_reg, &src_reg); + if (alu32) + scalar32_min_max_rsh(dst_reg, &src_reg); + else + scalar_min_max_rsh(dst_reg, &src_reg); break; case BPF_ARSH: if (umax_val >= insn_bitness) { @@ -5184,17 +5735,19 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, mark_reg_unknown(env, regs, insn->dst_reg); break; } - scalar_min_max_arsh(dst_reg, &src_reg, insn_bitness); + if (alu32) + scalar32_min_max_arsh(dst_reg, &src_reg); + else + scalar_min_max_arsh(dst_reg, &src_reg); break; default: mark_reg_unknown(env, regs, insn->dst_reg); break; } - if (BPF_CLASS(insn->code) != BPF_ALU64) { - /* 32-bit ALU ops are (32,32)->32 */ - coerce_reg_to_size(dst_reg, 4); - } + /* ALU32 ops are zero extended into 64bit register */ + if (alu32) + zext_32_to_64(dst_reg); __update_reg_bounds(dst_reg); __reg_deduce_bounds(dst_reg); @@ -5370,7 +5923,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn) mark_reg_unknown(env, regs, insn->dst_reg); } - coerce_reg_to_size(dst_reg, 4); + zext_32_to_64(dst_reg); } } else { /* case: R = imm @@ -5540,55 +6093,83 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate, new_range); } -/* compute branch direction of the expression "if (reg opcode val) goto target;" - * and return: - * 1 - branch will be taken and "goto target" will be executed - * 0 - branch will not be taken and fall-through to next insn - * -1 - unknown. Example: "if (reg < 5)" is unknown when register value range [0,10] - */ -static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode, - bool is_jmp32) +static int is_branch32_taken(struct bpf_reg_state *reg, u32 val, u8 opcode) { - struct bpf_reg_state reg_lo; - s64 sval; + struct tnum subreg = tnum_subreg(reg->var_off); + s32 sval = (s32)val; - if (__is_pointer_value(false, reg)) - return -1; + switch (opcode) { + case BPF_JEQ: + if (tnum_is_const(subreg)) + return !!tnum_equals_const(subreg, val); + break; + case BPF_JNE: + if (tnum_is_const(subreg)) + return !tnum_equals_const(subreg, val); + break; + case BPF_JSET: + if ((~subreg.mask & subreg.value) & val) + return 1; + if (!((subreg.mask | subreg.value) & val)) + return 0; + break; + case BPF_JGT: + if (reg->u32_min_value > val) + return 1; + else if (reg->u32_max_value <= val) + return 0; + break; + case BPF_JSGT: + if (reg->s32_min_value > sval) + return 1; + else if (reg->s32_max_value < sval) + return 0; + break; + case BPF_JLT: + if (reg->u32_max_value < val) + return 1; + else if (reg->u32_min_value >= val) + return 0; + break; + case BPF_JSLT: + if (reg->s32_max_value < sval) + return 1; + else if (reg->s32_min_value >= sval) + return 0; + break; + case BPF_JGE: + if (reg->u32_min_value >= val) + return 1; + else if (reg->u32_max_value < val) + return 0; + break; + case BPF_JSGE: + if (reg->s32_min_value >= sval) + return 1; + else if (reg->s32_max_value < sval) + return 0; + break; + case BPF_JLE: + if (reg->u32_max_value <= val) + return 1; + else if (reg->u32_min_value > val) + return 0; + break; + case BPF_JSLE: + if (reg->s32_max_value <= sval) + return 1; + else if (reg->s32_min_value > sval) + return 0; + break; + } - if (is_jmp32) { - reg_lo = *reg; - reg = ®_lo; - /* For JMP32, only low 32 bits are compared, coerce_reg_to_size - * could truncate high bits and update umin/umax according to - * information of low bits. - */ - coerce_reg_to_size(reg, 4); - /* smin/smax need special handling. For example, after coerce, - * if smin_value is 0x00000000ffffffffLL, the value is -1 when - * used as operand to JMP32. It is a negative number from s32's - * point of view, while it is a positive number when seen as - * s64. The smin/smax are kept as s64, therefore, when used with - * JMP32, they need to be transformed into s32, then sign - * extended back to s64. - * - * Also, smin/smax were copied from umin/umax. If umin/umax has - * different sign bit, then min/max relationship doesn't - * maintain after casting into s32, for this case, set smin/smax - * to safest range. - */ - if ((reg->umax_value ^ reg->umin_value) & - (1ULL << 31)) { - reg->smin_value = S32_MIN; - reg->smax_value = S32_MAX; - } - reg->smin_value = (s64)(s32)reg->smin_value; - reg->smax_value = (s64)(s32)reg->smax_value; + return -1; +} - val = (u32)val; - sval = (s64)(s32)val; - } else { - sval = (s64)val; - } + +static int is_branch64_taken(struct bpf_reg_state *reg, u64 val, u8 opcode) +{ + s64 sval = (s64)val; switch (opcode) { case BPF_JEQ: @@ -5658,91 +6239,22 @@ static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode, return -1; } -/* Generate min value of the high 32-bit from TNUM info. */ -static u64 gen_hi_min(struct tnum var) -{ - return var.value & ~0xffffffffULL; -} - -/* Generate max value of the high 32-bit from TNUM info. */ -static u64 gen_hi_max(struct tnum var) -{ - return (var.value | var.mask) & ~0xffffffffULL; -} - -/* Return true if VAL is compared with a s64 sign extended from s32, and they - * are with the same signedness. - */ -static bool cmp_val_with_extended_s64(s64 sval, struct bpf_reg_state *reg) -{ - return ((s32)sval >= 0 && - reg->smin_value >= 0 && reg->smax_value <= S32_MAX) || - ((s32)sval < 0 && - reg->smax_value <= 0 && reg->smin_value >= S32_MIN); -} - -/* Constrain the possible values of @reg with unsigned upper bound @bound. - * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. - * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits - * of @reg. - */ -static void set_upper_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, - bool is_exclusive) -{ - if (is_exclusive) { - /* There are no values for `reg` that make `reg<0` true. */ - if (bound == 0) - return; - bound--; - } - if (is_jmp32) { - /* Constrain the register's value in the tnum representation. - * For 64-bit comparisons this happens later in - * __reg_bound_offset(), but for 32-bit comparisons, we can be - * more precise than what can be derived from the updated - * numeric bounds. - */ - struct tnum t = tnum_range(0, bound); - - t.mask |= ~0xffffffffULL; /* upper half is unknown */ - reg->var_off = tnum_intersect(reg->var_off, t); - - /* Compute the 64-bit bound from the 32-bit bound. */ - bound += gen_hi_max(reg->var_off); - } - reg->umax_value = min(reg->umax_value, bound); -} - -/* Constrain the possible values of @reg with unsigned lower bound @bound. - * If @is_exclusive, @bound is an exclusive limit, otherwise it is inclusive. - * If @is_jmp32, @bound is a 32-bit value that only constrains the low 32 bits - * of @reg. +/* compute branch direction of the expression "if (reg opcode val) goto target;" + * and return: + * 1 - branch will be taken and "goto target" will be executed + * 0 - branch will not be taken and fall-through to next insn + * -1 - unknown. Example: "if (reg < 5)" is unknown when register value + * range [0,10] */ -static void set_lower_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, - bool is_exclusive) +static int is_branch_taken(struct bpf_reg_state *reg, u64 val, u8 opcode, + bool is_jmp32) { - if (is_exclusive) { - /* There are no values for `reg` that make `reg>MAX` true. */ - if (bound == (is_jmp32 ? U32_MAX : U64_MAX)) - return; - bound++; - } - if (is_jmp32) { - /* Constrain the register's value in the tnum representation. - * For 64-bit comparisons this happens later in - * __reg_bound_offset(), but for 32-bit comparisons, we can be - * more precise than what can be derived from the updated - * numeric bounds. - */ - struct tnum t = tnum_range(bound, U32_MAX); - - t.mask |= ~0xffffffffULL; /* upper half is unknown */ - reg->var_off = tnum_intersect(reg->var_off, t); + if (__is_pointer_value(false, reg)) + return -1; - /* Compute the 64-bit bound from the 32-bit bound. */ - bound += gen_hi_min(reg->var_off); - } - reg->umin_value = max(reg->umin_value, bound); + if (is_jmp32) + return is_branch32_taken(reg, val, opcode); + return is_branch64_taken(reg, val, opcode); } /* Adjusts the register min/max values in the case that the dst_reg is the @@ -5751,10 +6263,16 @@ static void set_lower_bound(struct bpf_reg_state *reg, u64 bound, bool is_jmp32, * In JEQ/JNE cases we also adjust the var_off values. */ static void reg_set_min_max(struct bpf_reg_state *true_reg, - struct bpf_reg_state *false_reg, u64 val, + struct bpf_reg_state *false_reg, + u64 val, u32 val32, u8 opcode, bool is_jmp32) { - s64 sval; + struct tnum false_32off = tnum_subreg(false_reg->var_off); + struct tnum false_64off = false_reg->var_off; + struct tnum true_32off = tnum_subreg(true_reg->var_off); + struct tnum true_64off = true_reg->var_off; + s64 sval = (s64)val; + s32 sval32 = (s32)val32; /* If the dst_reg is a pointer, we can't learn anything about its * variable offset from the compare (unless src_reg were a pointer into @@ -5765,9 +6283,6 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, if (__is_pointer_value(false, false_reg)) return; - val = is_jmp32 ? (u32)val : val; - sval = is_jmp32 ? (s64)(s32)val : (s64)val; - switch (opcode) { case BPF_JEQ: case BPF_JNE: @@ -5779,87 +6294,126 @@ static void reg_set_min_max(struct bpf_reg_state *true_reg, * if it is true we know the value for sure. Likewise for * BPF_JNE. */ - if (is_jmp32) { - u64 old_v = reg->var_off.value; - u64 hi_mask = ~0xffffffffULL; - - reg->var_off.value = (old_v & hi_mask) | val; - reg->var_off.mask &= hi_mask; - } else { + if (is_jmp32) + __mark_reg32_known(reg, val32); + else __mark_reg_known(reg, val); - } break; } case BPF_JSET: - false_reg->var_off = tnum_and(false_reg->var_off, - tnum_const(~val)); - if (is_power_of_2(val)) - true_reg->var_off = tnum_or(true_reg->var_off, - tnum_const(val)); + if (is_jmp32) { + false_32off = tnum_and(false_32off, tnum_const(~val32)); + if (is_power_of_2(val32)) + true_32off = tnum_or(true_32off, + tnum_const(val32)); + } else { + false_64off = tnum_and(false_64off, tnum_const(~val)); + if (is_power_of_2(val)) + true_64off = tnum_or(true_64off, + tnum_const(val)); + } break; case BPF_JGE: case BPF_JGT: { - set_upper_bound(false_reg, val, is_jmp32, opcode == BPF_JGE); - set_lower_bound(true_reg, val, is_jmp32, opcode == BPF_JGT); + if (is_jmp32) { + u32 false_umax = opcode == BPF_JGT ? val32 : val32 - 1; + u32 true_umin = opcode == BPF_JGT ? val32 + 1 : val32; + + false_reg->u32_max_value = min(false_reg->u32_max_value, + false_umax); + true_reg->u32_min_value = max(true_reg->u32_min_value, + true_umin); + } else { + u64 false_umax = opcode == BPF_JGT ? val : val - 1; + u64 true_umin = opcode == BPF_JGT ? val + 1 : val; + + false_reg->umax_value = min(false_reg->umax_value, false_umax); + true_reg->umin_value = max(true_reg->umin_value, true_umin); + } break; } case BPF_JSGE: case BPF_JSGT: { - s64 false_smax = opcode == BPF_JSGT ? sval : sval - 1; - s64 true_smin = opcode == BPF_JSGT ? sval + 1 : sval; + if (is_jmp32) { + s32 false_smax = opcode == BPF_JSGT ? sval32 : sval32 - 1; + s32 true_smin = opcode == BPF_JSGT ? sval32 + 1 : sval32; - /* If the full s64 was not sign-extended from s32 then don't - * deduct further info. - */ - if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) - break; - false_reg->smax_value = min(false_reg->smax_value, false_smax); - true_reg->smin_value = max(true_reg->smin_value, true_smin); + false_reg->s32_max_value = min(false_reg->s32_max_value, false_smax); + true_reg->s32_min_value = max(true_reg->s32_min_value, true_smin); + } else { + s64 false_smax = opcode == BPF_JSGT ? sval : sval - 1; + s64 true_smin = opcode == BPF_JSGT ? sval + 1 : sval; + + false_reg->smax_value = min(false_reg->smax_value, false_smax); + true_reg->smin_value = max(true_reg->smin_value, true_smin); + } break; } case BPF_JLE: case BPF_JLT: { - set_lower_bound(false_reg, val, is_jmp32, opcode == BPF_JLE); - set_upper_bound(true_reg, val, is_jmp32, opcode == BPF_JLT); + if (is_jmp32) { + u32 false_umin = opcode == BPF_JLT ? val32 : val32 + 1; + u32 true_umax = opcode == BPF_JLT ? val32 - 1 : val32; + + false_reg->u32_min_value = max(false_reg->u32_min_value, + false_umin); + true_reg->u32_max_value = min(true_reg->u32_max_value, + true_umax); + } else { + u64 false_umin = opcode == BPF_JLT ? val : val + 1; + u64 true_umax = opcode == BPF_JLT ? val - 1 : val; + + false_reg->umin_value = max(false_reg->umin_value, false_umin); + true_reg->umax_value = min(true_reg->umax_value, true_umax); + } break; } case BPF_JSLE: case BPF_JSLT: { - s64 false_smin = opcode == BPF_JSLT ? sval : sval + 1; - s64 true_smax = opcode == BPF_JSLT ? sval - 1 : sval; + if (is_jmp32) { + s32 false_smin = opcode == BPF_JSLT ? sval32 : sval32 + 1; + s32 true_smax = opcode == BPF_JSLT ? sval32 - 1 : sval32; - if (is_jmp32 && !cmp_val_with_extended_s64(sval, false_reg)) - break; - false_reg->smin_value = max(false_reg->smin_value, false_smin); - true_reg->smax_value = min(true_reg->smax_value, true_smax); + false_reg->s32_min_value = max(false_reg->s32_min_value, false_smin); + true_reg->s32_max_value = min(true_reg->s32_max_value, true_smax); + } else { + s64 false_smin = opcode == BPF_JSLT ? sval : sval + 1; + s64 true_smax = opcode == BPF_JSLT ? sval - 1 : sval; + + false_reg->smin_value = max(false_reg->smin_value, false_smin); + true_reg->smax_value = min(true_reg->smax_value, true_smax); + } break; } default: return; } - __reg_deduce_bounds(false_reg); - __reg_deduce_bounds(true_reg); - /* We might have learned some bits from the bounds. */ - __reg_bound_offset(false_reg); - __reg_bound_offset(true_reg); - /* Intersecting with the old var_off might have improved our bounds - * slightly. e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), - * then new var_off is (0; 0x7f...fc) which improves our umax. - */ - __update_reg_bounds(false_reg); - __update_reg_bounds(true_reg); + if (is_jmp32) { + false_reg->var_off = tnum_or(tnum_clear_subreg(false_64off), + tnum_subreg(false_32off)); + true_reg->var_off = tnum_or(tnum_clear_subreg(true_64off), + tnum_subreg(true_32off)); + __reg_combine_32_into_64(false_reg); + __reg_combine_32_into_64(true_reg); + } else { + false_reg->var_off = false_64off; + true_reg->var_off = true_64off; + __reg_combine_64_into_32(false_reg); + __reg_combine_64_into_32(true_reg); + } } /* Same as above, but for the case that dst_reg holds a constant and src_reg is * the variable reg. */ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, - struct bpf_reg_state *false_reg, u64 val, + struct bpf_reg_state *false_reg, + u64 val, u32 val32, u8 opcode, bool is_jmp32) { /* How can we transform "a b" into "b a"? */ @@ -5883,7 +6437,7 @@ static void reg_set_min_max_inv(struct bpf_reg_state *true_reg, * BPF_JA, can't get here. */ if (opcode) - reg_set_min_max(true_reg, false_reg, val, opcode, is_jmp32); + reg_set_min_max(true_reg, false_reg, val, val32, opcode, is_jmp32); } /* Regs are known to be equal, so intersect their min/max/var_off */ @@ -6172,13 +6726,22 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, dst_reg = ®s[insn->dst_reg]; is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32; - if (BPF_SRC(insn->code) == BPF_K) - pred = is_branch_taken(dst_reg, insn->imm, - opcode, is_jmp32); - else if (src_reg->type == SCALAR_VALUE && - tnum_is_const(src_reg->var_off)) - pred = is_branch_taken(dst_reg, src_reg->var_off.value, - opcode, is_jmp32); + if (BPF_SRC(insn->code) == BPF_K) { + pred = is_branch_taken(dst_reg, insn->imm, opcode, is_jmp32); + } else if (src_reg->type == SCALAR_VALUE && + is_jmp32 && tnum_is_const(tnum_subreg(src_reg->var_off))) { + pred = is_branch_taken(dst_reg, + tnum_subreg(src_reg->var_off).value, + opcode, + is_jmp32); + } else if (src_reg->type == SCALAR_VALUE && + !is_jmp32 && tnum_is_const(src_reg->var_off)) { + pred = is_branch_taken(dst_reg, + src_reg->var_off.value, + opcode, + is_jmp32); + } + if (pred >= 0) { err = mark_chain_precision(env, insn->dst_reg); if (BPF_SRC(insn->code) == BPF_X && !err) @@ -6212,32 +6775,24 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, */ if (BPF_SRC(insn->code) == BPF_X) { struct bpf_reg_state *src_reg = ®s[insn->src_reg]; - struct bpf_reg_state lo_reg0 = *dst_reg; - struct bpf_reg_state lo_reg1 = *src_reg; - struct bpf_reg_state *src_lo, *dst_lo; - - dst_lo = &lo_reg0; - src_lo = &lo_reg1; - coerce_reg_to_size(dst_lo, 4); - coerce_reg_to_size(src_lo, 4); if (dst_reg->type == SCALAR_VALUE && src_reg->type == SCALAR_VALUE) { if (tnum_is_const(src_reg->var_off) || - (is_jmp32 && tnum_is_const(src_lo->var_off))) + (is_jmp32 && + tnum_is_const(tnum_subreg(src_reg->var_off)))) reg_set_min_max(&other_branch_regs[insn->dst_reg], dst_reg, - is_jmp32 - ? src_lo->var_off.value - : src_reg->var_off.value, + src_reg->var_off.value, + tnum_subreg(src_reg->var_off).value, opcode, is_jmp32); else if (tnum_is_const(dst_reg->var_off) || - (is_jmp32 && tnum_is_const(dst_lo->var_off))) + (is_jmp32 && + tnum_is_const(tnum_subreg(dst_reg->var_off)))) reg_set_min_max_inv(&other_branch_regs[insn->src_reg], src_reg, - is_jmp32 - ? dst_lo->var_off.value - : dst_reg->var_off.value, + dst_reg->var_off.value, + tnum_subreg(dst_reg->var_off).value, opcode, is_jmp32); else if (!is_jmp32 && (opcode == BPF_JEQ || opcode == BPF_JNE)) @@ -6248,7 +6803,8 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, } } else if (dst_reg->type == SCALAR_VALUE) { reg_set_min_max(&other_branch_regs[insn->dst_reg], - dst_reg, insn->imm, opcode, is_jmp32); + dst_reg, insn->imm, (u32)insn->imm, + opcode, is_jmp32); } /* detect if R == 0 where R is returned from bpf_map_lookup_elem(). -- cgit v1.2.3-71-gd317 From fa123ac022e425becce11f1a6c7ee4d283f75a90 Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Mon, 30 Mar 2020 14:36:59 -0700 Subject: bpf: Verifier, refine 32bit bound in do_refine_retval_range Further refine return values range in do_refine_retval_range by noting these are int return types (We will assume here that int is a 32-bit type). Two reasons to pull this out of original patch. First it makes the original fix impossible to backport. And second I've not seen this as being problematic in practice unlike the other case. Fixes: 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper") Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158560421952.10843.12496354931526965046.stgit@john-Precision-5820-Tower --- kernel/bpf/verifier.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1c60d001bb46..04c6630cc18f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4335,6 +4335,7 @@ static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type, return; ret_reg->smax_value = meta->msize_max_value; + ret_reg->s32_max_value = meta->msize_max_value; __reg_deduce_bounds(ret_reg); __reg_bound_offset(ret_reg); __update_reg_bounds(ret_reg); -- cgit v1.2.3-71-gd317 From af6eea57437a830293eab56246b6025cc7d46ee7 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Sun, 29 Mar 2020 19:59:58 -0700 Subject: bpf: Implement bpf_link-based cgroup BPF program attachment Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Acked-by: Roman Gushchin Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com --- include/linux/bpf-cgroup.h | 29 +++- include/linux/bpf.h | 10 +- include/uapi/linux/bpf.h | 10 +- kernel/bpf/cgroup.c | 315 +++++++++++++++++++++++++++++++---------- kernel/bpf/syscall.c | 64 +++++++-- kernel/cgroup/cgroup.c | 14 +- tools/include/uapi/linux/bpf.h | 10 +- 7 files changed, 354 insertions(+), 98 deletions(-) (limited to 'kernel') diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index a7cd5c7a2509..d2d969669564 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -51,9 +51,18 @@ struct bpf_cgroup_storage { struct rcu_head rcu; }; +struct bpf_cgroup_link { + struct bpf_link link; + struct cgroup *cgroup; + enum bpf_attach_type type; +}; + +extern const struct bpf_link_ops bpf_cgroup_link_lops; + struct bpf_prog_list { struct list_head node; struct bpf_prog *prog; + struct bpf_cgroup_link *link; struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE]; }; @@ -84,20 +93,23 @@ struct cgroup_bpf { int cgroup_bpf_inherit(struct cgroup *cgrp); void cgroup_bpf_offline(struct cgroup *cgrp); -int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, - struct bpf_prog *replace_prog, +int __cgroup_bpf_attach(struct cgroup *cgrp, + struct bpf_prog *prog, struct bpf_prog *replace_prog, + struct bpf_cgroup_link *link, enum bpf_attach_type type, u32 flags); int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, + struct bpf_cgroup_link *link, enum bpf_attach_type type); int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); /* Wrapper for __cgroup_bpf_*() protected by cgroup_mutex */ -int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, - struct bpf_prog *replace_prog, enum bpf_attach_type type, +int cgroup_bpf_attach(struct cgroup *cgrp, + struct bpf_prog *prog, struct bpf_prog *replace_prog, + struct bpf_cgroup_link *link, enum bpf_attach_type type, u32 flags); int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 flags); + enum bpf_attach_type type); int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); @@ -332,6 +344,7 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr, enum bpf_prog_type ptype, struct bpf_prog *prog); int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype); +int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog); int cgroup_bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr); #else @@ -354,6 +367,12 @@ static inline int cgroup_bpf_prog_detach(const union bpf_attr *attr, return -EINVAL; } +static inline int cgroup_bpf_link_attach(const union bpf_attr *attr, + struct bpf_prog *prog) +{ + return -EINVAL; +} + static inline int cgroup_bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) { diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 3bde59a8453b..56254d880293 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1082,15 +1082,23 @@ extern int sysctl_unprivileged_bpf_disabled; int bpf_map_new_fd(struct bpf_map *map, int flags); int bpf_prog_new_fd(struct bpf_prog *prog); -struct bpf_link; +struct bpf_link { + atomic64_t refcnt; + const struct bpf_link_ops *ops; + struct bpf_prog *prog; + struct work_struct work; +}; struct bpf_link_ops { void (*release)(struct bpf_link *link); void (*dealloc)(struct bpf_link *link); + }; void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, struct bpf_prog *prog); +void bpf_link_cleanup(struct bpf_link *link, struct file *link_file, + int link_fd); void bpf_link_inc(struct bpf_link *link); void bpf_link_put(struct bpf_link *link); int bpf_link_new_fd(struct bpf_link *link); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9f786a5a44ac..37dffe5089a0 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -111,6 +111,7 @@ enum bpf_cmd { BPF_MAP_LOOKUP_AND_DELETE_BATCH, BPF_MAP_UPDATE_BATCH, BPF_MAP_DELETE_BATCH, + BPF_LINK_CREATE, }; enum bpf_map_type { @@ -541,7 +542,7 @@ union bpf_attr { __u32 prog_cnt; } query; - struct { + struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ __u64 name; __u32 prog_fd; } raw_tracepoint; @@ -569,6 +570,13 @@ union bpf_attr { __u64 probe_offset; /* output: probe_offset */ __u64 probe_addr; /* output: probe_addr */ } task_fd_query; + + struct { /* struct used by BPF_LINK_CREATE command */ + __u32 prog_fd; /* eBPF program to attach */ + __u32 target_fd; /* object to attach to */ + __u32 attach_type; /* attach type */ + __u32 flags; /* extra flags */ + } link_create; } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 9c8472823a7f..c24029937431 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -80,6 +80,17 @@ static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[]) bpf_cgroup_storage_unlink(storages[stype]); } +/* Called when bpf_cgroup_link is auto-detached from dying cgroup. + * It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It + * doesn't free link memory, which will eventually be done by bpf_link's + * release() callback, when its last FD is closed. + */ +static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link) +{ + cgroup_put(link->cgroup); + link->cgroup = NULL; +} + /** * cgroup_bpf_release() - put references of all bpf programs and * release all cgroup bpf data @@ -100,7 +111,10 @@ static void cgroup_bpf_release(struct work_struct *work) list_for_each_entry_safe(pl, tmp, progs, node) { list_del(&pl->node); - bpf_prog_put(pl->prog); + if (pl->prog) + bpf_prog_put(pl->prog); + if (pl->link) + bpf_cgroup_link_auto_detach(pl->link); bpf_cgroup_storages_unlink(pl->storage); bpf_cgroup_storages_free(pl->storage); kfree(pl); @@ -134,6 +148,18 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) queue_work(system_wq, &cgrp->bpf.release_work); } +/* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through + * link or direct prog. + */ +static struct bpf_prog *prog_list_prog(struct bpf_prog_list *pl) +{ + if (pl->prog) + return pl->prog; + if (pl->link) + return pl->link->link.prog; + return NULL; +} + /* count number of elements in the list. * it's slow but the list cannot be long */ @@ -143,7 +169,7 @@ static u32 prog_list_length(struct list_head *head) u32 cnt = 0; list_for_each_entry(pl, head, node) { - if (!pl->prog) + if (!prog_list_prog(pl)) continue; cnt++; } @@ -212,11 +238,11 @@ static int compute_effective_progs(struct cgroup *cgrp, continue; list_for_each_entry(pl, &p->bpf.progs[type], node) { - if (!pl->prog) + if (!prog_list_prog(pl)) continue; item = &progs->items[cnt]; - item->prog = pl->prog; + item->prog = prog_list_prog(pl); bpf_cgroup_storages_assign(item->cgroup_storage, pl->storage); cnt++; @@ -333,19 +359,60 @@ cleanup: #define BPF_CGROUP_MAX_PROGS 64 +static struct bpf_prog_list *find_attach_entry(struct list_head *progs, + struct bpf_prog *prog, + struct bpf_cgroup_link *link, + struct bpf_prog *replace_prog, + bool allow_multi) +{ + struct bpf_prog_list *pl; + + /* single-attach case */ + if (!allow_multi) { + if (list_empty(progs)) + return NULL; + return list_first_entry(progs, typeof(*pl), node); + } + + list_for_each_entry(pl, progs, node) { + if (prog && pl->prog == prog) + /* disallow attaching the same prog twice */ + return ERR_PTR(-EINVAL); + if (link && pl->link == link) + /* disallow attaching the same link twice */ + return ERR_PTR(-EINVAL); + } + + /* direct prog multi-attach w/ replacement case */ + if (replace_prog) { + list_for_each_entry(pl, progs, node) { + if (pl->prog == replace_prog) + /* a match found */ + return pl; + } + /* prog to replace not found for cgroup */ + return ERR_PTR(-ENOENT); + } + + return NULL; +} + /** - * __cgroup_bpf_attach() - Attach the program to a cgroup, and + * __cgroup_bpf_attach() - Attach the program or the link to a cgroup, and * propagate the change to descendants * @cgrp: The cgroup which descendants to traverse * @prog: A program to attach + * @link: A link to attach * @replace_prog: Previously attached program to replace if BPF_F_REPLACE is set * @type: Type of attach operation * @flags: Option flags * + * Exactly one of @prog or @link can be non-null. * Must be called with cgroup_mutex held. */ -int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, - struct bpf_prog *replace_prog, +int __cgroup_bpf_attach(struct cgroup *cgrp, + struct bpf_prog *prog, struct bpf_prog *replace_prog, + struct bpf_cgroup_link *link, enum bpf_attach_type type, u32 flags) { u32 saved_flags = (flags & (BPF_F_ALLOW_OVERRIDE | BPF_F_ALLOW_MULTI)); @@ -353,13 +420,19 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, struct bpf_prog *old_prog = NULL; struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE], *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {NULL}; - struct bpf_prog_list *pl, *replace_pl = NULL; + struct bpf_prog_list *pl; int err; if (((flags & BPF_F_ALLOW_OVERRIDE) && (flags & BPF_F_ALLOW_MULTI)) || ((flags & BPF_F_REPLACE) && !(flags & BPF_F_ALLOW_MULTI))) /* invalid combination */ return -EINVAL; + if (link && (prog || replace_prog)) + /* only either link or prog/replace_prog can be specified */ + return -EINVAL; + if (!!replace_prog != !!(flags & BPF_F_REPLACE)) + /* replace_prog implies BPF_F_REPLACE, and vice versa */ + return -EINVAL; if (!hierarchy_allows_attach(cgrp, type)) return -EPERM; @@ -374,26 +447,15 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, if (prog_list_length(progs) >= BPF_CGROUP_MAX_PROGS) return -E2BIG; - if (flags & BPF_F_ALLOW_MULTI) { - list_for_each_entry(pl, progs, node) { - if (pl->prog == prog) - /* disallow attaching the same prog twice */ - return -EINVAL; - if (pl->prog == replace_prog) - replace_pl = pl; - } - if ((flags & BPF_F_REPLACE) && !replace_pl) - /* prog to replace not found for cgroup */ - return -ENOENT; - } else if (!list_empty(progs)) { - replace_pl = list_first_entry(progs, typeof(*pl), node); - } + pl = find_attach_entry(progs, prog, link, replace_prog, + flags & BPF_F_ALLOW_MULTI); + if (IS_ERR(pl)) + return PTR_ERR(pl); - if (bpf_cgroup_storages_alloc(storage, prog)) + if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog)) return -ENOMEM; - if (replace_pl) { - pl = replace_pl; + if (pl) { old_prog = pl->prog; bpf_cgroup_storages_unlink(pl->storage); bpf_cgroup_storages_assign(old_storage, pl->storage); @@ -407,6 +469,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, } pl->prog = prog; + pl->link = link; bpf_cgroup_storages_assign(pl->storage, storage); cgrp->bpf.flags[type] = saved_flags; @@ -414,80 +477,93 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, if (err) goto cleanup; - static_branch_inc(&cgroup_bpf_enabled_key); bpf_cgroup_storages_free(old_storage); - if (old_prog) { + if (old_prog) bpf_prog_put(old_prog); - static_branch_dec(&cgroup_bpf_enabled_key); - } - bpf_cgroup_storages_link(storage, cgrp, type); + else + static_branch_inc(&cgroup_bpf_enabled_key); + bpf_cgroup_storages_link(pl->storage, cgrp, type); return 0; cleanup: - /* and cleanup the prog list */ - pl->prog = old_prog; + if (old_prog) { + pl->prog = old_prog; + pl->link = NULL; + } bpf_cgroup_storages_free(pl->storage); bpf_cgroup_storages_assign(pl->storage, old_storage); bpf_cgroup_storages_link(pl->storage, cgrp, type); - if (!replace_pl) { + if (!old_prog) { list_del(&pl->node); kfree(pl); } return err; } +static struct bpf_prog_list *find_detach_entry(struct list_head *progs, + struct bpf_prog *prog, + struct bpf_cgroup_link *link, + bool allow_multi) +{ + struct bpf_prog_list *pl; + + if (!allow_multi) { + if (list_empty(progs)) + /* report error when trying to detach and nothing is attached */ + return ERR_PTR(-ENOENT); + + /* to maintain backward compatibility NONE and OVERRIDE cgroups + * allow detaching with invalid FD (prog==NULL) in legacy mode + */ + return list_first_entry(progs, typeof(*pl), node); + } + + if (!prog && !link) + /* to detach MULTI prog the user has to specify valid FD + * of the program or link to be detached + */ + return ERR_PTR(-EINVAL); + + /* find the prog or link and detach it */ + list_for_each_entry(pl, progs, node) { + if (pl->prog == prog && pl->link == link) + return pl; + } + return ERR_PTR(-ENOENT); +} + /** - * __cgroup_bpf_detach() - Detach the program from a cgroup, and + * __cgroup_bpf_detach() - Detach the program or link from a cgroup, and * propagate the change to descendants * @cgrp: The cgroup which descendants to traverse * @prog: A program to detach or NULL + * @prog: A link to detach or NULL * @type: Type of detach operation * + * At most one of @prog or @link can be non-NULL. * Must be called with cgroup_mutex held. */ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type) + struct bpf_cgroup_link *link, enum bpf_attach_type type) { struct list_head *progs = &cgrp->bpf.progs[type]; u32 flags = cgrp->bpf.flags[type]; - struct bpf_prog *old_prog = NULL; struct bpf_prog_list *pl; + struct bpf_prog *old_prog; int err; - if (flags & BPF_F_ALLOW_MULTI) { - if (!prog) - /* to detach MULTI prog the user has to specify valid FD - * of the program to be detached - */ - return -EINVAL; - } else { - if (list_empty(progs)) - /* report error when trying to detach and nothing is attached */ - return -ENOENT; - } + if (prog && link) + /* only one of prog or link can be specified */ + return -EINVAL; - if (flags & BPF_F_ALLOW_MULTI) { - /* find the prog and detach it */ - list_for_each_entry(pl, progs, node) { - if (pl->prog != prog) - continue; - old_prog = prog; - /* mark it deleted, so it's ignored while - * recomputing effective - */ - pl->prog = NULL; - break; - } - if (!old_prog) - return -ENOENT; - } else { - /* to maintain backward compatibility NONE and OVERRIDE cgroups - * allow detaching with invalid FD (prog==NULL) - */ - pl = list_first_entry(progs, typeof(*pl), node); - old_prog = pl->prog; - pl->prog = NULL; - } + pl = find_detach_entry(progs, prog, link, flags & BPF_F_ALLOW_MULTI); + if (IS_ERR(pl)) + return PTR_ERR(pl); + + /* mark it deleted, so it's ignored while recomputing effective */ + old_prog = pl->prog; + pl->prog = NULL; + pl->link = NULL; err = update_effective_progs(cgrp, type); if (err) @@ -501,14 +577,15 @@ int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, if (list_empty(progs)) /* last program was detached, reset flags to zero */ cgrp->bpf.flags[type] = 0; - - bpf_prog_put(old_prog); + if (old_prog) + bpf_prog_put(old_prog); static_branch_dec(&cgroup_bpf_enabled_key); return 0; cleanup: - /* and restore back old_prog */ + /* restore back prog or link */ pl->prog = old_prog; + pl->link = link; return err; } @@ -521,6 +598,7 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, struct list_head *progs = &cgrp->bpf.progs[type]; u32 flags = cgrp->bpf.flags[type]; struct bpf_prog_array *effective; + struct bpf_prog *prog; int cnt, ret = 0, i; effective = rcu_dereference_protected(cgrp->bpf.effective[type], @@ -551,7 +629,8 @@ int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, i = 0; list_for_each_entry(pl, progs, node) { - id = pl->prog->aux->id; + prog = prog_list_prog(pl); + id = prog->aux->id; if (copy_to_user(prog_ids + i, &id, sizeof(id))) return -EFAULT; if (++i == cnt) @@ -581,8 +660,8 @@ int cgroup_bpf_prog_attach(const union bpf_attr *attr, } } - ret = cgroup_bpf_attach(cgrp, prog, replace_prog, attr->attach_type, - attr->attach_flags); + ret = cgroup_bpf_attach(cgrp, prog, replace_prog, NULL, + attr->attach_type, attr->attach_flags); if (replace_prog) bpf_prog_put(replace_prog); @@ -604,7 +683,7 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype) if (IS_ERR(prog)) prog = NULL; - ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type, 0); + ret = cgroup_bpf_detach(cgrp, prog, attr->attach_type); if (prog) bpf_prog_put(prog); @@ -612,6 +691,90 @@ int cgroup_bpf_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype) return ret; } +static void bpf_cgroup_link_release(struct bpf_link *link) +{ + struct bpf_cgroup_link *cg_link = + container_of(link, struct bpf_cgroup_link, link); + + /* link might have been auto-detached by dying cgroup already, + * in that case our work is done here + */ + if (!cg_link->cgroup) + return; + + mutex_lock(&cgroup_mutex); + + /* re-check cgroup under lock again */ + if (!cg_link->cgroup) { + mutex_unlock(&cgroup_mutex); + return; + } + + WARN_ON(__cgroup_bpf_detach(cg_link->cgroup, NULL, cg_link, + cg_link->type)); + + mutex_unlock(&cgroup_mutex); + cgroup_put(cg_link->cgroup); +} + +static void bpf_cgroup_link_dealloc(struct bpf_link *link) +{ + struct bpf_cgroup_link *cg_link = + container_of(link, struct bpf_cgroup_link, link); + + kfree(cg_link); +} + +const struct bpf_link_ops bpf_cgroup_link_lops = { + .release = bpf_cgroup_link_release, + .dealloc = bpf_cgroup_link_dealloc, +}; + +int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog) +{ + struct bpf_cgroup_link *link; + struct file *link_file; + struct cgroup *cgrp; + int err, link_fd; + + if (attr->link_create.flags) + return -EINVAL; + + cgrp = cgroup_get_from_fd(attr->link_create.target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + link = kzalloc(sizeof(*link), GFP_USER); + if (!link) { + err = -ENOMEM; + goto out_put_cgroup; + } + bpf_link_init(&link->link, &bpf_cgroup_link_lops, prog); + link->cgroup = cgrp; + link->type = attr->link_create.attach_type; + + link_file = bpf_link_new_file(&link->link, &link_fd); + if (IS_ERR(link_file)) { + kfree(link); + err = PTR_ERR(link_file); + goto out_put_cgroup; + } + + err = cgroup_bpf_attach(cgrp, NULL, NULL, link, link->type, + BPF_F_ALLOW_MULTI); + if (err) { + bpf_link_cleanup(&link->link, link_file, link_fd); + goto out_put_cgroup; + } + + fd_install(link_fd, link_file); + return link_fd; + +out_put_cgroup: + cgroup_put(cgrp); + return err; +} + int cgroup_bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index a616b63f23b4..97d5c6fb63cd 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2175,13 +2175,6 @@ static int bpf_obj_get(const union bpf_attr *attr) attr->file_flags); } -struct bpf_link { - atomic64_t refcnt; - const struct bpf_link_ops *ops; - struct bpf_prog *prog; - struct work_struct work; -}; - void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, struct bpf_prog *prog) { @@ -2195,8 +2188,8 @@ void bpf_link_init(struct bpf_link *link, const struct bpf_link_ops *ops, * anon_inode's release() call. This helper manages marking bpf_link as * defunct, releases anon_inode file and puts reserved FD. */ -static void bpf_link_cleanup(struct bpf_link *link, struct file *link_file, - int link_fd) +void bpf_link_cleanup(struct bpf_link *link, struct file *link_file, + int link_fd) { link->prog = NULL; fput(link_file); @@ -2266,6 +2259,10 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp) link_type = "raw_tracepoint"; else if (link->ops == &bpf_tracing_link_lops) link_type = "tracing"; +#ifdef CONFIG_CGROUP_BPF + else if (link->ops == &bpf_cgroup_link_lops) + link_type = "cgroup"; +#endif else link_type = "unknown"; @@ -3553,6 +3550,52 @@ err_put: return err; } +#define BPF_LINK_CREATE_LAST_FIELD link_create.flags +static int link_create(union bpf_attr *attr) +{ + enum bpf_prog_type ptype; + struct bpf_prog *prog; + int ret; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (CHECK_ATTR(BPF_LINK_CREATE)) + return -EINVAL; + + ptype = attach_type_to_prog_type(attr->link_create.attach_type); + if (ptype == BPF_PROG_TYPE_UNSPEC) + return -EINVAL; + + prog = bpf_prog_get_type(attr->link_create.prog_fd, ptype); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + ret = bpf_prog_attach_check_attach_type(prog, + attr->link_create.attach_type); + if (ret) + goto err_out; + + switch (ptype) { + case BPF_PROG_TYPE_CGROUP_SKB: + case BPF_PROG_TYPE_CGROUP_SOCK: + case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: + case BPF_PROG_TYPE_SOCK_OPS: + case BPF_PROG_TYPE_CGROUP_DEVICE: + case BPF_PROG_TYPE_CGROUP_SYSCTL: + case BPF_PROG_TYPE_CGROUP_SOCKOPT: + ret = cgroup_bpf_link_attach(attr, prog); + break; + default: + ret = -EINVAL; + } + +err_out: + if (ret < 0) + bpf_prog_put(prog); + return ret; +} + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; @@ -3663,6 +3706,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_MAP_DELETE_BATCH: err = bpf_map_do_batch(&attr, uattr, BPF_MAP_DELETE_BATCH); break; + case BPF_LINK_CREATE: + err = link_create(&attr); + break; default: err = -EINVAL; break; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 3dead0416b91..219624fba9ba 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6303,27 +6303,31 @@ void cgroup_sk_free(struct sock_cgroup_data *skcd) #endif /* CONFIG_SOCK_CGROUP_DATA */ #ifdef CONFIG_CGROUP_BPF -int cgroup_bpf_attach(struct cgroup *cgrp, struct bpf_prog *prog, - struct bpf_prog *replace_prog, enum bpf_attach_type type, +int cgroup_bpf_attach(struct cgroup *cgrp, + struct bpf_prog *prog, struct bpf_prog *replace_prog, + struct bpf_cgroup_link *link, + enum bpf_attach_type type, u32 flags) { int ret; mutex_lock(&cgroup_mutex); - ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, type, flags); + ret = __cgroup_bpf_attach(cgrp, prog, replace_prog, link, type, flags); mutex_unlock(&cgroup_mutex); return ret; } + int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, - enum bpf_attach_type type, u32 flags) + enum bpf_attach_type type) { int ret; mutex_lock(&cgroup_mutex); - ret = __cgroup_bpf_detach(cgrp, prog, type); + ret = __cgroup_bpf_detach(cgrp, prog, NULL, type); mutex_unlock(&cgroup_mutex); return ret; } + int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr) { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9f786a5a44ac..37dffe5089a0 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -111,6 +111,7 @@ enum bpf_cmd { BPF_MAP_LOOKUP_AND_DELETE_BATCH, BPF_MAP_UPDATE_BATCH, BPF_MAP_DELETE_BATCH, + BPF_LINK_CREATE, }; enum bpf_map_type { @@ -541,7 +542,7 @@ union bpf_attr { __u32 prog_cnt; } query; - struct { + struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ __u64 name; __u32 prog_fd; } raw_tracepoint; @@ -569,6 +570,13 @@ union bpf_attr { __u64 probe_offset; /* output: probe_offset */ __u64 probe_addr; /* output: probe_addr */ } task_fd_query; + + struct { /* struct used by BPF_LINK_CREATE command */ + __u32 prog_fd; /* eBPF program to attach */ + __u32 target_fd; /* object to attach to */ + __u32 attach_type; /* attach type */ + __u32 flags; /* extra flags */ + } link_create; } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF -- cgit v1.2.3-71-gd317 From 0c991ebc8c69d29b7fc44db17075c5aa5253e2ab Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Sun, 29 Mar 2020 19:59:59 -0700 Subject: bpf: Implement bpf_prog replacement for an active bpf_cgroup_link Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com --- include/linux/bpf-cgroup.h | 12 +++++++ include/uapi/linux/bpf.h | 12 +++++++ kernel/bpf/cgroup.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 55 +++++++++++++++++++++++++++++++ kernel/cgroup/cgroup.c | 27 ++++++++++++++++ 5 files changed, 186 insertions(+) (limited to 'kernel') diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index d2d969669564..c11b413d5b1a 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -100,6 +100,8 @@ int __cgroup_bpf_attach(struct cgroup *cgrp, int __cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, struct bpf_cgroup_link *link, enum bpf_attach_type type); +int __cgroup_bpf_replace(struct cgroup *cgrp, struct bpf_cgroup_link *link, + struct bpf_prog *new_prog); int __cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); @@ -110,6 +112,8 @@ int cgroup_bpf_attach(struct cgroup *cgrp, u32 flags); int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, enum bpf_attach_type type); +int cgroup_bpf_replace(struct bpf_link *link, struct bpf_prog *old_prog, + struct bpf_prog *new_prog); int cgroup_bpf_query(struct cgroup *cgrp, const union bpf_attr *attr, union bpf_attr __user *uattr); @@ -350,6 +354,7 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr, #else struct bpf_prog; +struct bpf_link; struct cgroup_bpf {}; static inline int cgroup_bpf_inherit(struct cgroup *cgrp) { return 0; } static inline void cgroup_bpf_offline(struct cgroup *cgrp) {} @@ -373,6 +378,13 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr, return -EINVAL; } +static inline int cgroup_bpf_replace(struct bpf_link *link, + struct bpf_prog *old_prog, + struct bpf_prog *new_prog) +{ + return -EINVAL; +} + static inline int cgroup_bpf_prog_query(const union bpf_attr *attr, union bpf_attr __user *uattr) { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 37dffe5089a0..2e29a671d67e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -112,6 +112,7 @@ enum bpf_cmd { BPF_MAP_UPDATE_BATCH, BPF_MAP_DELETE_BATCH, BPF_LINK_CREATE, + BPF_LINK_UPDATE, }; enum bpf_map_type { @@ -577,6 +578,17 @@ union bpf_attr { __u32 attach_type; /* attach type */ __u32 flags; /* extra flags */ } link_create; + + struct { /* struct used by BPF_LINK_UPDATE command */ + __u32 link_fd; /* link fd */ + /* new program fd to update link with */ + __u32 new_prog_fd; + __u32 flags; /* extra flags */ + /* expected link's program fd; is specified only if + * BPF_F_REPLACE flag is set in flags */ + __u32 old_prog_fd; + } link_update; + } __attribute__((aligned(8))); /* The description below is an attempt at providing documentation to eBPF diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index c24029937431..80676fc00d81 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -500,6 +500,86 @@ cleanup: return err; } +/* Swap updated BPF program for given link in effective program arrays across + * all descendant cgroups. This function is guaranteed to succeed. + */ +static void replace_effective_prog(struct cgroup *cgrp, + enum bpf_attach_type type, + struct bpf_cgroup_link *link) +{ + struct bpf_prog_array_item *item; + struct cgroup_subsys_state *css; + struct bpf_prog_array *progs; + struct bpf_prog_list *pl; + struct list_head *head; + struct cgroup *cg; + int pos; + + css_for_each_descendant_pre(css, &cgrp->self) { + struct cgroup *desc = container_of(css, struct cgroup, self); + + if (percpu_ref_is_zero(&desc->bpf.refcnt)) + continue; + + /* find position of link in effective progs array */ + for (pos = 0, cg = desc; cg; cg = cgroup_parent(cg)) { + if (pos && !(cg->bpf.flags[type] & BPF_F_ALLOW_MULTI)) + continue; + + head = &cg->bpf.progs[type]; + list_for_each_entry(pl, head, node) { + if (!prog_list_prog(pl)) + continue; + if (pl->link == link) + goto found; + pos++; + } + } +found: + BUG_ON(!cg); + progs = rcu_dereference_protected( + desc->bpf.effective[type], + lockdep_is_held(&cgroup_mutex)); + item = &progs->items[pos]; + WRITE_ONCE(item->prog, link->link.prog); + } +} + +/** + * __cgroup_bpf_replace() - Replace link's program and propagate the change + * to descendants + * @cgrp: The cgroup which descendants to traverse + * @link: A link for which to replace BPF program + * @type: Type of attach operation + * + * Must be called with cgroup_mutex held. + */ +int __cgroup_bpf_replace(struct cgroup *cgrp, struct bpf_cgroup_link *link, + struct bpf_prog *new_prog) +{ + struct list_head *progs = &cgrp->bpf.progs[link->type]; + struct bpf_prog *old_prog; + struct bpf_prog_list *pl; + bool found = false; + + if (link->link.prog->type != new_prog->type) + return -EINVAL; + + list_for_each_entry(pl, progs, node) { + if (pl->link == link) { + found = true; + break; + } + } + if (!found) + return -ENOENT; + + old_prog = xchg(&link->link.prog, new_prog); + replace_effective_prog(cgrp, link->type, link); + bpf_prog_put(old_prog); + return 0; +} + static struct bpf_prog_list *find_detach_entry(struct list_head *progs, struct bpf_prog *prog, struct bpf_cgroup_link *link, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 97d5c6fb63cd..e0a3b34d7039 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3596,6 +3596,58 @@ err_out: return ret; } +#define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd + +static int link_update(union bpf_attr *attr) +{ + struct bpf_prog *old_prog = NULL, *new_prog; + struct bpf_link *link; + u32 flags; + int ret; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (CHECK_ATTR(BPF_LINK_UPDATE)) + return -EINVAL; + + flags = attr->link_update.flags; + if (flags & ~BPF_F_REPLACE) + return -EINVAL; + + link = bpf_link_get_from_fd(attr->link_update.link_fd); + if (IS_ERR(link)) + return PTR_ERR(link); + + new_prog = bpf_prog_get(attr->link_update.new_prog_fd); + if (IS_ERR(new_prog)) + return PTR_ERR(new_prog); + + if (flags & BPF_F_REPLACE) { + old_prog = bpf_prog_get(attr->link_update.old_prog_fd); + if (IS_ERR(old_prog)) { + ret = PTR_ERR(old_prog); + old_prog = NULL; + goto out_put_progs; + } + } + +#ifdef CONFIG_CGROUP_BPF + if (link->ops == &bpf_cgroup_link_lops) { + ret = cgroup_bpf_replace(link, old_prog, new_prog); + goto out_put_progs; + } +#endif + ret = -EINVAL; + +out_put_progs: + if (old_prog) + bpf_prog_put(old_prog); + if (ret) + bpf_prog_put(new_prog); + return ret; +} + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; @@ -3709,6 +3761,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_LINK_CREATE: err = link_create(&attr); break; + case BPF_LINK_UPDATE: + err = link_update(&attr); + break; default: err = -EINVAL; break; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 219624fba9ba..915dda3f7f19 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6317,6 +6317,33 @@ int cgroup_bpf_attach(struct cgroup *cgrp, return ret; } +int cgroup_bpf_replace(struct bpf_link *link, struct bpf_prog *old_prog, + struct bpf_prog *new_prog) +{ + struct bpf_cgroup_link *cg_link; + int ret; + + if (link->ops != &bpf_cgroup_link_lops) + return -EINVAL; + + cg_link = container_of(link, struct bpf_cgroup_link, link); + + mutex_lock(&cgroup_mutex); + /* link might have been auto-released by dying cgroup, so fail */ + if (!cg_link->cgroup) { + ret = -EINVAL; + goto out_unlock; + } + if (old_prog && link->prog != old_prog) { + ret = -EPERM; + goto out_unlock; + } + ret = __cgroup_bpf_replace(cg_link->cgroup, cg_link, new_prog); +out_unlock: + mutex_unlock(&cgroup_mutex); + return ret; +} + int cgroup_bpf_detach(struct cgroup *cgrp, struct bpf_prog *prog, enum bpf_attach_type type) { -- cgit v1.2.3-71-gd317 From 3704a6a445790e6621c19be25d85dfadbeb16a69 Mon Sep 17 00:00:00 2001 From: Dexuan Cui Date: Tue, 31 Mar 2020 08:55:25 -0700 Subject: PM: hibernate: Propagate the return value of hibernation_restore() If hibernation_restore() fails, the 'error' should not be zero. Signed-off-by: Dexuan Cui Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 6dbeedb7354c..86aba8706b16 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -678,7 +678,7 @@ static int load_image_and_restore(void) error = swsusp_read(&flags); swsusp_close(FMODE_READ); if (!error) - hibernation_restore(flags & SF_PLATFORM_MODE); + error = hibernation_restore(flags & SF_PLATFORM_MODE); pr_err("Failed to load image, recovering.\n"); swsusp_free(); -- cgit v1.2.3-71-gd317 From 73d20564e0dcae003e0d79977f044d5e57496304 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Tue, 31 Mar 2020 22:18:49 +0200 Subject: hrtimer: Don't dereference the hrtimer pointer after the callback A hrtimer can be released in its callback, but lockdep_hrtimer_exit() dereferences the pointer after the callback returns, i.e. a potential use after free. Retrieve the context in which the hrtimer expires before the callback is invoked and use it in lockdep_hrtimer_exit(). Fixes: 40db173965c0 ("lockdep: Add hrtimer context tracing bits") Reported-by: syzbot+62c155c276e580cfb606@syzkaller.appspotmail.com Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200331201849.fkp2siy3vcdqvqlz@linutronix.de --- include/linux/irqflags.h | 27 ++++++++++++++++----------- kernel/time/hrtimer.c | 5 +++-- 2 files changed, 19 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index ceca42de4438..61a9ced3aa50 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -58,16 +58,21 @@ do { \ } while (0) # define lockdep_hrtimer_enter(__hrtimer) \ - do { \ - if (!__hrtimer->is_hard) \ - current->irq_config = 1; \ - } while (0) - -# define lockdep_hrtimer_exit(__hrtimer) \ - do { \ - if (!__hrtimer->is_hard) \ +({ \ + bool __expires_hardirq = true; \ + \ + if (!__hrtimer->is_hard) { \ + current->irq_config = 1; \ + __expires_hardirq = false; \ + } \ + __expires_hardirq; \ +}) + +# define lockdep_hrtimer_exit(__expires_hardirq) \ + do { \ + if (!__expires_hardirq) \ current->irq_config = 0; \ - } while (0) + } while (0) # define lockdep_posixtimer_enter() \ do { \ @@ -102,8 +107,8 @@ do { \ # define lockdep_hardirq_exit() do { } while (0) # define lockdep_softirq_enter() do { } while (0) # define lockdep_softirq_exit() do { } while (0) -# define lockdep_hrtimer_enter(__hrtimer) do { } while (0) -# define lockdep_hrtimer_exit(__hrtimer) do { } while (0) +# define lockdep_hrtimer_enter(__hrtimer) false +# define lockdep_hrtimer_exit(__context) do { } while (0) # define lockdep_posixtimer_enter() do { } while (0) # define lockdep_posixtimer_exit() do { } while (0) # define lockdep_irq_work_enter(__work) do { } while (0) diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index d0a5ba37aff4..d89da1c7e005 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1480,6 +1480,7 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, unsigned long flags) __must_hold(&cpu_base->lock) { enum hrtimer_restart (*fn)(struct hrtimer *); + bool expires_in_hardirq; int restart; lockdep_assert_held(&cpu_base->lock); @@ -1514,11 +1515,11 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base, */ raw_spin_unlock_irqrestore(&cpu_base->lock, flags); trace_hrtimer_expire_entry(timer, now); - lockdep_hrtimer_enter(timer); + expires_in_hardirq = lockdep_hrtimer_enter(timer); restart = fn(timer); - lockdep_hrtimer_exit(timer); + lockdep_hrtimer_exit(expires_in_hardirq); trace_hrtimer_expire_exit(timer); raw_spin_lock_irq(&cpu_base->lock); -- cgit v1.2.3-71-gd317 From d228bee8201a7ea77c414f1298b2f572f42c6113 Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 13 Feb 2020 09:57:24 +0000 Subject: kdb: Eliminate strncpy() warnings by replacing with strscpy() Currently the code to manage the kdb history buffer uses strncpy() to copy strings to/and from the history and exhibits the classic "but nobody ever told me that strncpy() doesn't always terminate strings" bug. Modern gcc compilers recognise this bug and issue a warning. In reality these calls will only abridge the copied string if kdb_read() has *already* overflowed the command buffer. Thus the use of counted copies here is only used to reduce the secondary effects of a bug elsewhere in the code. Therefore transitioning these calls into strscpy() (without checking the return code) is appropriate. Signed-off-by: Daniel Thompson Reviewed-by: Douglas Anderson --- kernel/debug/kdb/kdb_main.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index ba12e9f4661e..a4641be4123c 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -1102,12 +1102,12 @@ static int handle_ctrl_cmd(char *cmd) case CTRL_P: if (cmdptr != cmd_tail) cmdptr = (cmdptr-1) % KDB_CMD_HISTORY_COUNT; - strncpy(cmd_cur, cmd_hist[cmdptr], CMD_BUFLEN); + strscpy(cmd_cur, cmd_hist[cmdptr], CMD_BUFLEN); return 1; case CTRL_N: if (cmdptr != cmd_head) cmdptr = (cmdptr+1) % KDB_CMD_HISTORY_COUNT; - strncpy(cmd_cur, cmd_hist[cmdptr], CMD_BUFLEN); + strscpy(cmd_cur, cmd_hist[cmdptr], CMD_BUFLEN); return 1; } return 0; @@ -1314,7 +1314,7 @@ do_full_getstr: if (*cmdbuf != '\n') { if (*cmdbuf < 32) { if (cmdptr == cmd_head) { - strncpy(cmd_hist[cmd_head], cmd_cur, + strscpy(cmd_hist[cmd_head], cmd_cur, CMD_BUFLEN); *(cmd_hist[cmd_head] + strlen(cmd_hist[cmd_head])-1) = '\0'; @@ -1324,7 +1324,7 @@ do_full_getstr: cmdbuf = cmd_cur; goto do_full_getstr; } else { - strncpy(cmd_hist[cmd_head], cmd_cur, + strscpy(cmd_hist[cmd_head], cmd_cur, CMD_BUFLEN); } -- cgit v1.2.3-71-gd317 From ad99b5105c0823ff02126497f4366e6a8009453e Mon Sep 17 00:00:00 2001 From: Daniel Thompson Date: Thu, 13 Feb 2020 15:16:40 +0000 Subject: kdb: Censor attempts to set PROMPT without ENABLE_MEM_READ Currently the PROMPT variable could be abused to provoke the printf() machinery to read outside the current stack frame. Normally this doesn't matter becaues md is already a much better tool for reading from memory. However the md command can be disabled by not setting KDB_ENABLE_MEM_READ. Let's also prevent PROMPT from being modified in these circumstances. Whilst adding a comment to help future code reviewers we also remove the #ifdef where PROMPT in consumed. There is no problem passing an unused (0) to snprintf when !CONFIG_SMP. argument Reported-by: Wang Xiayang Signed-off-by: Daniel Thompson Reviewed-by: Douglas Anderson --- kernel/debug/kdb/kdb_main.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index a4641be4123c..515379cbf209 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -398,6 +398,13 @@ int kdb_set(int argc, const char **argv) if (argc != 2) return KDB_ARGCOUNT; + /* + * Censor sensitive variables + */ + if (strcmp(argv[1], "PROMPT") == 0 && + !kdb_check_flags(KDB_ENABLE_MEM_READ, kdb_cmd_enabled, false)) + return KDB_NOPERM; + /* * Check for internal variables */ @@ -1298,12 +1305,9 @@ static int kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs, *(cmd_hist[cmd_head]) = '\0'; do_full_getstr: -#if defined(CONFIG_SMP) + /* PROMPT can only be set if we have MEM_READ permission. */ snprintf(kdb_prompt_str, CMD_BUFLEN, kdbgetenv("PROMPT"), raw_smp_processor_id()); -#else - snprintf(kdb_prompt_str, CMD_BUFLEN, kdbgetenv("PROMPT")); -#endif if (defcmd_in_progress) strncat(kdb_prompt_str, "[defcmd]", CMD_BUFLEN); -- cgit v1.2.3-71-gd317 From d1e7fd6462ca9fc76650fbe6ca800e35b24267da Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 30 Mar 2020 19:01:04 -0500 Subject: signal: Extend exec_id to 64bits Replace the 32bit exec_id with a 64bit exec_id to make it impossible to wrap the exec_id counter. With care an attacker can cause exec_id wrap and send arbitrary signals to a newly exec'd parent. This bypasses the signal sending checks if the parent changes their credentials during exec. The severity of this problem can been seen that in my limited testing of a 32bit exec_id it can take as little as 19s to exec 65536 times. Which means that it can take as little as 14 days to wrap a 32bit exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7 days. Even my slower timing is in the uptime of a typical server. Which means self_exec_id is simply a speed bump today, and if exec gets noticably faster self_exec_id won't even be a speed bump. Extending self_exec_id to 64bits introduces a problem on 32bit architectures where reading self_exec_id is no longer atomic and can take two read instructions. Which means that is is possible to hit a window where the read value of exec_id does not match the written value. So with very lucky timing after this change this still remains expoiltable. I have updated the update of exec_id on exec to use WRITE_ONCE and the read of exec_id in do_notify_parent to use READ_ONCE to make it clear that there is no locking between these two locations. Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl Fixes: 2.3.23pre2 Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" --- fs/exec.c | 2 +- include/linux/sched.h | 4 ++-- kernel/signal.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/fs/exec.c b/fs/exec.c index 0e46ec57fe0a..d55710a36056 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm) /* An exec changes our domain. We are no longer part of the thread group */ - current->self_exec_id++; + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1); flush_signal_handlers(current, 0); } EXPORT_SYMBOL(setup_new_exec); diff --git a/include/linux/sched.h b/include/linux/sched.h index 04278493bf15..0323e4f0982a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -939,8 +939,8 @@ struct task_struct { struct seccomp seccomp; /* Thread group tracking: */ - u32 parent_exec_id; - u32 self_exec_id; + u64 parent_exec_id; + u64 self_exec_id; /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */ spinlock_t alloc_lock; diff --git a/kernel/signal.c b/kernel/signal.c index 9ad8dea93dbb..5383b562df85 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) * This is only possible if parent == real_parent. * Check if it has changed security domain. */ - if (tsk->parent_exec_id != tsk->parent->self_exec_id) + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id)) sig = SIGCHLD; } -- cgit v1.2.3-71-gd317 From db96a75946d3a2a41092c83992e65f727cd35529 Mon Sep 17 00:00:00 2001 From: Chen Yu Date: Thu, 2 Apr 2020 15:56:52 +0800 Subject: PM: sleep: Add pm_debug_messages kernel command line option Debug messages from the system suspend/hibernation infrastructure are disabled by default, and can only be enabled after the system has boot up via /sys/power/pm_debug_messages. This makes the hibernation resume hard to track as it involves system boot up across hibernation. There's no chance for software_resume() to track the resume process, for example. Add a kernel command line option to set pm_debug_messages during boot up. Signed-off-by: Chen Yu [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ kernel/power/main.c | 7 +++++++ 2 files changed, 10 insertions(+) (limited to 'kernel') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4e3885e38179..df2c2b97f417 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3719,6 +3719,9 @@ Override pmtimer IOPort with a hex value. e.g. pmtmr=0x508 + pm_debug_messages [SUSPEND,KNL] + Enable suspend/resume debug messages during boot up. + pnp.debug=1 [PNP] Enable PNP debug messages (depends on the CONFIG_PNP_DEBUG_MESSAGES option). Change at run-time diff --git a/kernel/power/main.c b/kernel/power/main.c index 69b7a8aeca3b..40f86ec4ab30 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -535,6 +535,13 @@ static ssize_t pm_debug_messages_store(struct kobject *kobj, power_attr(pm_debug_messages); +static int __init pm_debug_messages_setup(char *str) +{ + pm_debug_messages_on = true; + return 1; +} +__setup("pm_debug_messages", pm_debug_messages_setup); + /** * __pm_pr_dbg - Print a suspend debug message to the kernel log. * @defer: Whether or not to use printk_deferred() to print the message. -- cgit v1.2.3-71-gd317 From f4b00eab5004e823f28a268580ae4ed16df9fabf Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Wed, 1 Apr 2020 21:06:46 -0700 Subject: mm: kmem: rename memcg_kmem_(un)charge() into memcg_kmem_(un)charge_page() Rename (__)memcg_kmem_(un)charge() into (__)memcg_kmem_(un)charge_page() to better reflect what they are actually doing: 1) call __memcg_kmem_(un)charge_memcg() to actually charge or uncharge the current memcg 2) set or clear the PageKmemcg flag Signed-off-by: Roman Gushchin Signed-off-by: Andrew Morton Reviewed-by: Shakeel Butt Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Link: http://lkml.kernel.org/r/20200109202659.752357-4-guro@fb.com Signed-off-by: Linus Torvalds --- fs/pipe.c | 2 +- include/linux/memcontrol.h | 23 +++++++++++++---------- kernel/fork.c | 9 +++++---- mm/memcontrol.c | 8 ++++---- mm/page_alloc.c | 4 ++-- 5 files changed, 25 insertions(+), 21 deletions(-) (limited to 'kernel') diff --git a/fs/pipe.c b/fs/pipe.c index 2144507447c5..16fb72e9abf7 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -146,7 +146,7 @@ static int anon_pipe_buf_steal(struct pipe_inode_info *pipe, struct page *page = buf->page; if (page_count(page) == 1) { - memcg_kmem_uncharge(page, 0); + memcg_kmem_uncharge_page(page, 0); __SetPageLocked(page); return 0; } diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 13b5d70b8b0e..4bc97ae50f3b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1367,8 +1367,8 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep); void memcg_kmem_put_cache(struct kmem_cache *cachep); #ifdef CONFIG_MEMCG_KMEM -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order); -void __memcg_kmem_uncharge(struct page *page, int order); +int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order); +void __memcg_kmem_uncharge_page(struct page *page, int order); int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order); void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg, unsigned int nr_pages); @@ -1393,17 +1393,18 @@ static inline bool memcg_kmem_enabled(void) return static_branch_unlikely(&memcg_kmem_enabled_key); } -static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) +static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp, + int order) { if (memcg_kmem_enabled()) - return __memcg_kmem_charge(page, gfp, order); + return __memcg_kmem_charge_page(page, gfp, order); return 0; } -static inline void memcg_kmem_uncharge(struct page *page, int order) +static inline void memcg_kmem_uncharge_page(struct page *page, int order) { if (memcg_kmem_enabled()) - __memcg_kmem_uncharge(page, order); + __memcg_kmem_uncharge_page(page, order); } static inline int memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, @@ -1435,21 +1436,23 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p); #else -static inline int memcg_kmem_charge(struct page *page, gfp_t gfp, int order) +static inline int memcg_kmem_charge_page(struct page *page, gfp_t gfp, + int order) { return 0; } -static inline void memcg_kmem_uncharge(struct page *page, int order) +static inline void memcg_kmem_uncharge_page(struct page *page, int order) { } -static inline int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) +static inline int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, + int order) { return 0; } -static inline void __memcg_kmem_uncharge(struct page *page, int order) +static inline void __memcg_kmem_uncharge_page(struct page *page, int order) { } diff --git a/kernel/fork.c b/kernel/fork.c index d90af13431c7..e37e12c203d1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -281,7 +281,7 @@ static inline void free_thread_stack(struct task_struct *tsk) MEMCG_KERNEL_STACK_KB, -(int)(PAGE_SIZE / 1024)); - memcg_kmem_uncharge(vm->pages[i], 0); + memcg_kmem_uncharge_page(vm->pages[i], 0); } for (i = 0; i < NR_CACHED_STACKS; i++) { @@ -413,12 +413,13 @@ static int memcg_charge_kernel_stack(struct task_struct *tsk) for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++) { /* - * If memcg_kmem_charge() fails, page->mem_cgroup - * pointer is NULL, and both memcg_kmem_uncharge() + * If memcg_kmem_charge_page() fails, page->mem_cgroup + * pointer is NULL, and both memcg_kmem_uncharge_page() * and mod_memcg_page_state() in free_thread_stack() * will ignore this page. So it's safe. */ - ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0); + ret = memcg_kmem_charge_page(vm->pages[i], GFP_KERNEL, + 0); if (ret) return ret; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 896b6ebef6a2..4dd3d2883382 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2917,14 +2917,14 @@ int __memcg_kmem_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp, int order) } /** - * __memcg_kmem_charge: charge a kmem page to the current memory cgroup + * __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup * @page: page to charge * @gfp: reclaim mode * @order: allocation order * * Returns 0 on success, an error code on failure. */ -int __memcg_kmem_charge(struct page *page, gfp_t gfp, int order) +int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) { struct mem_cgroup *memcg; int ret = 0; @@ -2960,11 +2960,11 @@ void __memcg_kmem_uncharge_memcg(struct mem_cgroup *memcg, page_counter_uncharge(&memcg->memsw, nr_pages); } /** - * __memcg_kmem_uncharge: uncharge a kmem page + * __memcg_kmem_uncharge_page: uncharge a kmem page * @page: page to uncharge * @order: allocation order */ -void __memcg_kmem_uncharge(struct page *page, int order) +void __memcg_kmem_uncharge_page(struct page *page, int order) { struct mem_cgroup *memcg = page->mem_cgroup; unsigned int nr_pages = 1 << order; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8f3a3bf2c347..7c30f90c3ef4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1153,7 +1153,7 @@ static __always_inline bool free_pages_prepare(struct page *page, if (PageMappingFlags(page)) page->mapping = NULL; if (memcg_kmem_enabled() && PageKmemcg(page)) - __memcg_kmem_uncharge(page, order); + __memcg_kmem_uncharge_page(page, order); if (check_free) bad += free_pages_check(page); if (bad) @@ -4753,7 +4753,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, out: if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page && - unlikely(__memcg_kmem_charge(page, gfp_mask, order) != 0)) { + unlikely(__memcg_kmem_charge_page(page, gfp_mask, order) != 0)) { __free_pages(page, order); page = NULL; } -- cgit v1.2.3-71-gd317 From 8a931f801340c2be10552c7b5622d5f4852f3a36 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 1 Apr 2020 21:07:07 -0700 Subject: mm: memcontrol: recursive memory.low protection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Cc: Michal Hocko Cc: Michal Koutný Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ include/linux/cgroup-defs.h | 5 ++++ kernel/cgroup/cgroup.c | 17 ++++++++++- mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++--- 4 files changed, 79 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index fbb111616705..bcc80269bb6a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -188,6 +188,17 @@ cgroup v2 currently supports the following mount options. modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..e1fafed22db1 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -94,6 +94,11 @@ enum { * Enable legacy local memory.events. */ CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5), + + /* + * Enable recursive subtree protection + */ + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 915dda3f7f19..755c07d845ce 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, enum cgroup2_param { Opt_nsdelegate, Opt_memory_localevents, + Opt_memory_recursiveprot, nr__cgroup2_params }; static const struct fs_parameter_spec cgroup2_fs_parameters[] = { fsparam_flag("nsdelegate", Opt_nsdelegate), fsparam_flag("memory_localevents", Opt_memory_localevents), + fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), {} }; @@ -1839,6 +1841,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_localevents: ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; return 0; + case Opt_memory_recursiveprot: + ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + return 0; } return -EINVAL; } @@ -1855,6 +1860,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS; + + if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; } } @@ -1864,6 +1874,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",nsdelegate"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) seq_puts(seq, ",memory_localevents"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + seq_puts(seq, ",memory_recursiveprot"); return 0; } @@ -6412,7 +6424,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n"); + return snprintf(buf, PAGE_SIZE, + "nsdelegate\n" + "memory_localevents\n" + "memory_recursiveprot\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0e3a8c11fb3b..07032c088608 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6264,13 +6264,27 @@ struct cgroup_subsys memory_cgrp_subsys = { * budget is NOT proportional. A cgroup's protection from a sibling * is capped to its own memory.min/low setting. * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual cgroup's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each cgroup's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. */ static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected) { unsigned long protected; + unsigned long ep; protected = min(usage, setting); /* @@ -6301,7 +6315,34 @@ static unsigned long effective_protection(unsigned long usage, * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ - return protected; + ep = protected; + + /* + * If the children aren't claiming (all of) the protection + * afforded to them by the parent, distribute the remainder in + * proportion to the (unprotected) memory of each cgroup. That + * way, cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they are + * collectively protected from neighboring trees. + * + * We're using unprotected memory for the weight so that if + * some cgroups DO claim explicit protection, we don't protect + * the same bytes twice. + */ + if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) + return ep; + + if (parent_effective > siblings_protected && usage > protected) { + unsigned long unclaimed; + + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; } /** @@ -6321,8 +6362,8 @@ static unsigned long effective_protection(unsigned long usage, enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { + unsigned long usage, parent_usage; struct mem_cgroup *parent; - unsigned long usage; if (mem_cgroup_disabled()) return MEMCG_PROT_NONE; @@ -6347,11 +6388,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, goto out; } - memcg->memory.emin = effective_protection(usage, + parent_usage = page_counter_read(&parent->memory); + + memcg->memory.emin = effective_protection(usage, parent_usage, memcg->memory.min, READ_ONCE(parent->memory.emin), atomic_long_read(&parent->memory.children_min_usage)); - memcg->memory.elow = effective_protection(usage, + memcg->memory.elow = effective_protection(usage, parent_usage, memcg->memory.low, READ_ONCE(parent->memory.elow), atomic_long_read(&parent->memory.children_low_usage)); -- cgit v1.2.3-71-gd317 From 964b692daf3018ae4b45285c3bcbb2d897cb4b40 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 1 Apr 2020 21:10:38 -0700 Subject: mm/compaction: really limit compact_unevictable_allowed to 0 and 1 The proc file `compact_unevictable_allowed' should allow 0 and 1 only, the `extra*' attribues have been set properly but without proc_dointvec_minmax() as the `proc_handler' the limit will not be enforced. Use proc_dointvec_minmax() as the `proc_handler' to enfoce the valid specified range. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Thomas Gleixner Cc: Luis Chamberlain Cc: Kees Cook Cc: Iurii Zaikin Cc: Mel Gorman Link: http://lkml.kernel.org/r/20200303202054.gsosv7fsx2ma3cic@linutronix.de Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 94638f695e60..cb650bb9da68 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1467,7 +1467,7 @@ static struct ctl_table vm_table[] = { .data = &sysctl_compact_unevictable_allowed, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, -- cgit v1.2.3-71-gd317 From 6923aa0d8c629a7853822626877dcb11f4f1d354 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 1 Apr 2020 21:10:42 -0700 Subject: mm/compaction: Disable compact_unevictable_allowed on RT Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages and compact them by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and issue a warning on RT if it is changed. [bigeasy@linutronix.de: v5] Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Acked-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Thomas Gleixner Cc: Luis Chamberlain Cc: Kees Cook Cc: Iurii Zaikin Cc: Vlastimil Babka Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 3 +++ kernel/sysctl.c | 29 ++++++++++++++++++++++++++++- mm/compaction.c | 4 ++++ 3 files changed, 35 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..0329a4d3fa9e 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -128,6 +128,9 @@ allowed to examine the unevictable lru (mlocked pages) for pages to compact. This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. Default value is 1. +On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due +to compaction, which would block the task from becomming active until the fault +is resolved. dirty_background_bytes diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cb650bb9da68..8a176d8727a3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -212,6 +212,11 @@ static int proc_do_cad_pid(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); static int proc_taint(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos); +#endif #endif #ifdef CONFIG_PRINTK @@ -1467,7 +1472,7 @@ static struct ctl_table vm_table[] = { .data = &sysctl_compact_unevictable_allowed, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec_minmax, + .proc_handler = proc_dointvec_minmax_warn_RT_change, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, @@ -2555,6 +2560,28 @@ int proc_dointvec(struct ctl_table *table, int write, return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL); } +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos) +{ + int ret, old; + + if (!IS_ENABLED(CONFIG_PREEMPT_RT) || !write) + return proc_dointvec_minmax(table, write, buffer, lenp, ppos); + + old = *(int *)table->data; + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret) + return ret; + if (old != *(int *)table->data) + pr_warn_once("sysctl attribute %s changed by %s[%d]\n", + table->procname, current->comm, + task_pid_nr(current)); + return ret; +} +#endif + /** * proc_douintvec - read a vector of unsigned integers * @table: the sysctl table diff --git a/mm/compaction.c b/mm/compaction.c index 07947387244a..c589ead54fb3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1594,7 +1594,11 @@ typedef enum { * Allow userspace to control policy on scanning the unevictable LRU for * compactable pages. */ +#ifdef CONFIG_PREEMPT_RT +int sysctl_compact_unevictable_allowed __read_mostly = 0; +#else int sysctl_compact_unevictable_allowed __read_mostly = 1; +#endif static inline void update_fast_start_pfn(struct compact_control *cc, unsigned long pfn) -- cgit v1.2.3-71-gd317 From 8e99cf91b99bb30e16727f10ad6828741c0e992f Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 1 Apr 2020 22:44:46 -0400 Subject: tracing: Do not allocate buffer in trace_find_next_entry() in atomic When dumping out the trace data in latency format, a check is made to peek at the next event to compare its timestamp to the current one, and if the delta is of a greater size, it will add a marker showing so. But to do this, it needs to save the current event otherwise peeking at the next event will remove the current event. To save the event, a temp buffer is used, and if the event is bigger than the temp buffer, the temp buffer is freed and a bigger buffer is allocated. This allocation is a problem when called in atomic context. The only way this gets called via atomic context is via ftrace_dump(). Thus, use a static buffer of 128 bytes (which covers most events), and if the event is bigger than that, simply return NULL. The callers of trace_find_next_entry() need to handle a NULL case, as that's what would happen if the allocation failed. Link: https://lore.kernel.org/r/20200326091256.GR11705@shao2-debian Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Reported-by: kernel test robot Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 6519b7afc499..8d2b98812625 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3472,6 +3472,9 @@ __find_next_entry(struct trace_iterator *iter, int *ent_cpu, return next; } +#define STATIC_TEMP_BUF_SIZE 128 +static char static_temp_buf[STATIC_TEMP_BUF_SIZE]; + /* Find the next real entry, without updating the iterator itself */ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, int *ent_cpu, u64 *ent_ts) @@ -3480,13 +3483,26 @@ struct trace_entry *trace_find_next_entry(struct trace_iterator *iter, int ent_size = iter->ent_size; struct trace_entry *entry; + /* + * If called from ftrace_dump(), then the iter->temp buffer + * will be the static_temp_buf and not created from kmalloc. + * If the entry size is greater than the buffer, we can + * not save it. Just return NULL in that case. This is only + * used to add markers when two consecutive events' time + * stamps have a large delta. See trace_print_lat_context() + */ + if (iter->temp == static_temp_buf && + STATIC_TEMP_BUF_SIZE < ent_size) + return NULL; + /* * The __find_next_entry() may call peek_next_entry(), which may * call ring_buffer_peek() that may make the contents of iter->ent * undefined. Need to copy iter->ent now. */ if (iter->ent && iter->ent != iter->temp) { - if (!iter->temp || iter->temp_size < iter->ent_size) { + if ((!iter->temp || iter->temp_size < iter->ent_size) && + !WARN_ON_ONCE(iter->temp == static_temp_buf)) { kfree(iter->temp); iter->temp = kmalloc(iter->ent_size, GFP_KERNEL); if (!iter->temp) @@ -9203,6 +9219,9 @@ void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) /* Simulate the iterator */ trace_init_global_iter(&iter); + /* Can not use kmalloc for iter.temp */ + iter.temp = static_temp_buf; + iter.temp_size = STATIC_TEMP_BUF_SIZE; for_each_tracing_cpu(cpu) { atomic_inc(&per_cpu_ptr(iter.array_buffer->data, cpu)->disabled); -- cgit v1.2.3-71-gd317 From 2b729fe7f3e9478a21a336231daf35768e7cf37b Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Fri, 3 Apr 2020 11:32:13 -0400 Subject: Revert "cpuset: Make cpuset hotplug synchronous" This reverts commit a49e4629b5ed ("cpuset: Make cpuset hotplug synchronous") as it may deadlock with cpu hotplug path. Link: http://lkml.kernel.org/r/F0388D99-84D7-453B-9B6B-EEFF0E7BE4CC@lca.pw Signed-off-by: Tejun Heo Reported-by: Qian Cai Cc: Prateek Sood --- include/linux/cpuset.h | 3 +++ kernel/cgroup/cpuset.c | 31 ++++++++++++------------------- kernel/power/process.c | 2 ++ 3 files changed, 17 insertions(+), 19 deletions(-) (limited to 'kernel') diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h index cede4cb98b78..04c20de66afc 100644 --- a/include/linux/cpuset.h +++ b/include/linux/cpuset.h @@ -54,6 +54,7 @@ extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); +extern void cpuset_wait_for_hotplug(void); extern void cpuset_read_lock(void); extern void cpuset_read_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); @@ -175,6 +176,8 @@ static inline void cpuset_update_active_cpus(void) partition_sched_domains(1, NULL, NULL); } +static inline void cpuset_wait_for_hotplug(void) { } + static inline void cpuset_read_lock(void) { } static inline void cpuset_read_unlock(void) { } diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index cafd4d2ff882..58f5073acff7 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3101,7 +3101,7 @@ update_tasks: } /** - * cpuset_hotplug - handle CPU/memory hotunplug for a cpuset + * cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset * * This function is called after either CPU or memory configuration has * changed and updates cpuset accordingly. The top_cpuset is always @@ -3116,7 +3116,7 @@ update_tasks: * Note that CPU offlining during suspend is ignored. We don't modify * cpusets across suspend/resume cycles at all. */ -static void cpuset_hotplug(bool use_cpu_hp_lock) +static void cpuset_hotplug_workfn(struct work_struct *work) { static cpumask_t new_cpus; static nodemask_t new_mems; @@ -3201,32 +3201,25 @@ static void cpuset_hotplug(bool use_cpu_hp_lock) /* rebuild sched domains if cpus_allowed has changed */ if (cpus_updated || force_rebuild) { force_rebuild = false; - if (use_cpu_hp_lock) - rebuild_sched_domains(); - else { - /* Acquiring cpu_hotplug_lock is not required. - * When cpuset_hotplug() is called in hotplug path, - * cpu_hotplug_lock is held by the hotplug context - * which is waiting for cpuhp_thread_fun to indicate - * completion of callback. - */ - percpu_down_write(&cpuset_rwsem); - rebuild_sched_domains_locked(); - percpu_up_write(&cpuset_rwsem); - } + rebuild_sched_domains(); } free_cpumasks(NULL, ptmp); } -static void cpuset_hotplug_workfn(struct work_struct *work) +void cpuset_update_active_cpus(void) { - cpuset_hotplug(true); + /* + * We're inside cpu hotplug critical region which usually nests + * inside cgroup synchronization. Bounce actual hotplug processing + * to a work item to avoid reverse locking order. + */ + schedule_work(&cpuset_hotplug_work); } -void cpuset_update_active_cpus(void) +void cpuset_wait_for_hotplug(void) { - cpuset_hotplug(false); + flush_work(&cpuset_hotplug_work); } /* diff --git a/kernel/power/process.c b/kernel/power/process.c index 08f7019357ee..4b6a54da7e65 100644 --- a/kernel/power/process.c +++ b/kernel/power/process.c @@ -204,6 +204,8 @@ void thaw_processes(void) __usermodehelper_set_disable_depth(UMH_FREEZING); thaw_workqueues(); + cpuset_wait_for_hotplug(); + read_lock(&tasklist_lock); for_each_process_thread(g, p) { /* No other threads should have PF_SUSPEND_TASK set */ -- cgit v1.2.3-71-gd317 From 0c05b9bdbfe52ad9b391a28dd26f047715627e0c Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Mon, 30 Mar 2020 10:06:15 -0400 Subject: docs: cgroup-v1: Document the cpuset_v2_mode mount option The cpuset in cgroup v1 accepts a special "cpuset_v2_mode" mount option that make cpuset.cpus and cpuset.mems behave more like those in cgroup v2. Document it to make other people more aware of this feature that can be useful in some circumstances. Signed-off-by: Waiman Long Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v1/cpusets.rst | 11 +++++++++++ kernel/cgroup/cpuset.c | 8 ++++++-- 2 files changed, 17 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 86a6ae995d54..7ade3abd342a 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -223,6 +223,17 @@ cpu_online_mask using a CPU hotplug notifier, and the mems file automatically tracks the value of node_states[N_MEMORY]--i.e., nodes with memory--using the cpuset_track_online_nodes() hook. +The cpuset.effective_cpus and cpuset.effective_mems files are +normally read-only copies of cpuset.cpus and cpuset.mems files +respectively. If the cpuset cgroup filesystem is mounted with the +special "cpuset_v2_mode" option, the behavior of these files will become +similar to the corresponding files in cpuset v2. In other words, hotplug +events will not change cpuset.cpus and cpuset.mems. Those events will +only affect cpuset.effective_cpus and cpuset.effective_mems which show +the actual cpus and memory nodes that are currently used by this cpuset. +See Documentation/admin-guide/cgroup-v2.rst for more information about +cpuset v2 behavior. + 1.4 What are exclusive cpusets ? -------------------------------- diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 58f5073acff7..729d3a5c772e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -358,8 +358,12 @@ static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn); static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); /* - * Cgroup v2 behavior is used when on default hierarchy or the - * cgroup_v2_mode flag is set. + * Cgroup v2 behavior is used on the "cpus" and "mems" control files when + * on default hierarchy or when the cpuset_v2_mode flag is set by mounting + * the v1 cpuset cgroup filesystem with the "cpuset_v2_mode" mount option. + * With v2 behavior, "cpus" and "mems" are always what the users have + * requested and won't be changed by hotplug events. Only the effective + * cpus or mems will be affected. */ static inline bool is_in_v2_mode(void) { -- cgit v1.2.3-71-gd317 From bf37da98c51825c90432d340e135cced37a7460d Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Thu, 12 Mar 2020 16:55:07 -0700 Subject: rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common() The rcu_nmi_enter_common() function can be invoked both in interrupt and NMI handlers. If it is invoked from process context (as opposed to userspace or idle context) on a nohz_full CPU, it might acquire the CPU's leaf rcu_node structure's ->lock. Because this lock is held only with interrupts disabled, this is safe from an interrupt handler, but doing so from an NMI handler can result in self-deadlock. This commit therefore adds "irq" to the "if" condition so as to only acquire the ->lock from irq handlers or process context, never from an NMI handler. Fixes: 5b14557b073c ("rcu: Avoid tick_dep_set_cpu() misordering") Reported-by: Thomas Gleixner Signed-off-by: Paul E. McKenney Reviewed-by: Joel Fernandes (Google) Cc: # 5.5.x --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 550193a9ce76..2c17859233db 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -825,7 +825,7 @@ static __always_inline void rcu_nmi_enter_common(bool irq) rcu_cleanup_after_idle(); incby = 1; - } else if (tick_nohz_full_cpu(rdp->cpu) && + } else if (irq && tick_nohz_full_cpu(rdp->cpu) && rdp->dynticks_nmi_nesting == DYNTICK_IRQ_NONIDLE && READ_ONCE(rdp->rcu_urgent_qs) && !READ_ONCE(rdp->rcu_forced_tick)) { -- cgit v1.2.3-71-gd317 From 88a77559cc06f4a80e956dbf2da2a33c5f18c0af Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 6 Apr 2020 13:58:34 +0200 Subject: PM / sleep: move SNAPSHOT_SET_SWAP_AREA handling into a helper Move the handling of the SNAPSHOT_SET_SWAP_AREA ioctl from the main ioctl helper into a helper function. Signed-off-by: Christoph Hellwig Signed-off-by: Rafael J. Wysocki --- kernel/power/user.c | 57 +++++++++++++++++++++++++++-------------------------- 1 file changed, 29 insertions(+), 28 deletions(-) (limited to 'kernel') diff --git a/kernel/power/user.c b/kernel/power/user.c index ef90eb1fb86e..0cb555f526e4 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -196,6 +196,34 @@ unlock: return res; } +static int snapshot_set_swap_area(struct snapshot_data *data, + void __user *argp) +{ + struct resume_swap_area swap_area; + sector_t offset; + dev_t swdev; + + if (swsusp_swap_in_use()) + return -EPERM; + if (copy_from_user(&swap_area, argp, sizeof(swap_area))) + return -EFAULT; + + /* + * User space encodes device types as two-byte values, + * so we need to recode them + */ + swdev = new_decode_dev(swap_area.dev); + if (!swdev) { + data->swap = -1; + return -EINVAL; + } + offset = swap_area.offset; + data->swap = swap_type_of(swdev, offset, NULL); + if (data->swap < 0) + return -ENODEV; + return 0; +} + static long snapshot_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { @@ -351,34 +379,7 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd, break; case SNAPSHOT_SET_SWAP_AREA: - if (swsusp_swap_in_use()) { - error = -EPERM; - } else { - struct resume_swap_area swap_area; - dev_t swdev; - - error = copy_from_user(&swap_area, (void __user *)arg, - sizeof(struct resume_swap_area)); - if (error) { - error = -EFAULT; - break; - } - - /* - * User space encodes device types as two-byte values, - * so we need to recode them - */ - swdev = new_decode_dev(swap_area.dev); - if (swdev) { - offset = swap_area.offset; - data->swap = swap_type_of(swdev, offset, NULL); - if (data->swap < 0) - error = -ENODEV; - } else { - data->swap = -1; - error = -EINVAL; - } - } + error = snapshot_set_swap_area(data, (void __user *)arg); break; default: -- cgit v1.2.3-71-gd317 From 0f5c4c6e0e9874952e2950465a8859782437b465 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 6 Apr 2020 13:58:35 +0200 Subject: PM / sleep: handle the compat case in snapshot_set_swap_area() Use in_compat_syscall to copy directly from the 32-bit ABI structure. Signed-off-by: Christoph Hellwig Signed-off-by: Rafael J. Wysocki --- kernel/power/user.c | 54 ++++++++++++++++++++++------------------------------- 1 file changed, 22 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/kernel/power/user.c b/kernel/power/user.c index 0cb555f526e4..7959449765d9 100644 --- a/kernel/power/user.c +++ b/kernel/power/user.c @@ -196,28 +196,44 @@ unlock: return res; } +struct compat_resume_swap_area { + compat_loff_t offset; + u32 dev; +} __packed; + static int snapshot_set_swap_area(struct snapshot_data *data, void __user *argp) { - struct resume_swap_area swap_area; sector_t offset; dev_t swdev; if (swsusp_swap_in_use()) return -EPERM; - if (copy_from_user(&swap_area, argp, sizeof(swap_area))) - return -EFAULT; + + if (in_compat_syscall()) { + struct compat_resume_swap_area swap_area; + + if (copy_from_user(&swap_area, argp, sizeof(swap_area))) + return -EFAULT; + swdev = new_decode_dev(swap_area.dev); + offset = swap_area.offset; + } else { + struct resume_swap_area swap_area; + + if (copy_from_user(&swap_area, argp, sizeof(swap_area))) + return -EFAULT; + swdev = new_decode_dev(swap_area.dev); + offset = swap_area.offset; + } /* * User space encodes device types as two-byte values, * so we need to recode them */ - swdev = new_decode_dev(swap_area.dev); if (!swdev) { data->swap = -1; return -EINVAL; } - offset = swap_area.offset; data->swap = swap_type_of(swdev, offset, NULL); if (data->swap < 0) return -ENODEV; @@ -394,12 +410,6 @@ static long snapshot_ioctl(struct file *filp, unsigned int cmd, } #ifdef CONFIG_COMPAT - -struct compat_resume_swap_area { - compat_loff_t offset; - u32 dev; -} __packed; - static long snapshot_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -410,33 +420,13 @@ snapshot_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) case SNAPSHOT_AVAIL_SWAP_SIZE: case SNAPSHOT_ALLOC_SWAP_PAGE: case SNAPSHOT_CREATE_IMAGE: + case SNAPSHOT_SET_SWAP_AREA: return snapshot_ioctl(file, cmd, (unsigned long) compat_ptr(arg)); - - case SNAPSHOT_SET_SWAP_AREA: { - struct compat_resume_swap_area __user *u_swap_area = - compat_ptr(arg); - struct resume_swap_area swap_area; - mm_segment_t old_fs; - int err; - - err = get_user(swap_area.offset, &u_swap_area->offset); - err |= get_user(swap_area.dev, &u_swap_area->dev); - if (err) - return -EFAULT; - old_fs = get_fs(); - set_fs(KERNEL_DS); - err = snapshot_ioctl(file, SNAPSHOT_SET_SWAP_AREA, - (unsigned long) &swap_area); - set_fs(old_fs); - return err; - } - default: return snapshot_ioctl(file, cmd, arg); } } - #endif /* CONFIG_COMPAT */ static const struct file_operations snapshot_fops = { -- cgit v1.2.3-71-gd317 From 0ac16296ffc638f5163f9aeeeb1fe447268e449f Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Fri, 3 Apr 2020 16:07:34 +0800 Subject: bpf: Fix a typo "inacitve" -> "inactive" There is a typo in struct bpf_lru_list's next_inactive_rotation description, thus fix s/inacitve/inactive/. Signed-off-by: Qiujun Huang Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/1585901254-30377-1-git-send-email-hqjagain@gmail.com --- kernel/bpf/bpf_lru_list.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/bpf_lru_list.h b/kernel/bpf/bpf_lru_list.h index f02504640e18..6b12f06ee18c 100644 --- a/kernel/bpf/bpf_lru_list.h +++ b/kernel/bpf/bpf_lru_list.h @@ -30,7 +30,7 @@ struct bpf_lru_node { struct bpf_lru_list { struct list_head lists[NR_BPF_LRU_LIST_T]; unsigned int counts[NR_BPF_LRU_LIST_COUNT]; - /* The next inacitve list rotation starts from here */ + /* The next inactive list rotation starts from here */ struct list_head *next_inactive_rotation; raw_spinlock_t lock ____cacheline_aligned_in_smp; -- cgit v1.2.3-71-gd317 From b801f1e22c23c259d6a2c955efddd20370de19a6 Mon Sep 17 00:00:00 2001 From: "Michael Kerrisk (man-pages)" Date: Fri, 3 Apr 2020 14:11:39 +0200 Subject: time/namespace: Fix time_for_children symlink Looking at the contents of the /proc/PID/ns/time_for_children symlink shows an anomaly: $ ls -l /proc/self/ns/* |awk '{print $9, $10, $11}' ... /proc/self/ns/pid -> pid:[4026531836] /proc/self/ns/pid_for_children -> pid:[4026531836] /proc/self/ns/time -> time:[4026531834] /proc/self/ns/time_for_children -> time_for_children:[4026531834] /proc/self/ns/user -> user:[4026531837] ... The reference for 'time_for_children' should be a 'time' namespace, just as the reference for 'pid_for_children' is a 'pid' namespace. In other words, the above time_for_children link should read: /proc/self/ns/time_for_children -> time:[4026531834] Fixes: 769071ac9f20 ("ns: Introduce Time Namespace") Signed-off-by: Michael Kerrisk Signed-off-by: Thomas Gleixner Reviewed-by: Dmitry Safonov Acked-by: Christian Brauner Acked-by: Andrei Vagin Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/a2418c48-ed80-3afe-116e-6611cb799557@gmail.com --- kernel/time/namespace.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c index e6ba064ce773..3b30288793fe 100644 --- a/kernel/time/namespace.c +++ b/kernel/time/namespace.c @@ -447,6 +447,7 @@ const struct proc_ns_operations timens_operations = { const struct proc_ns_operations timens_for_children_operations = { .name = "time_for_children", + .real_ns_name = "time", .type = CLONE_NEWTIME, .get = timens_for_children_get, .put = timens_put, -- cgit v1.2.3-71-gd317 From eeec26d5da8248ea4e240b8795bb4364213d3247 Mon Sep 17 00:00:00 2001 From: Dmitry Safonov Date: Mon, 6 Apr 2020 18:13:42 +0100 Subject: time/namespace: Add max_time_namespaces ucount Michael noticed that userns limit for number of time namespaces is missing. Furthermore, time namespace introduced UCOUNT_TIME_NAMESPACES, but didn't introduce an array member in user_table[]. It would make array's initialisation OOB write, but by luck the user_table array has an excessive empty member (all accesses to the array are limited with UCOUNT_COUNTS - so it silently reuses the last free member. Fixes user-visible regression: max_inotify_instances by reason of the missing UCOUNT_ENTRY() has limited max number of namespaces instead of the number of inotify instances. Fixes: 769071ac9f20 ("ns: Introduce Time Namespace") Reported-by: Michael Kerrisk (man-pages) Signed-off-by: Dmitry Safonov Signed-off-by: Thomas Gleixner Acked-by: Andrei Vagin Acked-by: Vincenzo Frascino Cc: stable@kernel.org Link: https://lkml.kernel.org/r/20200406171342.128733-1-dima@arista.com --- Documentation/admin-guide/sysctl/user.rst | 6 ++++++ kernel/ucount.c | 1 + 2 files changed, 7 insertions(+) (limited to 'kernel') diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst index 650eaa03f15e..c45824589339 100644 --- a/Documentation/admin-guide/sysctl/user.rst +++ b/Documentation/admin-guide/sysctl/user.rst @@ -65,6 +65,12 @@ max_pid_namespaces The maximum number of pid namespaces that any user in the current user namespace may create. +max_time_namespaces +=================== + + The maximum number of time namespaces that any user in the current + user namespace may create. + max_user_namespaces =================== diff --git a/kernel/ucount.c b/kernel/ucount.c index a53cc2b4179c..29c60eb4ec9b 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -69,6 +69,7 @@ static struct ctl_table user_table[] = { UCOUNT_ENTRY("max_net_namespaces"), UCOUNT_ENTRY("max_mnt_namespaces"), UCOUNT_ENTRY("max_cgroup_namespaces"), + UCOUNT_ENTRY("max_time_namespaces"), #ifdef CONFIG_INOTIFY_USER UCOUNT_ENTRY("max_inotify_instances"), UCOUNT_ENTRY("max_inotify_watches"), -- cgit v1.2.3-71-gd317 From 93949bb21b52aae924fd26f9775c8655dd9e885e Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Mon, 6 Apr 2020 20:03:33 -0700 Subject: mm: don't prepare anon_vma if vma has VM_WIPEONFORK Patch series "mm: Fix misuse of parent anon_vma in dup_mmap path". This patchset fixes the misuse of parenet anon_vma, which mainly caused by child vma's vm_next and vm_prev are left same as its parent after duplicate vma. Finally, code reached parent vma's neighbor by referring pointer of child vma and executed wrong logic. The first two patches fix relevant issues, and the third patch sets vm_next and vm_prev to NULL when duplicate vma to prevent potential misuse in future. Effects of the first bug is that causes rmap code to check both parent and child's page table, although a page couldn't be mapped by both parent and child, because child vma has WIPEONFORK so all pages mapped by child are 'new' and not relevant to parent. Effects of the second bug is that the relationship of anon_vma of parent and child are totallyconvoluted. It would cause 'son', 'grandson', ..., etc, to share 'parent' anon_vma, which disobey the design rule of reusing anon_vma (the rule to be followed is that reusing should among vma of same process, and vma should not gone through fork). So, both issues should cause unnecessary rmap walking and have unexpected complexity. These two issues would not be directly visible, I used debugging code to check the anon_vma pointers of parent and child when inspecting the suspicious implementation of issue #2, then find the problem. This patch (of 3): In dup_mmap(), anon_vma_prepare() is called for vma has VM_WIPEONFORK, and parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and ->vm_prev as its parent vma. That allows anon_vma used by parent been mistakenly shared by child (find_mergeable_anon_vma() will do this reuse work). Besides this issue, call anon_vma_prepare() should be avoided because we don't copy page for this vma. Preparing anon_vma will be handled during fault. Fixes: d2cd9ede6e19 ("mm,fork: introduce MADV_WIPEONFORK") Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Rik van Riel Cc: Kirill A. Shutemov Cc: Johannes Weiner Link: http://lkml.kernel.org/r/1581150928-3214-2-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index d2a967bf85d5..d3a5915d3cfe 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -553,10 +553,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, if (retval) goto fail_nomem_anon_vma_fork; if (tmp->vm_flags & VM_WIPEONFORK) { - /* VM_WIPEONFORK gets a clean slate in the child. */ + /* + * VM_WIPEONFORK gets a clean slate in the child. + * Don't prepare anon_vma until fault since we don't + * copy page for current vma. + */ tmp->anon_vma = NULL; - if (anon_vma_prepare(tmp)) - goto fail_nomem_anon_vma_fork; } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); -- cgit v1.2.3-71-gd317 From e39a4b332df69f6850efda5ce6d932ef14a39042 Mon Sep 17 00:00:00 2001 From: Li Xinhai Date: Mon, 6 Apr 2020 20:03:39 -0700 Subject: mm: set vm_next and vm_prev to NULL in vm_area_dup() Set ->vm_next and ->vm_prev to NULL to prevent potential misuse from the new duplicated vma. Currently, only in fork path there are misuse for handling anon_vma. No other bugs been revealed with this patch applied. Signed-off-by: Li Xinhai Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Johannes Weiner Cc: Rik van Riel Link: http://lkml.kernel.org/r/1581150928-3214-4-git-send-email-lixinhai.lxh@gmail.com Signed-off-by: Linus Torvalds --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index d3a5915d3cfe..4385f3d639f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -361,6 +361,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) if (new) { *new = *orig; INIT_LIST_HEAD(&new->anon_vma_chain); + new->vm_next = new->vm_prev = NULL; } return new; } @@ -562,7 +563,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); - tmp->vm_next = tmp->vm_prev = NULL; file = tmp->vm_file; if (file) { struct inode *inode = file_inode(file); -- cgit v1.2.3-71-gd317 From 3122e80efc0faf4a2accba7a46c7ed795edbfded Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:47 -0700 Subject: mm/vma: make vma_is_accessible() available for general use Lets move vma_is_accessible() helper to include/linux/mm.h which makes it available for general use. While here, this replaces all remaining open encodings for VMA access check with vma_is_accessible(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Geert Uytterhoeven Acked-by: Guo Ren Acked-by: Vlastimil Babka Cc: Guo Ren Cc: Geert Uytterhoeven Cc: Ralf Baechle Cc: Paul Burton Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Michael Ellerman Cc: Yoshinori Sato Cc: Rich Felker Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Steven Rostedt Cc: Mel Gorman Cc: Alexander Viro Cc: "Aneesh Kumar K.V" Cc: Arnaldo Carvalho de Melo Cc: Arnd Bergmann Cc: Nick Piggin Cc: Paul Mackerras Cc: Will Deacon Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/csky/mm/fault.c | 2 +- arch/m68k/mm/fault.c | 2 +- arch/mips/mm/fault.c | 2 +- arch/powerpc/mm/fault.c | 2 +- arch/sh/mm/fault.c | 2 +- arch/x86/mm/fault.c | 2 +- include/linux/mm.h | 6 ++++++ kernel/sched/fair.c | 2 +- mm/gup.c | 2 +- mm/memory.c | 5 ----- mm/mempolicy.c | 3 +-- mm/mmap.c | 5 ++--- 12 files changed, 17 insertions(+), 18 deletions(-) (limited to 'kernel') diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index d3c61b83e195..a6e8230b6fbf 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -141,7 +141,7 @@ good_area: if (!(vma->vm_flags & VM_WRITE)) goto bad_area; } else { - if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + if (!vma_is_accessible(vma)) goto bad_area; } diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index f7afb9897966..0c4a21a685d5 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -125,7 +125,7 @@ good_area: case 1: /* read, present */ goto acc_err; case 0: /* read, not present */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + if (!vma_is_accessible(vma)) goto acc_err; } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 4a0eafe3d932..fb048ba2b91d 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -142,7 +142,7 @@ good_area: goto bad_area; } } else { - if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))) + if (!vma_is_accessible(vma)) goto bad_area; } } diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index d15f0f0ee806..84af6c8eecf7 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -314,7 +314,7 @@ static bool access_error(bool is_write, bool is_exec, return false; } - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return true; /* * We should ideally do the vma pkey access check here. But in the diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index 13ee4d20e622..5f23d7907597 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -355,7 +355,7 @@ static inline int access_error(int error_code, struct vm_area_struct *vma) return 1; /* read, not present: */ - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return 1; return 0; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 859519f5b342..a51df516b87b 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1222,7 +1222,7 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) return 1; /* read, not present: */ - if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + if (unlikely(!vma_is_accessible(vma))) return 1; return 0; diff --git a/include/linux/mm.h b/include/linux/mm.h index 7dd5c4ccbf85..be49e371e4b5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -629,6 +629,12 @@ static inline bool vma_is_foreign(struct vm_area_struct *vma) return false; } + +static inline bool vma_is_accessible(struct vm_area_struct *vma) +{ + return vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC); +} + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d7fb20adabeb..1ea3dddafe69 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2799,7 +2799,7 @@ static void task_numa_work(struct callback_head *work) * Skip inaccessible VMAs to avoid any confusion between * PROT_NONE and NUMA hinting ptes */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) + if (!vma_is_accessible(vma)) continue; do { diff --git a/mm/gup.c b/mm/gup.c index da3e03185144..4d505c994623 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1416,7 +1416,7 @@ long populate_vma_page_range(struct vm_area_struct *vma, * We want mlock to succeed for regions that have any permissions * other than PROT_NONE. */ - if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) + if (vma_is_accessible(vma)) gup_flags |= FOLL_FORCE; /* diff --git a/mm/memory.c b/mm/memory.c index 586271f3efc6..d2a353c345ad 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3964,11 +3964,6 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) return VM_FAULT_FALLBACK; } -static inline bool vma_is_accessible(struct vm_area_struct *vma) -{ - return vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE); -} - static vm_fault_t create_huge_pud(struct vm_fault *vmf) { #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 5fb427aed612..b36926ba02e2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -678,8 +678,7 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end, if (flags & MPOL_MF_LAZY) { /* Similar to task_numa_work, skip inaccessible VMAs */ - if (!is_vm_hugetlb_page(vma) && - (vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)) && + if (!is_vm_hugetlb_page(vma) && vma_is_accessible(vma) && !(vma->vm_flags & VM_MIXEDMAP)) change_prot_numa(vma, start, endvma); return 1; diff --git a/mm/mmap.c b/mm/mmap.c index 94ae18398c59..aa09429d5888 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2358,8 +2358,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) gap_addr = TASK_SIZE; next = vma->vm_next; - if (next && next->vm_start < gap_addr && - (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { + if (next && next->vm_start < gap_addr && vma_is_accessible(next)) { if (!(next->vm_flags & VM_GROWSUP)) return -ENOMEM; /* Check that both stack segments have the same anon_vma? */ @@ -2440,7 +2439,7 @@ int expand_downwards(struct vm_area_struct *vma, prev = vma->vm_prev; /* Check that both stack segments have the same anon_vma? */ if (prev && !(prev->vm_flags & VM_GROWSDOWN) && - (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) { + vma_is_accessible(prev)) { if (address - prev->vm_end < stack_guard_gap) return -ENOMEM; } -- cgit v1.2.3-71-gd317 From 03911132aafd6727e59408e497c049402a5a11fa Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Mon, 6 Apr 2020 20:03:51 -0700 Subject: mm/vma: replace all remaining open encodings with is_vm_hugetlb_page() This replaces all remaining open encodings with is_vm_hugetlb_page(). Signed-off-by: Anshuman Khandual Signed-off-by: Andrew Morton Acked-by: Vlastimil Babka Cc: Paul Mackerras Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Alexander Viro Cc: Will Deacon Cc: "Aneesh Kumar K.V" Cc: Nick Piggin Cc: Peter Zijlstra Cc: Arnd Bergmann Cc: Ingo Molnar Cc: Arnaldo Carvalho de Melo Cc: Andy Lutomirski Cc: Dave Hansen Cc: Geert Uytterhoeven Cc: Guo Ren Cc: Mel Gorman Cc: Paul Burton Cc: Paul Mackerras Cc: Ralf Baechle Cc: Rich Felker Cc: Steven Rostedt Cc: Thomas Gleixner Cc: Yoshinori Sato Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds --- arch/powerpc/kvm/e500_mmu_host.c | 2 +- fs/binfmt_elf.c | 3 ++- include/asm-generic/tlb.h | 3 ++- kernel/events/core.c | 3 ++- 4 files changed, 7 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 425d13806645..df9989cf7ba3 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -422,7 +422,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, break; } } else if (vma && hva >= vma->vm_start && - (vma->vm_flags & VM_HUGETLB)) { + is_vm_hugetlb_page(vma)) { unsigned long psize = vma_kernel_pagesize(vma); tsize = (gtlbe->mas1 & MAS1_TSIZE_MASK) >> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index f4713ea76e82..1eb63867e266 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include @@ -1317,7 +1318,7 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, } /* Hugetlb memory check */ - if (vma->vm_flags & VM_HUGETLB) { + if (is_vm_hugetlb_page(vma)) { if ((vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_SHARED)) goto whole; if (!(vma->vm_flags & VM_SHARED) && FILTER(HUGETLB_PRIVATE)) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index f391f6b500b4..3f1649a8cf55 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -13,6 +13,7 @@ #include #include +#include #include #include #include @@ -398,7 +399,7 @@ tlb_update_vma_flags(struct mmu_gather *tlb, struct vm_area_struct *vma) * We rely on tlb_end_vma() to issue a flush, such that when we reset * these values the batch is empty. */ - tlb->vma_huge = !!(vma->vm_flags & VM_HUGETLB); + tlb->vma_huge = is_vm_hugetlb_page(vma); tlb->vma_exec = !!(vma->vm_flags & VM_EXEC); } diff --git a/kernel/events/core.c b/kernel/events/core.c index 81e6d80cb219..55e44417f66d 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -7973,7 +7974,7 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event) flags |= MAP_EXECUTABLE; if (vma->vm_flags & VM_LOCKED) flags |= MAP_LOCKED; - if (vma->vm_flags & VM_HUGETLB) + if (is_vm_hugetlb_page(vma)) flags |= MAP_HUGETLB; if (file) { -- cgit v1.2.3-71-gd317 From d919b33dafb3e222d23671b2bb06d119aede625f Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Mon, 6 Apr 2020 20:09:01 -0700 Subject: proc: faster open/read/close with "permanent" files Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include #include #include #include #include #include #include #include const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot Reported-by: Dan Carpenter Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Cc: Al Viro Cc: Joe Perches Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds --- fs/proc/cpuinfo.c | 1 + fs/proc/generic.c | 31 +++++++- fs/proc/inode.c | 187 +++++++++++++++++++++++++++++++++++------------- fs/proc/internal.h | 6 ++ fs/proc/kmsg.c | 1 + fs/proc/stat.c | 1 + include/linux/proc_fs.h | 17 ++++- ipc/util.c | 1 + kernel/module.c | 1 + mm/slab_common.c | 1 + mm/swapfile.c | 1 + 11 files changed, 194 insertions(+), 54 deletions(-) (limited to 'kernel') diff --git a/fs/proc/cpuinfo.c b/fs/proc/cpuinfo.c index c1dea9b8222e..d0989a443c77 100644 --- a/fs/proc/cpuinfo.c +++ b/fs/proc/cpuinfo.c @@ -17,6 +17,7 @@ static int cpuinfo_open(struct inode *inode, struct file *file) } static const struct proc_ops cpuinfo_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = cpuinfo_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 3faed94e4b65..4ed6dabdf6ff 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -531,6 +531,12 @@ struct proc_dir_entry *proc_create_reg(const char *name, umode_t mode, return p; } +static inline void pde_set_flags(struct proc_dir_entry *pde) +{ + if (pde->proc_ops->proc_flags & PROC_ENTRY_PERMANENT) + pde->flags |= PROC_ENTRY_PERMANENT; +} + struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, struct proc_dir_entry *parent, const struct proc_ops *proc_ops, void *data) @@ -541,6 +547,7 @@ struct proc_dir_entry *proc_create_data(const char *name, umode_t mode, if (!p) return NULL; p->proc_ops = proc_ops; + pde_set_flags(p); return proc_register(parent, p); } EXPORT_SYMBOL(proc_create_data); @@ -572,6 +579,7 @@ static int proc_seq_release(struct inode *inode, struct file *file) } static const struct proc_ops proc_seq_ops = { + /* not permanent -- can call into arbitrary seq_operations */ .proc_open = proc_seq_open, .proc_read = seq_read, .proc_lseek = seq_lseek, @@ -602,6 +610,7 @@ static int proc_single_open(struct inode *inode, struct file *file) } static const struct proc_ops proc_single_ops = { + /* not permanent -- can call into arbitrary ->single_show */ .proc_open = proc_single_open, .proc_read = seq_read, .proc_lseek = seq_lseek, @@ -662,9 +671,13 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) de = pde_subdir_find(parent, fn, len); if (de) { - rb_erase(&de->subdir_node, &parent->subdir); - if (S_ISDIR(de->mode)) { - parent->nlink--; + if (unlikely(pde_is_permanent(de))) { + WARN(1, "removing permanent /proc entry '%s'", de->name); + de = NULL; + } else { + rb_erase(&de->subdir_node, &parent->subdir); + if (S_ISDIR(de->mode)) + parent->nlink--; } } write_unlock(&proc_subdir_lock); @@ -700,12 +713,24 @@ int remove_proc_subtree(const char *name, struct proc_dir_entry *parent) write_unlock(&proc_subdir_lock); return -ENOENT; } + if (unlikely(pde_is_permanent(root))) { + write_unlock(&proc_subdir_lock); + WARN(1, "removing permanent /proc entry '%s/%s'", + root->parent->name, root->name); + return -EINVAL; + } rb_erase(&root->subdir_node, &parent->subdir); de = root; while (1) { next = pde_subdir_first(de); if (next) { + if (unlikely(pde_is_permanent(root))) { + write_unlock(&proc_subdir_lock); + WARN(1, "removing permanent /proc entry '%s/%s'", + next->parent->name, next->name); + return -EINVAL; + } rb_erase(&next->subdir_node, &de->subdir); de = next; continue; diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 05d31c464bee..fb4cace9ea41 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -259,135 +259,204 @@ void proc_entry_rundown(struct proc_dir_entry *de) spin_unlock(&de->pde_unload_lock); } +static loff_t pde_lseek(struct proc_dir_entry *pde, struct file *file, loff_t offset, int whence) +{ + typeof_member(struct proc_ops, proc_lseek) lseek; + + lseek = pde->proc_ops->proc_lseek; + if (!lseek) + lseek = default_llseek; + return lseek(file, offset, whence); +} + static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence) { struct proc_dir_entry *pde = PDE(file_inode(file)); loff_t rv = -EINVAL; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_lseek) lseek; - lseek = pde->proc_ops->proc_lseek; - if (!lseek) - lseek = default_llseek; - rv = lseek(file, offset, whence); + if (pde_is_permanent(pde)) { + return pde_lseek(pde, file, offset, whence); + } else if (use_pde(pde)) { + rv = pde_lseek(pde, file, offset, whence); unuse_pde(pde); } return rv; } +static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char __user *buf, size_t count, loff_t *ppos) +{ + typeof_member(struct proc_ops, proc_read) read; + + read = pde->proc_ops->proc_read; + if (read) + return read(file, buf, count, ppos); + return -EIO; +} + static ssize_t proc_reg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { struct proc_dir_entry *pde = PDE(file_inode(file)); ssize_t rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_read) read; - read = pde->proc_ops->proc_read; - if (read) - rv = read(file, buf, count, ppos); + if (pde_is_permanent(pde)) { + return pde_read(pde, file, buf, count, ppos); + } else if (use_pde(pde)) { + rv = pde_read(pde, file, buf, count, ppos); unuse_pde(pde); } return rv; } +static ssize_t pde_write(struct proc_dir_entry *pde, struct file *file, const char __user *buf, size_t count, loff_t *ppos) +{ + typeof_member(struct proc_ops, proc_write) write; + + write = pde->proc_ops->proc_write; + if (write) + return write(file, buf, count, ppos); + return -EIO; +} + static ssize_t proc_reg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { struct proc_dir_entry *pde = PDE(file_inode(file)); ssize_t rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_write) write; - write = pde->proc_ops->proc_write; - if (write) - rv = write(file, buf, count, ppos); + if (pde_is_permanent(pde)) { + return pde_write(pde, file, buf, count, ppos); + } else if (use_pde(pde)) { + rv = pde_write(pde, file, buf, count, ppos); unuse_pde(pde); } return rv; } +static __poll_t pde_poll(struct proc_dir_entry *pde, struct file *file, struct poll_table_struct *pts) +{ + typeof_member(struct proc_ops, proc_poll) poll; + + poll = pde->proc_ops->proc_poll; + if (poll) + return poll(file, pts); + return DEFAULT_POLLMASK; +} + static __poll_t proc_reg_poll(struct file *file, struct poll_table_struct *pts) { struct proc_dir_entry *pde = PDE(file_inode(file)); __poll_t rv = DEFAULT_POLLMASK; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_poll) poll; - poll = pde->proc_ops->proc_poll; - if (poll) - rv = poll(file, pts); + if (pde_is_permanent(pde)) { + return pde_poll(pde, file, pts); + } else if (use_pde(pde)) { + rv = pde_poll(pde, file, pts); unuse_pde(pde); } return rv; } +static long pde_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg) +{ + typeof_member(struct proc_ops, proc_ioctl) ioctl; + + ioctl = pde->proc_ops->proc_ioctl; + if (ioctl) + return ioctl(file, cmd, arg); + return -ENOTTY; +} + static long proc_reg_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct proc_dir_entry *pde = PDE(file_inode(file)); long rv = -ENOTTY; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_ioctl) ioctl; - ioctl = pde->proc_ops->proc_ioctl; - if (ioctl) - rv = ioctl(file, cmd, arg); + if (pde_is_permanent(pde)) { + return pde_ioctl(pde, file, cmd, arg); + } else if (use_pde(pde)) { + rv = pde_ioctl(pde, file, cmd, arg); unuse_pde(pde); } return rv; } #ifdef CONFIG_COMPAT +static long pde_compat_ioctl(struct proc_dir_entry *pde, struct file *file, unsigned int cmd, unsigned long arg) +{ + typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl; + + compat_ioctl = pde->proc_ops->proc_compat_ioctl; + if (compat_ioctl) + return compat_ioctl(file, cmd, arg); + return -ENOTTY; +} + static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct proc_dir_entry *pde = PDE(file_inode(file)); long rv = -ENOTTY; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_compat_ioctl) compat_ioctl; - - compat_ioctl = pde->proc_ops->proc_compat_ioctl; - if (compat_ioctl) - rv = compat_ioctl(file, cmd, arg); + if (pde_is_permanent(pde)) { + return pde_compat_ioctl(pde, file, cmd, arg); + } else if (use_pde(pde)) { + rv = pde_compat_ioctl(pde, file, cmd, arg); unuse_pde(pde); } return rv; } #endif +static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma) +{ + typeof_member(struct proc_ops, proc_mmap) mmap; + + mmap = pde->proc_ops->proc_mmap; + if (mmap) + return mmap(file, vma); + return -EIO; +} + static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma) { struct proc_dir_entry *pde = PDE(file_inode(file)); int rv = -EIO; - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_mmap) mmap; - mmap = pde->proc_ops->proc_mmap; - if (mmap) - rv = mmap(file, vma); + if (pde_is_permanent(pde)) { + return pde_mmap(pde, file, vma); + } else if (use_pde(pde)) { + rv = pde_mmap(pde, file, vma); unuse_pde(pde); } return rv; } static unsigned long -proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr, +pde_get_unmapped_area(struct proc_dir_entry *pde, struct file *file, unsigned long orig_addr, unsigned long len, unsigned long pgoff, unsigned long flags) { - struct proc_dir_entry *pde = PDE(file_inode(file)); - unsigned long rv = -EIO; - - if (use_pde(pde)) { - typeof_member(struct proc_ops, proc_get_unmapped_area) get_area; + typeof_member(struct proc_ops, proc_get_unmapped_area) get_area; - get_area = pde->proc_ops->proc_get_unmapped_area; + get_area = pde->proc_ops->proc_get_unmapped_area; #ifdef CONFIG_MMU - if (!get_area) - get_area = current->mm->get_unmapped_area; + if (!get_area) + get_area = current->mm->get_unmapped_area; #endif + if (get_area) + return get_area(file, orig_addr, len, pgoff, flags); + return orig_addr; +} + +static unsigned long +proc_reg_get_unmapped_area(struct file *file, unsigned long orig_addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + struct proc_dir_entry *pde = PDE(file_inode(file)); + unsigned long rv = -EIO; - if (get_area) - rv = get_area(file, orig_addr, len, pgoff, flags); - else - rv = orig_addr; + if (pde_is_permanent(pde)) { + return pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags); + } else if (use_pde(pde)) { + rv = pde_get_unmapped_area(pde, file, orig_addr, len, pgoff, flags); unuse_pde(pde); } return rv; @@ -401,6 +470,13 @@ static int proc_reg_open(struct inode *inode, struct file *file) typeof_member(struct proc_ops, proc_release) release; struct pde_opener *pdeo; + if (pde_is_permanent(pde)) { + open = pde->proc_ops->proc_open; + if (open) + rv = open(inode, file); + return rv; + } + /* * Ensure that * 1) PDE's ->release hook will be called no matter what @@ -450,6 +526,17 @@ static int proc_reg_release(struct inode *inode, struct file *file) { struct proc_dir_entry *pde = PDE(inode); struct pde_opener *pdeo; + + if (pde_is_permanent(pde)) { + typeof_member(struct proc_ops, proc_release) release; + + release = pde->proc_ops->proc_release; + if (release) { + return release(inode, file); + } + return 0; + } + spin_lock(&pde->pde_unload_lock); list_for_each_entry(pdeo, &pde->pde_openers, lh) { if (pdeo->file == file) { diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 9e294f0290e5..917cc85e3466 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -61,6 +61,7 @@ struct proc_dir_entry { struct rb_node subdir_node; char *name; umode_t mode; + u8 flags; u8 namelen; char inline_name[]; } __randomize_layout; @@ -73,6 +74,11 @@ struct proc_dir_entry { 0) #define SIZEOF_PDE_INLINE_NAME (SIZEOF_PDE - sizeof(struct proc_dir_entry)) +static inline bool pde_is_permanent(const struct proc_dir_entry *pde) +{ + return pde->flags & PROC_ENTRY_PERMANENT; +} + extern struct kmem_cache *proc_dir_entry_cache; void pde_free(struct proc_dir_entry *pde); diff --git a/fs/proc/kmsg.c b/fs/proc/kmsg.c index ec1b7d2fb773..b38ad552887f 100644 --- a/fs/proc/kmsg.c +++ b/fs/proc/kmsg.c @@ -50,6 +50,7 @@ static __poll_t kmsg_poll(struct file *file, poll_table *wait) static const struct proc_ops kmsg_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_read = kmsg_read, .proc_poll = kmsg_poll, .proc_open = kmsg_open, diff --git a/fs/proc/stat.c b/fs/proc/stat.c index 0449edf460f5..46b3293015fe 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -224,6 +224,7 @@ static int stat_open(struct inode *inode, struct file *file) } static const struct proc_ops stat_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = stat_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index 40a7982b7285..45c05fd9c99d 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -5,6 +5,7 @@ #ifndef _LINUX_PROC_FS_H #define _LINUX_PROC_FS_H +#include #include #include @@ -12,7 +13,21 @@ struct proc_dir_entry; struct seq_file; struct seq_operations; +enum { + /* + * All /proc entries using this ->proc_ops instance are never removed. + * + * If in doubt, ignore this flag. + */ +#ifdef MODULE + PROC_ENTRY_PERMANENT = 0U, +#else + PROC_ENTRY_PERMANENT = 1U << 0, +#endif +}; + struct proc_ops { + unsigned int proc_flags; int (*proc_open)(struct inode *, struct file *); ssize_t (*proc_read)(struct file *, char __user *, size_t, loff_t *); ssize_t (*proc_write)(struct file *, const char __user *, size_t, loff_t *); @@ -25,7 +40,7 @@ struct proc_ops { #endif int (*proc_mmap)(struct file *, struct vm_area_struct *); unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); -}; +} __randomize_layout; #ifdef CONFIG_PROC_FS diff --git a/ipc/util.c b/ipc/util.c index fe61df53775a..97638eb2d7cb 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -885,6 +885,7 @@ static int sysvipc_proc_release(struct inode *inode, struct file *file) } static const struct proc_ops sysvipc_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = sysvipc_proc_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/kernel/module.c b/kernel/module.c index 33569a01d6e1..3447f3b74870 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -4355,6 +4355,7 @@ static int modules_open(struct inode *inode, struct file *file) } static const struct proc_ops modules_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = modules_open, .proc_read = seq_read, .proc_lseek = seq_lseek, diff --git a/mm/slab_common.c b/mm/slab_common.c index 5282f881d2f5..93ec4a574d8d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1581,6 +1581,7 @@ static int slabinfo_open(struct inode *inode, struct file *file) } static const struct proc_ops slabinfo_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = slabinfo_open, .proc_read = seq_read, .proc_write = slabinfo_write, diff --git a/mm/swapfile.c b/mm/swapfile.c index 273a923c275c..5871a2aa86a5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2797,6 +2797,7 @@ static int swaps_open(struct inode *inode, struct file *file) } static const struct proc_ops swaps_proc_ops = { + .proc_flags = PROC_ENTRY_PERMANENT, .proc_open = swaps_open, .proc_read = seq_read, .proc_lseek = seq_lseek, -- cgit v1.2.3-71-gd317 From 63174f61dfaef58dc0e813eaf6602636794f8942 Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Mon, 6 Apr 2020 20:09:27 -0700 Subject: kernel/extable.c: use address-of operator on section symbols Clang warns: ../kernel/extable.c:37:52: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: Nick Desaulniers Signed-off-by: Nathan Chancellor Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Link: https://github.com/ClangBuiltLinux/linux/issues/892 Link: http://lkml.kernel.org/r/20200219202036.45702-1-natechancellor@gmail.com Signed-off-by: Linus Torvalds --- kernel/extable.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/extable.c b/kernel/extable.c index 7681f87e89dd..b0ea5eb0c3b4 100644 --- a/kernel/extable.c +++ b/kernel/extable.c @@ -34,7 +34,8 @@ u32 __initdata __visible main_extable_sort_needed = 1; /* Sort the kernel's built-in exception table */ void __init sort_main_extable(void) { - if (main_extable_sort_needed && __stop___ex_table > __start___ex_table) { + if (main_extable_sort_needed && + &__stop___ex_table > &__start___ex_table) { pr_notice("Sorting __ex_table...\n"); sort_extable(__start___ex_table, __stop___ex_table); } -- cgit v1.2.3-71-gd317 From 889b3c1245de48ed0cacf7aebb25c489d3e4a3e9 Mon Sep 17 00:00:00 2001 From: Masahiro Yamada Date: Mon, 6 Apr 2020 20:09:33 -0700 Subject: compiler: remove CONFIG_OPTIMIZE_INLINING entirely Commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by: Masahiro Yamada Signed-off-by: Andrew Morton Reviewed-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Cc: Arnd Bergmann Cc: Borislav Petkov Cc: David Miller Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by: Linus Torvalds --- arch/x86/configs/i386_defconfig | 1 - arch/x86/configs/x86_64_defconfig | 1 - include/linux/compiler_types.h | 11 +---------- kernel/configs/tiny.config | 1 - lib/Kconfig.debug | 12 ------------ 5 files changed, 1 insertion(+), 25 deletions(-) (limited to 'kernel') diff --git a/arch/x86/configs/i386_defconfig b/arch/x86/configs/i386_defconfig index ab8b30cb978e..550904591e94 100644 --- a/arch/x86/configs/i386_defconfig +++ b/arch/x86/configs/i386_defconfig @@ -285,7 +285,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y CONFIG_SECURITY_SELINUX=y diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig index 2d196cb49084..614961009075 100644 --- a/arch/x86/configs/x86_64_defconfig +++ b/arch/x86/configs/x86_64_defconfig @@ -282,7 +282,6 @@ CONFIG_EARLY_PRINTK_DBGP=y CONFIG_DEBUG_STACKOVERFLOW=y # CONFIG_DEBUG_RODATA_TEST is not set CONFIG_DEBUG_BOOT_PARAMS=y -CONFIG_OPTIMIZE_INLINING=y CONFIG_UNWINDER_ORC=y CONFIG_SECURITY=y CONFIG_SECURITY_NETWORK=y diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 72393a8c1a6c..e970f97a7fcb 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -129,22 +129,13 @@ struct ftrace_likely_data { #define __compiler_offsetof(a, b) __builtin_offsetof(a, b) /* - * Force always-inline if the user requests it so via the .config. * Prefer gnu_inline, so that extern inline functions do not emit an * externally visible function. This makes extern inline behave as per gnu89 * semantics rather than c99. This prevents multiple symbol definition errors * of extern inline functions at link time. * A lot of inline functions can cause havoc with function tracing. - * Do not use __always_inline here, since currently it expands to inline again - * (which would break users of __always_inline). */ -#if !defined(CONFIG_OPTIMIZE_INLINING) -#define inline inline __attribute__((__always_inline__)) __gnu_inline \ - __inline_maybe_unused notrace -#else -#define inline inline __gnu_inline \ - __inline_maybe_unused notrace -#endif +#define inline inline __gnu_inline __inline_maybe_unused notrace /* * gcc provides both __inline__ and __inline as alternate spellings of diff --git a/kernel/configs/tiny.config b/kernel/configs/tiny.config index 7fa0c4ae6394..8a44b93da0f3 100644 --- a/kernel/configs/tiny.config +++ b/kernel/configs/tiny.config @@ -6,7 +6,6 @@ CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set -CONFIG_OPTIMIZE_INLINING=y # CONFIG_SLAB is not set # CONFIG_SLUB is not set CONFIG_SLOB=y diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d1398cef3b18..7f9a89847b65 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -305,18 +305,6 @@ config HEADERS_INSTALL user-space program samples. It is also needed by some features such as uapi header sanity checks. -config OPTIMIZE_INLINING - def_bool y - help - This option determines if the kernel forces gcc to inline the functions - developers have marked 'inline'. Doing so takes away freedom from gcc to - do what it thinks is best, which is desirable for the gcc 3.x series of - compilers. The gcc 4.x series have a rewritten inlining algorithm and - enabling this option will generate a smaller kernel there. Hopefully - this algorithm is so good that allowing gcc 4.x and above to make the - decision will become the default in the future. Until then this option - is there to test gcc for this. - config DEBUG_SECTION_MISMATCH bool "Enable full Section mismatch analysis" help -- cgit v1.2.3-71-gd317 From 0bd476e6c67190b5eb7b6e105c8db8ff61103281 Mon Sep 17 00:00:00 2001 From: Will Deacon Date: Mon, 6 Apr 2020 20:11:43 -0700 Subject: kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol() kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to modules despite having no in-tree users and being wide open to abuse by out-of-tree modules that can use them as a method to invoke arbitrary non-exported kernel functions. Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol(). Signed-off-by: Will Deacon Signed-off-by: Andrew Morton Reviewed-by: Greg Kroah-Hartman Reviewed-by: Christoph Hellwig Reviewed-by: Masami Hiramatsu Reviewed-by: Quentin Perret Acked-by: Alexei Starovoitov Cc: Thomas Gleixner Cc: Frederic Weisbecker Cc: K.Prasad Cc: Miroslav Benes Cc: Petr Mladek Cc: Joe Lawrence Cc: Mathieu Desnoyers Link: http://lkml.kernel.org/r/20200221114404.14641-4-will@kernel.org Signed-off-by: Linus Torvalds --- kernel/kallsyms.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c index a9b3f660dee7..16c8c605f4b0 100644 --- a/kernel/kallsyms.c +++ b/kernel/kallsyms.c @@ -175,7 +175,6 @@ unsigned long kallsyms_lookup_name(const char *name) } return module_kallsyms_lookup_name(name); } -EXPORT_SYMBOL_GPL(kallsyms_lookup_name); int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *, unsigned long), @@ -194,7 +193,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *, } return module_kallsyms_on_each_symbol(fn, data); } -EXPORT_SYMBOL_GPL(kallsyms_on_each_symbol); static unsigned long get_symbol_pos(unsigned long addr, unsigned long *symbolsize, -- cgit v1.2.3-71-gd317 From 06d4f8152a01034b36591885321da54eb77398c1 Mon Sep 17 00:00:00 2001 From: Qiujun Huang Date: Mon, 6 Apr 2020 20:11:49 -0700 Subject: kernel/kmod.c: fix a typo "assuems" -> "assumes" There is a typo in comment. Fix it. s/assuems/assumes/ Signed-off-by: Qiujun Huang Signed-off-by: Andrew Morton Acked-by: Luis Chamberlain Link: http://lkml.kernel.org/r/1585891029-6450-1-git-send-email-hqjagain@gmail.com Signed-off-by: Linus Torvalds --- kernel/kmod.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index bc6addd9152b..8b2b311afa95 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -35,7 +35,7 @@ * (u64) THREAD_SIZE * 8UL); * * If you need less than 50 threads would mean we're dealing with systems - * smaller than 3200 pages. This assuems you are capable of having ~13M memory, + * smaller than 3200 pages. This assumes you are capable of having ~13M memory, * and this would only be an be an upper limit, after which the OOM killer * would take effect. Systems like these are very unlikely if modules are * enabled. -- cgit v1.2.3-71-gd317 From fba4168edecdd2781bcd83cb131977ec1157f87c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:52 -0700 Subject: gcov: gcc_4_7: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Acked-by: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200213152241.GA877@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/gcc_4_7.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/gcov/gcc_4_7.c b/kernel/gcov/gcc_4_7.c index ec37563674d6..908fdf5098c3 100644 --- a/kernel/gcov/gcc_4_7.c +++ b/kernel/gcov/gcc_4_7.c @@ -68,7 +68,7 @@ struct gcov_fn_info { unsigned int ident; unsigned int lineno_checksum; unsigned int cfg_checksum; - struct gcov_ctr_info ctrs[0]; + struct gcov_ctr_info ctrs[]; }; /** -- cgit v1.2.3-71-gd317 From 7ff87182d156e90b59455aac8503da512a1c5dba Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:55 -0700 Subject: gcov: gcc_3_4: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200302224501.GA14175@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/gcc_3_4.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/gcov/gcc_3_4.c b/kernel/gcov/gcc_3_4.c index 801ee4b0b969..acb83558e5df 100644 --- a/kernel/gcov/gcc_3_4.c +++ b/kernel/gcov/gcc_3_4.c @@ -38,7 +38,7 @@ static struct gcov_info *gcov_info_head; struct gcov_fn_info { unsigned int ident; unsigned int checksum; - unsigned int n_ctrs[0]; + unsigned int n_ctrs[]; }; /** @@ -78,7 +78,7 @@ struct gcov_info { unsigned int n_functions; const struct gcov_fn_info *functions; unsigned int ctr_mask; - struct gcov_ctr_info counts[0]; + struct gcov_ctr_info counts[]; }; /** @@ -352,7 +352,7 @@ struct gcov_iterator { unsigned int count; int num_types; - struct type_info type_info[0]; + struct type_info type_info[]; }; static struct gcov_fn_info *get_func(struct gcov_iterator *iter) -- cgit v1.2.3-71-gd317 From 6524d79413f4ffd28480f541ef68f9ea199b2e0c Mon Sep 17 00:00:00 2001 From: "Gustavo A. R. Silva" Date: Mon, 6 Apr 2020 20:11:58 -0700 Subject: kernel/gcov/fs.c: replace zero-length array with flexible-array member The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva Signed-off-by: Andrew Morton Cc: Peter Oberparleiter Link: http://lkml.kernel.org/r/20200302224851.GA26467@embeddedor Signed-off-by: Linus Torvalds --- kernel/gcov/fs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/gcov/fs.c b/kernel/gcov/fs.c index e5eb5ea7ea59..5e891c3c2d93 100644 --- a/kernel/gcov/fs.c +++ b/kernel/gcov/fs.c @@ -58,7 +58,7 @@ struct gcov_node { struct dentry *dentry; struct dentry **links; int num_loaded; - char name[0]; + char name[]; }; static const char objtree[] = OBJTREE; -- cgit v1.2.3-71-gd317 From 0f538e3e712a517bd351607de50cd298102c7c08 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Tue, 7 Apr 2020 17:46:43 +0200 Subject: ucount: Make sure ucounts in /proc/sys/user don't regress again Commit 769071ac9f20 "ns: Introduce Time Namespace" broke reporting of inotify ucounts (max_inotify_instances, max_inotify_watches) in /proc/sys/user because it has added UCOUNT_TIME_NAMESPACES into enum ucount_type but didn't properly update reporting in kernel/ucount.c:setup_userns_sysctls(). This problem got fixed in commit eeec26d5da82 "time/namespace: Add max_time_namespaces ucount". Add BUILD_BUG_ON to catch a similar problem in the future. Signed-off-by: Jan Kara Signed-off-by: Thomas Gleixner Acked-by: Andrei Vagin Link: https://lkml.kernel.org/r/20200407154643.10102-1-jack@suse.cz --- kernel/ucount.c | 2 ++ 1 file changed, 2 insertions(+) (limited to 'kernel') diff --git a/kernel/ucount.c b/kernel/ucount.c index 29c60eb4ec9b..11b1596e2542 100644 --- a/kernel/ucount.c +++ b/kernel/ucount.c @@ -82,6 +82,8 @@ bool setup_userns_sysctls(struct user_namespace *ns) { #ifdef CONFIG_SYSCTL struct ctl_table *tbl; + + BUILD_BUG_ON(ARRAY_SIZE(user_table) != UCOUNT_COUNTS + 1); setup_sysctl_set(&ns->set, &set_root, set_is_seen); tbl = kmemdup(user_table, sizeof(user_table), GFP_KERNEL); if (tbl) { -- cgit v1.2.3-71-gd317 From 33238c50451596be86db1505ab65fee5172844d0 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Wed, 18 Mar 2020 20:33:37 +0100 Subject: perf/core: Fix event cgroup tracking Song reports that installing cgroup events is broken since: db0503e4f675 ("perf/core: Optimize perf_install_in_event()") The problem being that cgroup events try to track cpuctx->cgrp even for disabled events, which is pointless and actively harmful since the above commit. Rework the code to have explicit enable/disable hooks for cgroup events, such that we can limit cgroup tracking to active events. More specifically, since the above commit disabled events are no longer added to their context from the 'right' CPU, and we can't access things like the current cgroup for a remote CPU. Cc: # v5.5+ Fixes: db0503e4f675 ("perf/core: Optimize perf_install_in_event()") Reported-by: Song Liu Tested-by: Song Liu Reviewed-by: Song Liu Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200318193337.GB20760@hirez.programming.kicks-ass.net --- kernel/events/core.c | 70 ++++++++++++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 27 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 55e44417f66d..7afd0b503406 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -983,16 +983,10 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now) event->shadow_ctx_time = now - t->timestamp; } -/* - * Update cpuctx->cgrp so that it is set when first cgroup event is added and - * cleared when last cgroup event is removed. - */ static inline void -list_update_cgroup_event(struct perf_event *event, - struct perf_event_context *ctx, bool add) +perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx) { struct perf_cpu_context *cpuctx; - struct list_head *cpuctx_entry; if (!is_cgroup_event(event)) return; @@ -1009,28 +1003,41 @@ list_update_cgroup_event(struct perf_event *event, * because if the first would mismatch, the second would not try again * and we would leave cpuctx->cgrp unset. */ - if (add && !cpuctx->cgrp) { + if (ctx->is_active && !cpuctx->cgrp) { struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx); if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) cpuctx->cgrp = cgrp; } - if (add && ctx->nr_cgroups++) + if (ctx->nr_cgroups++) return; - else if (!add && --ctx->nr_cgroups) + + list_add(&cpuctx->cgrp_cpuctx_entry, + per_cpu_ptr(&cgrp_cpuctx_list, event->cpu)); +} + +static inline void +perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx) +{ + struct perf_cpu_context *cpuctx; + + if (!is_cgroup_event(event)) return; - /* no cgroup running */ - if (!add) + /* + * Because cgroup events are always per-cpu events, + * @ctx == &cpuctx->ctx. + */ + cpuctx = container_of(ctx, struct perf_cpu_context, ctx); + + if (--ctx->nr_cgroups) + return; + + if (ctx->is_active && cpuctx->cgrp) cpuctx->cgrp = NULL; - cpuctx_entry = &cpuctx->cgrp_cpuctx_entry; - if (add) - list_add(cpuctx_entry, - per_cpu_ptr(&cgrp_cpuctx_list, event->cpu)); - else - list_del(cpuctx_entry); + list_del(&cpuctx->cgrp_cpuctx_entry); } #else /* !CONFIG_CGROUP_PERF */ @@ -1096,11 +1103,14 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event) } static inline void -list_update_cgroup_event(struct perf_event *event, - struct perf_event_context *ctx, bool add) +perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ctx) { } +static inline void +perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx) +{ +} #endif /* @@ -1791,13 +1801,14 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx) add_event_to_groups(event, ctx); } - list_update_cgroup_event(event, ctx, true); - list_add_rcu(&event->event_entry, &ctx->event_list); ctx->nr_events++; if (event->attr.inherit_stat) ctx->nr_stat++; + if (event->state > PERF_EVENT_STATE_OFF) + perf_cgroup_event_enable(event, ctx); + ctx->generation++; } @@ -1976,8 +1987,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) event->attach_state &= ~PERF_ATTACH_CONTEXT; - list_update_cgroup_event(event, ctx, false); - ctx->nr_events--; if (event->attr.inherit_stat) ctx->nr_stat--; @@ -1994,8 +2003,10 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) * of error state is by explicit re-enabling * of the event */ - if (event->state > PERF_EVENT_STATE_OFF) + if (event->state > PERF_EVENT_STATE_OFF) { + perf_cgroup_event_disable(event, ctx); perf_event_set_state(event, PERF_EVENT_STATE_OFF); + } ctx->generation++; } @@ -2226,6 +2237,7 @@ event_sched_out(struct perf_event *event, if (READ_ONCE(event->pending_disable) >= 0) { WRITE_ONCE(event->pending_disable, -1); + perf_cgroup_event_disable(event, ctx); state = PERF_EVENT_STATE_OFF; } perf_event_set_state(event, state); @@ -2363,6 +2375,7 @@ static void __perf_event_disable(struct perf_event *event, event_sched_out(event, cpuctx, ctx); perf_event_set_state(event, PERF_EVENT_STATE_OFF); + perf_cgroup_event_disable(event, ctx); } /* @@ -2746,7 +2759,7 @@ static int __perf_install_in_context(void *info) } #ifdef CONFIG_CGROUP_PERF - if (is_cgroup_event(event)) { + if (event->state > PERF_EVENT_STATE_OFF && is_cgroup_event(event)) { /* * If the current cgroup doesn't match the event's * cgroup, we should not try to schedule it. @@ -2906,6 +2919,7 @@ static void __perf_event_enable(struct perf_event *event, ctx_sched_out(ctx, cpuctx, EVENT_TIME); perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); + perf_cgroup_event_enable(event, ctx); if (!ctx->is_active) return; @@ -3616,8 +3630,10 @@ static int merge_sched_in(struct perf_event *event, void *data) } if (event->state == PERF_EVENT_STATE_INACTIVE) { - if (event->attr.pinned) + if (event->attr.pinned) { + perf_cgroup_event_disable(event, ctx); perf_event_set_state(event, PERF_EVENT_STATE_ERROR); + } *can_add_hw = 0; ctx->rotate_necessary = 1; -- cgit v1.2.3-71-gd317 From 24fb6b8e7c2280000966e3f2c9c8069a538518eb Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Sat, 21 Mar 2020 09:43:31 -0700 Subject: perf/cgroup: Correct indirection in perf_less_group_idx() The void* in perf_less_group_idx() is to a member in the array which points at a perf_event*, as such it is a perf_event**. Reported-By: John Sperbeck Fixes: 6eef8a7116de ("perf/core: Use min_heap in visit_groups_merge()") Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200321164331.107337-1-irogers@google.com --- kernel/events/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 7afd0b503406..26de0a5ee887 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -3522,7 +3522,8 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx, static bool perf_less_group_idx(const void *l, const void *r) { - const struct perf_event *le = l, *re = r; + const struct perf_event *le = *(const struct perf_event **)l; + const struct perf_event *re = *(const struct perf_event **)r; return le->group_index < re->group_index; } -- cgit v1.2.3-71-gd317 From d3296fb372bf7497b0e5d0478c4e7a677ec6f6e9 Mon Sep 17 00:00:00 2001 From: Jiri Olsa Date: Tue, 7 Apr 2020 16:14:27 +0200 Subject: perf/core: Disable page faults when getting phys address We hit following warning when running tests on kernel compiled with CONFIG_DEBUG_ATOMIC_SLEEP=y: WARNING: CPU: 19 PID: 4472 at mm/gup.c:2381 __get_user_pages_fast+0x1a4/0x200 CPU: 19 PID: 4472 Comm: dummy Not tainted 5.6.0-rc6+ #3 RIP: 0010:__get_user_pages_fast+0x1a4/0x200 ... Call Trace: perf_prepare_sample+0xff1/0x1d90 perf_event_output_forward+0xe8/0x210 __perf_event_overflow+0x11a/0x310 __intel_pmu_pebs_event+0x657/0x850 intel_pmu_drain_pebs_nhm+0x7de/0x11d0 handle_pmi_common+0x1b2/0x650 intel_pmu_handle_irq+0x17b/0x370 perf_event_nmi_handler+0x40/0x60 nmi_handle+0x192/0x590 default_do_nmi+0x6d/0x150 do_nmi+0x2f9/0x3c0 nmi+0x8e/0xd7 While __get_user_pages_fast() is IRQ-safe, it calls access_ok(), which warns on: WARN_ON_ONCE(!in_task() && !pagefault_disabled()) Peter suggested disabling page faults around __get_user_pages_fast(), which gets rid of the warning in access_ok() call. Suggested-by: Peter Zijlstra Signed-off-by: Jiri Olsa Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200407141427.3184722-1-jolsa@kernel.org --- kernel/events/core.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index 26de0a5ee887..bc9b98a9af9a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6934,9 +6934,12 @@ static u64 perf_virt_to_phys(u64 virt) * Try IRQ-safe __get_user_pages_fast first. * If failed, leave phys_addr as 0. */ - if ((current->mm != NULL) && - (__get_user_pages_fast(virt, 1, 0, &p) == 1)) - phys_addr = page_to_phys(p) + virt % PAGE_SIZE; + if (current->mm != NULL) { + pagefault_disable(); + if (__get_user_pages_fast(virt, 1, 0, &p) == 1) + phys_addr = page_to_phys(p) + virt % PAGE_SIZE; + pagefault_enable(); + } if (p) put_page(p); -- cgit v1.2.3-71-gd317 From d76343c6b2b79f5e89c392bc9ce9dabc4c9e90cb Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Mon, 30 Mar 2020 10:01:27 +0100 Subject: sched/fair: Align rq->avg_idle and rq->avg_scan_cost sched/core.c uses update_avg() for rq->avg_idle and sched/fair.c uses an open-coded version (with the exact same decay factor) for rq->avg_scan_cost. On top of that, select_idle_cpu() expects to be able to compare these two fields. The only difference between the two is that rq->avg_scan_cost is computed using a pure division rather than a shift. Turns out it actually matters, first of all because the shifted value can be negative, and the standard has this to say about it: """ The result of E1 >> E2 is E1 right-shifted E2 bit positions. [...] If E1 has a signed type and a negative value, the resulting value is implementation-defined. """ Not only this, but (arithmetic) right shifting a negative value (using 2's complement) is *not* equivalent to dividing it by the corresponding power of 2. Let's look at a few examples: -4 -> 0xF..FC -4 >> 3 -> 0xF..FF == -1 != -4 / 8 -8 -> 0xF..F8 -8 >> 3 -> 0xF..FF == -1 == -8 / 8 -9 -> 0xF..F7 -9 >> 3 -> 0xF..FE == -2 != -9 / 8 Make update_avg() use a division, and export it to the private scheduler header to reuse it where relevant. Note that this still lets compilers use a shift here, but should prevent any unwanted surprise. The disassembly of select_idle_cpu() remains unchanged on arm64, and ttwu_do_wakeup() gains 2 instructions; the diff sort of looks like this: - sub x1, x1, x0 + subs x1, x1, x0 // set condition codes + add x0, x1, #0x7 + csel x0, x0, x1, mi // x0 = x1 < 0 ? x0 : x1 add x0, x3, x0, asr #3 which does the right thing (i.e. gives us the expected result while still using an arithmetic shift) Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200330090127.16294-1-valentin.schneider@arm.com --- kernel/sched/core.c | 6 ------ kernel/sched/fair.c | 7 ++----- kernel/sched/sched.h | 6 ++++++ 3 files changed, 8 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a2694ba82874..f6b329bca0c6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2119,12 +2119,6 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags) return cpu; } -static void update_avg(u64 *avg, u64 sample) -{ - s64 diff = sample - *avg; - *avg += diff >> 3; -} - void sched_set_stop_task(int cpu, struct task_struct *stop) { struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1ea3dddafe69..fb025e946f83 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6080,8 +6080,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask); struct sched_domain *this_sd; u64 avg_cost, avg_idle; - u64 time, cost; - s64 delta; + u64 time; int this = smp_processor_id(); int cpu, nr = INT_MAX; @@ -6119,9 +6118,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t } time = cpu_clock(this) - time; - cost = this_sd->avg_scan_cost; - delta = (s64)(time - cost) / 8; - this_sd->avg_scan_cost += delta; + update_avg(&this_sd->avg_scan_cost, time); return cpu; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0f616bf7bce3..cd008147eccb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -195,6 +195,12 @@ static inline int task_has_dl_policy(struct task_struct *p) #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) +static inline void update_avg(u64 *avg, u64 sample) +{ + s64 diff = sample - *avg; + *avg += diff / 8; +} + /* * !! For sched_setattr_nocheck() (kernel) only !! * -- cgit v1.2.3-71-gd317 From 26a8b12747c975b33b4a82d62e4a307e1c07f31b Mon Sep 17 00:00:00 2001 From: Huaixin Chang Date: Fri, 27 Mar 2020 11:26:25 +0800 Subject: sched/fair: Fix race between runtime distribution and assignment Currently, there is a potential race between distribute_cfs_runtime() and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read, distributes without holding lock and finds out there is not enough runtime to charge against after distribution. Because assign_cfs_rq_runtime() might be called during distribution, and use cfs_b->runtime at the same time. Fibtest is the tool to test this race. Assume all gcfs_rq is throttled and cfs period timer runs, slow threads might run and sleep, returning unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local pool. If all this happens sufficiently quickly, cfs_b->runtime will drop a lot. If runtime distributed is large too, over-use of runtime happens. A runtime over-using by about 70 percent of quota is seen when we test fibtest on a 96-core machine. We run fibtest with 1 fast thread and 95 slow threads in test group, configure 10ms quota for this group and see the CPU usage of fibtest is 17.0%, which is far more than the expected 10%. On a smaller machine with 32 cores, we also run fibtest with 96 threads. CPU usage is more than 12%, which is also more than expected 10%. This shows that on similar workloads, this race do affect CPU bandwidth control. Solve this by holding lock inside distribute_cfs_runtime(). Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Reviewed-by: Ben Segall Signed-off-by: Huaixin Chang Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/ --- kernel/sched/fair.c | 31 +++++++++++-------------------- 1 file changed, 11 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fb025e946f83..95cbd9e7958d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4836,11 +4836,10 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) resched_curr(rq); } -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) +static void distribute_cfs_runtime(struct cfs_bandwidth *cfs_b) { struct cfs_rq *cfs_rq; - u64 runtime; - u64 starting_runtime = remaining; + u64 runtime, remaining = 1; rcu_read_lock(); list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq, @@ -4855,10 +4854,13 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) /* By the above check, this should never be true */ SCHED_WARN_ON(cfs_rq->runtime_remaining > 0); + raw_spin_lock(&cfs_b->lock); runtime = -cfs_rq->runtime_remaining + 1; - if (runtime > remaining) - runtime = remaining; - remaining -= runtime; + if (runtime > cfs_b->runtime) + runtime = cfs_b->runtime; + cfs_b->runtime -= runtime; + remaining = cfs_b->runtime; + raw_spin_unlock(&cfs_b->lock); cfs_rq->runtime_remaining += runtime; @@ -4873,8 +4875,6 @@ next: break; } rcu_read_unlock(); - - return starting_runtime - remaining; } /* @@ -4885,7 +4885,6 @@ next: */ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) { - u64 runtime; int throttled; /* no need to continue the timer with no bandwidth constraint */ @@ -4914,24 +4913,17 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u cfs_b->nr_throttled += overrun; /* - * This check is repeated as we are holding onto the new bandwidth while - * we unthrottle. This can potentially race with an unthrottled group - * trying to acquire new bandwidth from the global pool. This can result - * in us over-using our runtime if it is all used during this loop, but - * only by limited amounts in that extreme case. + * This check is repeated as we release cfs_b->lock while we unthrottle. */ while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) { - runtime = cfs_b->runtime; cfs_b->distribute_running = 1; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); /* we can't nest cfs_b->lock while distributing bandwidth */ - runtime = distribute_cfs_runtime(cfs_b, runtime); + distribute_cfs_runtime(cfs_b); raw_spin_lock_irqsave(&cfs_b->lock, flags); cfs_b->distribute_running = 0; throttled = !list_empty(&cfs_b->throttled_cfs_rq); - - lsub_positive(&cfs_b->runtime, runtime); } /* @@ -5065,10 +5057,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) if (!runtime) return; - runtime = distribute_cfs_runtime(cfs_b, runtime); + distribute_cfs_runtime(cfs_b); raw_spin_lock_irqsave(&cfs_b->lock, flags); - lsub_positive(&cfs_b->runtime, runtime); cfs_b->distribute_running = 0; raw_spin_unlock_irqrestore(&cfs_b->lock, flags); } -- cgit v1.2.3-71-gd317 From 111688ca1c4a43a7e482f5401f82c46326b8ed49 Mon Sep 17 00:00:00 2001 From: Aubrey Li Date: Thu, 26 Mar 2020 13:42:29 +0800 Subject: sched/fair: Fix negative imbalance in imbalance calculation A negative imbalance value was observed after imbalance calculation, this happens when the local sched group type is group_fully_busy, and the average load of local group is greater than the selected busiest group. Fix this problem by comparing the average load of the local and busiest group before imbalance calculation formula. Suggested-by: Vincent Guittot Reviewed-by: Phil Auld Reviewed-by: Vincent Guittot Acked-by: Mel Gorman Signed-off-by: Aubrey Li Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/1585201349-70192-1-git-send-email-aubrey.li@intel.com --- kernel/sched/fair.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 95cbd9e7958d..02f323b85b6d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9036,6 +9036,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) / sds->total_capacity; + /* + * If the local group is more loaded than the selected + * busiest group don't try to pull any tasks. + */ + if (local->avg_load >= busiest->avg_load) { + env->imbalance = 0; + return; + } } /* -- cgit v1.2.3-71-gd317 From 62849a9612924a655c67cf6962920544aa5c20db Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Sat, 28 Mar 2020 00:29:59 +0100 Subject: workqueue: Remove the warning in wq_worker_sleeping() The kernel test robot triggered a warning with the following race: task-ctx A interrupt-ctx B worker -> process_one_work() -> work_item() -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> ->sleeping = 1 atomic_dec_and_test(nr_running) __schedule(); *interrupt* async_page_fault() -> local_irq_enable(); -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> if (WARN_ON(->sleeping)) return -> __schedule() -> sched_update_worker() -> wq_worker_running() -> atomic_inc(nr_running); -> ->sleeping = 0; -> sched_update_worker() -> wq_worker_running() if (!->sleeping) return In this context the warning is pointless everything is fine. An interrupt before wq_worker_sleeping() will perform the ->sleeping assignment (0 -> 1 > 0) twice. An interrupt after wq_worker_sleeping() will trigger the warning and nr_running will be decremented (by A) and incremented once (only by B, A will skip it). This is the case until the ->sleeping is zeroed again in wq_worker_running(). Remove the WARN statement because this condition may happen. Document that preemption around wq_worker_sleeping() needs to be disabled to protect ->sleeping and not just as an optimisation. Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reported-by: kernel test robot Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Cc: Tejun Heo Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian --- kernel/sched/core.c | 3 ++- kernel/workqueue.c | 6 ++++-- 2 files changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f6b329bca0c6..c3d12e3762d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4120,7 +4120,8 @@ static inline void sched_submit_work(struct task_struct *tsk) * it wants to wake up a task to maintain concurrency. * As this function is called inside the schedule() context, * we disable preemption to avoid it calling schedule() again - * in the possible wakeup of a kworker. + * in the possible wakeup of a kworker and because wq_worker_sleeping() + * requires it. */ if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { preempt_disable(); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 3816a18c251e..891ccad5f271 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -858,7 +858,8 @@ void wq_worker_running(struct task_struct *task) * @task: task going to sleep * * This function is called from schedule() when a busy worker is - * going to sleep. + * going to sleep. Preemption needs to be disabled to protect ->sleeping + * assignment. */ void wq_worker_sleeping(struct task_struct *task) { @@ -875,7 +876,8 @@ void wq_worker_sleeping(struct task_struct *task) pool = worker->pool; - if (WARN_ON_ONCE(worker->sleeping)) + /* Return if preempted before wq_worker_running() was reached */ + if (worker->sleeping) return; worker->sleeping = 1; -- cgit v1.2.3-71-gd317 From 275b2f6723ab9173484e1055ae138d4c2dd9d7c5 Mon Sep 17 00:00:00 2001 From: Vincent Donnefort Date: Fri, 20 Mar 2020 13:21:35 +0000 Subject: sched/core: Remove unused rq::last_load_update_tick The following commit: 5e83eafbfd3b ("sched/fair: Remove the rq->cpu_load[] update code") eliminated the last use case for rq->last_load_update_tick, so remove the field as well. Reviewed-by: Dietmar Eggemann Reviewed-by: Vincent Guittot Signed-off-by: Vincent Donnefort Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/1584710495-308969-1-git-send-email-vincent.donnefort@arm.com --- kernel/sched/core.c | 1 - kernel/sched/sched.h | 1 - 2 files changed, 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c3d12e3762d4..3a61a3b8eaa9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6694,7 +6694,6 @@ void __init sched_init(void) rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON - rq->last_load_update_tick = jiffies; rq->last_blocked_load_update_tick = jiffies; atomic_set(&rq->nohz_flags, 0); #endif diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cd008147eccb..db3a57675ccf 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -888,7 +888,6 @@ struct rq { #endif #ifdef CONFIG_NO_HZ_COMMON #ifdef CONFIG_SMP - unsigned long last_load_update_tick; unsigned long last_blocked_load_update_tick; unsigned int has_blocked_load; #endif /* CONFIG_SMP */ -- cgit v1.2.3-71-gd317 From c745a6212c9923eb2253f4229e5d7277ca3d9d8e Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 26 Feb 2020 12:45:41 +0000 Subject: sched/debug: Remove redundant macro define Most printing macros for procfs are defined globally in debug.c, and they are re-defined (to the exact same thing) within proc_sched_show_task(). Get rid of the duplicate defines. Reviewed-by: Qais Yousef Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com --- kernel/sched/debug.c | 12 ------------ 1 file changed, 12 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 8331bc04aea2..4670151eb131 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -868,16 +868,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, SEQ_printf(m, "---------------------------------------------------------" "----------\n"); -#define __P(F) \ - SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F) -#define P(F) \ - SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F) #define P_SCHEDSTAT(F) \ SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)schedstat_val(p->F)) -#define __PN(F) \ - SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F)) -#define PN(F) \ - SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F)) #define PN_SCHEDSTAT(F) \ SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(p->F))) @@ -963,11 +955,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(dl.deadline); } #undef PN_SCHEDSTAT -#undef PN -#undef __PN #undef P_SCHEDSTAT -#undef P -#undef __P { unsigned int this_cpu = raw_smp_processor_id(); -- cgit v1.2.3-71-gd317 From 9e3bf9469c29f7e4e49c5c0d8fecaf8ac57d1fe4 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 26 Feb 2020 12:45:42 +0000 Subject: sched/debug: Factor out printing formats into common macros The printing macros in debug.c keep redefining the same output format. Collect each output format in a single definition, and reuse that definition in the other macros. While at it, add a layer of parentheses and replace printf's with the newly introduced macros. Reviewed-by: Qais Yousef Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com --- kernel/sched/debug.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 4670151eb131..315ef6de3cc4 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -816,10 +816,12 @@ static int __init init_sched_debug_procfs(void) __initcall(init_sched_debug_procfs); -#define __P(F) SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F) -#define P(F) SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F) -#define __PN(F) SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F)) -#define PN(F) SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F)) +#define __PS(S, F) SEQ_printf(m, "%-45s:%21Ld\n", S, (long long)(F)) +#define __P(F) __PS(#F, F) +#define P(F) __PS(#F, p->F) +#define __PSN(S, F) SEQ_printf(m, "%-45s:%14Ld.%06ld\n", S, SPLIT_NS((long long)(F))) +#define __PN(F) __PSN(#F, F) +#define PN(F) __PSN(#F, p->F) #ifdef CONFIG_NUMA_BALANCING @@ -868,10 +870,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, SEQ_printf(m, "---------------------------------------------------------" "----------\n"); -#define P_SCHEDSTAT(F) \ - SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)schedstat_val(p->F)) -#define PN_SCHEDSTAT(F) \ - SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)schedstat_val(p->F))) + +#define P_SCHEDSTAT(F) __PS(#F, schedstat_val(p->F)) +#define PN_SCHEDSTAT(F) __PSN(#F, schedstat_val(p->F)) PN(se.exec_start); PN(se.vruntime); @@ -931,10 +932,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, } __P(nr_switches); - SEQ_printf(m, "%-45s:%21Ld\n", - "nr_voluntary_switches", (long long)p->nvcsw); - SEQ_printf(m, "%-45s:%21Ld\n", - "nr_involuntary_switches", (long long)p->nivcsw); + __PS("nr_voluntary_switches", p->nvcsw); + __PS("nr_involuntary_switches", p->nivcsw); P(se.load.weight); #ifdef CONFIG_SMP @@ -963,8 +962,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, t0 = cpu_clock(this_cpu); t1 = cpu_clock(this_cpu); - SEQ_printf(m, "%-45s:%21Ld\n", - "clock-delta", (long long)(t1-t0)); + __PS("clock-delta", t1-t0); } sched_show_numa(p, m); -- cgit v1.2.3-71-gd317 From 96e74ebf8d594496f3dda5f8e26af6b4e161e4e9 Mon Sep 17 00:00:00 2001 From: Valentin Schneider Date: Wed, 26 Feb 2020 12:45:43 +0000 Subject: sched/debug: Add task uclamp values to SCHED_DEBUG procfs Requested and effective uclamp values can be a bit tricky to decipher when playing with cgroup hierarchies. Add them to a task's procfs when SCHED_DEBUG is enabled. Reviewed-by: Qais Yousef Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com --- kernel/sched/debug.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 315ef6de3cc4..a562df57a86e 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -946,6 +946,12 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(se.avg.last_update_time); P(se.avg.util_est.ewma); P(se.avg.util_est.enqueued); +#endif +#ifdef CONFIG_UCLAMP_TASK + __PS("uclamp.min", p->uclamp[UCLAMP_MIN].value); + __PS("uclamp.max", p->uclamp[UCLAMP_MAX].value); + __PS("effective uclamp.min", uclamp_eff_value(p, UCLAMP_MIN)); + __PS("effective uclamp.max", uclamp_eff_value(p, UCLAMP_MAX)); #endif P(policy); P(prio); -- cgit v1.2.3-71-gd317 From d22cc7f67d55ebf2d5be865453971c783e9fb21a Mon Sep 17 00:00:00 2001 From: Qian Cai Date: Mon, 30 Mar 2020 17:30:02 -0400 Subject: locking/percpu-rwsem: Fix a task_struct refcount The following commit: 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem") introduced task_struct memory leaks due to messing up the task_struct refcount. At the beginning of percpu_rwsem_wake_function(), it calls get_task_struct(), but if the trylock failed, it will remain in the waitqueue. However, it will run percpu_rwsem_wake_function() again with get_task_struct() to increase the refcount but then only call put_task_struct() once the trylock succeeded. Fix it by adjusting percpu_rwsem_wake_function() a bit to guard against when percpu_rwsem_wait() observing !private, terminating the wait and doing a quick exit() while percpu_rwsem_wake_function() then doing wake_up_process(p) as a use-after-free. Fixes: 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem") Suggested-by: Peter Zijlstra Signed-off-by: Qian Cai Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20200330213002.2374-1-cai@lca.pw --- kernel/locking/percpu-rwsem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c index a008a1ba21a7..8bbafe3e5203 100644 --- a/kernel/locking/percpu-rwsem.c +++ b/kernel/locking/percpu-rwsem.c @@ -118,14 +118,15 @@ static int percpu_rwsem_wake_function(struct wait_queue_entry *wq_entry, unsigned int mode, int wake_flags, void *key) { - struct task_struct *p = get_task_struct(wq_entry->private); bool reader = wq_entry->flags & WQ_FLAG_CUSTOM; struct percpu_rw_semaphore *sem = key; + struct task_struct *p; /* concurrent against percpu_down_write(), can get stolen */ if (!__percpu_rwsem_trylock(sem, reader)) return 1; + p = get_task_struct(wq_entry->private); list_del_init(&wq_entry->entry); smp_store_release(&wq_entry->private, NULL); -- cgit v1.2.3-71-gd317 From 9a019db0b6bebc84d6b64636faf73ed6d64cd4bb Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 31 Mar 2020 20:38:12 +0200 Subject: locking/lockdep: Improve 'invalid wait context' splat The 'invalid wait context' splat doesn't print all the information required to reconstruct / validate the error, specifically the irq-context state is missing. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 51 +++++++++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 20 deletions(-) (limited to 'kernel') diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 1511690e4de7..ac10db66cc63 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3952,10 +3952,36 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this, return ret; } +static inline short task_wait_context(struct task_struct *curr) +{ + /* + * Set appropriate wait type for the context; for IRQs we have to take + * into account force_irqthread as that is implied by PREEMPT_RT. + */ + if (curr->hardirq_context) { + /* + * Check if force_irqthreads will run us threaded. + */ + if (curr->hardirq_threaded || curr->irq_config) + return LD_WAIT_CONFIG; + + return LD_WAIT_SPIN; + } else if (curr->softirq_context) { + /* + * Softirqs are always threaded. + */ + return LD_WAIT_CONFIG; + } + + return LD_WAIT_MAX; +} + static int print_lock_invalid_wait_context(struct task_struct *curr, struct held_lock *hlock) { + short curr_inner; + if (!debug_locks_off()) return 0; if (debug_locks_silent) @@ -3971,6 +3997,10 @@ print_lock_invalid_wait_context(struct task_struct *curr, print_lock(hlock); pr_warn("other info that might help us debug this:\n"); + + curr_inner = task_wait_context(curr); + pr_warn("context-{%d:%d}\n", curr_inner, curr_inner); + lockdep_print_held_locks(curr); pr_warn("stack backtrace:\n"); @@ -4017,26 +4047,7 @@ static int check_wait_context(struct task_struct *curr, struct held_lock *next) } depth++; - /* - * Set appropriate wait type for the context; for IRQs we have to take - * into account force_irqthread as that is implied by PREEMPT_RT. - */ - if (curr->hardirq_context) { - /* - * Check if force_irqthreads will run us threaded. - */ - if (curr->hardirq_threaded || curr->irq_config) - curr_inner = LD_WAIT_CONFIG; - else - curr_inner = LD_WAIT_SPIN; - } else if (curr->softirq_context) { - /* - * Softirqs are always threaded. - */ - curr_inner = LD_WAIT_CONFIG; - } else { - curr_inner = LD_WAIT_MAX; - } + curr_inner = task_wait_context(curr); for (; depth < curr->lockdep_depth; depth++) { struct held_lock *prev = curr->held_locks + depth; -- cgit v1.2.3-71-gd317 From cdcda0d1f8f4ab84efe7cd9921c98364398aefd7 Mon Sep 17 00:00:00 2001 From: Kishon Vijay Abraham I Date: Mon, 6 Apr 2020 10:58:36 +0530 Subject: dma-direct: fix data truncation in dma_direct_get_required_mask() The upper 32-bit physical address gets truncated inadvertently when dma_direct_get_required_mask() invokes phys_to_dma_direct(). This results in dma_addressing_limited() return incorrect value when used in platforms with LPAE enabled. Fix it here by explicitly type casting 'max_pfn' to phys_addr_t in order to prevent overflow of intermediate value while evaluating '(max_pfn - 1) << PAGE_SHIFT'. Signed-off-by: Kishon Vijay Abraham I Signed-off-by: Christoph Hellwig --- kernel/dma/direct.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index a8560052a915..8f4bbdaf965e 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -39,7 +39,8 @@ static inline struct page *dma_direct_to_page(struct device *dev, u64 dma_direct_get_required_mask(struct device *dev) { - u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT); + phys_addr_t phys = (phys_addr_t)(max_pfn - 1) << PAGE_SHIFT; + u64 max_dma = phys_to_dma_direct(dev, phys); return (1ULL << (fls64(max_dma) - 1)) * 2 - 1; } -- cgit v1.2.3-71-gd317 From 9bb50ed7470944238ec8e30a94ef096caf9056ee Mon Sep 17 00:00:00 2001 From: Grygorii Strashko Date: Wed, 8 Apr 2020 22:43:00 +0300 Subject: dma-debug: fix displaying of dma allocation type The commit 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") removed "dma_debug_page" enum, but missed to update type2name string table. This causes incorrect displaying of dma allocation type. Fix it by removing "page" string from type2name string table and switch to use named initializers. Before (dma_alloc_coherent()): k3-ringacc 4b800000.ringacc: scather-gather idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: scather-gather idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable After: k3-ringacc 4b800000.ringacc: coherent idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable k3-ringacc 4b800000.ringacc: coherent idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable Fixes: 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs") Signed-off-by: Grygorii Strashko Signed-off-by: Christoph Hellwig --- kernel/dma/debug.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index 2031ed1ad7fa..9e1777c81f55 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -137,9 +137,12 @@ static const char *const maperr2str[] = { [MAP_ERR_CHECKED] = "dma map error checked", }; -static const char *type2name[5] = { "single", "page", - "scather-gather", "coherent", - "resource" }; +static const char *type2name[] = { + [dma_debug_single] = "single", + [dma_debug_sg] = "scather-gather", + [dma_debug_coherent] = "coherent", + [dma_debug_resource] = "resource", +}; static const char *dir2name[4] = { "DMA_BIDIRECTIONAL", "DMA_TO_DEVICE", "DMA_FROM_DEVICE", "DMA_NONE" }; -- cgit v1.2.3-71-gd317 From 63f818f46af9f8b3f17b9695501e8d08959feb60 Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Tue, 7 Apr 2020 09:43:04 -0500 Subject: proc: Use a dedicated lock in struct pid syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not tainted > -------------------------------------------------------- > swapper/1/0 just changed the state of lock: > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&pid->wait_pidfd){+.+.}-{2:2} > > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&pid->wait_pidfd); > local_irq_disable(); > lock(tasklist_lock); > lock(&pid->wait_pidfd); > > lock(tasklist_lock); > > *** DEADLOCK *** > > 4 locks held by swapper/1/0: The problem is that because wait_pidfd.lock is taken under the tasklist lock. It must always be taken with irqs disabled as tasklist_lock can be taken from interrupt context and if wait_pidfd.lock was already taken this would create a lock order inversion. Oleg suggested just disabling irqs where I have added extra calls to wait_pidfd.lock. That should be safe and I think the code will eventually do that. It was rightly pointed out by Christian that sharing the wait_pidfd.lock was a premature optimization. It is also true that my pre-merge window testing was insufficient. So remove the premature optimization and give struct pid a dedicated lock of it's own for struct pid things. I have verified that lockdep sees all 3 paths where we take the new pid->lock and lockdep does not complain. It is my current day dream that one day pid->lock can be used to guard the task lists as well and then the tasklist_lock won't need to be held to deliver signals. That will require taking pid->lock with irqs disabled. Acked-by: Christian Brauner Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/ Cc: Oleg Nesterov Cc: Christian Brauner Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") Signed-off-by: "Eric W. Biederman" --- fs/proc/base.c | 10 +++++----- include/linux/pid.h | 1 + kernel/pid.c | 1 + 3 files changed, 7 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/fs/proc/base.c b/fs/proc/base.c index 74f948a6b621..6042b646ab27 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1839,9 +1839,9 @@ void proc_pid_evict_inode(struct proc_inode *ei) struct pid *pid = ei->pid; if (S_ISDIR(ei->vfs_inode.i_mode)) { - spin_lock(&pid->wait_pidfd.lock); + spin_lock(&pid->lock); hlist_del_init_rcu(&ei->sibling_inodes); - spin_unlock(&pid->wait_pidfd.lock); + spin_unlock(&pid->lock); } put_pid(pid); @@ -1877,9 +1877,9 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* Let the pid remember us for quick removal */ ei->pid = pid; if (S_ISDIR(mode)) { - spin_lock(&pid->wait_pidfd.lock); + spin_lock(&pid->lock); hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes); - spin_unlock(&pid->wait_pidfd.lock); + spin_unlock(&pid->lock); } task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); @@ -3273,7 +3273,7 @@ static const struct inode_operations proc_tgid_base_inode_operations = { void proc_flush_pid(struct pid *pid) { - proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock); + proc_invalidate_siblings_dcache(&pid->inodes, &pid->lock); put_pid(pid); } diff --git a/include/linux/pid.h b/include/linux/pid.h index 01a0d4e28506..cc896f0fc4e3 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -60,6 +60,7 @@ struct pid { refcount_t count; unsigned int level; + spinlock_t lock; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; struct hlist_head inodes; diff --git a/kernel/pid.c b/kernel/pid.c index efd34874b3d1..517d0855d4cf 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -246,6 +246,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, get_pid_ns(ns); refcount_set(&pid->count, 1); + spin_lock_init(&pid->lock); for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); -- cgit v1.2.3-71-gd317 From d8ef4b38cb69d907f9b0e889c44d05fc0f890977 Mon Sep 17 00:00:00 2001 From: Tejun Heo Date: Thu, 9 Apr 2020 14:55:35 -0400 Subject: Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window" This reverts commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"). The commit was added in anticipation of memcg rstat conversion which needed synchronous accounting for the event counters (e.g. oom kill count). However, the conversion didn't get merged due to percpu memory overhead concern which couldn't be addressed at the time. Unfortunately, the patch's addition of smp_mb() to cgroup_rstat_updated() meant that every scheduling event now had to go through an additional full barrier and Mel Gorman noticed it as 1% regression in netperf UDP_STREAM test. There's no need to have this barrier in tree now and even if we need synchronous accounting in the future, the right thing to do is separating that out to a separate function so that hot paths which don't care about synchronous behavior don't have to pay the overhead of the full barrier. Let's revert. Signed-off-by: Tejun Heo Reported-by: Mel Gorman Link: http://lkml.kernel.org/r/20200409154413.GK3818@techsingularity.net Cc: v4.18+ --- kernel/cgroup/rstat.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 6f87352f8219..41ca996568df 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -33,12 +33,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) return; /* - * Paired with the one in cgroup_rstat_cpu_pop_updated(). Either we - * see NULL updated_next or they see our updated stat. - */ - smp_mb(); - - /* + * Speculative already-on-list test. This may race leading to + * temporary inaccuracies, which is fine. + * * Because @parent's updated_children is terminated with @parent * instead of NULL, we can tell whether @cgrp is on the list by * testing the next pointer for NULL. @@ -134,13 +131,6 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, *nextp = rstatc->updated_next; rstatc->updated_next = NULL; - /* - * Paired with the one in cgroup_rstat_cpu_updated(). - * Either they see NULL updated_next or we see their - * updated stat. - */ - smp_mb(); - return pos; } -- cgit v1.2.3-71-gd317 From ab6f762f0f53162d41497708b33c9a3236d3609e Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Tue, 3 Mar 2020 20:30:02 +0900 Subject: printk: queue wake_up_klogd irq_work only if per-CPU areas are ready printk_deferred(), similarly to printk_safe/printk_nmi, does not immediately attempt to print a new message on the consoles, avoiding calls into non-reentrant kernel paths, e.g. scheduler or timekeeping, which potentially can deadlock the system. Those printk() flavors, instead, rely on per-CPU flush irq_work to print messages from safer contexts. For same reasons (recursive scheduler or timekeeping calls) printk() uses per-CPU irq_work in order to wake up user space syslog/kmsg readers. However, only printk_safe/printk_nmi do make sure that per-CPU areas have been initialised and that it's safe to modify per-CPU irq_work. This means that, for instance, should printk_deferred() be invoked "too early", that is before per-CPU areas are initialised, printk_deferred() will perform illegal per-CPU access. Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random: silence a lockdep splat with printk()") user-space syslog/kmsg readers are not able to read new kernel messages. The reason is printk_deferred() being called too early (as was pointed out by Petr and John). Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU areas are initialized. Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/ Reported-by: Lech Perczak Signed-off-by: Sergey Senozhatsky Tested-by: Jann Horn Reviewed-by: Petr Mladek Cc: Greg Kroah-Hartman Cc: Theodore Ts'o Cc: John Ogness Signed-off-by: Linus Torvalds --- include/linux/printk.h | 5 ----- init/main.c | 1 - kernel/printk/internal.h | 5 +++++ kernel/printk/printk.c | 34 ++++++++++++++++++++++++++++++++++ kernel/printk/printk_safe.c | 11 +---------- 5 files changed, 40 insertions(+), 16 deletions(-) (limited to 'kernel') diff --git a/include/linux/printk.h b/include/linux/printk.h index 1e6108b8d15f..e061635e0409 100644 --- a/include/linux/printk.h +++ b/include/linux/printk.h @@ -202,7 +202,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); void dump_stack_print_info(const char *log_lvl); void show_regs_print_info(const char *log_lvl); extern asmlinkage void dump_stack(void) __cold; -extern void printk_safe_init(void); extern void printk_safe_flush(void); extern void printk_safe_flush_on_panic(void); #else @@ -269,10 +268,6 @@ static inline void dump_stack(void) { } -static inline void printk_safe_init(void) -{ -} - static inline void printk_safe_flush(void) { } diff --git a/init/main.c b/init/main.c index e488213857e2..a48617f2e5e5 100644 --- a/init/main.c +++ b/init/main.c @@ -913,7 +913,6 @@ asmlinkage __visible void __init start_kernel(void) boot_init_stack_canary(); time_init(); - printk_safe_init(); perf_event_init(); profile_init(); call_function_init(); diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h index c8e6ab689d42..b2b0f526f249 100644 --- a/kernel/printk/internal.h +++ b/kernel/printk/internal.h @@ -23,6 +23,9 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list args); void __printk_safe_enter(void); void __printk_safe_exit(void); +void printk_safe_init(void); +bool printk_percpu_data_ready(void); + #define printk_safe_enter_irqsave(flags) \ do { \ local_irq_save(flags); \ @@ -64,4 +67,6 @@ __printf(1, 0) int vprintk_func(const char *fmt, va_list args) { return 0; } #define printk_safe_enter_irq() local_irq_disable() #define printk_safe_exit_irq() local_irq_enable() +static inline void printk_safe_init(void) { } +static inline bool printk_percpu_data_ready(void) { return false; } #endif /* CONFIG_PRINTK */ diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 633f41a11d75..9a9b6156270b 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -460,6 +460,18 @@ static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN); static char *log_buf = __log_buf; static u32 log_buf_len = __LOG_BUF_LEN; +/* + * We cannot access per-CPU data (e.g. per-CPU flush irq_work) before + * per_cpu_areas are initialised. This variable is set to true when + * it's safe to access per-CPU data. + */ +static bool __printk_percpu_data_ready __read_mostly; + +bool printk_percpu_data_ready(void) +{ + return __printk_percpu_data_ready; +} + /* Return log buffer address */ char *log_buf_addr_get(void) { @@ -1146,12 +1158,28 @@ static void __init log_buf_add_cpu(void) static inline void log_buf_add_cpu(void) {} #endif /* CONFIG_SMP */ +static void __init set_percpu_data_ready(void) +{ + printk_safe_init(); + /* Make sure we set this flag only after printk_safe() init is done */ + barrier(); + __printk_percpu_data_ready = true; +} + void __init setup_log_buf(int early) { unsigned long flags; char *new_log_buf; unsigned int free; + /* + * Some archs call setup_log_buf() multiple times - first is very + * early, e.g. from setup_arch(), and second - when percpu_areas + * are initialised. + */ + if (!early) + set_percpu_data_ready(); + if (log_buf != __log_buf) return; @@ -2975,6 +3003,9 @@ static DEFINE_PER_CPU(struct irq_work, wake_up_klogd_work) = { void wake_up_klogd(void) { + if (!printk_percpu_data_ready()) + return; + preempt_disable(); if (waitqueue_active(&log_wait)) { this_cpu_or(printk_pending, PRINTK_PENDING_WAKEUP); @@ -2985,6 +3016,9 @@ void wake_up_klogd(void) void defer_console_output(void) { + if (!printk_percpu_data_ready()) + return; + preempt_disable(); __this_cpu_or(printk_pending, PRINTK_PENDING_OUTPUT); irq_work_queue(this_cpu_ptr(&wake_up_klogd_work)); diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c index b4045e782743..d9a659a686f3 100644 --- a/kernel/printk/printk_safe.c +++ b/kernel/printk/printk_safe.c @@ -27,7 +27,6 @@ * There are situations when we want to make sure that all buffers * were handled or when IRQs are blocked. */ -static int printk_safe_irq_ready __read_mostly; #define SAFE_LOG_BUF_LEN ((1 << CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT) - \ sizeof(atomic_t) - \ @@ -51,7 +50,7 @@ static DEFINE_PER_CPU(struct printk_safe_seq_buf, nmi_print_seq); /* Get flushed in a more safe context. */ static void queue_flush_work(struct printk_safe_seq_buf *s) { - if (printk_safe_irq_ready) + if (printk_percpu_data_ready()) irq_work_queue(&s->work); } @@ -402,14 +401,6 @@ void __init printk_safe_init(void) #endif } - /* - * In the highly unlikely event that a NMI were to trigger at - * this moment. Make sure IRQ work is set up before this - * variable is set. - */ - barrier(); - printk_safe_irq_ready = 1; - /* Flush pending messages that did not have scheduled IRQ works. */ printk_safe_flush(); } -- cgit v1.2.3-71-gd317 From d7d27cfc5cf0766a26a8f56868c5ad5434735126 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Fri, 10 Apr 2020 14:33:43 -0700 Subject: kmod: make request_module() return an error when autoloading is disabled Patch series "module autoloading fixes and cleanups", v5. This series fixes a bug where request_module() was reporting success to kernel code when module autoloading had been completely disabled via 'echo > /proc/sys/kernel/modprobe'. It also addresses the issues raised on the original thread (https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u) bydocumenting the modprobe sysctl, adding a self-test for the empty path case, and downgrading a user-reachable WARN_ONCE(). This patch (of 4): It's long been possible to disable kernel module autoloading completely (while still allowing manual module insertion) by setting /proc/sys/kernel/modprobe to the empty string. This can be preferable to setting it to a nonexistent file since it avoids the overhead of an attempted execve(), avoids potential deadlocks, and avoids the call to security_kernel_module_request() and thus on SELinux-based systems eliminates the need to write SELinux rules to dontaudit module_request. However, when module autoloading is disabled in this way, request_module() returns 0. This is broken because callers expect 0 to mean that the module was successfully loaded. Apparently this was never noticed because this method of disabling module autoloading isn't used much, and also most callers don't use the return value of request_module() since it's always necessary to check whether the module registered its functionality or not anyway. But improperly returning 0 can indeed confuse a few callers, for example get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit: if (!fs && (request_module("fs-%.*s", len, name) == 0)) { fs = __get_fs_type(name, len); WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name); } This is easily reproduced with: echo > /proc/sys/kernel/modprobe mount -t NONEXISTENT none / It causes: request_module fs-NONEXISTENT succeeded, but still no fs? WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0 [...] This should actually use pr_warn_once() rather than WARN_ONCE(), since it's also user-reachable if userspace immediately unloads the module. Regardless, request_module() should correctly return an error when it fails. So let's make it return -ENOENT, which matches the error when the modprobe binary doesn't exist. I've also sent patches to document and test this case. Signed-off-by: Eric Biggers Signed-off-by: Andrew Morton Reviewed-by: Kees Cook Reviewed-by: Jessica Yu Acked-by: Luis Chamberlain Cc: Alexei Starovoitov Cc: Greg Kroah-Hartman Cc: Jeff Vander Stoep Cc: Ben Hutchings Cc: Josh Triplett Cc: Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org Signed-off-by: Linus Torvalds --- kernel/kmod.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kmod.c b/kernel/kmod.c index 8b2b311afa95..37c3c4b97b8e 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -120,7 +120,7 @@ out: * invoke it. * * If module auto-loading support is disabled then this function - * becomes a no-operation. + * simply returns -ENOENT. */ int __request_module(bool wait, const char *fmt, ...) { @@ -137,7 +137,7 @@ int __request_module(bool wait, const char *fmt, ...) WARN_ON_ONCE(wait && current_is_async()); if (!modprobe_path[0]) - return 0; + return -ENOENT; va_start(args, fmt); ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args); -- cgit v1.2.3-71-gd317 From f4d74ef6220c1eda0875da30457bef5c7111ab06 Mon Sep 17 00:00:00 2001 From: Vasily Averin Date: Fri, 10 Apr 2020 14:34:10 -0700 Subject: kernel/gcov/fs.c: gcov_seq_next() should increase position index If seq_file .next function does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin Signed-off-by: Andrew Morton Acked-by: Peter Oberparleiter Cc: Al Viro Cc: Davidlohr Bueso Cc: Ingo Molnar Cc: Manfred Spraul Cc: NeilBrown Cc: Steven Rostedt Cc: Waiman Long Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com Signed-off-by: Linus Torvalds --- kernel/gcov/fs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/gcov/fs.c b/kernel/gcov/fs.c index 5e891c3c2d93..82babf5aa077 100644 --- a/kernel/gcov/fs.c +++ b/kernel/gcov/fs.c @@ -108,9 +108,9 @@ static void *gcov_seq_next(struct seq_file *seq, void *data, loff_t *pos) { struct gcov_iterator *iter = data; + (*pos)++; if (gcov_iter_next(iter)) return NULL; - (*pos)++; return iter; } -- cgit v1.2.3-71-gd317 From eaec2b0bd30690575c581eebffae64bfb7f684ac Mon Sep 17 00:00:00 2001 From: Zhiqiang Liu Date: Mon, 30 Mar 2020 10:18:33 +0800 Subject: signal: check sig before setting info in kill_pid_usb_asyncio In kill_pid_usb_asyncio, if signal is not valid, we do not need to set info struct. Signed-off-by: Zhiqiang Liu Acked-by: Christian Brauner Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com Signed-off-by: Christian Brauner --- kernel/signal.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index e58a6c619824..3f94894d1253 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1510,15 +1510,15 @@ int kill_pid_usb_asyncio(int sig, int errno, sigval_t addr, unsigned long flags; int ret = -EINVAL; + if (!valid_signal(sig)) + return ret; + clear_siginfo(&info); info.si_signo = sig; info.si_errno = errno; info.si_code = SI_ASYNCIO; *((sigval_t *)&info.si_pid) = addr; - if (!valid_signal(sig)) - return ret; - rcu_read_lock(); p = pid_task(pid, PIDTYPE_PID); if (!p) { -- cgit v1.2.3-71-gd317 From 3075afdf15b89a063f8d31c0db08a50472bb7faf Mon Sep 17 00:00:00 2001 From: Zhiqiang Liu Date: Mon, 30 Mar 2020 10:44:43 +0800 Subject: signal: use kill_proc_info instead of kill_pid_info in kill_something_info signal.c provides kill_proc_info, we can use it instead of kill_pid_info in kill_something_info func gracefully. Signed-off-by: Zhiqiang Liu Acked-by: Oleg Nesterov Acked-by: Christian Brauner Link: https://lore.kernel.org/r/80236965-f0b5-c888-95ff-855bdec75bb3@huawei.com Signed-off-by: Christian Brauner --- kernel/signal.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index 3f94894d1253..713104884414 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1557,12 +1557,8 @@ static int kill_something_info(int sig, struct kernel_siginfo *info, pid_t pid) { int ret; - if (pid > 0) { - rcu_read_lock(); - ret = kill_pid_info(sig, info, find_vpid(pid)); - rcu_read_unlock(); - return ret; - } + if (pid > 0) + return kill_proc_info(sig, info, pid); /* -INT_MIN is undefined. Exclude this case to avoid a UBSAN warning */ if (pid == INT_MIN) -- cgit v1.2.3-71-gd317 From 07d8350ede4c4c29634b26c163a1eecdf39dfcfb Mon Sep 17 00:00:00 2001 From: afzal mohammed Date: Fri, 27 Mar 2020 21:41:16 +0530 Subject: genirq: Remove setup_irq() and remove_irq() Now that all the users of setup_irq() & remove_irq() have been replaced by request_irq() & free_irq() respectively, delete them. Signed-off-by: afzal mohammed Signed-off-by: Thomas Gleixner Reviewed-by: Linus Walleij Link: https://lkml.kernel.org/r/0aa8771ada1ac8e1312f6882980c9c08bd023148.1585320721.git.afzal.mohd.ma@gmail.com --- include/linux/irq.h | 2 -- kernel/irq/manage.c | 44 -------------------------------------------- 2 files changed, 46 deletions(-) (limited to 'kernel') diff --git a/include/linux/irq.h b/include/linux/irq.h index 9315fbb87db3..c63c2aa915ff 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -573,8 +573,6 @@ enum { #define IRQ_DEFAULT_INIT_FLAGS ARCH_IRQ_INIT_FLAGS struct irqaction; -extern int setup_irq(unsigned int irq, struct irqaction *new); -extern void remove_irq(unsigned int irq, struct irqaction *act); extern int setup_percpu_irq(unsigned int irq, struct irqaction *new); extern void remove_percpu_irq(unsigned int irq, struct irqaction *act); diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index fe40c658f86f..453a8a0f4804 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1690,34 +1690,6 @@ out_mput: return ret; } -/** - * setup_irq - setup an interrupt - * @irq: Interrupt line to setup - * @act: irqaction for the interrupt - * - * Used to statically setup interrupts in the early boot process. - */ -int setup_irq(unsigned int irq, struct irqaction *act) -{ - int retval; - struct irq_desc *desc = irq_to_desc(irq); - - if (!desc || WARN_ON(irq_settings_is_per_cpu_devid(desc))) - return -EINVAL; - - retval = irq_chip_pm_get(&desc->irq_data); - if (retval < 0) - return retval; - - retval = __setup_irq(irq, desc, act); - - if (retval) - irq_chip_pm_put(&desc->irq_data); - - return retval; -} -EXPORT_SYMBOL_GPL(setup_irq); - /* * Internal function to unregister an irqaction - used to free * regular and special interrupts that are part of the architecture. @@ -1858,22 +1830,6 @@ static struct irqaction *__free_irq(struct irq_desc *desc, void *dev_id) return action; } -/** - * remove_irq - free an interrupt - * @irq: Interrupt line to free - * @act: irqaction for the interrupt - * - * Used to remove interrupts statically setup by the early boot process. - */ -void remove_irq(unsigned int irq, struct irqaction *act) -{ - struct irq_desc *desc = irq_to_desc(irq); - - if (desc && !WARN_ON(irq_settings_is_per_cpu_devid(desc))) - __free_irq(desc, act->dev_id); -} -EXPORT_SYMBOL_GPL(remove_irq); - /** * free_irq - free an interrupt allocated with request_irq * @irq: Interrupt line to free -- cgit v1.2.3-71-gd317 From 1f6cb19be2e231fe092f40decb71f066eba090d7 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Fri, 10 Apr 2020 13:26:12 -0700 Subject: bpf: Prevent re-mmap()'ing BPF map as writable for initially r/o mapping VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed pages can be later remapped as writable ones through mprotect() call. To prevent user application to rewrite contents of memory-mapped as read-only and subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially read-only mapping. Alternatively, we could treat any memory-mapping on unfrozen map as writable and bump writecnt instead. But there is little legitimate reason to map BPF map as read-only and then re-mmap() it as writable through mprotect(), instead of just mmap()'ing it as read/write from the very beginning. Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap operations. We can just rely on VMA holding reference to BPF map's file properly. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Reported-by: Jann Horn Signed-off-by: Andrii Nakryiko Signed-off-by: Daniel Borkmann Reviewed-by: Jann Horn Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com --- kernel/bpf/syscall.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 64783da34202..d85f37239540 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -586,9 +586,7 @@ static void bpf_map_mmap_open(struct vm_area_struct *vma) { struct bpf_map *map = vma->vm_file->private_data; - bpf_map_inc_with_uref(map); - - if (vma->vm_flags & VM_WRITE) { + if (vma->vm_flags & VM_MAYWRITE) { mutex_lock(&map->freeze_mutex); map->writecnt++; mutex_unlock(&map->freeze_mutex); @@ -600,13 +598,11 @@ static void bpf_map_mmap_close(struct vm_area_struct *vma) { struct bpf_map *map = vma->vm_file->private_data; - if (vma->vm_flags & VM_WRITE) { + if (vma->vm_flags & VM_MAYWRITE) { mutex_lock(&map->freeze_mutex); map->writecnt--; mutex_unlock(&map->freeze_mutex); } - - bpf_map_put_with_uref(map); } static const struct vm_operations_struct bpf_map_default_vmops = { @@ -635,14 +631,16 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) /* set default open/close callbacks */ vma->vm_ops = &bpf_map_default_vmops; vma->vm_private_data = map; + vma->vm_flags &= ~VM_MAYEXEC; + if (!(vma->vm_flags & VM_WRITE)) + /* disallow re-mapping with PROT_WRITE */ + vma->vm_flags &= ~VM_MAYWRITE; err = map->ops->map_mmap(map, vma); if (err) goto out; - bpf_map_inc_with_uref(map); - - if (vma->vm_flags & VM_WRITE) + if (vma->vm_flags & VM_MAYWRITE) map->writecnt++; out: mutex_unlock(&map->freeze_mutex); -- cgit v1.2.3-71-gd317 From 89f33dcadb349eb926a92633e2c5f61466afc596 Mon Sep 17 00:00:00 2001 From: Zou Wei Date: Mon, 13 Apr 2020 19:57:56 +0800 Subject: bpf: remove unneeded conversion to bool in __mark_reg_unknown This issue was detected by using the Coccinelle software: kernel/bpf/verifier.c:1259:16-21: WARNING: conversion to bool not needed here The conversion to bool is unneeded, remove it. Reported-by: Hulk Robot Signed-off-by: Zou Wei Signed-off-by: Daniel Borkmann Acked-by: Song Liu Link: https://lore.kernel.org/bpf/1586779076-101346-1-git-send-email-zou_wei@huawei.com --- kernel/bpf/verifier.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 04c6630cc18f..38cfcf701eeb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1255,8 +1255,7 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env, reg->type = SCALAR_VALUE; reg->var_off = tnum_unknown; reg->frameno = 0; - reg->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks ? - true : false; + reg->precise = env->subprog_cnt > 1 || !env->allow_ptr_leaks; __mark_reg_unbounded(reg); } -- cgit v1.2.3-71-gd317 From 0bbe7f719985efd9adb3454679ecef0984cb6800 Mon Sep 17 00:00:00 2001 From: Xiao Yang Date: Tue, 14 Apr 2020 09:51:45 +0800 Subject: tracing: Fix the race between registering 'snapshot' event trigger and triggering 'snapshot' operation Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger() or snapshot_count_trigger()) when register_snapshot_trigger() has completed registration but doesn't allocate buffer for 'snapshot' event trigger. In the rare case, 'snapshot' operation always detects the lack of allocated buffer so make register_snapshot_trigger() allocate buffer first. trigger-snapshot.tc in kselftest reproduces the issue on slow vm: ----------------------------------------------------------- cat trace ... ftracetest-3028 [002] .... 236.784290: sched_process_fork: comm=ftracetest pid=3028 child_comm=ftracetest child_pid=3036 <...>-2875 [003] .... 240.460335: tracing_snapshot_instance_cond: *** SNAPSHOT NOT ALLOCATED *** <...>-2875 [003] .... 240.460338: tracing_snapshot_instance_cond: *** stopping trace here! *** ----------------------------------------------------------- Link: http://lkml.kernel.org/r/20200414015145.66236-1-yangx.jy@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 93e31ffbf417a ("tracing: Add 'snapshot' event trigger command") Signed-off-by: Xiao Yang Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_trigger.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_events_trigger.c b/kernel/trace/trace_events_trigger.c index dd34a1b46a86..3a74736da363 100644 --- a/kernel/trace/trace_events_trigger.c +++ b/kernel/trace/trace_events_trigger.c @@ -1088,14 +1088,10 @@ register_snapshot_trigger(char *glob, struct event_trigger_ops *ops, struct event_trigger_data *data, struct trace_event_file *file) { - int ret = register_trigger(glob, ops, data, file); - - if (ret > 0 && tracing_alloc_snapshot_instance(file->tr) != 0) { - unregister_trigger(glob, ops, data, file); - ret = 0; - } + if (tracing_alloc_snapshot_instance(file->tr) != 0) + return 0; - return ret; + return register_trigger(glob, ops, data, file); } static int -- cgit v1.2.3-71-gd317 From e82a118f57b89bbb437ce70780fc2678d5c281e5 Mon Sep 17 00:00:00 2001 From: Eugene Syromiatnikov Date: Sun, 12 Apr 2020 22:25:33 +0200 Subject: clone3: fix cgroup argument sanity check Checking that cgroup field value of struct clone_args is less than 0 is useless, as it is defined as unsigned 64-bit integer. Moreover, it doesn't catch the situations where its higher bits are lost during the assignment to the cgroup field of the cgroup field of the internal struct kernel_clone_args (where it is declared as signed 32-bit integer), so it is still possible to pass garbage there. A check against INT_MAX solves both these issues. Fixes: ef2c41cf38a7559b ("clone3: allow spawning processes into cgroups") Signed-off-by: Eugene Syromiatnikov Acked-by: Christian Brauner Link: https://lore.kernel.org/r/20200412202533.GA29554@asgard.redhat.com Signed-off-by: Christian Brauner --- kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 4385f3d639f2..b4f7775623c8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2631,7 +2631,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, !valid_signal(args.exit_signal))) return -EINVAL; - if ((args.flags & CLONE_INTO_CGROUP) && args.cgroup < 0) + if ((args.flags & CLONE_INTO_CGROUP) && args.cgroup > INT_MAX) return -EINVAL; *kargs = (struct kernel_clone_args){ -- cgit v1.2.3-71-gd317 From 62173872ca65767c586217dec0a32485da8a2f07 Mon Sep 17 00:00:00 2001 From: Eugene Syromiatnikov Date: Sun, 12 Apr 2020 22:31:23 +0200 Subject: clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set Passing CLONE_INTO_CGROUP with an under-sized structure (that doesn't properly contain cgroup field) seems like garbage input, especially considering the fact that fd 0 is a valid descriptor. Signed-off-by: Eugene Syromiatnikov Acked-by: Christian Brauner Link: https://lore.kernel.org/r/20200412203123.GA5869@asgard.redhat.com Signed-off-by: Christian Brauner --- kernel/fork.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index b4f7775623c8..3ab7cf88e455 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2631,7 +2631,8 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, !valid_signal(args.exit_signal))) return -EINVAL; - if ((args.flags & CLONE_INTO_CGROUP) && args.cgroup > INT_MAX) + if ((args.flags & CLONE_INTO_CGROUP) && + (args.cgroup > INT_MAX || usize < CLONE_ARGS_SIZE_VER2)) return -EINVAL; *kargs = (struct kernel_clone_args){ -- cgit v1.2.3-71-gd317 From a966dcfe153ab0a3d8d79cd971a079411a489be7 Mon Sep 17 00:00:00 2001 From: Eugene Syromiatnikov Date: Sun, 12 Apr 2020 22:26:58 +0200 Subject: clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via the offsets of the relevant struct clone_args fields, which makes it rather error-prone, so it probably makes sense to add some compile-time checks for them (including the one that breaks on struct clone_args extension as a reminder to add a relevant size macro and a similar check). Function copy_clone_args_from_user seems to be a good place for such checks. Signed-off-by: Eugene Syromiatnikov Acked-by: Christian Brauner Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com Signed-off-by: Christian Brauner --- kernel/fork.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 3ab7cf88e455..8c700f881d92 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2605,6 +2605,14 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs, struct clone_args args; pid_t *kset_tid = kargs->set_tid; + BUILD_BUG_ON(offsetofend(struct clone_args, tls) != + CLONE_ARGS_SIZE_VER0); + BUILD_BUG_ON(offsetofend(struct clone_args, set_tid_size) != + CLONE_ARGS_SIZE_VER1); + BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) != + CLONE_ARGS_SIZE_VER2); + BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2); + if (unlikely(usize > PAGE_SIZE)) return -E2BIG; if (unlikely(usize < CLONE_ARGS_SIZE_VER0)) -- cgit v1.2.3-71-gd317 From 3662daf023500dc084fa3b96f68a6f46179ddc73 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Fri, 3 Apr 2020 18:35:17 -0400 Subject: sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters The "isolcpus=" parameter allows sub-parameters before the cpulist is specified, and if the parser detects an unknown sub-parameters the whole parameter will be ignored. This design is incompatible with itself when new sub-parameters are added. An older kernel will not recognize the new sub-parameter and will invalidate the whole parameter so the CPU isolation will not take effect. It emits a warning: isolcpus: Error, unknown flag The better and compatible way is to allow "isolcpus=" to skip unknown sub-parameters, so that even if new sub-parameters are added an older kernel will still be able to behave as usual even if with the new sub-parameter specified on the command line. Ideally this should have been there when the first sub-parameter for "isolcpus=" was introduced. Suggested-by: Thomas Gleixner Signed-off-by: Peter Xu Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com --- kernel/sched/isolation.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c index 008d6ac2342b..808244f3ddd9 100644 --- a/kernel/sched/isolation.c +++ b/kernel/sched/isolation.c @@ -149,6 +149,9 @@ __setup("nohz_full=", housekeeping_nohz_full_setup); static int __init housekeeping_isolcpus_setup(char *str) { unsigned int flags = 0; + bool illegal = false; + char *par; + int len; while (isalpha(*str)) { if (!strncmp(str, "nohz,", 5)) { @@ -169,8 +172,22 @@ static int __init housekeeping_isolcpus_setup(char *str) continue; } - pr_warn("isolcpus: Error, unknown flag\n"); - return 0; + /* + * Skip unknown sub-parameter and validate that it is not + * containing an invalid character. + */ + for (par = str, len = 0; *str && *str != ','; str++, len++) { + if (!isalpha(*str) && *str != '_') + illegal = true; + } + + if (illegal) { + pr_warn("isolcpus: Invalid flag %.*s\n", len, par); + return 0; + } + + pr_info("isolcpus: Skipped unknown flag %.*s\n", len, par); + str++; } /* Default behaviour for isolcpus without flags */ -- cgit v1.2.3-71-gd317 From e0d648f9d883ec1efab261af158d73aa30e9dd12 Mon Sep 17 00:00:00 2001 From: Borislav Petkov Date: Fri, 27 Mar 2020 22:43:34 +0100 Subject: sched/vtime: Work around an unitialized variable warning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Work around this warning: kernel/sched/cputime.c: In function ‘kcpustat_field’: kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized] because GCC can't see that val is used only when err is 0. Acked-by: Peter Zijlstra Signed-off-by: Borislav Petkov Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic --- kernel/sched/cputime.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index dac9104d126f..ff9435dee1df 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -1003,12 +1003,12 @@ u64 kcpustat_field(struct kernel_cpustat *kcpustat, enum cpu_usage_stat usage, int cpu) { u64 *cpustat = kcpustat->cpustat; + u64 val = cpustat[usage]; struct rq *rq; - u64 val; int err; if (!vtime_accounting_enabled_cpu(cpu)) - return cpustat[usage]; + return val; rq = cpu_rq(cpu); -- cgit v1.2.3-71-gd317 From 94d440d618467806009c8edc70b094d64e12ee5a Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Sat, 11 Apr 2020 08:40:31 -0700 Subject: proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets Michael Kerrisk suggested to replace numeric clock IDs with symbolic names. Now the content of these files looks like this: $ cat /proc/774/timens_offsets monotonic 864000 0 boottime 1728000 0 For setting offsets, both representations of clocks (numeric and symbolic) can be used. As for compatibility, it is acceptable to change things as long as userspace doesn't care. The format of timens_offsets files is very new and there are no userspace tools yet which rely on this format. But three projects crun, util-linux and criu rely on the interface of setting time offsets and this is why it's required to continue supporting the numeric clock IDs on write. Fixes: 04a8682a71be ("fs/proc: Introduce /proc/pid/timens_offsets") Suggested-by: Michael Kerrisk Signed-off-by: Andrei Vagin Signed-off-by: Thomas Gleixner Tested-by: Michael Kerrisk Acked-by: Michael Kerrisk Cc: Andrew Morton Cc: Eric W. Biederman Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com --- fs/proc/base.c | 14 +++++++++++++- kernel/time/namespace.c | 15 ++++++++++++++- 2 files changed, 27 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/fs/proc/base.c b/fs/proc/base.c index 6042b646ab27..572898dd16a0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1573,6 +1573,7 @@ static ssize_t timens_offsets_write(struct file *file, const char __user *buf, noffsets = 0; for (pos = kbuf; pos; pos = next_line) { struct proc_timens_offset *off = &offsets[noffsets]; + char clock[10]; int err; /* Find the end of line and ensure we don't look past it */ @@ -1584,10 +1585,21 @@ static ssize_t timens_offsets_write(struct file *file, const char __user *buf, next_line = NULL; } - err = sscanf(pos, "%u %lld %lu", &off->clockid, + err = sscanf(pos, "%9s %lld %lu", clock, &off->val.tv_sec, &off->val.tv_nsec); if (err != 3 || off->val.tv_nsec >= NSEC_PER_SEC) goto out; + + clock[sizeof(clock) - 1] = 0; + if (strcmp(clock, "monotonic") == 0 || + strcmp(clock, __stringify(CLOCK_MONOTONIC)) == 0) + off->clockid = CLOCK_MONOTONIC; + else if (strcmp(clock, "boottime") == 0 || + strcmp(clock, __stringify(CLOCK_BOOTTIME)) == 0) + off->clockid = CLOCK_BOOTTIME; + else + goto out; + noffsets++; if (noffsets == ARRAY_SIZE(offsets)) { if (next_line) diff --git a/kernel/time/namespace.c b/kernel/time/namespace.c index 3b30288793fe..53bce347cd50 100644 --- a/kernel/time/namespace.c +++ b/kernel/time/namespace.c @@ -338,7 +338,20 @@ static struct user_namespace *timens_owner(struct ns_common *ns) static void show_offset(struct seq_file *m, int clockid, struct timespec64 *ts) { - seq_printf(m, "%d %lld %ld\n", clockid, ts->tv_sec, ts->tv_nsec); + char *clock; + + switch (clockid) { + case CLOCK_BOOTTIME: + clock = "boottime"; + break; + case CLOCK_MONOTONIC: + clock = "monotonic"; + break; + default: + clock = "unknown"; + break; + } + seq_printf(m, "%-10s %10lld %9ld\n", clock, ts->tv_sec, ts->tv_nsec); } void proc_timens_show_offsets(struct task_struct *p, struct seq_file *m) -- cgit v1.2.3-71-gd317 From 763dafc520add02a1f4639b500c509acc0ea8e5b Mon Sep 17 00:00:00 2001 From: Paul Moore Date: Mon, 20 Apr 2020 16:24:34 -0400 Subject: audit: check the length of userspace generated audit records Commit 756125289285 ("audit: always check the netlink payload length in audit_receive_msg()") fixed a number of missing message length checks, but forgot to check the length of userspace generated audit records. The good news is that you need CAP_AUDIT_WRITE to submit userspace audit records, which is generally only given to trusted processes, so the impact should be limited. Cc: stable@vger.kernel.org Fixes: 756125289285 ("audit: always check the netlink payload length in audit_receive_msg()") Reported-by: syzbot+49e69b4d71a420ceda3e@syzkaller.appspotmail.com Signed-off-by: Paul Moore --- kernel/audit.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'kernel') diff --git a/kernel/audit.c b/kernel/audit.c index b69c8b460341..87f31bf1f0a0 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -1326,6 +1326,9 @@ static int audit_receive_msg(struct sk_buff *skb, struct nlmsghdr *nlh) case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2: if (!audit_enabled && msg_type != AUDIT_USER_AVC) return 0; + /* exit early if there isn't at least one character to print */ + if (data_len < 2) + return -EINVAL; err = audit_filter(msg_type, AUDIT_FILTER_USER); if (err == 1) { /* match or error */ -- cgit v1.2.3-71-gd317 From bc23d0e3f717ced21fbfacab3ab887d55e5ba367 Mon Sep 17 00:00:00 2001 From: Toke Høiland-Jørgensen Date: Thu, 16 Apr 2020 10:31:20 +0200 Subject: cpumap: Avoid warning when CONFIG_DEBUG_PER_CPU_MAPS is enabled MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When the kernel is built with CONFIG_DEBUG_PER_CPU_MAPS, the cpumap code can trigger a spurious warning if CONFIG_CPUMASK_OFFSTACK is also set. This happens because in this configuration, NR_CPUS can be larger than nr_cpumask_bits, so the initial check in cpu_map_alloc() is not sufficient to guard against hitting the warning in cpumask_check(). Fix this by explicitly checking the supplied key against the nr_cpumask_bits variable before calling cpu_possible(). Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reported-by: Xiumei Mu Signed-off-by: Toke Høiland-Jørgensen Signed-off-by: Alexei Starovoitov Tested-by: Xiumei Mu Acked-by: Jesper Dangaard Brouer Acked-by: Song Liu Link: https://lore.kernel.org/bpf/20200416083120.453718-1-toke@redhat.com --- kernel/bpf/cpumap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index 70f71b154fa5..3fe0b006d2d2 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -469,7 +469,7 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value, return -EOVERFLOW; /* Make sure CPU is a valid possible cpu */ - if (!cpu_possible(key_cpu)) + if (key_cpu >= nr_cpumask_bits || !cpu_possible(key_cpu)) return -ENODEV; if (qsize == 0) { -- cgit v1.2.3-71-gd317 From 6e7e63cbb023976d828cdb22422606bf77baa8a9 Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Fri, 17 Apr 2020 02:00:06 +0200 Subject: bpf: Forbid XADD on spilled pointers for unprivileged users When check_xadd() verifies an XADD operation on a pointer to a stack slot containing a spilled pointer, check_stack_read() verifies that the read, which is part of XADD, is valid. However, since the placeholder value -1 is passed as `value_regno`, check_stack_read() can only return a binary decision and can't return the type of the value that was read. The intent here is to verify whether the value read from the stack slot may be used as a SCALAR_VALUE; but since check_stack_read() doesn't check the type, and the type information is lost when check_stack_read() returns, this is not enforced, and a malicious user can abuse XADD to leak spilled kernel pointers. Fix it by letting check_stack_read() verify that the value is usable as a SCALAR_VALUE if no type information is passed to the caller. To be able to use __is_pointer_value() in check_stack_read(), move it up. Fix up the expected unprivileged error message for a BPF selftest that, until now, assumed that unprivileged users can use XADD on stack-spilled pointers. This also gives us a test for the behavior introduced in this patch for free. In theory, this could also be fixed by forbidding XADD on stack spills entirely, since XADD is a locked operation (for operations on memory with concurrency) and there can't be any concurrency on the BPF stack; but Alexei has said that he wants to keep XADD on stack slots working to avoid changes to the test suite [1]. The following BPF program demonstrates how to leak a BPF map pointer as an unprivileged user using this bug: // r7 = map_pointer BPF_LD_MAP_FD(BPF_REG_7, small_map), // r8 = launder(map_pointer) BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_7, -8), BPF_MOV64_IMM(BPF_REG_1, 0), ((struct bpf_insn) { .code = BPF_STX | BPF_DW | BPF_XADD, .dst_reg = BPF_REG_FP, .src_reg = BPF_REG_1, .off = -8 }), BPF_LDX_MEM(BPF_DW, BPF_REG_8, BPF_REG_FP, -8), // store r8 into map BPF_MOV64_REG(BPF_REG_ARG1, BPF_REG_7), BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP), BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4), BPF_ST_MEM(BPF_W, BPF_REG_ARG2, 0, 0), BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), BPF_EXIT_INSN(), BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_8, 0), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN() [1] https://lore.kernel.org/bpf/20200416211116.qxqcza5vo2ddnkdq@ast-mbp.dhcp.thefacebook.com/ Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)") Signed-off-by: Jann Horn Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200417000007.10734-1-jannh@google.com --- kernel/bpf/verifier.c | 28 +++++++++++++++------- .../selftests/bpf/verifier/value_illegal_alu.c | 1 + 2 files changed, 20 insertions(+), 9 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 38cfcf701eeb..9e92d3d5ffd1 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2118,6 +2118,15 @@ static bool register_is_const(struct bpf_reg_state *reg) return reg->type == SCALAR_VALUE && tnum_is_const(reg->var_off); } +static bool __is_pointer_value(bool allow_ptr_leaks, + const struct bpf_reg_state *reg) +{ + if (allow_ptr_leaks) + return false; + + return reg->type != SCALAR_VALUE; +} + static void save_register_state(struct bpf_func_state *state, int spi, struct bpf_reg_state *reg) { @@ -2308,6 +2317,16 @@ static int check_stack_read(struct bpf_verifier_env *env, * which resets stack/reg liveness for state transitions */ state->regs[value_regno].live |= REG_LIVE_WRITTEN; + } else if (__is_pointer_value(env->allow_ptr_leaks, reg)) { + /* If value_regno==-1, the caller is asking us whether + * it is acceptable to use this value as a SCALAR_VALUE + * (e.g. for XADD). + * We must not allow unprivileged callers to do that + * with spilled pointers. + */ + verbose(env, "leaking pointer from stack off %d\n", + off); + return -EACCES; } mark_reg_read(env, reg, reg->parent, REG_LIVE_READ64); } else { @@ -2673,15 +2692,6 @@ static int check_sock_access(struct bpf_verifier_env *env, int insn_idx, return -EACCES; } -static bool __is_pointer_value(bool allow_ptr_leaks, - const struct bpf_reg_state *reg) -{ - if (allow_ptr_leaks) - return false; - - return reg->type != SCALAR_VALUE; -} - static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno) { return cur_regs(env) + regno; diff --git a/tools/testing/selftests/bpf/verifier/value_illegal_alu.c b/tools/testing/selftests/bpf/verifier/value_illegal_alu.c index 7f6c232cd842..ed1c2cea1dea 100644 --- a/tools/testing/selftests/bpf/verifier/value_illegal_alu.c +++ b/tools/testing/selftests/bpf/verifier/value_illegal_alu.c @@ -88,6 +88,7 @@ BPF_EXIT_INSN(), }, .fixup_map_hash_48b = { 3 }, + .errstr_unpriv = "leaking pointer from stack off -8", .errstr = "R0 invalid mem access 'inv'", .result = REJECT, .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, -- cgit v1.2.3-71-gd317 From 8ff3571f7e1bf3f293cc5e3dc14f2943f4fa7fcf Mon Sep 17 00:00:00 2001 From: Jann Horn Date: Fri, 17 Apr 2020 02:00:07 +0200 Subject: bpf: Fix handling of XADD on BTF memory check_xadd() can cause check_ptr_to_btf_access() to be executed with atype==BPF_READ and value_regno==-1 (meaning "just check whether the access is okay, don't tell me what type it will result in"). Handle that case properly and skip writing type information, instead of indexing into the registers at index -1 and writing into out-of-bounds memory. Note that at least at the moment, you can't actually write through a BTF pointer, so check_xadd() will reject the program after calling check_ptr_to_btf_access with atype==BPF_WRITE; but that's after the verifier has already corrupted memory. This patch assumes that BTF pointers are not available in unprivileged programs. Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF") Signed-off-by: Jann Horn Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200417000007.10734-2-jannh@google.com --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9e92d3d5ffd1..9382609147f5 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -3099,7 +3099,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env, if (ret < 0) return ret; - if (atype == BPF_READ) { + if (atype == BPF_READ && value_regno >= 0) { if (ret == SCALAR_VALUE) { mark_reg_unknown(env, regs, value_regno); return 0; -- cgit v1.2.3-71-gd317 From 61e713bdca3678e84815f2427f7a063fc353a1fc Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Mon, 20 Apr 2020 11:41:50 -0500 Subject: signal: Avoid corrupting si_pid and si_uid in do_notify_parent Christof Meerwald writes: > Hi, > > this is probably related to commit > 7a0cf094944e2540758b7f957eb6846d5126f535 (signal: Correct namespace > fixups of si_pid and si_uid). > > With a 5.6.5 kernel I am seeing SIGCHLD signals that don't include a > properly set si_pid field - this seems to happen for multi-threaded > child processes. > > A simple test program (based on the sample from the signalfd man page): > > #include > #include > #include > #include > #include > #include > > #define handle_error(msg) \ > do { perror(msg); exit(EXIT_FAILURE); } while (0) > > int main(int argc, char *argv[]) > { > sigset_t mask; > int sfd; > struct signalfd_siginfo fdsi; > ssize_t s; > > sigemptyset(&mask); > sigaddset(&mask, SIGCHLD); > > if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1) > handle_error("sigprocmask"); > > pid_t chldpid; > char *chldargv[] = { "./sfdclient", NULL }; > posix_spawn(&chldpid, "./sfdclient", NULL, NULL, chldargv, NULL); > > sfd = signalfd(-1, &mask, 0); > if (sfd == -1) > handle_error("signalfd"); > > for (;;) { > s = read(sfd, &fdsi, sizeof(struct signalfd_siginfo)); > if (s != sizeof(struct signalfd_siginfo)) > handle_error("read"); > > if (fdsi.ssi_signo == SIGCHLD) { > printf("Got SIGCHLD %d %d %d %d\n", > fdsi.ssi_status, fdsi.ssi_code, > fdsi.ssi_uid, fdsi.ssi_pid); > return 0; > } else { > printf("Read unexpected signal\n"); > } > } > } > > > and a multi-threaded client to test with: > > #include > #include > > void *f(void *arg) > { > sleep(100); > } > > int main() > { > pthread_t t[8]; > > for (int i = 0; i != 8; ++i) > { > pthread_create(&t[i], NULL, f, NULL); > } > } > > I tried to do a bit of debugging and what seems to be happening is > that > > /* From an ancestor pid namespace? */ > if (!task_pid_nr_ns(current, task_active_pid_ns(t))) { > > fails inside task_pid_nr_ns because the check for "pid_alive" fails. > > This code seems to be called from do_notify_parent and there we > actually have "tsk != current" (I am assuming both are threads of the > current process?) I instrumented the code with a warning and received the following backtrace: > WARNING: CPU: 0 PID: 777 at kernel/pid.c:501 __task_pid_nr_ns.cold.6+0xc/0x15 > Modules linked in: > CPU: 0 PID: 777 Comm: sfdclient Not tainted 5.7.0-rc1userns+ #2924 > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > RIP: 0010:__task_pid_nr_ns.cold.6+0xc/0x15 > Code: ff 66 90 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 9a b6 44 00 48 83 c4 08 c3 48 c7 c7 59 9f ac 82 e8 c2 c4 04 00 <0f> 0b e9 3fd > RSP: 0018:ffffc9000042fbf8 EFLAGS: 00010046 > RAX: 000000000000000c RBX: 0000000000000000 RCX: ffffc9000042faf4 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81193d29 > RBP: ffffc9000042fc18 R08: 0000000000000000 R09: 0000000000000001 > R10: 000000100f938416 R11: 0000000000000309 R12: ffff8880b941c140 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff8880b941c140 > FS: 0000000000000000(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f2e8c0a32e0 CR3: 0000000002e10000 CR4: 00000000000006f0 > Call Trace: > send_signal+0x1c8/0x310 > do_notify_parent+0x50f/0x550 > release_task.part.21+0x4fd/0x620 > do_exit+0x6f6/0xaf0 > do_group_exit+0x42/0xb0 > get_signal+0x13b/0xbb0 > do_signal+0x2b/0x670 > ? __audit_syscall_exit+0x24d/0x2b0 > ? rcu_read_lock_sched_held+0x4d/0x60 > ? kfree+0x24c/0x2b0 > do_syscall_64+0x176/0x640 > ? trace_hardirqs_off_thunk+0x1a/0x1c > entry_SYSCALL_64_after_hwframe+0x49/0xb3 The immediate problem is as Christof noticed that "pid_alive(current) == false". This happens because do_notify_parent is called from the last thread to exit in a process after that thread has been reaped. The bigger issue is that do_notify_parent can be called from any process that manages to wait on a thread of a multi-threaded process from wait_task_zombie. So any logic based upon current for do_notify_parent is just nonsense, as current can be pretty much anything. So change do_notify_parent to call __send_signal directly. Inspecting the code it appears this problem has existed since the pid namespace support started handling this case in 2.6.30. This fix only backports to 7a0cf094944e ("signal: Correct namespace fixups of si_pid and si_uid") where the problem logic was moved out of __send_signal and into send_signal. Cc: stable@vger.kernel.org Fixes: 6588c1e3ff01 ("signals: SI_USER: Masquerade si_pid when crossing pid ns boundary") Ref: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals") Link: https://lore.kernel.org/lkml/20200419201336.GI22017@edge.cmeerw.net/ Reported-by: Christof Meerwald Acked-by: Oleg Nesterov Acked-by: Christian Brauner Signed-off-by: "Eric W. Biederman" --- kernel/signal.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/signal.c b/kernel/signal.c index e58a6c619824..7938c60e11dd 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1993,8 +1993,12 @@ bool do_notify_parent(struct task_struct *tsk, int sig) if (psig->action[SIGCHLD-1].sa.sa_handler == SIG_IGN) sig = 0; } + /* + * Send with __send_signal as si_pid and si_uid are in the + * parent's namespaces. + */ if (valid_signal(sig) && sig) - __group_send_sig_info(sig, &info, tsk->parent); + __send_signal(sig, &info, tsk->parent, PIDTYPE_TGID, false); __wake_up_parent(tsk, tsk->parent); spin_unlock_irqrestore(&psig->siglock, flags); -- cgit v1.2.3-71-gd317 From eaf5a92ebde5bca3bb2565616115bd6d579486cd Mon Sep 17 00:00:00 2001 From: Quentin Perret Date: Thu, 16 Apr 2020 09:59:56 +0100 Subject: sched/core: Fix reset-on-fork from RT with uclamp uclamp_fork() resets the uclamp values to their default when the reset-on-fork flag is set. It also checks whether the task has a RT policy, and sets its uclamp.min to 1024 accordingly. However, during reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after, hence leading to an erroneous uclamp.min setting for the new task if it was forked from RT. Fix this by removing the unnecessary check on rt_task() in uclamp_fork() as this doesn't make sense if the reset-on-fork flag is set. Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks") Reported-by: Chitti Babu Theegala Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Patrick Bellasi Reviewed-by: Dietmar Eggemann Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com --- kernel/sched/core.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3a61a3b8eaa9..9a2fbf98fd6f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1232,13 +1232,8 @@ static void uclamp_fork(struct task_struct *p) return; for_each_clamp_id(clamp_id) { - unsigned int clamp_value = uclamp_none(clamp_id); - - /* By default, RT tasks always get 100% boost */ - if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN)) - clamp_value = uclamp_none(UCLAMP_MAX); - - uclamp_se_set(&p->uclamp_req[clamp_id], clamp_value, false); + uclamp_se_set(&p->uclamp_req[clamp_id], + uclamp_none(clamp_id), false); } } -- cgit v1.2.3-71-gd317 From f3bed55e850926614b9898fe982f66d2541a36a5 Mon Sep 17 00:00:00 2001 From: Ian Rogers Date: Fri, 17 Apr 2020 11:28:42 -0700 Subject: perf/core: fix parent pid/tid in task exit events Current logic yields the child task as the parent. Before: $ perf record bash -c "perf list > /dev/null" $ perf script -D |grep 'FORK\|EXIT' 4387036190981094 0x5a70 [0x30]: PERF_RECORD_FORK(10472:10472):(10470:10470) 4387036606207580 0xf050 [0x30]: PERF_RECORD_EXIT(10472:10472):(10472:10472) 4387036607103839 0x17150 [0x30]: PERF_RECORD_EXIT(10470:10470):(10470:10470) ^ Note the repeated values here -------------------/ After: 383281514043 0x9d8 [0x30]: PERF_RECORD_FORK(2268:2268):(2266:2266) 383442003996 0x2180 [0x30]: PERF_RECORD_EXIT(2268:2268):(2266:2266) 383451297778 0xb70 [0x30]: PERF_RECORD_EXIT(2266:2266):(2265:2265) Fixes: 94d5d1b2d891 ("perf_counter: Report the cloning task as parent on perf_counter_fork()") Reported-by: KP Singh Signed-off-by: Ian Rogers Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200417182842.12522-1-irogers@google.com --- kernel/events/core.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/events/core.c b/kernel/events/core.c index bc9b98a9af9a..633b4ae72ed5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7491,10 +7491,17 @@ static void perf_event_task_output(struct perf_event *event, goto out; task_event->event_id.pid = perf_event_pid(event, task); - task_event->event_id.ppid = perf_event_pid(event, current); - task_event->event_id.tid = perf_event_tid(event, task); - task_event->event_id.ptid = perf_event_tid(event, current); + + if (task_event->event_id.header.type == PERF_RECORD_EXIT) { + task_event->event_id.ppid = perf_event_pid(event, + task->real_parent); + task_event->event_id.ptid = perf_event_pid(event, + task->real_parent); + } else { /* PERF_RECORD_FORK */ + task_event->event_id.ppid = perf_event_pid(event, current); + task_event->event_id.ptid = perf_event_tid(event, current); + } task_event->event_id.time = perf_event_clock(event); -- cgit v1.2.3-71-gd317 From 9da73974eb9c965dd9989befb593b8c8da9e4bdc Mon Sep 17 00:00:00 2001 From: Vamshi K Sthambamkadi Date: Wed, 22 Apr 2020 11:45:06 +0530 Subject: tracing: Fix memory leaks in trace_events_hist.c kmemleak report 1: [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270 [<05a2c9ed>] create_field_var+0xcf/0x180 [<528a2d68>] action_create+0xe2/0xc80 [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920 [<28ea5d3d>] trigger_process_regex+0x7b/0xc0 [<3138e86f>] event_trigger_write+0x4d/0xb0 [] __vfs_write+0x30/0x200 [<4f424a0d>] vfs_write+0x96/0x1b0 [] ksys_write+0x53/0xc0 [<3717101a>] __ia32_sys_write+0x15/0x20 [] do_fast_syscall_32+0x70/0x250 [<46e2629c>] entry_SYSENTER_32+0xaf/0x102 This is because save_vars[] of struct hist_trigger_data are not destroyed kmemleak report 2: [<9092c50b>] kmem_cache_alloc_trace+0x138/0x270 [<6e5e97c5>] create_var+0x3c/0x110 [] create_field_var+0xaf/0x180 [<528a2d68>] action_create+0xe2/0xc80 [<63f50b61>] event_hist_trigger_func+0x15b5/0x1920 [<28ea5d3d>] trigger_process_regex+0x7b/0xc0 [<3138e86f>] event_trigger_write+0x4d/0xb0 [] __vfs_write+0x30/0x200 [<4f424a0d>] vfs_write+0x96/0x1b0 [] ksys_write+0x53/0xc0 [<3717101a>] __ia32_sys_write+0x15/0x20 [] do_fast_syscall_32+0x70/0x250 [<46e2629c>] entry_SYSENTER_32+0xaf/0x102 struct hist_field allocated through create_var() do not initialize "ref" field to 1. The code in __destroy_hist_field() does not destroy object if "ref" is initialized to zero, the condition if (--hist_field->ref > 1) always passes since unsigned int wraps. kmemleak report 3: [] __kmalloc_track_caller+0x139/0x2b0 [] kstrdup+0x27/0x50 [<39d70006>] init_var_ref+0x58/0xd0 [<8ca76370>] create_var_ref+0x89/0xe0 [] action_create+0x38f/0xc80 [<7c146821>] event_hist_trigger_func+0x15b5/0x1920 [<07de3f61>] trigger_process_regex+0x7b/0xc0 [] event_trigger_write+0x4d/0xb0 [<19bf1512>] __vfs_write+0x30/0x200 [<64ce4d27>] vfs_write+0x96/0x1b0 [] ksys_write+0x53/0xc0 [<7d4230cd>] __ia32_sys_write+0x15/0x20 [<8eadca00>] do_fast_syscall_32+0x70/0x250 [<235cf985>] entry_SYSENTER_32+0xaf/0x102 hist_fields (system & event_name) are not freed Link: http://lkml.kernel.org/r/20200422061503.GA5151@cosmos Signed-off-by: Vamshi K Sthambamkadi Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_events_hist.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_events_hist.c b/kernel/trace/trace_events_hist.c index 5f6834a2bf41..fcab11cc6833 100644 --- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -3320,6 +3320,9 @@ static void __destroy_hist_field(struct hist_field *hist_field) kfree(hist_field->name); kfree(hist_field->type); + kfree(hist_field->system); + kfree(hist_field->event_name); + kfree(hist_field); } @@ -4382,6 +4385,7 @@ static struct hist_field *create_var(struct hist_trigger_data *hist_data, goto out; } + var->ref = 1; var->flags = HIST_FIELD_FL_VAR; var->var.idx = idx; var->var.hist_data = var->hist_data = hist_data; @@ -5011,6 +5015,9 @@ static void destroy_field_vars(struct hist_trigger_data *hist_data) for (i = 0; i < hist_data->n_field_vars; i++) destroy_field_var(hist_data->field_vars[i]); + + for (i = 0; i < hist_data->n_save_vars; i++) + destroy_field_var(hist_data->save_vars[i]); } static void save_field_var(struct hist_trigger_data *hist_data, -- cgit v1.2.3-71-gd317 From 353da87921a5ec654e7e9024e083f099f1b33c97 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 22 Apr 2020 21:38:45 -0400 Subject: ftrace: Fix memory leak caused by not freeing entry in unregister_ftrace_direct() kmemleak reported the following: unreferenced object 0xffff90d47127a920 (size 32): comm "modprobe", pid 1766, jiffies 4294792031 (age 162.568s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 22 01 00 00 00 00 ad de ........"....... 00 78 12 a7 ff ff ff ff 00 00 b6 c0 ff ff ff ff .x.............. backtrace: [<00000000bb79e72e>] register_ftrace_direct+0xcb/0x3a0 [<00000000295e4f79>] do_one_initcall+0x72/0x340 [<00000000873ead18>] do_init_module+0x5a/0x220 [<00000000974d9de5>] load_module+0x2235/0x2550 [<0000000059c3d6ce>] __do_sys_finit_module+0xc0/0x120 [<000000005a8611b4>] do_syscall_64+0x60/0x230 [<00000000a0cdc49e>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 The entry used to save the direct descriptor needs to be freed when unregistering. Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ftrace.c | 1 + 1 file changed, 1 insertion(+) (limited to 'kernel') diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index 041694a1eb74..bd030b1b9514 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -5165,6 +5165,7 @@ int unregister_ftrace_direct(unsigned long ip, unsigned long addr) list_del_rcu(&direct->next); synchronize_rcu_tasks(); kfree(direct); + kfree(entry); ftrace_direct_func_count--; } } -- cgit v1.2.3-71-gd317 From d013496f99c5608e0f80afd67acb1ba93c4144ea Mon Sep 17 00:00:00 2001 From: Jason Yan Date: Fri, 10 Apr 2020 15:33:12 +0800 Subject: tracing: Convert local functions in tracing_map.c to static Fix the following sparse warning: kernel/trace/tracing_map.c:286:6: warning: symbol 'tracing_map_array_clear' was not declared. Should it be static? kernel/trace/tracing_map.c:297:6: warning: symbol 'tracing_map_array_free' was not declared. Should it be static? kernel/trace/tracing_map.c:319:26: warning: symbol 'tracing_map_array_alloc' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20200410073312.38855-1-yanaijie@huawei.com Reported-by: Hulk Robot Signed-off-by: Jason Yan Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/tracing_map.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c index 9e31bfc818ff..74738c9856f1 100644 --- a/kernel/trace/tracing_map.c +++ b/kernel/trace/tracing_map.c @@ -283,7 +283,7 @@ int tracing_map_add_key_field(struct tracing_map *map, return idx; } -void tracing_map_array_clear(struct tracing_map_array *a) +static void tracing_map_array_clear(struct tracing_map_array *a) { unsigned int i; @@ -294,7 +294,7 @@ void tracing_map_array_clear(struct tracing_map_array *a) memset(a->pages[i], 0, PAGE_SIZE); } -void tracing_map_array_free(struct tracing_map_array *a) +static void tracing_map_array_free(struct tracing_map_array *a) { unsigned int i; @@ -316,7 +316,7 @@ void tracing_map_array_free(struct tracing_map_array *a) kfree(a); } -struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts, +static struct tracing_map_array *tracing_map_array_alloc(unsigned int n_elts, unsigned int entry_size) { struct tracing_map_array *a; -- cgit v1.2.3-71-gd317 From 6ade99ec6175ab2b54c227521e181e1c3c2bfc8a Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Fri, 24 Apr 2020 15:41:20 -0500 Subject: proc: Put thread_pid in release_task not proc_flush_pid Oleg pointed out that in the unlikely event the kernel is compiled with CONFIG_PROC_FS unset that release_task will now leak the pid. Move the put_pid out of proc_flush_pid into release_task to fix this and to guarantee I don't make that mistake again. When possible it makes sense to keep get and put in the same function so it can easily been seen how they pair up. Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc") Reported-by: Oleg Nesterov Signed-off-by: "Eric W. Biederman" --- fs/proc/base.c | 1 - kernel/exit.c | 1 + 2 files changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/fs/proc/base.c b/fs/proc/base.c index 6042b646ab27..42f43c7b9669 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3274,7 +3274,6 @@ static const struct inode_operations proc_tgid_base_inode_operations = { void proc_flush_pid(struct pid *pid) { proc_invalidate_siblings_dcache(&pid->inodes, &pid->lock); - put_pid(pid); } static struct dentry *proc_pid_instantiate(struct dentry * dentry, diff --git a/kernel/exit.c b/kernel/exit.c index 389a88cb3081..ce2a75bc0ade 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -219,6 +219,7 @@ repeat: write_unlock_irq(&tasklist_lock); proc_flush_pid(thread_pid); + put_pid(thread_pid); release_thread(p); put_task_struct_rcu_user(p); -- cgit v1.2.3-71-gd317 From 4adb7a4a151c65ac7e9c3a1aa462c84190d48385 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Thu, 23 Apr 2020 22:20:44 -0700 Subject: bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd Fix bug of not putting bpf_link in LINK_UPDATE command. Also enforce zeroed old_prog_fd if no BPF_F_REPLACE flag is specified. Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/20200424052045.4002963-1-andriin@fb.com --- kernel/bpf/syscall.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index d85f37239540..bca58c235ac0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3628,8 +3628,10 @@ static int link_update(union bpf_attr *attr) return PTR_ERR(link); new_prog = bpf_prog_get(attr->link_update.new_prog_fd); - if (IS_ERR(new_prog)) - return PTR_ERR(new_prog); + if (IS_ERR(new_prog)) { + ret = PTR_ERR(new_prog); + goto out_put_link; + } if (flags & BPF_F_REPLACE) { old_prog = bpf_prog_get(attr->link_update.old_prog_fd); @@ -3638,6 +3640,9 @@ static int link_update(union bpf_attr *attr) old_prog = NULL; goto out_put_progs; } + } else if (attr->link_update.old_prog_fd) { + ret = -EINVAL; + goto out_put_progs; } #ifdef CONFIG_CGROUP_BPF @@ -3653,6 +3658,8 @@ out_put_progs: bpf_prog_put(old_prog); if (ret) bpf_prog_put(new_prog); +out_put_link: + bpf_link_put(link); return ret; } -- cgit v1.2.3-71-gd317 From 03f87c0b45b177ba5f6b4a9bbe9f95e4aba31026 Mon Sep 17 00:00:00 2001 From: Toke Høiland-Jørgensen Date: Fri, 24 Apr 2020 15:34:27 +0200 Subject: bpf: Propagate expected_attach_type when verifying freplace programs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit For some program types, the verifier relies on the expected_attach_type of the program being verified in the verification process. However, for freplace programs, the attach type was not propagated along with the verifier ops, so the expected_attach_type would always be zero for freplace programs. This in turn caused the verifier to sometimes make the wrong call for freplace programs. For all existing uses of expected_attach_type for this purpose, the result of this was only false negatives (i.e., freplace functions would be rejected by the verifier even though they were valid programs for the target they were replacing). However, should a false positive be introduced, this can lead to out-of-bounds accesses and/or crashes. The fix introduced in this patch is to propagate the expected_attach_type to the freplace program during verification, and reset it after that is done. Fixes: be8704ff07d2 ("bpf: Introduce dynamic program extensions") Signed-off-by: Toke Høiland-Jørgensen Signed-off-by: Alexei Starovoitov Link: https://lore.kernel.org/bpf/158773526726.293902.13257293296560360508.stgit@toke.dk --- kernel/bpf/verifier.c | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 9382609147f5..fa1d8245b925 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -10497,6 +10497,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) return -EINVAL; } env->ops = bpf_verifier_ops[tgt_prog->type]; + prog->expected_attach_type = tgt_prog->expected_attach_type; } if (!tgt_prog->jited) { verbose(env, "Can attach to only JITed progs\n"); @@ -10841,6 +10842,13 @@ err_release_maps: * them now. Otherwise free_used_maps() will release them. */ release_maps(env); + + /* extension progs temporarily inherit the attach_type of their targets + for verification purposes, so set it back to zero before returning + */ + if (env->prog->type == BPF_PROG_TYPE_EXT) + env->prog->expected_attach_type = 0; + *prog = env->prog; err_unlock: if (!is_priv) -- cgit v1.2.3-71-gd317 From 6f302bfb221470e712ce3e5911fb83cdca174387 Mon Sep 17 00:00:00 2001 From: Zou Wei Date: Thu, 23 Apr 2020 10:32:40 +0800 Subject: bpf: Make bpf_link_fops static Fix the following sparse warning: kernel/bpf/syscall.c:2289:30: warning: symbol 'bpf_link_fops' was not declared. Should it be static? Reported-by: Hulk Robot Signed-off-by: Zou Wei Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/1587609160-117806-1-git-send-email-zou_wei@huawei.com --- kernel/bpf/syscall.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index bca58c235ac0..7626b8024471 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2283,7 +2283,7 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp) } #endif -const struct file_operations bpf_link_fops = { +static const struct file_operations bpf_link_fops = { #ifdef CONFIG_PROC_FS .show_fdinfo = bpf_link_show_fdinfo, #endif -- cgit v1.2.3-71-gd317 From 2351f8d295ed63393190e39c2f7c1fee1a80578f Mon Sep 17 00:00:00 2001 From: Dexuan Cui Date: Thu, 23 Apr 2020 20:40:16 -0700 Subject: PM: hibernate: Freeze kernel threads in software_resume() Currently the kernel threads are not frozen in software_resume(), so between dpm_suspend_start(PMSG_QUIESCE) and resume_target_kernel(), system_freezable_power_efficient_wq can still try to submit SCSI commands and this can cause a panic since the low level SCSI driver (e.g. hv_storvsc) has quiesced the SCSI adapter and can not accept any SCSI commands: https://lkml.org/lkml/2020/4/10/47 At first I posted a fix (https://lkml.org/lkml/2020/4/21/1318) trying to resolve the issue from hv_storvsc, but with the help of Bart Van Assche, I realized it's better to fix software_resume(), since this looks like a generic issue, not only pertaining to SCSI. Cc: All applicable Signed-off-by: Dexuan Cui Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 86aba8706b16..30bd28d1d418 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -898,6 +898,13 @@ static int software_resume(void) error = freeze_processes(); if (error) goto Close_Finish; + + error = freeze_kernel_threads(); + if (error) { + thaw_processes(); + goto Close_Finish; + } + error = load_image_and_restore(); thaw_processes(); Finish: -- cgit v1.2.3-71-gd317 From 3740d93e37902b31159a82da2d5c8812ed825404 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain Date: Thu, 16 Apr 2020 16:28:59 +0000 Subject: coredump: fix crash when umh is disabled Commit 64e90a8acb859 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") added the optiont to disable all call_usermodehelper() calls by setting STATIC_USERMODEHELPER_PATH to an empty string. When this is done, and crashdump is triggered, it will crash on null pointer dereference, since we make assumptions over what call_usermodehelper_exec() did. This has been reported by Sergey when one triggers a a coredump with the following configuration: ``` CONFIG_STATIC_USERMODEHELPER=y CONFIG_STATIC_USERMODEHELPER_PATH="" kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e ``` The way disabling the umh was designed was that call_usermodehelper_exec() would just return early, without an error. But coredump assumes certain variables are set up for us when this happens, and calls ile_start_write(cprm.file) with a NULL file. [ 2.819676] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 2.819859] #PF: supervisor read access in kernel mode [ 2.820035] #PF: error_code(0x0000) - not-present page [ 2.820188] PGD 0 P4D 0 [ 2.820305] Oops: 0000 [#1] SMP PTI [ 2.820436] CPU: 2 PID: 89 Comm: a Not tainted 5.7.0-rc1+ #7 [ 2.820680] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014 [ 2.821150] RIP: 0010:do_coredump+0xd80/0x1060 [ 2.821385] Code: e8 95 11 ed ff 48 c7 c6 cc a7 b4 81 48 8d bd 28 ff ff ff 89 c2 e8 70 f1 ff ff 41 89 c2 85 c0 0f 84 72 f7 ff ff e9 b4 fe ff ff <48> 8b 57 20 0f b7 02 66 25 00 f0 66 3d 00 8 0 0f 84 9c 01 00 00 44 [ 2.822014] RSP: 0000:ffffc9000029bcb8 EFLAGS: 00010246 [ 2.822339] RAX: 0000000000000000 RBX: ffff88803f860000 RCX: 000000000000000a [ 2.822746] RDX: 0000000000000009 RSI: 0000000000000282 RDI: 0000000000000000 [ 2.823141] RBP: ffffc9000029bde8 R08: 0000000000000000 R09: ffffc9000029bc00 [ 2.823508] R10: 0000000000000001 R11: ffff88803dec90be R12: ffffffff81c39da0 [ 2.823902] R13: ffff88803de84400 R14: 0000000000000000 R15: 0000000000000000 [ 2.824285] FS: 00007fee08183540(0000) GS:ffff88803e480000(0000) knlGS:0000000000000000 [ 2.824767] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.825111] CR2: 0000000000000020 CR3: 000000003f856005 CR4: 0000000000060ea0 [ 2.825479] Call Trace: [ 2.825790] get_signal+0x11e/0x720 [ 2.826087] do_signal+0x1d/0x670 [ 2.826361] ? force_sig_info_to_task+0xc1/0xf0 [ 2.826691] ? force_sig_fault+0x3c/0x40 [ 2.826996] ? do_trap+0xc9/0x100 [ 2.827179] exit_to_usermode_loop+0x49/0x90 [ 2.827359] prepare_exit_to_usermode+0x77/0xb0 [ 2.827559] ? invalid_op+0xa/0x30 [ 2.827747] ret_from_intr+0x20/0x20 [ 2.827921] RIP: 0033:0x55e2c76d2129 [ 2.828107] Code: 2d ff ff ff e8 68 ff ff ff 5d c6 05 18 2f 00 00 01 c3 0f 1f 80 00 00 00 00 c3 0f 1f 80 00 00 00 00 e9 7b ff ff ff 55 48 89 e5 <0f> 0b b8 00 00 00 00 5d c3 66 2e 0f 1f 84 0 0 00 00 00 00 0f 1f 40 [ 2.828603] RSP: 002b:00007fffeba5e080 EFLAGS: 00010246 [ 2.828801] RAX: 000055e2c76d2125 RBX: 0000000000000000 RCX: 00007fee0817c718 [ 2.829034] RDX: 00007fffeba5e188 RSI: 00007fffeba5e178 RDI: 0000000000000001 [ 2.829257] RBP: 00007fffeba5e080 R08: 0000000000000000 R09: 00007fee08193c00 [ 2.829482] R10: 0000000000000009 R11: 0000000000000000 R12: 000055e2c76d2040 [ 2.829727] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 2.829964] CR2: 0000000000000020 [ 2.830149] ---[ end trace ceed83d8c68a1bf1 ]--- ``` Cc: # v4.11+ Fixes: 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199795 Reported-by: Tony Vroon Reported-by: Sergey Kvachonok Tested-by: Sergei Trofimovich Signed-off-by: Luis Chamberlain Link: https://lore.kernel.org/r/20200416162859.26518-1-mcgrof@kernel.org Signed-off-by: Greg Kroah-Hartman --- fs/coredump.c | 8 ++++++++ kernel/umh.c | 5 +++++ 2 files changed, 13 insertions(+) (limited to 'kernel') diff --git a/fs/coredump.c b/fs/coredump.c index 408418e6aa13..478a0d810136 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -788,6 +788,14 @@ void do_coredump(const kernel_siginfo_t *siginfo) if (displaced) put_files_struct(displaced); if (!dump_interrupted()) { + /* + * umh disabled with CONFIG_STATIC_USERMODEHELPER_PATH="" would + * have this set to NULL. + */ + if (!cprm.file) { + pr_info("Core dump to |%s disabled\n", cn.corename); + goto close_fail; + } file_start_write(cprm.file); core_dumped = binfmt->core_dump(&cprm); file_end_write(cprm.file); diff --git a/kernel/umh.c b/kernel/umh.c index 7f255b5a8845..11bf5eea474c 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -544,6 +544,11 @@ EXPORT_SYMBOL_GPL(fork_usermode_blob); * Runs a user-space application. The application is started * asynchronously if wait is not set, and runs as a child of system workqueues. * (ie. it runs with full root capabilities and optimized affinity). + * + * Note: successful return value does not guarantee the helper was called at + * all. You can't rely on sub_info->{init,cleanup} being called even for + * UMH_WAIT_* wait modes as STATIC_USERMODEHELPER_PATH="" turns all helpers + * into a successful no-op. */ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) { -- cgit v1.2.3-71-gd317 From 7f645462ca01d01abb94d75e6768c8b3ed3a188b Mon Sep 17 00:00:00 2001 From: Wei Yongjun Date: Thu, 30 Apr 2020 08:18:51 +0000 Subject: bpf: Fix error return code in map_lookup_and_delete_elem() Fix to return negative error code -EFAULT from the copy_to_user() error handling case instead of 0, as done elsewhere in this function. Fixes: bd513cd08f10 ("bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall") Signed-off-by: Wei Yongjun Signed-off-by: Daniel Borkmann Link: https://lore.kernel.org/bpf/20200430081851.166996-1-weiyongjun1@huawei.com --- kernel/bpf/syscall.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7626b8024471..2843bbba9ca1 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1485,8 +1485,10 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) if (err) goto free_value; - if (copy_to_user(uvalue, value, value_size) != 0) + if (copy_to_user(uvalue, value, value_size) != 0) { + err = -EFAULT; goto free_value; + } err = 0; -- cgit v1.2.3-71-gd317 From dcbd21c9fca5e954fd4e3d91884907eb6d47187e Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Sat, 25 Apr 2020 14:49:09 +0900 Subject: tracing/kprobes: Fix a double initialization typo Fix a typo that resulted in an unnecessary double initialization to addr. Link: http://lkml.kernel.org/r/158779374968.6082.2337484008464939919.stgit@devnote2 Cc: Tom Zanussi Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: c7411a1a126f ("tracing/kprobe: Check whether the non-suffixed symbol is notrace") Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d0568af4a0ef..0d9300c3b084 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -453,7 +453,7 @@ static bool __within_notrace_func(unsigned long addr) static bool within_notrace_func(struct trace_kprobe *tk) { - unsigned long addr = addr = trace_kprobe_address(tk); + unsigned long addr = trace_kprobe_address(tk); char symname[KSYM_NAME_LEN], *p; if (!__within_notrace_func(addr)) -- cgit v1.2.3-71-gd317 From da0f1f4167e3af69e1d8b32d6d65195ddd2bfb64 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Sat, 25 Apr 2020 14:49:17 +0900 Subject: tracing/boottime: Fix kprobe event API usage Fix boottime kprobe events to use API correctly for multiple events. For example, when we set a multiprobe kprobe events in bootconfig like below, ftrace.event.kprobes.myevent { probes = "vfs_read $arg1 $arg2", "vfs_write $arg1 $arg2" } This cause an error; trace_boot: Failed to add probe: p:kprobes/myevent (null) vfs_read $arg1 $arg2 vfs_write $arg1 $arg2 This shows the 1st argument becomes NULL and multiprobes are merged to 1 probe. Link: http://lkml.kernel.org/r/158779375766.6082.201939936008972838.stgit@devnote2 Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: 29a154810546 ("tracing: Change trace_boot to use kprobe_event interface") Reviewed-by: Tom Zanussi Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_boot.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/trace_boot.c b/kernel/trace/trace_boot.c index 06d7feb5255f..9de29bb45a27 100644 --- a/kernel/trace/trace_boot.c +++ b/kernel/trace/trace_boot.c @@ -95,24 +95,20 @@ trace_boot_add_kprobe_event(struct xbc_node *node, const char *event) struct xbc_node *anode; char buf[MAX_BUF_LEN]; const char *val; - int ret; + int ret = 0; - kprobe_event_cmd_init(&cmd, buf, MAX_BUF_LEN); + xbc_node_for_each_array_value(node, "probes", anode, val) { + kprobe_event_cmd_init(&cmd, buf, MAX_BUF_LEN); - ret = kprobe_event_gen_cmd_start(&cmd, event, NULL); - if (ret) - return ret; + ret = kprobe_event_gen_cmd_start(&cmd, event, val); + if (ret) + break; - xbc_node_for_each_array_value(node, "probes", anode, val) { - ret = kprobe_event_add_field(&cmd, val); + ret = kprobe_event_gen_cmd_end(&cmd); if (ret) - return ret; + pr_err("Failed to add probe: %s\n", buf); } - ret = kprobe_event_gen_cmd_end(&cmd); - if (ret) - pr_err("Failed to add probe: %s\n", buf); - return ret; } #else -- cgit v1.2.3-71-gd317 From 5b4dcd2d201a395ad4054067bfae4a07554fbd65 Mon Sep 17 00:00:00 2001 From: Masami Hiramatsu Date: Sat, 25 Apr 2020 14:49:26 +0900 Subject: tracing/kprobes: Reject new event if loc is NULL Reject the new event which has NULL location for kprobes. For kprobes, user must specify at least the location. Link: http://lkml.kernel.org/r/158779376597.6082.1411212055469099461.stgit@devnote2 Cc: Tom Zanussi Cc: Ingo Molnar Cc: stable@vger.kernel.org Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Signed-off-by: Masami Hiramatsu Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace_kprobe.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 0d9300c3b084..35989383ae11 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -940,6 +940,9 @@ EXPORT_SYMBOL_GPL(kprobe_event_cmd_init); * complete command or only the first part of it; in the latter case, * kprobe_event_add_fields() can be used to add more fields following this. * + * Unlikely the synth_event_gen_cmd_start(), @loc must be specified. This + * returns -EINVAL if @loc == NULL. + * * Return: 0 if successful, error otherwise. */ int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, @@ -953,6 +956,9 @@ int __kprobe_event_gen_cmd_start(struct dynevent_cmd *cmd, bool kretprobe, if (cmd->type != DYNEVENT_TYPE_KPROBE) return -EINVAL; + if (!loc) + return -EINVAL; + if (kretprobe) snprintf(buf, MAX_EVENT_NAME_LEN, "r:kprobes/%s", name); else -- cgit v1.2.3-71-gd317 From d16a8c31077e75ecb9427fbfea59b74eed00f698 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 6 May 2020 10:20:10 -0400 Subject: tracing: Wait for preempt irq delay thread to finish Running on a slower machine, it is possible that the preempt delay kernel thread may still be executing if the module was immediately removed after added, and this can cause the kernel to crash as the kernel thread might be executing after its code has been removed. There's no reason that the caller of the code shouldn't just wait for the delay thread to finish, as the thread can also be created by a trigger in the sysfs code, which also has the same issues. Link: http://lore.kernel.org/r/5EA2B0C8.2080706@cn.fujitsu.com Cc: stable@vger.kernel.org Fixes: 793937236d1ee ("lib: Add module for testing preemptoff/irqsoff latency tracers") Reported-by: Xiao Yang Reviewed-by: Xiao Yang Reviewed-by: Joel Fernandes Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/preemptirq_delay_test.c | 30 ++++++++++++++++++++++++------ 1 file changed, 24 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c index 31c0fad4cb9e..c4c86de63cf9 100644 --- a/kernel/trace/preemptirq_delay_test.c +++ b/kernel/trace/preemptirq_delay_test.c @@ -113,22 +113,42 @@ static int preemptirq_delay_run(void *data) for (i = 0; i < s; i++) (testfuncs[i])(i); + + set_current_state(TASK_INTERRUPTIBLE); + while (!kthread_should_stop()) { + schedule(); + set_current_state(TASK_INTERRUPTIBLE); + } + + __set_current_state(TASK_RUNNING); + return 0; } -static struct task_struct *preemptirq_start_test(void) +static int preemptirq_run_test(void) { + struct task_struct *task; + char task_name[50]; snprintf(task_name, sizeof(task_name), "%s_test", test_mode); - return kthread_run(preemptirq_delay_run, NULL, task_name); + task = kthread_run(preemptirq_delay_run, NULL, task_name); + if (IS_ERR(task)) + return PTR_ERR(task); + if (task) + kthread_stop(task); + return 0; } static ssize_t trigger_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { - preemptirq_start_test(); + ssize_t ret; + + ret = preemptirq_run_test(); + if (ret) + return ret; return count; } @@ -148,11 +168,9 @@ static struct kobject *preemptirq_delay_kobj; static int __init preemptirq_delay_init(void) { - struct task_struct *test_task; int retval; - test_task = preemptirq_start_test(); - retval = PTR_ERR_OR_ZERO(test_task); + retval = preemptirq_run_test(); if (retval != 0) return retval; -- cgit v1.2.3-71-gd317 From 11f5efc3ab66284f7aaacc926e9351d658e2577b Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 6 May 2020 10:36:18 -0400 Subject: tracing: Add a vmalloc_sync_mappings() for safe measure x86_64 lazily maps in the vmalloc pages, and the way this works with per_cpu areas can be complex, to say the least. Mappings may happen at boot up, and if nothing synchronizes the page tables, those page mappings may not be synced till they are used. This causes issues for anything that might touch one of those mappings in the path of the page fault handler. When one of those unmapped mappings is touched in the page fault handler, it will cause another page fault, which in turn will cause a page fault, and leave us in a loop of page faults. Commit 763802b53a42 ("x86/mm: split vmalloc_sync_all()") split vmalloc_sync_all() into vmalloc_sync_unmappings() and vmalloc_sync_mappings(), as on system exit, it did not need to do a full sync on x86_64 (although it still needed to be done on x86_32). By chance, the vmalloc_sync_all() would synchronize the page mappings done at boot up and prevent the per cpu area from being a problem for tracing in the page fault handler. But when that synchronization in the exit of a task became a nop, it caused the problem to appear. Link: https://lore.kernel.org/r/20200429054857.66e8e333@oasis.local.home Cc: stable@vger.kernel.org Fixes: 737223fbca3b1 ("tracing: Consolidate buffer allocation code") Reported-by: "Tzvetomir Stoyanov (VMware)" Suggested-by: Joerg Roedel Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8d2b98812625..9ed6d92768af 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -8525,6 +8525,19 @@ static int allocate_trace_buffers(struct trace_array *tr, int size) */ allocate_snapshot = false; #endif + + /* + * Because of some magic with the way alloc_percpu() works on + * x86_64, we need to synchronize the pgd of all the tables, + * otherwise the trace events that happen in x86_64 page fault + * handlers can't cope with accessing the chance that a + * alloc_percpu()'d memory might be touched in the page fault trace + * event. Oh, and we need to audit all other alloc_percpu() and vmalloc() + * calls in tracing, because something might get triggered within a + * page fault trace event! + */ + vmalloc_sync_mappings(); + return 0; } -- cgit v1.2.3-71-gd317 From 192b7993b3ff92b62b687e940e5e88fa0218d764 Mon Sep 17 00:00:00 2001 From: Zou Wei Date: Thu, 23 Apr 2020 12:08:25 +0800 Subject: tracing: Make tracing_snapshot_instance_cond() static Fix the following sparse warning: kernel/trace/trace.c:950:6: warning: symbol 'tracing_snapshot_instance_cond' was not declared. Should it be static? Link: http://lkml.kernel.org/r/1587614905-48692-1-git-send-email-zou_wei@huawei.com Reported-by: Hulk Robot Signed-off-by: Zou Wei Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/trace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9ed6d92768af..29615f15a820 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -947,7 +947,8 @@ int __trace_bputs(unsigned long ip, const char *str) EXPORT_SYMBOL_GPL(__trace_bputs); #ifdef CONFIG_TRACER_SNAPSHOT -void tracing_snapshot_instance_cond(struct trace_array *tr, void *cond_data) +static void tracing_snapshot_instance_cond(struct trace_array *tr, + void *cond_data) { struct tracer *tracer = tr->current_trace; unsigned long flags; -- cgit v1.2.3-71-gd317 From 324cfb19567c80ed71d7a02f1d5ff4621902f4c3 Mon Sep 17 00:00:00 2001 From: Maciej Grochowski Date: Thu, 7 May 2020 18:35:49 -0700 Subject: kernel/kcov.c: fix typos in kcov_remote_start documentation Signed-off-by: Maciej Grochowski Signed-off-by: Andrew Morton Reviewed-by: Andrey Konovalov Link: http://lkml.kernel.org/r/20200420030259.31674-1-maciek.grochowski@gmail.com Signed-off-by: Linus Torvalds --- kernel/kcov.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/kcov.c b/kernel/kcov.c index f50354202dbe..8accc9722a81 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -740,8 +740,8 @@ static const struct file_operations kcov_fops = { * kcov_remote_handle() with KCOV_SUBSYSTEM_COMMON as the subsystem id and an * arbitrary 4-byte non-zero number as the instance id). This common handle * then gets saved into the task_struct of the process that issued the - * KCOV_REMOTE_ENABLE ioctl. When this proccess issues system calls that spawn - * kernel threads, the common handle must be retrived via kcov_common_handle() + * KCOV_REMOTE_ENABLE ioctl. When this process issues system calls that spawn + * kernel threads, the common handle must be retrieved via kcov_common_handle() * and passed to the spawned threads via custom annotations. Those kernel * threads must in turn be annotated with kcov_remote_start(common_handle) and * kcov_remote_stop(). All of the threads that are spawned by the same process -- cgit v1.2.3-71-gd317 From 3f2c788a13143620c5471ac96ac4f033fc9ac3f3 Mon Sep 17 00:00:00 2001 From: Christian Brauner Date: Thu, 7 May 2020 12:32:14 +0200 Subject: fork: prevent accidental access to clone3 features Jan reported an issue where an interaction between sign-extending clone's flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes clone() to consistently fail with EBADF. The whole story is a little longer. The legacy clone() syscall is odd in a bunch of ways and here two things interact. First, legacy clone's flag argument is word-size dependent, i.e. it's an unsigned long whereas most system calls with flag arguments use int or unsigned int. Second, legacy clone() ignores unknown and deprecated flags. The two of them taken together means that users on 64bit systems can pass garbage for the upper 32bit of the clone() syscall since forever and things would just work fine. Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed and on v5.7-rc1 where this will fail with EBADF: int main(int argc, char *argv[]) { pid_t pid; /* Note that legacy clone() has different argument ordering on * different architectures so this won't work everywhere. * * Only set the upper 32 bits. */ pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD, NULL, NULL, NULL, NULL); if (pid < 0) exit(EXIT_FAILURE); if (pid == 0) exit(EXIT_SUCCESS); if (wait(NULL) != pid) exit(EXIT_FAILURE); exit(EXIT_SUCCESS); } Since legacy clone() couldn't be extended this was not a problem so far and nobody really noticed or cared since nothing in the kernel ever bothered to look at the upper 32 bits. But once we introduced clone3() and expanded the flag argument in struct clone_args to 64 bit we opened this can of worms. With the first flag-based extension to clone3() making use of the upper 32 bits of the flag argument we've effectively made it possible for the legacy clone() syscall to reach clone3() only flags. The sign extension scenario is just the odd corner-case that we needed to figure this out. The reason we just realized this now and not already when we introduced CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup file descriptor has been given. So the sign extension (or the user accidently passing garbage for the upper 32 bits) caused the CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it didn't find a valid cgroup file descriptor. Let's fix this by always capping the upper 32 bits for all codepaths that are not aware of clone3() features. This ensures that we can't reach clone3() only features by accident via legacy clone as with the sign extension case and also that legacy clone() works exactly like before, i.e. ignoring any unknown flags. This solution risks no regressions and is also pretty clean. Fixes: 7f192e3cd316 ("fork: add clone3") Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Reported-by: Jan Stancek Signed-off-by: Christian Brauner Cc: Arnd Bergmann Cc: Dmitry V. Levin Cc: Andreas Schwab Cc: Florian Weimer Cc: libc-alpha@sourceware.org Cc: stable@vger.kernel.org # 5.3+ Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com --- kernel/fork.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/fork.c b/kernel/fork.c index 8c700f881d92..48ed22774efa 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2486,11 +2486,11 @@ long do_fork(unsigned long clone_flags, int __user *child_tidptr) { struct kernel_clone_args args = { - .flags = (clone_flags & ~CSIGNAL), + .flags = (lower_32_bits(clone_flags) & ~CSIGNAL), .pidfd = parent_tidptr, .child_tid = child_tidptr, .parent_tid = parent_tidptr, - .exit_signal = (clone_flags & CSIGNAL), + .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL), .stack = stack_start, .stack_size = stack_size, }; @@ -2508,8 +2508,9 @@ long do_fork(unsigned long clone_flags, pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags) { struct kernel_clone_args args = { - .flags = ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL), - .exit_signal = (flags & CSIGNAL), + .flags = ((lower_32_bits(flags) | CLONE_VM | + CLONE_UNTRACED) & ~CSIGNAL), + .exit_signal = (lower_32_bits(flags) & CSIGNAL), .stack = (unsigned long)fn, .stack_size = (unsigned long)arg, }; @@ -2570,11 +2571,11 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, #endif { struct kernel_clone_args args = { - .flags = (clone_flags & ~CSIGNAL), + .flags = (lower_32_bits(clone_flags) & ~CSIGNAL), .pidfd = parent_tidptr, .child_tid = child_tidptr, .parent_tid = parent_tidptr, - .exit_signal = (clone_flags & CSIGNAL), + .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL), .stack = newsp, .tls = tls, }; -- cgit v1.2.3-71-gd317 From db803036ada7d61d096783726f9771b3fc540370 Mon Sep 17 00:00:00 2001 From: Vincent Minet Date: Fri, 8 May 2020 00:14:22 +0200 Subject: umh: fix memory leak on execve failure If a UMH process created by fork_usermode_blob() fails to execute, a pair of struct file allocated by umh_pipe_setup() will leak. Under normal conditions, the caller (like bpfilter) needs to manage the lifetime of the UMH and its two pipes. But when fork_usermode_blob() fails, the caller doesn't really have a way to know what needs to be done. It seems better to do the cleanup ourselves in this case. Fixes: 449325b52b7a ("umh: introduce fork_usermode_blob() helper") Signed-off-by: Vincent Minet Signed-off-by: Jakub Kicinski --- kernel/umh.c | 6 ++++++ 1 file changed, 6 insertions(+) (limited to 'kernel') diff --git a/kernel/umh.c b/kernel/umh.c index 7f255b5a8845..20d51e0957e0 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -475,6 +475,12 @@ static void umh_clean_and_save_pid(struct subprocess_info *info) { struct umh_info *umh_info = info->data; + /* cleanup if umh_pipe_setup() was successful but exec failed */ + if (info->pid && info->retval) { + fput(umh_info->pipe_to_umh); + fput(umh_info->pipe_from_umh); + } + argv_free(info->argv); umh_info->pid = info->pid; } -- cgit v1.2.3-71-gd317 From 78a5255ffb6a1af189a83e493d916ba1c54d8c75 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 9 May 2020 13:57:10 -0700 Subject: Stop the ad-hoc games with -Wno-maybe-initialized We have some rather random rules about when we accept the "maybe-initialized" warnings, and when we don't. For example, we consider it unreliable for gcc versions < 4.9, but also if -O3 is enabled, or if optimizing for size. And then various kernel config options disabled it, because they know that they trigger that warning by confusing gcc sufficiently (ie PROFILE_ALL_BRANCHES). And now gcc-10 seems to be introducing a lot of those warnings too, so it falls under the same heading as 4.9 did. At the same time, we have a very straightforward way to _enable_ that warning when wanted: use "W=2" to enable more warnings. So stop playing these ad-hoc games, and just disable that warning by default, with the known and straight-forward "if you want to work on the extra compiler warnings, use W=123". Would it be great to have code that is always so obvious that it never confuses the compiler whether a variable is used initialized or not? Yes, it would. In a perfect world, the compilers would be smarter, and our source code would be simpler. That's currently not the world we live in, though. Signed-off-by: Linus Torvalds --- Makefile | 7 +++---- init/Kconfig | 18 ------------------ kernel/trace/Kconfig | 1 - 3 files changed, 3 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/Makefile b/Makefile index 3512f7b243dc..1e94f4f15d92 100644 --- a/Makefile +++ b/Makefile @@ -729,10 +729,6 @@ else ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE KBUILD_CFLAGS += -Os endif -ifdef CONFIG_CC_DISABLE_WARN_MAYBE_UNINITIALIZED -KBUILD_CFLAGS += -Wno-maybe-uninitialized -endif - # Tell gcc to never replace conditional load with a non-conditional one KBUILD_CFLAGS += $(call cc-option,--param=allow-store-data-races=0) KBUILD_CFLAGS += $(call cc-option,-fno-allow-store-data-races) @@ -881,6 +877,9 @@ KBUILD_CFLAGS += -Wno-pointer-sign # disable stringop warnings in gcc 8+ KBUILD_CFLAGS += $(call cc-disable-warning, stringop-truncation) +# Enabled with W=2, disabled by default as noisy +KBUILD_CFLAGS += $(call cc-disable-warning, maybe-uninitialized) + # disable invalid "can't wrap" optimizations for signed / pointers KBUILD_CFLAGS += $(call cc-option,-fno-strict-overflow) diff --git a/init/Kconfig b/init/Kconfig index 9e22ee8fbd75..9278a603d399 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -39,22 +39,6 @@ config TOOLS_SUPPORT_RELR config CC_HAS_ASM_INLINE def_bool $(success,echo 'void foo(void) { asm inline (""); }' | $(CC) -x c - -c -o /dev/null) -config CC_HAS_WARN_MAYBE_UNINITIALIZED - def_bool $(cc-option,-Wmaybe-uninitialized) - help - GCC >= 4.7 supports this option. - -config CC_DISABLE_WARN_MAYBE_UNINITIALIZED - bool - depends on CC_HAS_WARN_MAYBE_UNINITIALIZED - default CC_IS_GCC && GCC_VERSION < 40900 # unreliable for GCC < 4.9 - help - GCC's -Wmaybe-uninitialized is not reliable by definition. - Lots of false positive warnings are produced in some cases. - - If this option is enabled, -Wno-maybe-uninitialzed is passed - to the compiler to suppress maybe-uninitialized warnings. - config CONSTRUCTORS bool depends on !UML @@ -1257,14 +1241,12 @@ config CC_OPTIMIZE_FOR_PERFORMANCE config CC_OPTIMIZE_FOR_PERFORMANCE_O3 bool "Optimize more for performance (-O3)" depends on ARC - imply CC_DISABLE_WARN_MAYBE_UNINITIALIZED # avoid false positives help Choosing this option will pass "-O3" to your compiler to optimize the kernel yet more for performance. config CC_OPTIMIZE_FOR_SIZE bool "Optimize for size (-Os)" - imply CC_DISABLE_WARN_MAYBE_UNINITIALIZED # avoid false positives help Choosing this option will pass "-Os" to your compiler resulting in a smaller kernel. diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 402eef84c859..743647005f64 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -466,7 +466,6 @@ config PROFILE_ANNOTATED_BRANCHES config PROFILE_ALL_BRANCHES bool "Profile all if conditionals" if !FORTIFY_SOURCE select TRACE_BRANCH_PROFILING - imply CC_DISABLE_WARN_MAYBE_UNINITIALIZED # avoid false positives help This tracer profiles all branch conditions. Every if () taken in the kernel is recorded whether it hit or miss. -- cgit v1.2.3-71-gd317 From 8b1fac2e73e84ef0d6391051880a8e1d7044c847 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Sun, 10 May 2020 11:35:10 -0400 Subject: tracing: Wait for preempt irq delay thread to execute A bug report was posted that running the preempt irq delay module on a slow machine, and removing it quickly could lead to the thread created by the modlue to execute after the module is removed, and this could cause the kernel to crash. The fix for this was to call kthread_stop() after creating the thread to make sure it finishes before allowing the module to be removed. Now this caused the opposite problem on fast machines. What now happens is the kthread_stop() can cause the kthread never to execute and the test never to run. To fix this, add a completion and wait for the kthread to execute, then wait for it to end. This issue caused the ftracetest selftests to fail on the preemptirq tests. Link: https://lore.kernel.org/r/20200510114210.15d9e4af@oasis.local.home Cc: stable@vger.kernel.org Fixes: d16a8c31077e ("tracing: Wait for preempt irq delay thread to finish") Reviewed-by: Joel Fernandes (Google) Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/preemptirq_delay_test.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/preemptirq_delay_test.c b/kernel/trace/preemptirq_delay_test.c index c4c86de63cf9..312d1a0ca3b6 100644 --- a/kernel/trace/preemptirq_delay_test.c +++ b/kernel/trace/preemptirq_delay_test.c @@ -16,6 +16,7 @@ #include #include #include +#include static ulong delay = 100; static char test_mode[12] = "irq"; @@ -28,6 +29,8 @@ MODULE_PARM_DESC(delay, "Period in microseconds (100 us default)"); MODULE_PARM_DESC(test_mode, "Mode of the test such as preempt, irq, or alternate (default irq)"); MODULE_PARM_DESC(burst_size, "The size of a burst (default 1)"); +static struct completion done; + #define MIN(x, y) ((x) < (y) ? (x) : (y)) static void busy_wait(ulong time) @@ -114,6 +117,8 @@ static int preemptirq_delay_run(void *data) for (i = 0; i < s; i++) (testfuncs[i])(i); + complete(&done); + set_current_state(TASK_INTERRUPTIBLE); while (!kthread_should_stop()) { schedule(); @@ -128,15 +133,18 @@ static int preemptirq_delay_run(void *data) static int preemptirq_run_test(void) { struct task_struct *task; - char task_name[50]; + init_completion(&done); + snprintf(task_name, sizeof(task_name), "%s_test", test_mode); task = kthread_run(preemptirq_delay_run, NULL, task_name); if (IS_ERR(task)) return PTR_ERR(task); - if (task) + if (task) { + wait_for_completion(&done); kthread_stop(task); + } return 0; } -- cgit v1.2.3-71-gd317 From 59566b0b622e3e6ea928c0b8cac8a5601b00b383 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Thu, 30 Apr 2020 20:21:47 -0400 Subject: x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up Booting one of my machines, it triggered the following crash: Kernel/User page tables isolation: enabled ftrace: allocating 36577 entries in 143 pages Starting tracer 'function' BUG: unable to handle page fault for address: ffffffffa000005c #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061 Oops: 0003 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 RIP: 0010:text_poke_early+0x4a/0x58 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff 0 41 57 49 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0 FS: 0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0 Call Trace: text_poke_bp+0x27/0x64 ? mutex_lock+0x36/0x5d arch_ftrace_update_trampoline+0x287/0x2d5 ? ftrace_replace_code+0x14b/0x160 ? ftrace_update_ftrace_func+0x65/0x6c __register_ftrace_function+0x6d/0x81 ftrace_startup+0x23/0xc1 register_ftrace_function+0x20/0x37 func_set_flag+0x59/0x77 __set_tracer_option.isra.19+0x20/0x3e trace_set_options+0xd6/0x13e apply_trace_boot_options+0x44/0x6d register_tracer+0x19e/0x1ac early_trace_init+0x21b/0x2c9 start_kernel+0x241/0x518 ? load_ucode_intel_bsp+0x21/0x52 secondary_startup_64+0xa4/0xb0 I was able to trigger it on other machines, when I added to the kernel command line of both "ftrace=function" and "trace_options=func_stack_trace". The cause is the "ftrace=function" would register the function tracer and create a trampoline, and it will set it as executable and read-only. Then the "trace_options=func_stack_trace" would then update the same trampoline to include the stack tracer version of the function tracer. But since the trampoline already exists, it updates it with text_poke_bp(). The problem is that text_poke_bp() called while system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not the page mapping, as it would think that the text is still read-write. But in this case it is not, and we take a fault and crash. Instead, lets keep the ftrace trampolines read-write during boot up, and then when the kernel executable text is set to read-only, the ftrace trampolines get set to read-only as well. Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home Cc: Ingo Molnar Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Josh Poimboeuf Cc: "H. Peter Anvin" Cc: stable@vger.kernel.org Fixes: 768ae4406a5c ("x86/ftrace: Use text_poke()") Acked-by: Peter Zijlstra Signed-off-by: Steven Rostedt (VMware) --- arch/x86/include/asm/ftrace.h | 6 ++++++ arch/x86/kernel/ftrace.c | 29 ++++++++++++++++++++++++++++- arch/x86/mm/init_64.c | 3 +++ include/linux/ftrace.h | 23 +++++++++++++++++++++++ kernel/trace/ftrace_internal.h | 22 ---------------------- 5 files changed, 60 insertions(+), 23 deletions(-) (limited to 'kernel') diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h index 85be2f506272..89af0d2c62aa 100644 --- a/arch/x86/include/asm/ftrace.h +++ b/arch/x86/include/asm/ftrace.h @@ -56,6 +56,12 @@ struct dyn_arch_ftrace { #ifndef __ASSEMBLY__ +#if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_DYNAMIC_FTRACE) +extern void set_ftrace_ops_ro(void); +#else +static inline void set_ftrace_ops_ro(void) { } +#endif + #define ARCH_HAS_SYSCALL_MATCH_SYM_NAME static inline bool arch_syscall_match_sym_name(const char *sym, const char *name) { diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 37a0aeaf89e7..b0e641793be4 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -407,7 +407,8 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size) set_vm_flush_reset_perms(trampoline); - set_memory_ro((unsigned long)trampoline, npages); + if (likely(system_state != SYSTEM_BOOTING)) + set_memory_ro((unsigned long)trampoline, npages); set_memory_x((unsigned long)trampoline, npages); return (unsigned long)trampoline; fail: @@ -415,6 +416,32 @@ fail: return 0; } +void set_ftrace_ops_ro(void) +{ + struct ftrace_ops *ops; + unsigned long start_offset; + unsigned long end_offset; + unsigned long npages; + unsigned long size; + + do_for_each_ftrace_op(ops, ftrace_ops_list) { + if (!(ops->flags & FTRACE_OPS_FL_ALLOC_TRAMP)) + continue; + + if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) { + start_offset = (unsigned long)ftrace_regs_caller; + end_offset = (unsigned long)ftrace_regs_caller_end; + } else { + start_offset = (unsigned long)ftrace_caller; + end_offset = (unsigned long)ftrace_epilogue; + } + size = end_offset - start_offset; + size = size + RET_SIZE + sizeof(void *); + npages = DIV_ROUND_UP(size, PAGE_SIZE); + set_memory_ro((unsigned long)ops->trampoline, npages); + } while_for_each_ftrace_op(ops); +} + static unsigned long calc_trampoline_call_offset(bool save_regs) { unsigned long start_offset; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3b289c2f75cd..8b5f73f5e207 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "mm_internal.h" @@ -1291,6 +1292,8 @@ void mark_rodata_ro(void) all_end = roundup((unsigned long)_brk_end, PMD_SIZE); set_memory_nx(text_end, (all_end - text_end) >> PAGE_SHIFT); + set_ftrace_ops_ro(); + #ifdef CONFIG_CPA_DEBUG printk(KERN_INFO "Testing CPA: undo %lx-%lx\n", start, end); set_memory_rw(start, (end-start) >> PAGE_SHIFT); diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index db95244a62d4..ab4bd15cbcdb 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -210,6 +210,29 @@ struct ftrace_ops { #endif }; +extern struct ftrace_ops __rcu *ftrace_ops_list; +extern struct ftrace_ops ftrace_list_end; + +/* + * Traverse the ftrace_global_list, invoking all entries. The reason that we + * can use rcu_dereference_raw_check() is that elements removed from this list + * are simply leaked, so there is no need to interact with a grace-period + * mechanism. The rcu_dereference_raw_check() calls are needed to handle + * concurrent insertions into the ftrace_global_list. + * + * Silly Alpha and silly pointer-speculation compiler optimizations! + */ +#define do_for_each_ftrace_op(op, list) \ + op = rcu_dereference_raw_check(list); \ + do + +/* + * Optimized for just a single item in the list (as that is the normal case). + */ +#define while_for_each_ftrace_op(op) \ + while (likely(op = rcu_dereference_raw_check((op)->next)) && \ + unlikely((op) != &ftrace_list_end)) + /* * Type of the current tracing. */ diff --git a/kernel/trace/ftrace_internal.h b/kernel/trace/ftrace_internal.h index 0456e0a3dab1..382775edf690 100644 --- a/kernel/trace/ftrace_internal.h +++ b/kernel/trace/ftrace_internal.h @@ -4,28 +4,6 @@ #ifdef CONFIG_FUNCTION_TRACER -/* - * Traverse the ftrace_global_list, invoking all entries. The reason that we - * can use rcu_dereference_raw_check() is that elements removed from this list - * are simply leaked, so there is no need to interact with a grace-period - * mechanism. The rcu_dereference_raw_check() calls are needed to handle - * concurrent insertions into the ftrace_global_list. - * - * Silly Alpha and silly pointer-speculation compiler optimizations! - */ -#define do_for_each_ftrace_op(op, list) \ - op = rcu_dereference_raw_check(list); \ - do - -/* - * Optimized for just a single item in the list (as that is the normal case). - */ -#define while_for_each_ftrace_op(op) \ - while (likely(op = rcu_dereference_raw_check((op)->next)) && \ - unlikely((op) != &ftrace_list_end)) - -extern struct ftrace_ops __rcu *ftrace_ops_list; -extern struct ftrace_ops ftrace_list_end; extern struct mutex ftrace_lock; extern struct ftrace_ops global_ops; -- cgit v1.2.3-71-gd317 From 3d2353de81061cab4b9d68b3e1dc69cbec1451ea Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 13 May 2020 15:18:01 -0400 Subject: ring-buffer: Don't deactivate the ring buffer on failed iterator reads If the function tracer is running and the trace file is read (which uses the ring buffer iterator), the iterator can get in sync with the writes, and caues it to fail to find a page with content it can read three times. This causes a warning and deactivation of the ring buffer code. Looking at the other cases of failure to get an event, it appears that there's a chance that the writer could cause them too. Since the iterator is a "best effort" to read the ring buffer if there's an active writer (the consumer reader is made for this case "see trace_pipe"), if it fails to get an event after three tries, simply give up and return NULL. Don't warn, nor disable the ring buffer on this failure. Link: https://lore.kernel.org/r/20200429090508.GG5770@shao2-debian Reported-by: kernel test robot Fixes: ff84c50cfb4b ("ring-buffer: Do not die if rb_iter_peek() fails more than thrice") Tested-by: Sven Schnelle Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 22 +++++++--------------- 1 file changed, 7 insertions(+), 15 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6f0b42ceeb00..448d5f528764 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -4034,7 +4034,6 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) struct ring_buffer_per_cpu *cpu_buffer; struct ring_buffer_event *event; int nr_loops = 0; - bool failed = false; if (ts) *ts = 0; @@ -4056,19 +4055,14 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) return NULL; /* - * We repeat when a time extend is encountered or we hit - * the end of the page. Since the time extend is always attached - * to a data event, we should never loop more than three times. - * Once for going to next page, once on time extend, and - * finally once to get the event. - * We should never hit the following condition more than thrice, - * unless the buffer is very small, and there's a writer - * that is causing the reader to fail getting an event. + * As the writer can mess with what the iterator is trying + * to read, just give up if we fail to get an event after + * three tries. The iterator is not as reliable when reading + * the ring buffer with an active write as the consumer is. + * Do not warn if the three failures is reached. */ - if (++nr_loops > 3) { - RB_WARN_ON(cpu_buffer, !failed); + if (++nr_loops > 3) return NULL; - } if (rb_per_cpu_empty(cpu_buffer)) return NULL; @@ -4079,10 +4073,8 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) } event = rb_iter_head_event(iter); - if (!event) { - failed = true; + if (!event) goto again; - } switch (event->type_len) { case RINGBUF_TYPE_PADDING: -- cgit v1.2.3-71-gd317 From da4d401a6b8fda7414033f81982f64ade02c0e27 Mon Sep 17 00:00:00 2001 From: "Steven Rostedt (VMware)" Date: Wed, 13 May 2020 15:36:22 -0400 Subject: ring-buffer: Remove all BUG() calls There's a lot of checks to make sure the ring buffer is working, and if an anomaly is detected, it safely shuts itself down. But there's a few cases that it will call BUG(), which defeats the point of being safe (it crashes the kernel when an anomaly is found!). There's no reason for them. Switch them all to either WARN_ON_ONCE() (when no ring buffer descriptor is present), or to RB_WARN_ON() (when a ring buffer descriptor is present). Signed-off-by: Steven Rostedt (VMware) --- kernel/trace/ring_buffer.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 448d5f528764..b8e1ca48be50 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -193,7 +193,7 @@ rb_event_length(struct ring_buffer_event *event) case RINGBUF_TYPE_DATA: return rb_event_data_length(event); default: - BUG(); + WARN_ON_ONCE(1); } /* not hit */ return 0; @@ -249,7 +249,7 @@ rb_event_data(struct ring_buffer_event *event) { if (extended_time(event)) event = skip_time_extend(event); - BUG_ON(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX); + WARN_ON_ONCE(event->type_len > RINGBUF_TYPE_DATA_TYPE_LEN_MAX); /* If length is in len field, then array[0] has the data */ if (event->type_len) return (void *)&event->array[0]; @@ -3727,7 +3727,7 @@ rb_update_read_stamp(struct ring_buffer_per_cpu *cpu_buffer, return; default: - BUG(); + RB_WARN_ON(cpu_buffer, 1); } return; } @@ -3757,7 +3757,7 @@ rb_update_iter_read_stamp(struct ring_buffer_iter *iter, return; default: - BUG(); + RB_WARN_ON(iter->cpu_buffer, 1); } return; } @@ -4020,7 +4020,7 @@ rb_buffer_peek(struct ring_buffer_per_cpu *cpu_buffer, u64 *ts, return event; default: - BUG(); + RB_WARN_ON(cpu_buffer, 1); } return NULL; @@ -4109,7 +4109,7 @@ rb_iter_peek(struct ring_buffer_iter *iter, u64 *ts) return event; default: - BUG(); + RB_WARN_ON(cpu_buffer, 1); } return NULL; -- cgit v1.2.3-71-gd317 From 333291ce5055f2039afc907badaf5b66bc1adfdc Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Tue, 12 May 2020 16:59:25 -0700 Subject: bpf: Fix bug in mmap() implementation for BPF array map mmap() subsystem allows user-space application to memory-map region with initial page offset. This wasn't taken into account in initial implementation of BPF array memory-mapping. This would result in wrong pages, not taking into account requested page shift, being memory-mmaped into user-space. This patch fixes this gap and adds a test for such scenario. Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20200512235925.3817805-1-andriin@fb.com --- kernel/bpf/arraymap.c | 7 ++++++- tools/testing/selftests/bpf/prog_tests/mmap.c | 8 ++++++++ 2 files changed, 14 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 95d77770353c..1d6120fd5ba6 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -486,7 +486,12 @@ static int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) if (!(map->map_flags & BPF_F_MMAPABLE)) return -EINVAL; - return remap_vmalloc_range(vma, array_map_vmalloc_addr(array), pgoff); + if (vma->vm_pgoff * PAGE_SIZE + (vma->vm_end - vma->vm_start) > + PAGE_ALIGN((u64)array->map.max_entries * array->elem_size)) + return -EINVAL; + + return remap_vmalloc_range(vma, array_map_vmalloc_addr(array), + vma->vm_pgoff + pgoff); } const struct bpf_map_ops array_map_ops = { diff --git a/tools/testing/selftests/bpf/prog_tests/mmap.c b/tools/testing/selftests/bpf/prog_tests/mmap.c index 56d80adcf4bd..6b9dce431d41 100644 --- a/tools/testing/selftests/bpf/prog_tests/mmap.c +++ b/tools/testing/selftests/bpf/prog_tests/mmap.c @@ -217,6 +217,14 @@ void test_mmap(void) munmap(tmp2, 4 * page_size); + /* map all 4 pages, but with pg_off=1 page, should fail */ + tmp1 = mmap(NULL, 4 * page_size, PROT_READ, MAP_SHARED | MAP_FIXED, + data_map_fd, page_size /* initial page shift */); + if (CHECK(tmp1 != MAP_FAILED, "adv_mmap7", "unexpected success")) { + munmap(tmp1, 4 * page_size); + goto cleanup; + } + tmp1 = mmap(NULL, map_sz, PROT_READ, MAP_SHARED, data_map_fd, 0); if (CHECK(tmp1 == MAP_FAILED, "last_mmap", "failed %d\n", errno)) goto cleanup; -- cgit v1.2.3-71-gd317 From e92888c72fbdc6f9d07b3b0604c012e81d7c0da7 Mon Sep 17 00:00:00 2001 From: Yonghong Song Date: Wed, 13 May 2020 22:32:05 -0700 Subject: bpf: Enforce returning 0 for fentry/fexit progs Currently, tracing/fentry and tracing/fexit prog return values are not enforced. In trampoline codes, the fentry/fexit prog return values are ignored. Let us enforce it to be 0 to avoid confusion and allows potential future extension. This patch also explicitly added return value checking for tracing/raw_tp, tracing/fmod_ret, and freplace programs such that these program return values can be anything. The purpose are two folds: 1. to make it explicit about return value expectations for these programs in verifier. 2. for tracing prog_type, if a future attach type is added, the default is -ENOTSUPP which will enforce to specify return value ranges explicitly. Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline") Signed-off-by: Yonghong Song Signed-off-by: Alexei Starovoitov Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/bpf/20200514053206.1298415-1-yhs@fb.com --- kernel/bpf/verifier.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index fa1d8245b925..a44ba6672688 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7059,6 +7059,23 @@ static int check_return_code(struct bpf_verifier_env *env) return 0; range = tnum_const(0); break; + case BPF_PROG_TYPE_TRACING: + switch (env->prog->expected_attach_type) { + case BPF_TRACE_FENTRY: + case BPF_TRACE_FEXIT: + range = tnum_const(0); + break; + case BPF_TRACE_RAW_TP: + case BPF_MODIFY_RETURN: + return 0; + default: + return -ENOTSUPP; + } + break; + case BPF_PROG_TYPE_EXT: + /* freplace program can return anything as its return value + * depends on the to-be-replaced kernel func or bpf program. + */ default: return 0; } -- cgit v1.2.3-71-gd317 From 0ebeea8ca8a4d1d453ad299aef0507dab04f6e8d Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 15 May 2020 12:11:16 +0200 Subject: bpf: Restrict bpf_probe_read{, str}() only to archs where they work Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs with overlapping address ranges, we should really take the next step to disable them from BPF use there. To generally fix the situation, we've recently added new helper variants bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str(). For details on them, see 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,kernel}_str helpers"). Given bpf_probe_read{,str}() have been around for ~5 years by now, there are plenty of users at least on x86 still relying on them today, so we cannot remove them entirely w/o breaking the BPF tracing ecosystem. However, their use should be restricted to archs with non-overlapping address ranges where they are working in their current form. Therefore, move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and have x86, arm64, arm select it (other archs supporting it can follow-up on it as well). For the remaining archs, they can workaround easily by relying on the feature probe from bpftool which spills out defines that can be used out of BPF C code to implement the drop-in replacement for old/new kernels via: bpftool feature probe macro Suggested-by: Linus Torvalds Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Reviewed-by: Masami Hiramatsu Acked-by: Linus Torvalds Cc: Brendan Gregg Cc: Christoph Hellwig Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net --- arch/arm/Kconfig | 1 + arch/arm64/Kconfig | 1 + arch/x86/Kconfig | 1 + init/Kconfig | 3 +++ kernel/trace/bpf_trace.c | 6 ++++-- 5 files changed, 10 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 66a04f6f4775..c77c93c485a0 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -12,6 +12,7 @@ config ARM select ARCH_HAS_KEEPINITRD select ARCH_HAS_KCOV select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_SPECIAL if ARM_LPAE select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_SETUP_DMA_OPS diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 40fb05d96c60..5d513f461957 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -20,6 +20,7 @@ config ARM64 select ARCH_HAS_KCOV select ARCH_HAS_KEEPINITRD select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SETUP_DMA_OPS diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1197b5596d5a..2d3f963fd6f1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -68,6 +68,7 @@ config X86 select ARCH_HAS_KCOV if X86_64 select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_SPECIAL diff --git a/init/Kconfig b/init/Kconfig index 9e22ee8fbd75..6fd13a051342 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2279,6 +2279,9 @@ config ASN1 source "kernel/Kconfig.locks" +config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE + bool + config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE bool diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ca1796747a77..b83bdaa31c7b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -825,14 +825,16 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_probe_read_user_proto; case BPF_FUNC_probe_read_kernel: return &bpf_probe_read_kernel_proto; - case BPF_FUNC_probe_read: - return &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_user_str: return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: return &bpf_probe_read_kernel_str_proto; +#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE + case BPF_FUNC_probe_read: + return &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_str: return &bpf_probe_read_compat_str_proto; +#endif #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: return &bpf_get_current_cgroup_id_proto; -- cgit v1.2.3-71-gd317 From 47cc0ed574abcbbde0cf143ddb21a0baed1aa2df Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 15 May 2020 12:11:17 +0200 Subject: bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range Given bpf_probe_read{,str}() BPF helpers are now only available under CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range() as well to avoid hitting the same issue as in 849fa50662fbc ("bpf/verifier: refine retval R0 state for bpf_get_stack helper"). Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Acked-by: John Fastabend Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net --- kernel/bpf/verifier.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index a44ba6672688..8d7ee40e2748 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -4340,7 +4340,9 @@ static void do_refine_retval_range(struct bpf_reg_state *regs, int ret_type, if (ret_type != RET_INTEGER || (func_id != BPF_FUNC_get_stack && - func_id != BPF_FUNC_probe_read_str)) + func_id != BPF_FUNC_probe_read_str && + func_id != BPF_FUNC_probe_read_kernel_str && + func_id != BPF_FUNC_probe_read_user_str)) return; ret_reg->smax_value = meta->msize_max_value; -- cgit v1.2.3-71-gd317 From b2a5212fb634561bb734c6356904e37f6665b955 Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Fri, 15 May 2020 12:11:18 +0200 Subject: bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on archs with overlapping address ranges. While the helpers have been addressed through work in 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need an option for bpf_trace_printk() as well to fix it. Similarly as with the helpers, force users to make an explicit choice by adding %pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS. The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended for other objects aside strings that are probed and printed under tracing, and reused out of other facilities like bpf_seq_printf() or BTF based type printing. Existing behavior of %s for current users is still kept working for archs where it is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE. For archs not having this property we fall-back to pick probing under KERNEL_DS as a sensible default. Fixes: 8d3b7dce8622 ("bpf: add support for %s specifier to bpf_trace_printk()") Reported-by: Linus Torvalds Reported-by: Christoph Hellwig Signed-off-by: Daniel Borkmann Signed-off-by: Alexei Starovoitov Cc: Masami Hiramatsu Cc: Brendan Gregg Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net --- Documentation/core-api/printk-formats.rst | 14 +++++ kernel/trace/bpf_trace.c | 94 ++++++++++++++++++++----------- lib/vsprintf.c | 12 ++++ 3 files changed, 88 insertions(+), 32 deletions(-) (limited to 'kernel') diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 8ebe46b1af39..5dfcc4592b23 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -112,6 +112,20 @@ used when printing stack backtraces. The specifier takes into consideration the effect of compiler optimisations which may occur when tail-calls are used and marked with the noreturn GCC attribute. +Probed Pointers from BPF / tracing +---------------------------------- + +:: + + %pks kernel string + %pus user string + +The ``k`` and ``u`` specifiers are used for printing prior probed memory from +either kernel memory (k) or user memory (u). The subsequent ``s`` specifier +results in printing a string. For direct use in regular vsnprintf() the (k) +and (u) annotation is ignored, however, when used out of BPF's bpf_trace_printk(), +for example, it reads the memory it is pointing to without faulting. + Kernel Pointers --------------- diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b83bdaa31c7b..a010edc37ee0 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -323,17 +323,15 @@ static const struct bpf_func_proto *bpf_get_probe_write_proto(void) /* * Only limited trace_printk() conversion specifiers allowed: - * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %s + * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pks %pus %s */ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, u64, arg2, u64, arg3) { + int i, mod[3] = {}, fmt_cnt = 0; + char buf[64], fmt_ptype; + void *unsafe_ptr = NULL; bool str_seen = false; - int mod[3] = {}; - int fmt_cnt = 0; - u64 unsafe_addr; - char buf[64]; - int i; /* * bpf_check()->check_func_arg()->check_stack_boundary() @@ -359,40 +357,71 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, if (fmt[i] == 'l') { mod[fmt_cnt]++; i++; - } else if (fmt[i] == 'p' || fmt[i] == 's') { + } else if (fmt[i] == 'p') { mod[fmt_cnt]++; + if ((fmt[i + 1] == 'k' || + fmt[i + 1] == 'u') && + fmt[i + 2] == 's') { + fmt_ptype = fmt[i + 1]; + i += 2; + goto fmt_str; + } + /* disallow any further format extensions */ if (fmt[i + 1] != 0 && !isspace(fmt[i + 1]) && !ispunct(fmt[i + 1])) return -EINVAL; - fmt_cnt++; - if (fmt[i] == 's') { - if (str_seen) - /* allow only one '%s' per fmt string */ - return -EINVAL; - str_seen = true; - - switch (fmt_cnt) { - case 1: - unsafe_addr = arg1; - arg1 = (long) buf; - break; - case 2: - unsafe_addr = arg2; - arg2 = (long) buf; - break; - case 3: - unsafe_addr = arg3; - arg3 = (long) buf; - break; - } - buf[0] = 0; - strncpy_from_unsafe(buf, - (void *) (long) unsafe_addr, + + goto fmt_next; + } else if (fmt[i] == 's') { + mod[fmt_cnt]++; + fmt_ptype = fmt[i]; +fmt_str: + if (str_seen) + /* allow only one '%s' per fmt string */ + return -EINVAL; + str_seen = true; + + if (fmt[i + 1] != 0 && + !isspace(fmt[i + 1]) && + !ispunct(fmt[i + 1])) + return -EINVAL; + + switch (fmt_cnt) { + case 0: + unsafe_ptr = (void *)(long)arg1; + arg1 = (long)buf; + break; + case 1: + unsafe_ptr = (void *)(long)arg2; + arg2 = (long)buf; + break; + case 2: + unsafe_ptr = (void *)(long)arg3; + arg3 = (long)buf; + break; + } + + buf[0] = 0; + switch (fmt_ptype) { + case 's': +#ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE + strncpy_from_unsafe(buf, unsafe_ptr, sizeof(buf)); + break; +#endif + case 'k': + strncpy_from_unsafe_strict(buf, unsafe_ptr, + sizeof(buf)); + break; + case 'u': + strncpy_from_unsafe_user(buf, + (__force void __user *)unsafe_ptr, + sizeof(buf)); + break; } - continue; + goto fmt_next; } if (fmt[i] == 'l') { @@ -403,6 +432,7 @@ BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, if (fmt[i] != 'i' && fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x') return -EINVAL; +fmt_next: fmt_cnt++; } diff --git a/lib/vsprintf.c b/lib/vsprintf.c index 7c488a1ce318..532b6606a18a 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -2168,6 +2168,10 @@ char *fwnode_string(char *buf, char *end, struct fwnode_handle *fwnode, * f full name * P node name, including a possible unit address * - 'x' For printing the address. Equivalent to "%lx". + * - '[ku]s' For a BPF/tracing related format specifier, e.g. used out of + * bpf_trace_printk() where [ku] prefix specifies either kernel (k) + * or user (u) memory to probe, and: + * s a string, equivalent to "%s" on direct vsnprintf() use * * ** When making changes please also update: * Documentation/core-api/printk-formats.rst @@ -2251,6 +2255,14 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr, if (!IS_ERR(ptr)) break; return err_ptr(buf, end, ptr, spec); + case 'u': + case 'k': + switch (fmt[1]) { + case 's': + return string(buf, end, ptr, spec); + default: + return error_string(buf, end, "(einval)", spec); + } } /* default is to _not_ leak addresses, hash before printing */ -- cgit v1.2.3-71-gd317 From b34cb07dde7c2346dec73d053ce926aeaa087303 Mon Sep 17 00:00:00 2001 From: Phil Auld Date: Tue, 12 May 2020 09:52:22 -0400 Subject: sched/fair: Fix enqueue_task_fair() warning some more sched/fair: Fix enqueue_task_fair warning some more The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) did not fully resolve the issues with the rq->tmp_alone_branch != &rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where the first for_each_sched_entity loop exits due to on_rq, having incompletely updated the list. In this case the second for_each_sched_entity loop can further modify se. The later code to fix up the list management fails to do what is needed because se does not point to the sched_entity which broke out of the first loop. The list is not fixed up because the throttled parent was already added back to the list by a task enqueue in a parallel child hierarchy. Address this by calling list_add_leaf_cfs_rq if there are throttled parents while doing the second for_each_sched_entity loop. Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning") Suggested-by: Vincent Guittot Signed-off-by: Phil Auld Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 02f323b85b6d..c6d57c334d51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5479,6 +5479,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) goto enqueue_throttle; + + /* + * One parent has been throttled and cfs_rq removed from the + * list. Add it back to not break the leaf list. + */ + if (throttled_hierarchy(cfs_rq)) + list_add_leaf_cfs_rq(cfs_rq); } enqueue_throttle: -- cgit v1.2.3-71-gd317 From ad32bb41fca67936c0c1d6d0bdd6d3e2e9c5432f Mon Sep 17 00:00:00 2001 From: Pavankumar Kondeti Date: Sun, 10 May 2020 18:26:41 +0530 Subject: sched/debug: Fix requested task uclamp values shown in procfs The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") was to print requested and effective task uclamp values. The requested values printed are read from p->uclamp, which holds the last effective values. Fix this by printing the values from p->uclamp_req. Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs") Signed-off-by: Pavankumar Kondeti Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Valentin Schneider Tested-by: Valentin Schneider Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org --- kernel/sched/debug.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index a562df57a86e..239970b991c0 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -948,8 +948,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns, P(se.avg.util_est.enqueued); #endif #ifdef CONFIG_UCLAMP_TASK - __PS("uclamp.min", p->uclamp[UCLAMP_MIN].value); - __PS("uclamp.max", p->uclamp[UCLAMP_MAX].value); + __PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value); + __PS("uclamp.max", p->uclamp_req[UCLAMP_MAX].value); __PS("effective uclamp.min", uclamp_eff_value(p, UCLAMP_MIN)); __PS("effective uclamp.max", uclamp_eff_value(p, UCLAMP_MAX)); #endif -- cgit v1.2.3-71-gd317 From 39f23ce07b9355d05a64ae303ce20d1c4b92b957 Mon Sep 17 00:00:00 2001 From: Vincent Guittot Date: Wed, 13 May 2020 15:55:28 +0200 Subject: sched/fair: Fix unthrottle_cfs_rq() for leaf_cfs_rq list Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair() are quite close and follow the same sequence for enqueuing an entity in the cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as enqueue_task_fair(). This fixes a problem already faced with the latter and add an optimization in the last for_each_sched_entity loop. Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning) Reported-by Tao Zhou Signed-off-by: Vincent Guittot Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Phil Auld Reviewed-by: Ben Segall Link: https://lkml.kernel.org/r/20200513135528.4742-1-vincent.guittot@linaro.org --- kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++------------ 1 file changed, 30 insertions(+), 12 deletions(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c6d57c334d51..538ba5d94e99 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4774,7 +4774,6 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) struct rq *rq = rq_of(cfs_rq); struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; - int enqueue = 1; long task_delta, idle_task_delta; se = cfs_rq->tg->se[cpu_of(rq)]; @@ -4798,26 +4797,44 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) idle_task_delta = cfs_rq->idle_h_nr_running; for_each_sched_entity(se) { if (se->on_rq) - enqueue = 0; + break; + cfs_rq = cfs_rq_of(se); + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); + cfs_rq->h_nr_running += task_delta; + cfs_rq->idle_h_nr_running += idle_task_delta; + + /* end evaluation on encountering a throttled cfs_rq */ + if (cfs_rq_throttled(cfs_rq)) + goto unthrottle_throttle; + } + + for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - if (enqueue) { - enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP); - } else { - update_load_avg(cfs_rq, se, 0); - se_update_runnable(se); - } + + update_load_avg(cfs_rq, se, UPDATE_TG); + se_update_runnable(se); cfs_rq->h_nr_running += task_delta; cfs_rq->idle_h_nr_running += idle_task_delta; + + /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - break; + goto unthrottle_throttle; + + /* + * One parent has been throttled and cfs_rq removed from the + * list. Add it back to not break the leaf list. + */ + if (throttled_hierarchy(cfs_rq)) + list_add_leaf_cfs_rq(cfs_rq); } - if (!se) - add_nr_running(rq, task_delta); + /* At this point se is NULL and we are at root level*/ + add_nr_running(rq, task_delta); +unthrottle_throttle: /* * The cfs_rq_throttled() breaks in the above iteration can result in * incomplete leaf list maintenance, resulting in triggering the @@ -4826,7 +4843,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - list_add_leaf_cfs_rq(cfs_rq); + if (list_add_leaf_cfs_rq(cfs_rq)) + break; } assert_list_leaf_cfs_rq(rq); -- cgit v1.2.3-71-gd317 From dfeb376dd4cb2c5004aeb625e2475f58a5ff2ea7 Mon Sep 17 00:00:00 2001 From: Andrii Nakryiko Date: Mon, 18 May 2020 22:38:24 -0700 Subject: bpf: Prevent mmap()'ing read-only maps as writable As discussed in [0], it's dangerous to allow mapping BPF map, that's meant to be frozen and is read-only on BPF program side, because that allows user-space to actually store a writable view to the page even after it is frozen. This is exacerbated by BPF verifier making a strong assumption that contents of such frozen map will remain unchanged. To prevent this, disallow mapping BPF_F_RDONLY_PROG mmap()'able BPF maps as writable, ever. [0] https://lore.kernel.org/bpf/CAEf4BzYGWYhXdp6BJ7_=9OQPJxQpgug080MMjdSB72i9R+5c6g@mail.gmail.com/ Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Suggested-by: Jann Horn Signed-off-by: Andrii Nakryiko Signed-off-by: Alexei Starovoitov Reviewed-by: Jann Horn Link: https://lore.kernel.org/bpf/20200519053824.1089415-1-andriin@fb.com --- kernel/bpf/syscall.c | 17 ++++++++++++++--- tools/testing/selftests/bpf/prog_tests/mmap.c | 13 ++++++++++++- tools/testing/selftests/bpf/progs/test_mmap.c | 8 ++++++++ 3 files changed, 34 insertions(+), 4 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 2843bbba9ca1..4e6dee19a668 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -623,9 +623,20 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma) mutex_lock(&map->freeze_mutex); - if ((vma->vm_flags & VM_WRITE) && map->frozen) { - err = -EPERM; - goto out; + if (vma->vm_flags & VM_WRITE) { + if (map->frozen) { + err = -EPERM; + goto out; + } + /* map is meant to be read-only, so do not allow mapping as + * writable, because it's possible to leak a writable page + * reference and allows user-space to still modify it after + * freezing, while verifier will assume contents do not change + */ + if (map->map_flags & BPF_F_RDONLY_PROG) { + err = -EACCES; + goto out; + } } /* set default open/close callbacks */ diff --git a/tools/testing/selftests/bpf/prog_tests/mmap.c b/tools/testing/selftests/bpf/prog_tests/mmap.c index 6b9dce431d41..43d0b5578f46 100644 --- a/tools/testing/selftests/bpf/prog_tests/mmap.c +++ b/tools/testing/selftests/bpf/prog_tests/mmap.c @@ -19,7 +19,7 @@ void test_mmap(void) const size_t map_sz = roundup_page(sizeof(struct map_data)); const int zero = 0, one = 1, two = 2, far = 1500; const long page_size = sysconf(_SC_PAGE_SIZE); - int err, duration = 0, i, data_map_fd, data_map_id, tmp_fd; + int err, duration = 0, i, data_map_fd, data_map_id, tmp_fd, rdmap_fd; struct bpf_map *data_map, *bss_map; void *bss_mmaped = NULL, *map_mmaped = NULL, *tmp1, *tmp2; struct test_mmap__bss *bss_data; @@ -37,6 +37,17 @@ void test_mmap(void) data_map = skel->maps.data_map; data_map_fd = bpf_map__fd(data_map); + rdmap_fd = bpf_map__fd(skel->maps.rdonly_map); + tmp1 = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, rdmap_fd, 0); + if (CHECK(tmp1 != MAP_FAILED, "rdonly_write_mmap", "unexpected success\n")) { + munmap(tmp1, 4096); + goto cleanup; + } + /* now double-check if it's mmap()'able at all */ + tmp1 = mmap(NULL, 4096, PROT_READ, MAP_SHARED, rdmap_fd, 0); + if (CHECK(tmp1 == MAP_FAILED, "rdonly_read_mmap", "failed: %d\n", errno)) + goto cleanup; + /* get map's ID */ memset(&map_info, 0, map_info_sz); err = bpf_obj_get_info_by_fd(data_map_fd, &map_info, &map_info_sz); diff --git a/tools/testing/selftests/bpf/progs/test_mmap.c b/tools/testing/selftests/bpf/progs/test_mmap.c index 6239596cd14e..4eb42cff5fe9 100644 --- a/tools/testing/selftests/bpf/progs/test_mmap.c +++ b/tools/testing/selftests/bpf/progs/test_mmap.c @@ -7,6 +7,14 @@ char _license[] SEC("license") = "GPL"; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 4096); + __uint(map_flags, BPF_F_MMAPABLE | BPF_F_RDONLY_PROG); + __type(key, __u32); + __type(value, char); +} rdonly_map SEC(".maps"); + struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 512 * 4); /* at least 4 pages of data */ -- cgit v1.2.3-71-gd317 From 18f855e574d9799a0e7489f8ae6fd8447d0dd74a Mon Sep 17 00:00:00 2001 From: Jens Axboe Date: Tue, 26 May 2020 09:38:31 -0600 Subject: sched/fair: Don't NUMA balance for kthreads Stefano reported a crash with using SQPOLL with io_uring: BUG: kernel NULL pointer dereference, address: 00000000000003b0 CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11 RIP: 0010:task_numa_work+0x4f/0x2c0 Call Trace: task_work_run+0x68/0xa0 io_sq_thread+0x252/0x3d0 kthread+0xf9/0x130 ret_from_fork+0x35/0x40 which is task_numa_work() oopsing on current->mm being NULL. The task work is queued by task_tick_numa(), which checks if current->mm is NULL at the time of the call. But this state isn't necessarily persistent, if the kthread is using use_mm() to temporarily adopt the mm of a task. Change the task_tick_numa() check to exclude kernel threads in general, as it doesn't make sense to attempt ot balance for kthreads anyway. Reported-by: Stefano Garzarella Signed-off-by: Jens Axboe Signed-off-by: Ingo Molnar Acked-by: Peter Zijlstra Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 538ba5d94e99..da3e5b54715b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2908,7 +2908,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr) /* * We don't care about NUMA placement if we don't have memory. */ - if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work) + if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work) return; /* -- cgit v1.2.3-71-gd317 From 18644cec714aabbab173729192c7594ee57a406b Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Thu, 28 May 2020 21:38:36 -0700 Subject: bpf: Fix use-after-free in fmod_ret check Fix the following issue: [ 436.749342] BUG: KASAN: use-after-free in bpf_trampoline_put+0x39/0x2a0 [ 436.749995] Write of size 4 at addr ffff8881ef38b8a0 by task kworker/3:5/2243 [ 436.750712] [ 436.752677] Workqueue: events bpf_prog_free_deferred [ 436.753183] Call Trace: [ 436.756483] bpf_trampoline_put+0x39/0x2a0 [ 436.756904] bpf_prog_free_deferred+0x16d/0x3d0 [ 436.757377] process_one_work+0x94a/0x15b0 [ 436.761969] [ 436.762130] Allocated by task 2529: [ 436.763323] bpf_trampoline_lookup+0x136/0x540 [ 436.763776] bpf_check+0x2872/0xa0a8 [ 436.764144] bpf_prog_load+0xb6f/0x1350 [ 436.764539] __do_sys_bpf+0x16d7/0x3720 [ 436.765825] [ 436.765988] Freed by task 2529: [ 436.767084] kfree+0xc6/0x280 [ 436.767397] bpf_trampoline_put+0x1fd/0x2a0 [ 436.767826] bpf_check+0x6832/0xa0a8 [ 436.768197] bpf_prog_load+0xb6f/0x1350 [ 436.768594] __do_sys_bpf+0x16d7/0x3720 prog->aux->trampoline = tr should be set only when prog is valid. Otherwise prog freeing will try to put trampoline via prog->aux->trampoline, but it may not point to a valid trampoline. Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN") Signed-off-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann Acked-by: KP Singh Link: https://lore.kernel.org/bpf/20200529043839.15824-2-alexei.starovoitov@gmail.com --- kernel/bpf/verifier.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 8d7ee40e2748..bfea3f9a972f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -10428,22 +10428,13 @@ static int check_struct_ops_btf_id(struct bpf_verifier_env *env) } #define SECURITY_PREFIX "security_" -static int check_attach_modify_return(struct bpf_verifier_env *env) +static int check_attach_modify_return(struct bpf_prog *prog, unsigned long addr) { - struct bpf_prog *prog = env->prog; - unsigned long addr = (unsigned long) prog->aux->trampoline->func.addr; - - /* This is expected to be cleaned up in the future with the KRSI effort - * introducing the LSM_HOOK macro for cleaning up lsm_hooks.h. - */ if (within_error_injection_list(addr) || !strncmp(SECURITY_PREFIX, prog->aux->attach_func_name, sizeof(SECURITY_PREFIX) - 1)) return 0; - verbose(env, "fmod_ret attach_btf_id %u (%s) is not modifiable\n", - prog->aux->attach_btf_id, prog->aux->attach_func_name); - return -EINVAL; } @@ -10654,11 +10645,18 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) goto out; } } + + if (prog->expected_attach_type == BPF_MODIFY_RETURN) { + ret = check_attach_modify_return(prog, addr); + if (ret) + verbose(env, "%s() is not modifiable\n", + prog->aux->attach_func_name); + } + + if (ret) + goto out; tr->func.addr = (void *)addr; prog->aux->trampoline = tr; - - if (prog->expected_attach_type == BPF_MODIFY_RETURN) - ret = check_attach_modify_return(env); out: mutex_unlock(&tr->mutex); if (ret) -- cgit v1.2.3-71-gd317 From 3a71dc366d4aa51a22f385445f2862231d3fda3b Mon Sep 17 00:00:00 2001 From: John Fastabend Date: Fri, 29 May 2020 10:28:40 -0700 Subject: bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones With the latest trunk llvm (llvm 11), I hit a verifier issue for test_prog subtest test_verif_scale1. The following simplified example illustrate the issue: w9 = 0 /* R9_w=inv0 */ r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */ r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */ ...... w2 = w9 /* R2_w=inv0 */ r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r6 += r2 /* R6_w=inv(id=0) */ r3 = r6 /* R3_w=inv(id=0) */ r3 += 14 /* R3_w=inv(id=0) */ if r3 > r8 goto end r5 = *(u32 *)(r6 + 0) /* R6_w=inv(id=0) */ <== error here: R6 invalid mem access 'inv' ... end: In real test_verif_scale1 code, "w9 = 0" and "w2 = w9" are in different basic blocks. In the above, after "r6 += r2", r6 becomes a scalar, which eventually caused the memory access error. The correct register state should be a pkt pointer. The inprecise register state starts at "w2 = w9". The 32bit register w9 is 0, in __reg_assign_32_into_64(), the 64bit reg->smax_value is assigned to be U32_MAX. The 64bit reg->smin_value is 0 and the 64bit register itself remains constant based on reg->var_off. In adjust_ptr_min_max_vals(), the verifier checks for a known constant, smin_val must be equal to smax_val. Since they are not equal, the verifier decides r6 is a unknown scalar, which caused later failure. The llvm10 does not have this issue as it generates different code: w9 = 0 /* R9_w=inv0 */ r8 = *(u32 *)(r1 + 80) /* __sk_buff->data_end */ r7 = *(u32 *)(r1 + 76) /* __sk_buff->data */ ...... r6 = r7 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r6 += r9 /* R6_w=pkt(id=0,off=0,r=0,imm=0) */ r3 = r6 /* R3_w=pkt(id=0,off=0,r=0,imm=0) */ r3 += 14 /* R3_w=pkt(id=0,off=14,r=0,imm=0) */ if r3 > r8 goto end ... To fix the above issue, we can include zero in the test condition for assigning the s32_max_value and s32_min_value to their 64-bit equivalents smax_value and smin_value. Further, fix the condition to avoid doing zero extension bounds checks when s32_min_value <= 0. This could allow for the case where bounds 32-bit bounds (-1,1) get incorrectly translated to (0,1) 64-bit bounds. When in-fact the -1 min value needs to force U32_MAX bound. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov Acked-by: Yonghong Song Link: https://lore.kernel.org/bpf/159077331983.6014.5758956193749002737.stgit@john-Precision-5820-Tower --- kernel/bpf/verifier.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel') diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bfea3f9a972f..efe14cf24bc6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -1168,14 +1168,14 @@ static void __reg_assign_32_into_64(struct bpf_reg_state *reg) * but must be positive otherwise set to worse case bounds * and refine later from tnum. */ - if (reg->s32_min_value > 0) - reg->smin_value = reg->s32_min_value; - else - reg->smin_value = 0; - if (reg->s32_max_value > 0) + if (reg->s32_min_value >= 0 && reg->s32_max_value >= 0) reg->smax_value = reg->s32_max_value; else reg->smax_value = U32_MAX; + if (reg->s32_min_value >= 0) + reg->smin_value = reg->s32_min_value; + else + reg->smin_value = 0; } static void __reg_combine_32_into_64(struct bpf_reg_state *reg) -- cgit v1.2.3-71-gd317