From c2bc11113c50449f23c40b724fe410fc2380a8e9 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Thu, 27 Oct 2011 18:12:42 -0700 Subject: time: Improve documentation of timekeeeping_adjust() After getting a number of questions in private emails about the math around admittedly very complex timekeeping_adjust() and timekeeping_big_adjust(), I figure the code needs some better comments. Hopefully the explanations are clear enough and don't muddy the water any worse. Still needs documentation for ntp_error, but I couldn't recall exactly the full explanation behind the code that's there (although I do recall once working it out when Roman first proposed it). Given a bit more time I can probably work it out, but I don't want to hold back this documentation until then. Signed-off-by: John Stultz Cc: Chen Jie Cc: Steven Rostedt Link: http://lkml.kernel.org/r/1319764362-32367-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar --- kernel/time/timekeeping.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2b021b0e8507..025e136f3881 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -802,14 +802,44 @@ static void timekeeping_adjust(s64 offset) s64 error, interval = timekeeper.cycle_interval; int adj; + /* + * The point of this is to check if the error is greater then half + * an interval. + * + * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs. + * + * Note we subtract one in the shift, so that error is really error*2. + * This "saves" dividing(shifting) intererval twice, but keeps the + * (error > interval) comparision as still measuring if error is + * larger then half an interval. + * + * Note: It does not "save" on aggrivation when reading the code. + */ error = timekeeper.ntp_error >> (timekeeper.ntp_error_shift - 1); if (error > interval) { + /* + * We now divide error by 4(via shift), which checks if + * the error is greater then twice the interval. + * If it is greater, we need a bigadjust, if its smaller, + * we can adjust by 1. + */ error >>= 2; + /* + * XXX - In update_wall_time, we round up to the next + * nanosecond, and store the amount rounded up into + * the error. This causes the likely below to be unlikely. + * + * The properfix is to avoid rounding up by using + * the high precision timekeeper.xtime_nsec instead of + * xtime.tv_nsec everywhere. Fixing this will take some + * time. + */ if (likely(error <= interval)) adj = 1; else adj = timekeeping_bigadjust(error, &interval, &offset); } else if (error < -interval) { + /* See comment above, this is just switched for the negative */ error >>= 2; if (likely(error >= -interval)) { adj = -1; @@ -817,9 +847,58 @@ static void timekeeping_adjust(s64 offset) offset = -offset; } else adj = timekeeping_bigadjust(error, &interval, &offset); - } else + } else /* No adjustment needed */ return; + /* + * So the following can be confusing. + * + * To keep things simple, lets assume adj == 1 for now. + * + * When adj != 1, remember that the interval and offset values + * have been appropriately scaled so the math is the same. + * + * The basic idea here is that we're increasing the multiplier + * by one, this causes the xtime_interval to be incremented by + * one cycle_interval. This is because: + * xtime_interval = cycle_interval * mult + * So if mult is being incremented by one: + * xtime_interval = cycle_interval * (mult + 1) + * Its the same as: + * xtime_interval = (cycle_interval * mult) + cycle_interval + * Which can be shortened to: + * xtime_interval += cycle_interval + * + * So offset stores the non-accumulated cycles. Thus the current + * time (in shifted nanoseconds) is: + * now = (offset * adj) + xtime_nsec + * Now, even though we're adjusting the clock frequency, we have + * to keep time consistent. In other words, we can't jump back + * in time, and we also want to avoid jumping forward in time. + * + * So given the same offset value, we need the time to be the same + * both before and after the freq adjustment. + * now = (offset * adj_1) + xtime_nsec_1 + * now = (offset * adj_2) + xtime_nsec_2 + * So: + * (offset * adj_1) + xtime_nsec_1 = + * (offset * adj_2) + xtime_nsec_2 + * And we know: + * adj_2 = adj_1 + 1 + * So: + * (offset * adj_1) + xtime_nsec_1 = + * (offset * (adj_1+1)) + xtime_nsec_2 + * (offset * adj_1) + xtime_nsec_1 = + * (offset * adj_1) + offset + xtime_nsec_2 + * Canceling the sides: + * xtime_nsec_1 = offset + xtime_nsec_2 + * Which gives us: + * xtime_nsec_2 = xtime_nsec_1 - offset + * Which simplfies to: + * xtime_nsec -= offset + * + * XXX - TODO: Doc ntp_error calculation. + */ timekeeper.mult += adj; timekeeper.xtime_interval += interval; timekeeper.xtime_nsec -= offset; -- cgit v1.2.3-71-gd317 From c75d720fca8a91ce99196d33adea383621027bf2 Mon Sep 17 00:00:00 2001 From: Edward Donovan Date: Tue, 1 Nov 2011 15:29:44 -0400 Subject: genirq: Fix irqfixup, irqpoll regression commit d05c65fff0 ("genirq: spurious: Run only one poller at a time") introduced a regression, leaving the boot options 'irqfixup' and 'irqpoll' non-functional. The patch placed tests in each function, to exit if the function is already running. The test in 'misrouted_irq' exited when it should have proceeded, effectively disabling 'misrouted_irq' and 'poll_spurious_irqs'. The check for an already running poller needs to be "!= 1" not "== 1" as "1" is the value when the first poller starts running. Signed-off-by: Edward Donovan Cc: maciej.rutecki@gmail.com Link: http://lkml.kernel.org/r/1320175784-6745-1-git-send-email-edward.donovan@numble.net Cc: stable@vger.kernel.org # >= 2.6.39 Signed-off-by: Thomas Gleixner --- kernel/irq/spurious.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index aa57d5da18c1..b5f4742693c0 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -115,7 +115,7 @@ static int misrouted_irq(int irq) struct irq_desc *desc; int i, ok = 0; - if (atomic_inc_return(&irq_poll_active) == 1) + if (atomic_inc_return(&irq_poll_active) != 1) goto out; irq_poll_cpu = smp_processor_id(); -- cgit v1.2.3-71-gd317 From d65670a78cdbfae94f20a9e05ec705871d7cdf2b Mon Sep 17 00:00:00 2001 From: John Stultz Date: Mon, 31 Oct 2011 17:06:35 -0400 Subject: clocksource: Avoid selecting mult values that might overflow when adjusted For some frequencies, the clocks_calc_mult_shift() function will unfortunately select mult values very close to 0xffffffff. This has the potential to overflow when NTP adjusts the clock, adding to the mult value. This patch adds a clocksource.maxadj value, which provides an approximation of an 11% adjustment(NTP limits adjustments to 500ppm and the tick adjustment is limited to 10%), which could be made to the clocksource.mult value. This is then used to both check that the current mult value won't overflow/underflow, as well as warning us if the timekeeping_adjust() code pushes over that 11% boundary. v2: Fix max_adjustment calculation, and improve WARN_ONCE messages. v3: Don't warn before maxadj has actually been set CC: Yong Zhang CC: David Daney CC: Thomas Gleixner CC: Chen Jie CC: zhangfx CC: stable@kernel.org Reported-by: Chen Jie Reported-by: zhangfx Tested-by: Yong Zhang Signed-off-by: John Stultz --- include/linux/clocksource.h | 3 ++- kernel/time/clocksource.c | 58 +++++++++++++++++++++++++++++++++++++-------- kernel/time/timekeeping.c | 7 ++++++ 3 files changed, 57 insertions(+), 11 deletions(-) (limited to 'kernel') diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 139c4db55f17..c86c940d1de3 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -156,6 +156,7 @@ extern u64 timecounter_cyc2time(struct timecounter *tc, * @mult: cycle to nanosecond multiplier * @shift: cycle to nanosecond divisor (power of two) * @max_idle_ns: max idle time permitted by the clocksource (nsecs) + * @maxadj maximum adjustment value to mult (~11%) * @flags: flags describing special properties * @archdata: arch-specific data * @suspend: suspend function for the clocksource, if necessary @@ -172,7 +173,7 @@ struct clocksource { u32 mult; u32 shift; u64 max_idle_ns; - + u32 maxadj; #ifdef CONFIG_ARCH_CLOCKSOURCE_DATA struct arch_clocksource_data archdata; #endif diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cf52fda2e096..cfc65e1eb9fb 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -491,6 +491,22 @@ void clocksource_touch_watchdog(void) clocksource_resume_watchdog(); } +/** + * clocksource_max_adjustment- Returns max adjustment amount + * @cs: Pointer to clocksource + * + */ +static u32 clocksource_max_adjustment(struct clocksource *cs) +{ + u64 ret; + /* + * We won't try to correct for more then 11% adjustments (110,000 ppm), + */ + ret = (u64)cs->mult * 11; + do_div(ret,100); + return (u32)ret; +} + /** * clocksource_max_deferment - Returns max time the clocksource can be deferred * @cs: Pointer to clocksource @@ -503,25 +519,28 @@ static u64 clocksource_max_deferment(struct clocksource *cs) /* * Calculate the maximum number of cycles that we can pass to the * cyc2ns function without overflowing a 64-bit signed result. The - * maximum number of cycles is equal to ULLONG_MAX/cs->mult which - * is equivalent to the below. - * max_cycles < (2^63)/cs->mult - * max_cycles < 2^(log2((2^63)/cs->mult)) - * max_cycles < 2^(log2(2^63) - log2(cs->mult)) - * max_cycles < 2^(63 - log2(cs->mult)) - * max_cycles < 1 << (63 - log2(cs->mult)) + * maximum number of cycles is equal to ULLONG_MAX/(cs->mult+cs->maxadj) + * which is equivalent to the below. + * max_cycles < (2^63)/(cs->mult + cs->maxadj) + * max_cycles < 2^(log2((2^63)/(cs->mult + cs->maxadj))) + * max_cycles < 2^(log2(2^63) - log2(cs->mult + cs->maxadj)) + * max_cycles < 2^(63 - log2(cs->mult + cs->maxadj)) + * max_cycles < 1 << (63 - log2(cs->mult + cs->maxadj)) * Please note that we add 1 to the result of the log2 to account for * any rounding errors, ensure the above inequality is satisfied and * no overflow will occur. */ - max_cycles = 1ULL << (63 - (ilog2(cs->mult) + 1)); + max_cycles = 1ULL << (63 - (ilog2(cs->mult + cs->maxadj) + 1)); /* * The actual maximum number of cycles we can defer the clocksource is * determined by the minimum of max_cycles and cs->mask. + * Note: Here we subtract the maxadj to make sure we don't sleep for + * too long if there's a large negative adjustment. */ max_cycles = min_t(u64, max_cycles, (u64) cs->mask); - max_nsecs = clocksource_cyc2ns(max_cycles, cs->mult, cs->shift); + max_nsecs = clocksource_cyc2ns(max_cycles, cs->mult - cs->maxadj, + cs->shift); /* * To ensure that the clocksource does not wrap whilst we are idle, @@ -640,7 +659,6 @@ static void clocksource_enqueue(struct clocksource *cs) void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq) { u64 sec; - /* * Calc the maximum number of seconds which we can run before * wrapping around. For clocksources which have a mask > 32bit @@ -661,6 +679,20 @@ void __clocksource_updatefreq_scale(struct clocksource *cs, u32 scale, u32 freq) clocks_calc_mult_shift(&cs->mult, &cs->shift, freq, NSEC_PER_SEC / scale, sec * scale); + + /* + * for clocksources that have large mults, to avoid overflow. + * Since mult may be adjusted by ntp, add an safety extra margin + * + */ + cs->maxadj = clocksource_max_adjustment(cs); + while ((cs->mult + cs->maxadj < cs->mult) + || (cs->mult - cs->maxadj > cs->mult)) { + cs->mult >>= 1; + cs->shift--; + cs->maxadj = clocksource_max_adjustment(cs); + } + cs->max_idle_ns = clocksource_max_deferment(cs); } EXPORT_SYMBOL_GPL(__clocksource_updatefreq_scale); @@ -701,6 +733,12 @@ EXPORT_SYMBOL_GPL(__clocksource_register_scale); */ int clocksource_register(struct clocksource *cs) { + /* calculate max adjustment for given mult/shift */ + cs->maxadj = clocksource_max_adjustment(cs); + WARN_ONCE(cs->mult + cs->maxadj < cs->mult, + "Clocksource %s might overflow on 11%% adjustment\n", + cs->name); + /* calculate max idle time permitted for this clocksource */ cs->max_idle_ns = clocksource_max_deferment(cs); diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 2b021b0e8507..e65ff3171102 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -820,6 +820,13 @@ static void timekeeping_adjust(s64 offset) } else return; + WARN_ONCE(timekeeper.clock->maxadj && + (timekeeper.mult + adj > timekeeper.clock->mult + + timekeeper.clock->maxadj), + "Adjusting %s more then 11%% (%ld vs %ld)\n", + timekeeper.clock->name, (long)timekeeper.mult + adj, + (long)timekeeper.clock->mult + + timekeeper.clock->maxadj); timekeeper.mult += adj; timekeeper.xtime_interval += interval; timekeeper.xtime_nsec -= offset; -- cgit v1.2.3-71-gd317 From 468e6a20afaccb67e2a7d7f60d301f90e1c6f301 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 7 Sep 2011 10:41:32 -0600 Subject: writeback: remove vm_dirties and task->dirties They are not used any more. Signed-off-by: Wu Fengguang --- include/linux/init_task.h | 1 - include/linux/sched.h | 1 - kernel/fork.c | 5 ----- mm/page-writeback.c | 9 --------- 4 files changed, 16 deletions(-) (limited to 'kernel') diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 08ffab01e76c..94b1e356c02a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -184,7 +184,6 @@ extern struct cred init_cred; [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \ }, \ .thread_group = LIST_HEAD_INIT(tsk.thread_group), \ - .dirties = INIT_PROP_LOCAL_SINGLE(dirties), \ INIT_IDS \ INIT_PERF_EVENTS(tsk) \ INIT_TRACE_IRQFLAGS \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 68daf4f27e2c..1c4f3e9b9bc5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1521,7 +1521,6 @@ struct task_struct { #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif - struct prop_local_single dirties; /* * when (nr_dirtied >= nr_dirtied_pause), it's time to call * balance_dirty_pages() for some dirty throttling pause diff --git a/kernel/fork.c b/kernel/fork.c index ba0d17261329..da4a6a10d088 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -162,7 +162,6 @@ static void account_kernel_stack(struct thread_info *ti, int account) void free_task(struct task_struct *tsk) { - prop_local_destroy_single(&tsk->dirties); account_kernel_stack(tsk->stack, -1); free_thread_info(tsk->stack); rt_mutex_debug_task_free(tsk); @@ -274,10 +273,6 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) tsk->stack = ti; - err = prop_local_init_single(&tsk->dirties); - if (err) - goto out; - setup_thread_stack(tsk, orig); clear_user_return_notifier(tsk); clear_tsk_need_resched(tsk); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index e7cb5ff6e53d..71252486bc6f 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -128,7 +128,6 @@ unsigned long global_dirty_limit; * */ static struct prop_descriptor vm_completions; -static struct prop_descriptor vm_dirties; /* * couple the period to the dirty_ratio: @@ -154,7 +153,6 @@ static void update_completion_period(void) { int shift = calc_period_shift(); prop_change_shift(&vm_completions, shift); - prop_change_shift(&vm_dirties, shift); writeback_set_ratelimit(); } @@ -235,11 +233,6 @@ void bdi_writeout_inc(struct backing_dev_info *bdi) } EXPORT_SYMBOL_GPL(bdi_writeout_inc); -void task_dirty_inc(struct task_struct *tsk) -{ - prop_inc_single(&vm_dirties, &tsk->dirties); -} - /* * Obtain an accurate fraction of the BDI's portion. */ @@ -1395,7 +1388,6 @@ void __init page_writeback_init(void) shift = calc_period_shift(); prop_descriptor_init(&vm_completions, shift); - prop_descriptor_init(&vm_dirties, shift); } /** @@ -1724,7 +1716,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED); - task_dirty_inc(current); task_io_account_write(PAGE_CACHE_SIZE); } } -- cgit v1.2.3-71-gd317 From 2ed0e645f358c26f4f4a7aed56a9488db0020ad1 Mon Sep 17 00:00:00 2001 From: Marc Zyngier Date: Wed, 16 Nov 2011 12:27:39 +0000 Subject: genirq: Don't allow per cpu interrupts to be suspended The power management functions related to interrupts do not know (yet) about per-cpu interrupts and end up calling the wrong low-level methods to enable/disable interrupts. This leads to all kind of interesting issues (action taken on one CPU only, updating a refcount which is not used otherwise...). The workaround for the time being is simply to flag these interrupts with IRQF_NO_SUSPEND. At least on ARM, these interrupts are actually dealt with at the architecture level. Reported-by: Santosh Shilimkar Tested-by: Santosh Shilimkar Signed-off-by: Marc Zyngier Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1321446459-31409-1-git-send-email-marc.zyngier@arm.com Signed-off-by: Thomas Gleixner --- kernel/irq/manage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 67ce837ae52c..0e2b179bc7b3 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1596,7 +1596,7 @@ int request_percpu_irq(unsigned int irq, irq_handler_t handler, return -ENOMEM; action->handler = handler; - action->flags = IRQF_PERCPU; + action->flags = IRQF_PERCPU | IRQF_NO_SUSPEND; action->name = devname; action->percpu_dev_id = dev_id; -- cgit v1.2.3-71-gd317 From d004e024058a0eaca097513ce62cbcf978913e0a Mon Sep 17 00:00:00 2001 From: Hector Palacios Date: Mon, 14 Nov 2011 11:15:25 +0100 Subject: timekeeping: add arch_offset hook to ktime_get functions ktime_get and ktime_get_ts were calling timekeeping_get_ns() but later they were not calling arch_gettimeoffset() so architectures using this mechanism returned 0 ns when calling these functions. This happened for example when running Busybox's ping which calls syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts) which eventually calls ktime_get. As a result the returned ping travel time was zero. CC: stable@kernel.org Signed-off-by: Hector Palacios Signed-off-by: John Stultz --- kernel/time/timekeeping.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'kernel') diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index e9f60d311436..237841378c03 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -249,6 +249,8 @@ ktime_t ktime_get(void) secs = xtime.tv_sec + wall_to_monotonic.tv_sec; nsecs = xtime.tv_nsec + wall_to_monotonic.tv_nsec; nsecs += timekeeping_get_ns(); + /* If arch requires, add in gettimeoffset() */ + nsecs += arch_gettimeoffset(); } while (read_seqretry(&xtime_lock, seq)); /* @@ -280,6 +282,8 @@ void ktime_get_ts(struct timespec *ts) *ts = xtime; tomono = wall_to_monotonic; nsecs = timekeeping_get_ns(); + /* If arch requires, add in gettimeoffset() */ + nsecs += arch_gettimeoffset(); } while (read_seqretry(&xtime_lock, seq)); -- cgit v1.2.3-71-gd317 From aa9a7b11821e883a7b93ecce190881e0ea48648b Mon Sep 17 00:00:00 2001 From: "Srivatsa S. Bhat" Date: Fri, 18 Nov 2011 23:02:42 +0100 Subject: PM / Hibernate: Fix the early termination of test modes Commit 2aede851ddf08666f68ffc17be446420e9d2a056 (PM / Hibernate: Freeze kernel threads after preallocating memory) postponed the freezing of kernel threads to after preallocating memory for hibernation. But while doing that, the hibernation test TEST_FREEZER and the test mode HIBERNATION_TESTPROC were not moved accordingly. As a result, when using these test modes, it only goes upto the freezing of userspace and exits, when in fact it should go till the complete end of task freezing stage, namely the freezing of kernel threads as well. So, move these points of exit to appropriate places so that freezing of kernel threads is also tested while using these test harnesses. Signed-off-by: Srivatsa S. Bhat Signed-off-by: Rafael J. Wysocki --- kernel/power/hibernate.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index b4511b6d3ef9..196c01268ebd 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -55,6 +55,8 @@ enum { static int hibernation_mode = HIBERNATION_SHUTDOWN; +static bool freezer_test_done; + static const struct platform_hibernation_ops *hibernation_ops; /** @@ -347,6 +349,17 @@ int hibernation_snapshot(int platform_mode) if (error) goto Close; + if (hibernation_test(TEST_FREEZER) || + hibernation_testmode(HIBERNATION_TESTPROC)) { + + /* + * Indicate to the caller that we are returning due to a + * successful freezer test. + */ + freezer_test_done = true; + goto Close; + } + error = dpm_prepare(PMSG_FREEZE); if (error) goto Complete_devices; @@ -641,15 +654,13 @@ int hibernate(void) if (error) goto Finish; - if (hibernation_test(TEST_FREEZER)) - goto Thaw; - - if (hibernation_testmode(HIBERNATION_TESTPROC)) - goto Thaw; - error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM); if (error) goto Thaw; + if (freezer_test_done) { + freezer_test_done = false; + goto Thaw; + } if (in_suspend) { unsigned int flags = 0; -- cgit v1.2.3-71-gd317 From 27c9cd7e601632b3794e1c3344d37b86917ffb43 Mon Sep 17 00:00:00 2001 From: Jeff Ohlstein Date: Fri, 18 Nov 2011 15:47:10 -0800 Subject: hrtimer: Fix extra wakeups from __remove_hrtimer() __remove_hrtimer() attempts to reprogram the clockevent device when the timer being removed is the next to expire. However, __remove_hrtimer() reprograms the clockevent *before* removing the timer from the timerqueue and thus when hrtimer_force_reprogram() finds the next timer to expire it finds the timer we're trying to remove. This is especially noticeable when the system switches to NOHz mode and the system tick is removed. The timer tick is removed from the system but the clockevent is programmed to wakeup in another HZ anyway. Silence the extra wakeup by removing the timer from the timerqueue before calling hrtimer_force_reprogram() so that we actually program the clockevent for the next timer to expire. This was broken by 998adc3 "hrtimers: Convert hrtimers to use timerlist infrastructure". Signed-off-by: Jeff Ohlstein Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1321660030-8520-1-git-send-email-johlstei@codeaurora.org Signed-off-by: Thomas Gleixner --- kernel/hrtimer.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index a9205e32a059..2043c08d36c8 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -885,10 +885,13 @@ static void __remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, unsigned long newstate, int reprogram) { + struct timerqueue_node *next_timer; if (!(timer->state & HRTIMER_STATE_ENQUEUED)) goto out; - if (&timer->node == timerqueue_getnext(&base->active)) { + next_timer = timerqueue_getnext(&base->active); + timerqueue_del(&base->active, &timer->node); + if (&timer->node == next_timer) { #ifdef CONFIG_HIGH_RES_TIMERS /* Reprogram the clock event device. if enabled */ if (reprogram && hrtimer_hres_active()) { @@ -901,7 +904,6 @@ static void __remove_hrtimer(struct hrtimer *timer, } #endif } - timerqueue_del(&base->active, &timer->node); if (!timerqueue_getnext(&base->active)) base->cpu_base->active_bases &= ~(1 << base->index); out: -- cgit v1.2.3-71-gd317 From 501a708f18ef911328ffd39f39738b8a7862aa8e Mon Sep 17 00:00:00 2001 From: "Srivatsa S. Bhat" Date: Sat, 19 Nov 2011 14:37:57 +0100 Subject: PM / Suspend: Fix bug in suspend statistics update After commit 2a77c46de1e3dace73745015635ebbc648eca69c (PM / Suspend: Add statistics debugfs file for suspend to RAM) a missing pair of braces inside the state_store() function causes even invalid arguments to suspend to be wrongly treated as failed suspend attempts. Fix this. [rjw: Put the hash/subject of the buggy commit into the changelog.] Signed-off-by: Srivatsa S. Bhat Signed-off-by: Rafael J. Wysocki --- kernel/power/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/power/main.c b/kernel/power/main.c index 71f49fe4377e..36e0f0903c32 100644 --- a/kernel/power/main.c +++ b/kernel/power/main.c @@ -290,13 +290,14 @@ static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr, if (*s && len == strlen(*s) && !strncmp(buf, *s, len)) break; } - if (state < PM_SUSPEND_MAX && *s) + if (state < PM_SUSPEND_MAX && *s) { error = enter_state(state); if (error) { suspend_stats.fail++; dpm_save_failed_errno(error); } else suspend_stats.success++; + } #endif Exit: -- cgit v1.2.3-71-gd317 From bb58dd5d1ffad6c2d21c69698ba766dad4ae54e6 Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Tue, 22 Nov 2011 23:08:10 +0100 Subject: PM / Hibernate: Do not leak memory in error/test code paths The hibernation core code forgets to release memory preallocated for hibernation if there's an error in its early stages or if test modes causing hibernation_snapshot() to return early are used. This causes the system to be hardly usable, because the amount of preallocated memory is usually huge. Fix this problem. Reported-by: Srivatsa S. Bhat Signed-off-by: Rafael J. Wysocki Acked-by: Srivatsa S. Bhat --- kernel/power/hibernate.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'kernel') diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 196c01268ebd..a6b0503574ee 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -347,7 +347,7 @@ int hibernation_snapshot(int platform_mode) error = freeze_kernel_threads(); if (error) - goto Close; + goto Cleanup; if (hibernation_test(TEST_FREEZER) || hibernation_testmode(HIBERNATION_TESTPROC)) { @@ -357,12 +357,14 @@ int hibernation_snapshot(int platform_mode) * successful freezer test. */ freezer_test_done = true; - goto Close; + goto Cleanup; } error = dpm_prepare(PMSG_FREEZE); - if (error) - goto Complete_devices; + if (error) { + dpm_complete(msg); + goto Cleanup; + } suspend_console(); pm_restrict_gfp_mask(); @@ -391,8 +393,6 @@ int hibernation_snapshot(int platform_mode) pm_restore_gfp_mask(); resume_console(); - - Complete_devices: dpm_complete(msg); Close: @@ -402,6 +402,10 @@ int hibernation_snapshot(int platform_mode) Recover_platform: platform_recover(platform_mode); goto Resume_devices; + + Cleanup: + swsusp_free(); + goto Close; } /** -- cgit v1.2.3-71-gd317 From 884a45d964dd395eda945842afff5e16bcaedf56 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 22 Nov 2011 07:44:47 -0800 Subject: cgroup_freezer: fix freezing groups with stopped tasks 2d3cbf8b (cgroup_freezer: update_freezer_state() does incorrect state transitions) removed is_task_frozen_enough and replaced it with a simple frozen call. This, however, breaks freezing for a group with stopped tasks because those cannot be frozen and so the group remains in CGROUP_FREEZING state (update_if_frozen doesn't count stopped tasks) and never reaches CGROUP_FROZEN. Let's add is_task_frozen_enough back and use it at the original locations (update_if_frozen and try_to_freeze_cgroup). Semantically we consider stopped tasks as frozen enough so we should consider both cases when testing frozen tasks. Testcase: mkdir /dev/freezer mount -t cgroup -o freezer none /dev/freezer mkdir /dev/freezer/foo sleep 1h & pid=$! kill -STOP $pid echo $pid > /dev/freezer/foo/tasks echo FROZEN > /dev/freezer/foo/freezer.state while true do cat /dev/freezer/foo/freezer.state [ "`cat /dev/freezer/foo/freezer.state`" = "FROZEN" ] && break sleep 1 done echo OK Signed-off-by: Michal Hocko Acked-by: Li Zefan Cc: Tomasz Buchert Cc: Paul Menage Cc: Andrew Morton Cc: stable@kernel.org Signed-off-by: Tejun Heo --- kernel/cgroup_freezer.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'kernel') diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c index 5e828a2ca8e6..213c0351dad8 100644 --- a/kernel/cgroup_freezer.c +++ b/kernel/cgroup_freezer.c @@ -153,6 +153,13 @@ static void freezer_destroy(struct cgroup_subsys *ss, kfree(cgroup_freezer(cgroup)); } +/* task is frozen or will freeze immediately when next it gets woken */ +static bool is_task_frozen_enough(struct task_struct *task) +{ + return frozen(task) || + (task_is_stopped_or_traced(task) && freezing(task)); +} + /* * The call to cgroup_lock() in the freezer.state write method prevents * a write to that file racing against an attach, and hence the @@ -231,7 +238,7 @@ static void update_if_frozen(struct cgroup *cgroup, cgroup_iter_start(cgroup, &it); while ((task = cgroup_iter_next(cgroup, &it))) { ntotal++; - if (frozen(task)) + if (is_task_frozen_enough(task)) nfrozen++; } @@ -284,7 +291,7 @@ static int try_to_freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer) while ((task = cgroup_iter_next(cgroup, &it))) { if (!freeze_task(task, true)) continue; - if (frozen(task)) + if (is_task_frozen_enough(task)) continue; if (!freezing(task) && !freezer_should_skip(task)) num_cant_freeze_now++; -- cgit v1.2.3-71-gd317 From 52553ddffad76ccf192d4dd9ce88d5818f57f62a Mon Sep 17 00:00:00 2001 From: Edward Donovan Date: Sun, 27 Nov 2011 23:07:34 -0500 Subject: genirq: fix regression in irqfixup, irqpoll MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Commit fa27271bc8d2("genirq: Fixup poll handling") introduced a regression that broke irqfixup/irqpoll for some hardware configurations. Amidst reorganizing 'try_one_irq', that patch removed a test that checked for 'action->handler' returning IRQ_HANDLED, before acting on the interrupt. Restoring this test back returns the functionality lost since 2.6.39. In the current set of tests, after 'action' is set, it must precede '!action->next' to take effect. With this and my previous patch to irq/spurious.c, c75d720fca8a, all IRQ regressions that I have encountered are fixed. Signed-off-by: Edward Donovan Reported-and-tested-by: Rogério Brito Cc: Thomas Gleixner Cc: stable@kernel.org (2.6.39+) Signed-off-by: Linus Torvalds --- kernel/irq/spurious.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'kernel') diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index b5f4742693c0..dc813a948be2 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -84,7 +84,9 @@ static int try_one_irq(int irq, struct irq_desc *desc, bool force) */ action = desc->action; if (!action || !(action->flags & IRQF_SHARED) || - (action->flags & __IRQF_TIMER) || !action->next) + (action->flags & __IRQF_TIMER) || + (action->handler(irq, action->dev_id) == IRQ_HANDLED) || + !action->next) goto out; /* Already running on another processor */ -- cgit v1.2.3-71-gd317