Arvind M | 8e87d85 | 2018-01-29 00:04:29 -0800 | [diff] [blame] | 1 | From 8960432eb8d999fba34cf98ee7159de9159cbf30 Mon Sep 17 00:00:00 2001 |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 2 | From: Thomas Gleixner <tglx@linutronix.de> |
| 3 | Date: Mon, 1 Jul 2013 11:02:42 +0200 |
Arvind M | 10268e7 | 2017-12-04 22:18:06 -0800 | [diff] [blame] | 4 | Subject: [PATCH 172/366] workqueue: Prevent workqueue versus ata-piix livelock |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 5 | |
| 6 | An Intel i7 system regularly detected rcu_preempt stalls after the kernel |
| 7 | was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no |
| 8 | longer possible, unless the system was restarted. |
| 9 | |
| 10 | The kernel message was: |
| 11 | INFO: rcu_preempt self-detected stall on CPU { 6} |
| 12 | [..] |
| 13 | NMI backtrace for cpu 6 |
| 14 | CPU 6 |
| 15 | Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 |
| 16 | RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 |
| 17 | RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 |
| 18 | RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 |
| 19 | RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 |
| 20 | RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc |
| 21 | R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f |
| 22 | R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 |
| 23 | FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 |
| 24 | CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b |
| 25 | CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 |
| 26 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 |
| 27 | DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 |
| 28 | Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) |
| 29 | Stack: |
| 30 | ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 |
| 31 | 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 |
| 32 | ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 |
| 33 | Call Trace: |
| 34 | <IRQ> |
| 35 | [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 |
| 36 | [<ffffffff8124caa9>] __delay+0xf/0x11 |
| 37 | [<ffffffff8124cad2>] __const_udelay+0x27/0x29 |
| 38 | [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 |
| 39 | [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 |
| 40 | [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d |
| 41 | [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 |
| 42 | [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 |
| 43 | [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 |
| 44 | [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 |
| 45 | [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 |
| 46 | [<ffffffff8104f653>] update_process_times+0x44/0x55 |
| 47 | [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 |
| 48 | [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b |
| 49 | [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 |
| 50 | [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa |
| 51 | [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 |
| 52 | [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 |
| 53 | <EOI> |
| 54 | [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a |
| 55 | [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e |
| 56 | [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 |
| 57 | [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e |
| 58 | [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 |
| 59 | [<ffffffff81230985>] flush_end_io+0xf1/0x107 |
| 60 | [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 |
| 61 | [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 |
| 62 | [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 |
| 63 | [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 |
| 64 | [<ffffffff81335cec>] ? sd_done+0x298/0x2ef |
| 65 | [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 |
| 66 | [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f |
| 67 | [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 |
| 68 | [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 |
| 69 | [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a |
| 70 | [<ffffffff81048466>] local_bh_enable+0x43/0x72 |
| 71 | [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 |
| 72 | [<ffffffff810ab089>] irq_thread+0x8c/0x17c |
| 73 | [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c |
| 74 | [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 |
| 75 | [<ffffffff8105eb18>] kthread+0x8d/0x95 |
| 76 | [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 |
| 77 | [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 |
| 78 | [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 |
| 79 | |
| 80 | The state of softirqd of this CPU at the time of the crash was: |
| 81 | ksoftirqd/6 R running task 0 53 2 0x00000000 |
| 82 | ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 |
| 83 | ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 |
| 84 | ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 |
| 85 | Call Trace: |
| 86 | [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c |
| 87 | [<ffffffff814d178c>] preempt_schedule+0x61/0x76 |
| 88 | [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df |
| 89 | [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c |
| 90 | [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 |
| 91 | [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 |
| 92 | [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 |
| 93 | [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 |
| 94 | [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc |
| 95 | [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc |
| 96 | [<ffffffff8105eb18>] kthread+0x8d/0x95 |
| 97 | [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 |
| 98 | [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 |
| 99 | [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 |
| 100 | |
| 101 | Apparently, the softirq demon and the ata_piix IRQ handler were waiting |
| 102 | for each other to finish ending up in a livelock. After the below patch |
| 103 | was applied, the system no longer crashes. |
| 104 | |
| 105 | Reported-by: Carsten Emde <C.Emde@osadl.org> |
| 106 | Proposed-by: Thomas Gleixner <tglx@linutronix.de> |
| 107 | Tested by: Carsten Emde <C.Emde@osadl.org> |
| 108 | Signed-off-by: Carsten Emde <C.Emde@osadl.org> |
| 109 | Signed-off-by: Thomas Gleixner <tglx@linutronix.de> |
| 110 | Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 111 | --- |
| 112 | kernel/workqueue.c | 3 ++- |
| 113 | 1 file changed, 2 insertions(+), 1 deletion(-) |
| 114 | |
| 115 | diff --git a/kernel/workqueue.c b/kernel/workqueue.c |
Ishan Mittal | b799826 | 2017-01-17 16:11:50 +0530 | [diff] [blame] | 116 | index efc8cbe..492968a 100644 |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 117 | --- a/kernel/workqueue.c |
| 118 | +++ b/kernel/workqueue.c |
| 119 | @@ -49,6 +49,7 @@ |
| 120 | #include <linux/moduleparam.h> |
| 121 | #include <linux/uaccess.h> |
| 122 | #include <linux/locallock.h> |
| 123 | +#include <linux/delay.h> |
| 124 | |
| 125 | #include "workqueue_internal.h" |
| 126 | |
Allen Martin | fc468d8 | 2016-11-15 17:57:52 -0800 | [diff] [blame] | 127 | @@ -1285,7 +1286,7 @@ fail: |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 128 | local_unlock_irqrestore(pendingb_lock, *flags); |
| 129 | if (work_is_canceling(work)) |
| 130 | return -ENOENT; |
| 131 | - cpu_relax(); |
| 132 | + cpu_chill(); |
| 133 | return -EAGAIN; |
| 134 | } |
| 135 | |
| 136 | -- |
Arvind M | 10268e7 | 2017-12-04 22:18:06 -0800 | [diff] [blame] | 137 | 1.9.1 |
Allen Martin | 685e0f8 | 2016-07-26 19:34:29 -0700 | [diff] [blame] | 138 | |