soft lockup happened in tasklist_read_lock()

t19467 · Post by **t19467** » 2020/08/24 11:46:05

I have two issues to trouble you, I am very appricated if you can give me some help.
One is about the soft lockup issue; the other one is about the difference of the using of tasklist_lock between centos kernel and linux community kernel.
1,
We met a soft lockup issue on centos 7.6, whether centos community ever met such issue?
No special test scenario for the issue, we just do the normal test, we only met this issue once, the probablity of the issue is very very low.
That all cpus are in soft lockup state, the call trace as below, soft-lockup is caused by waiting for lock in fucntion tasklist_read_lock():
void tasklist_read_lock(void)
{
if (WARN_ON_ONCE(in_interrupt()))
goto no_wait;
#ifdef CONFIG_LOCKDEP
if (WARN_ON_ONCE(lockdep_is_held(&tasklist_lock)))
goto no_wait;
#endif
while (atomic_read(&tasklist_waiters)) <== soft lockup here
cpu_relax();
no_wait:
qread_lock(&tasklist_lock);
}

Jul 21 17:46:00 matrix01 kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [redis-sentinel:273673]
Jul 21 17:46:00 matrix01 kernel: CPU: 5 PID: 273673 Comm: redis-sentinel Kdump: loaded Tainted: G L ------------ T 3.10.0-957.27.2.el7.x86_64 #1
Jul 21 17:46:00 matrix01 kernel: Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9S, BIOS 2.00.37 01/16/2020
Jul 21 17:46:00 matrix01 kernel: task: ffff98413de830c0 ti: ffff982bc4dd8000 task.ti: ffff982bc4dd8000
Jul 21 17:46:00 matrix01 kernel: RIP: 0010:[<ffffffffb5094e28>] [<ffffffffb5094e28>] tasklist_read_lock+0x28/0x60
Jul 21 17:46:00 matrix01 kernel: RSP: 0018:ffff982bc4ddbe50 EFLAGS: 00000202
Jul 21 17:46:00 matrix01 kernel: RAX: 000000000000000e RBX: ffff98413de83618 RCX: ffff9869a6610480
Jul 21 17:46:00 matrix01 kernel: RDX: ffff9869a66104a8 RSI: 0000000000000246 RDI: 0000000000000246
Jul 21 17:46:00 matrix01 kernel: RBP: ffff982bc4ddbe50 R08: ffff9869a66104a8 R09: 0000000000000001
Jul 21 17:46:00 matrix01 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000002a060898f29
Jul 21 17:46:00 matrix01 kernel: R13: 000002cd43092cf2 R14: 00000001800044ca R15: 0000000000001001
Jul 21 17:46:00 matrix01 kernel: FS: 00007fc5eb6c5f80(0000) GS:ffff984abcb40000(0000) knlGS:0000000000000000
Jul 21 17:46:00 matrix01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 21 17:46:00 matrix01 kernel: CR2: 00007ff2544b2100 CR3: 0000002e482de000 CR4: 00000000007607e0
Jul 21 17:46:00 matrix01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 21 17:46:00 matrix01 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 21 17:46:00 matrix01 kernel: PKRU: 55555554
Jul 21 17:46:00 matrix01 kernel: Call Trace:
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb509e62b>] do_wait+0xbb/0x260
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb509f960>] SyS_wait4+0x80/0x110
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb509d3c0>] ? task_stopped_code+0x60/0x60
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb5776ddb>] system_call_fastpath+0x22/0x27

Jul 21 17:46:00 matrix01 kernel: NMI watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [sh:349426]
Jul 21 17:46:00 matrix01 kernel: CPU: 6 PID: 349426 Comm: sh Kdump: loaded Tainted: G L ------------ T 3.10.0-957.27.2.el7.x86_64 #1
Jul 21 17:46:00 matrix01 kernel: Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9S, BIOS 2.00.37 01/16/2020
Jul 21 17:46:00 matrix01 kernel: task: ffff982c56ffb0c0 ti: ffff982beb10c000 task.ti: ffff982beb10c000
Jul 21 17:46:00 matrix01 kernel: RIP: 0010:[<ffffffffb5094e22>] [<ffffffffb5094e22>] tasklist_read_lock+0x22/0x60
Jul 21 17:46:00 matrix01 kernel: RSP: 0018:ffff982beb10fed0 EFLAGS: 00000202
Jul 21 17:46:00 matrix01 kernel: RAX: 000000000000000e RBX: ffff982cc3cb8018 RCX: ffff982beb10ff28
Jul 21 17:46:00 matrix01 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff982c56ffb0c0
Jul 21 17:46:00 matrix01 kernel: RBP: ffff982beb10fed0 R08: 00007ffc4c432a80 R09: 00007ffc4c432930
Jul 21 17:46:00 matrix01 kernel: R10: 00007ffc4c4328a0 R11: 0000000000000202 R12: 0000000000000000
Jul 21 17:46:00 matrix01 kernel: R13: 0000000000000000 R14: 00000000006e0860 R15: 00000000000000e0
Jul 21 17:46:00 matrix01 kernel: FS: 00007f931a2db740(0000) GS:ffff984abcb80000(0000) knlGS:0000000000000000
Jul 21 17:46:00 matrix01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 21 17:46:00 matrix01 kernel: CR2: 00000000006e0860 CR3: 00000013d162e000 CR4: 00000000007607e0
Jul 21 17:46:00 matrix01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 21 17:46:00 matrix01 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 21 17:46:00 matrix01 kernel: PKRU: 55555554
Jul 21 17:46:00 matrix01 kernel: Call Trace:
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb50b623c>] do_prlimit+0x4c/0x1f0
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb50b6419>] SyS_getrlimit+0x39/0x90
Jul 21 17:46:00 matrix01 kernel: [<ffffffffb5139f34>] ? __audit_syscall_entry+0xb4/0x110

2，
we find there is some difference in using of tasklist_read_lock() between centos kernel and linux community kernel.
for cetnos kernel, it use tasklist_read_lock() in do_wait(), for linux community kernel, it directly use read_lock() in do_wait().
for tasklist_read_lock(), there is a while loop in it, the while loop is used to wait all writers to unlock the tasklist_lock.
I want to know, Based on what kind of consideration, centos kernel use tasklist_read_lock()? thanks in adveance.

void tasklist_read_lock(void)
{
if (WARN_ON_ONCE(in_interrupt()))
goto no_wait;
#ifdef CONFIG_LOCKDEP
if (WARN_ON_ONCE(lockdep_is_held(&tasklist_lock)))
goto no_wait;
#endif
while (atomic_read(&tasklist_waiters)) <== soft lockup here
cpu_relax();
no_wait:
qread_lock(&tasklist_lock);
}

static long do_wait(struct wait_opts *wo)
{
struct task_struct *tsk;
int retval;

trace_sched_process_wait(wo->wo_pid);

init_waitqueue_func_entry(&wo->child_wait, child_wait_callback);
wo->child_wait.private = current;
add_wait_queue(&current->signal->wait_chldexit, &wo->child_wait);
repeat:
/*
* If there is nothing that can match our critiera just get out.
* We will clear ->notask_error to zero if we see any child that
* might later match our criteria, even if we are not able to reap
* it yet.
*/
wo->notask_error = -ECHILD;
if ((wo->wo_type < PIDTYPE_MAX) &&
(!wo->wo_pid || hlist_empty(&wo->wo_pid->tasks[wo->wo_type])))
goto notask;

set_current_state(TASK_INTERRUPTIBLE);
tasklist_read_lock(); <<<<==============centos kernel code
read_lock(); <<<<==============linux community kernel code
tsk = current;
do {
retval = do_wait_thread(wo, tsk);
if (retval)
goto end;

retval = ptrace_do_wait(wo, tsk);
if (retval)
goto end;

if (wo->wo_flags & __WNOTHREAD)
break;
} while_each_thread(current, tsk);
qread_unlock(&tasklist_lock);

notask:

tunk · Post by **tunk** » 2020/08/24 12:28:19

I don't know this problem, but I think the general advice is to run yum update to get the latest kernel etc.

Post by **TrevorH** » 2020/08/24 12:31:03

7.6 is not supported and is 2 years old. The current version is 7.8 and uses a 3.10.0-1127 series of kernels. There are over 30,000 lines in the kernel rpm changelog since your version.

yum update

Try again.

t19467 · Post by **t19467** » 2020/08/25 01:56:21

thanks a lot for your reply,
I already checked the change log in centos 7.8 kernel, there is no fix related to this issue. As the probablity of the issue is very very low, so it is hard to confirm whether the issue is solved after upgrade.

I ever found the same issue on redhat, but it is closed as NOTABUG:
https://bugzilla.redhat.com/show_bug.cg ... id=1435334

lightman47 · Post by **lightman47** » 2020/08/25 20:29:29

Perhaps it's not the kernel and getting up to date may help. Maybe not, but that would be MY first move - eliminate 'unknowns'.

"my 2 cents".

t19467 · Post by **t19467** » 2020/08/26 02:27:02

Thanks for your reply,
Seems nothing we can do except upgrade.
As no vmcore was collected just after issue happened, I can't analyze the issue further.
I will go through the code that using the tasklist_lock, to check from code point of view.
Anyone who has some new findings about the issue, please touch me,
thanks in advance

lightman47 · Post by **lightman47** » 2020/08/26 10:13:54

Don't upgrade - just update what you have

yum update

Post by **TrevorH** » 2020/08/26 12:08:58

If this is a bug and you want it fixed then you will need to report it to Red Hat on bugzilla.redhat.com but if you aim to do that then you also need to be running the latest kernel first. They will not look at a ticket if it's on an old kernel - their first response will be the same as we have given you here: update to $latest and retry. So that is your first step...

t19467 · Post by **t19467** » 2020/08/27 07:37:36

I see, thanks Trevorh, lightman47 for your kindly reply.

t19467 · Post by **t19467** » 2020/10/29 01:44:34

Recently, we found a same issue on redhat, redhat is addressing now, not fixed yet

https://access.redhat.com/solutions/5363931
121 void tasklist_read_lock(void)
122 {
123 if (WARN_ON_ONCE(in_interrupt()))
124 goto no_wait;
125 #ifdef CONFIG_LOCKDEP
126 if (WARN_ON_ONCE(lockdep_is_held(&tasklist_lock)))
127 goto no_wait;
128 #endif
129 while (atomic_read(&tasklist_waiters)) <---------------------------------------------------[1]
130 cpu_relax();
131 no_wait:
132 qread_lock(&tasklist_lock);
133 }
Here, we are in the tight loop [1] while tasklist_waiters is greater than 0.
Raw
crash> pd tasklist_waiters
tasklist_waiters = $7 = {
counter = 110
}

CentOS

soft lockup happened in tasklist_read_lock()

soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()

Re: soft lockup happened in tasklist_read_lock()