NFS induced server crash

Dylan.Brochet · Post by **Dylan.Brochet** » 2021/03/10 08:35:33

Hi,

I need your help, i'm facing a brutal high load average on a nfs server.
The server is Centos 7.6.1810 kernel version 3.10.0-957.el7.x86_64

Randomely the load average of the server grow up fast, in less than 1 minute so it's impossible to connect to the server to check what happens
I need to restart the machine to resume the service

: Graph load average.PNG (13.68 KiB) Viewed 4118 times

DATA is stored on a RAID 6 BTRFS volume transfered from another distant server.

We are currently transfering large amount of data (approximately 10/15To per day) from multiple distant servers.

We have found those particular error messages in /var/log/messages before crash:

Mar 9 19:32:44 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:22366]
Mar 9 19:32:44 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:22365]

Those DATA are transfered via a NFS mount point between multiple distant servers
The server is sharing his entire BTRFS volume with the distant servers with those options :

cat /etc/exports
/btrfsvolume distantIP(rw,no_root_squash,no_subtree_check)

It seems to be this bug https://bugzilla.redhat.com/show_bug.cgi?id=1095436 but I want to be sure before upgrading or changing OS

Thanks for your help

Post by **TrevorH** » 2021/03/10 08:46:34

The server is Centos 7.6.1810 kernel version 3.10.0-957.el7.x86_64

7.6 is 3 years out of date. Run yum update to get current. There are numerous fixes in the latest kernels that you do not have - about 12,000 of them according to rpm -q --changelog kernel-3.10.0-1160.15.2.el7.x86_64 | less and several of those are for NFS.

Code: Select all

[root@centos7 ~]# rpm -q --changelog kernel-3.10.0-1160.15.2.el7.x86_64 | head -12436 | grep -c nfs
63

Dylan.Brochet · Post by **Dylan.Brochet** » 2021/03/11 11:03:23

Hi,

Thanks for your reply.
We just updated the kernel. We will see in the next few days if the problem is still present.

BShT · Post by **BShT** » 2021/03/11 20:25:49

look at your disks, your storage, your storage connection

Whoever · Post by **Whoever** » 2021/03/12 03:02:39

Do you have sysstat installed?

Dylan.Brochet · Post by **Dylan.Brochet** » 2021/03/16 10:46:45

Hi,

We have fixed the crash with kernel update. We are now in kernel 5.11.5-1.el7.elrepo.x86_64
But we always have some break up (3-4 minutes). And that what we see in /var/log/messages :

Mar 16 00:07:54 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:675 for /backup3 (/backup3)
Mar 16 00:34:10 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:826 for /backup3 (/backup3)
Mar 16 00:34:27 rpc.mountd[20792]: authenticated mount request from 10.98.155.11:790 for /backup3 (/backup3)
Mar 16 00:34:29 rpc.mountd[20792]: authenticated mount request from 10.98.155.17:896 for /backup3 (/backup3)
Mar 16 00:36:33 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:889 for /backup3 (/backup3)
Mar 16 00:36:46 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:703 for /backup3 (/backup3)
Mar 16 00:54:05 rpc.mountd[20792]: authenticated mount request from 10.98.155.16:949 for /backup3 (/backup3)
Mar 16 00:59:13 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:973 for /backup3 (/backup3)
Mar 16 00:59:17 rpc.mountd[20792]: authenticated mount request from 10.98.155.11:677 for /backup3 (/backup3)
Mar 16 00:59:17 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:825 for /backup3 (/backup3)
Mar 16 00:59:30 rpc.mountd[20792]: authenticated mount request from 10.98.155.22:730 for /backup3 (/backup3)
Mar 16 01:04:16 rpc.mountd[20792]: authenticated mount request from 10.98.155.22:847 for /backup3 (/backup3)
Mar 16 01:17:01 rpc.mountd[20792]: authenticated mount request from 10.98.155.15:877 for /backup3 (/backup3)
Mar 16 01:35:28 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:753 for /backup3 (/backup3)

Yes, we are using sysstat, that what we see from iotop :

: iotop forum centos.PNG (26.49 KiB) Viewed 3992 times

We see in ps aux that all NFS daemons are stucks :

: ps aux forum centos.PNG (11.56 KiB) Viewed 3992 times

CentOS

NFS induced server crash

NFS induced server crash

Re: NFS induced server crash

Re: NFS induced server crash

Re: NFS induced server crash

Re: NFS induced server crash

Re: NFS induced server crash