NFS induced server crash

Issues related to hardware problems
Post Reply
Dylan.Brochet
Posts: 11
Joined: 2020/01/20 13:36:26

NFS induced server crash

Post by Dylan.Brochet » 2021/03/10 08:35:33

Hi,

I need your help, i'm facing a brutal high load average on a nfs server.
The server is Centos 7.6.1810 kernel version 3.10.0-957.el7.x86_64

Randomely the load average of the server grow up fast, in less than 1 minute so it's impossible to connect to the server to check what happens
I need to restart the machine to resume the service
Graph load average.PNG
Graph load average.PNG (13.68 KiB) Viewed 4015 times
DATA is stored on a RAID 6 BTRFS volume transfered from another distant server.

We are currently transfering large amount of data (approximately 10/15To per day) from multiple distant servers.

We have found those particular error messages in /var/log/messages before crash:
Mar 9 19:32:44 kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:22366]
Mar 9 19:32:44 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:22365]
Those DATA are transfered via a NFS mount point between multiple distant servers
The server is sharing his entire BTRFS volume with the distant servers with those options :
cat /etc/exports
/btrfsvolume distantIP(rw,no_root_squash,no_subtree_check)
It seems to be this bug https://bugzilla.redhat.com/show_bug.cgi?id=1095436 but I want to be sure before upgrading or changing OS

Thanks for your help

User avatar
TrevorH
Site Admin
Posts: 33202
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: NFS induced server crash

Post by TrevorH » 2021/03/10 08:46:34

The server is Centos 7.6.1810 kernel version 3.10.0-957.el7.x86_64
7.6 is 3 years out of date. Run yum update to get current. There are numerous fixes in the latest kernels that you do not have - about 12,000 of them according to rpm -q --changelog kernel-3.10.0-1160.15.2.el7.x86_64 | less and several of those are for NFS.

Code: Select all

[root@centos7 ~]# rpm -q --changelog kernel-3.10.0-1160.15.2.el7.x86_64 | head -12436 | grep -c nfs
63
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

Dylan.Brochet
Posts: 11
Joined: 2020/01/20 13:36:26

Re: NFS induced server crash

Post by Dylan.Brochet » 2021/03/11 11:03:23

Hi,

Thanks for your reply.
We just updated the kernel. We will see in the next few days if the problem is still present.

BShT
Posts: 584
Joined: 2019/10/09 12:31:40

Re: NFS induced server crash

Post by BShT » 2021/03/11 20:25:49

look at your disks, your storage, your storage connection

Whoever
Posts: 1357
Joined: 2013/09/06 03:12:10

Re: NFS induced server crash

Post by Whoever » 2021/03/12 03:02:39

Do you have sysstat installed?

Dylan.Brochet
Posts: 11
Joined: 2020/01/20 13:36:26

Re: NFS induced server crash

Post by Dylan.Brochet » 2021/03/16 10:46:45

Hi,

We have fixed the crash with kernel update. We are now in kernel 5.11.5-1.el7.elrepo.x86_64
But we always have some break up (3-4 minutes). And that what we see in /var/log/messages :
Mar 16 00:07:54 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:675 for /backup3 (/backup3)
Mar 16 00:34:10 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:826 for /backup3 (/backup3)
Mar 16 00:34:27 rpc.mountd[20792]: authenticated mount request from 10.98.155.11:790 for /backup3 (/backup3)
Mar 16 00:34:29 rpc.mountd[20792]: authenticated mount request from 10.98.155.17:896 for /backup3 (/backup3)
Mar 16 00:36:33 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:889 for /backup3 (/backup3)
Mar 16 00:36:46 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:703 for /backup3 (/backup3)
Mar 16 00:54:05 rpc.mountd[20792]: authenticated mount request from 10.98.155.16:949 for /backup3 (/backup3)
Mar 16 00:59:13 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:973 for /backup3 (/backup3)
Mar 16 00:59:17 rpc.mountd[20792]: authenticated mount request from 10.98.155.11:677 for /backup3 (/backup3)
Mar 16 00:59:17 rpc.mountd[20792]: authenticated mount request from 10.98.155.18:825 for /backup3 (/backup3)
Mar 16 00:59:30 rpc.mountd[20792]: authenticated mount request from 10.98.155.22:730 for /backup3 (/backup3)
Mar 16 01:04:16 rpc.mountd[20792]: authenticated mount request from 10.98.155.22:847 for /backup3 (/backup3)
Mar 16 01:17:01 rpc.mountd[20792]: authenticated mount request from 10.98.155.15:877 for /backup3 (/backup3)
Mar 16 01:35:28 rpc.mountd[20792]: authenticated mount request from 10.98.155.20:753 for /backup3 (/backup3)
Yes, we are using sysstat, that what we see from iotop :
iotop forum centos.PNG
iotop forum centos.PNG (26.49 KiB) Viewed 3889 times
We see in ps aux that all NFS daemons are stucks :
ps aux forum centos.PNG
ps aux forum centos.PNG (11.56 KiB) Viewed 3889 times

Post Reply