nfsd sockets dying - kernel: rpc-srv/tcp: nfsd: got error -104 shutting down socket

Issues related to configuring your network
Post Reply
wuffy68
Posts: 2
Joined: 2008/04/12 03:15:35
Contact:

nfsd sockets dying - kernel: rpc-srv/tcp: nfsd: got error -104 shutting down socket

Post by wuffy68 » 2008/04/12 03:46:28

Hi there CentOS community ...

I've got CentOS 4.3
2.6.12-5smp kernel from sourceforge
4 CPU's i386
two e-ports.

Occasionally (every other day) we see this message;

rpc-srv/tcp: nfsd: got error -104 when sending 132 bytes - shutting down socket

Sometimes it is accompanied by a kernel Oops (shown below). The system seems to recover but with each subsequent "shutting down socket" message, I seem to lose an nfsd thread. If I start with 16 nfsd threads, when one is shutdown due to error -104, I can never open all 16 up again until I reboot ... I continue to run with 15, 14,13, 12,11 nfsds ... and so on, until when I reach 0, the system sometimes hangs or my nfs share (for obvious reasons) becomes unresponsive to its hosts. I have found very little on this, but we seem to be hitting it using AIX 5.3 hosts running Oracle DB backups to the share.

CentOS server: export options are: /Q/shares/xx30_NAS002 *(sync,rw,fsid=10001)

AIX host (up to 14 total host systems): mount options are: 10.2xx.2xx.23 /Q/shares/xx30_NAS002 /mb_prod nfs3 Apr 09 09:14 rw,soft,intr

Does anyone know if this issue has been addressed in more recent 2.6.xx kernels? BTW, it always seem to Oops CPU 0. We have another independent system which also experiences this, so a bad CPU is unlikely.

Kernel Oops:

Apr 8 04:10:30 Colorado001 kernel: rpc-srv/tcp: nfsd: got error -104 when sending 132 bytes - shutting down socket
Apr 8 04:12:48 Colorado001 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000028
Apr 8 04:12:48 Colorado001 kernel: printing eip:
Apr 8 04:12:48 Colorado001 kernel: f09631ce
Apr 8 04:12:48 Colorado001 kernel: *pde = 6fbc5001
Apr 8 04:12:48 Colorado001 kernel: *pte = 00000000
Apr 8 04:12:48 Colorado001 kernel: Oops: 0000 [#1]
Apr 8 04:12:48 Colorado001 kernel: SMP
Apr 8 04:12:48 Colorado001 kernel: Modules linked in: unh_iscsi_target unh_scsi_target crmiscsi target vdisk trace qbfs
nfsd lockd bmlite fcca vtl_mem_alloc cvfs_appliance sunrpc e1000
Apr 8 04:12:48 Colorado001 kernel: CPU: 0
Apr 8 04:12:48 Colorado001 kernel: EIP: 0060:[] Tainted: P VLI
Apr 8 04:12:48 Colorado001 kernel: EFLAGS: 00010287 (2.6.12-5smp)
Apr 8 04:12:48 Colorado001 kernel: EIP is at fh_verify+0x210/0x5c7 [nfsd]
Apr 8 04:12:48 Colorado001 kernel: eax: 00000000 ebx: e922b700 ecx: 00000007 edx: a09faa80
Apr 8 04:12:49 Colorado001 kernel: esi: e72a099c edi: 11270000 ebp: eed10804 esp: ed0b3ee8
Apr 8 04:12:49 Colorado001 kernel: ds: 007b es: 007b ss: 0068
Apr 8 04:12:49 Colorado001 kernel: Process nfsd (pid: 8573, threadinfo=ed0b2000 task=ed143520)
Apr 8 04:12:49 Colorado001 kernel: Stack: eef95a00 eed10810 00000004 00000002 f0962ed0 e922b700 00000004 eed10810
Apr 8 04:12:49 Colorado001 kernel: f08bb4e0 efb4e800 e960bcc0 ed0b3f45 edcc9000 eed10804 eed10800 e6794018
Apr 8 04:12:49 Colorado001 kernel: f096c2a1 efb4e800 eed10804 00000000 00000000 efb4e800 f0975984 e679401c
Apr 8 04:12:49 Colorado001 kernel: Call Trace:
Apr 8 04:12:49 Colorado001 kernel: [] nfsd_acceptable+0x0/0xee [nfsd]
Apr 8 04:12:49 Colorado001 kernel: [] nfsd3_proc_getattr+0x7e/0xa4 [nfsd]
Apr 8 04:12:49 Colorado001 kernel: [] nfsd_dispatch+0x99/0x222 [nfsd]
Apr 8 04:12:49 Colorado001 kernel: [] svc_process+0x5af/0x66c [sunrpc]
Apr 8 04:12:49 Colorado001 kernel: [] nfsd+0x1b0/0x335 [nfsd]
Apr 8 04:12:49 Colorado001 kernel: [] nfsd+0x0/0x335 [nfsd]
Apr 8 04:12:49 Colorado001 kernel: [] kernel_thread_helper+0x5/0xb
Apr 8 04:12:49 Colorado001 kernel: Code: 8b 43 1c 0f 44 15 68 68 3b 80 8b 40 14 89 04 24 ff d2 89 c6 85 f6 0f 84 cc 00
00 00 81 fe 18 fc ff ff 0f 87 9a 03 00 00 8b 46 0c b7 40 28 25 00 f0 00 00 3d 00 40 00 00 0f 84 8f 02 00 00 89

I know it's obscure, but thanks for any input you may have.

wuffy68

User avatar
toracat
Forum Moderator
Posts: 7453
Joined: 2006/09/03 16:37:24
Location: California, US
Contact:

nfsd sockets dying - kernel: rpc-srv/tcp: nfsd: got error -1

Post by toracat » 2008/04/12 11:58:40

You are not running a CentOS kernel. Do you get the same issue if you install the 2.6.9 kernel CentOS ships?

wuffy68
Posts: 2
Joined: 2008/04/12 03:15:35
Contact:

Re: nfsd sockets dying - kernel: rpc-srv/tcp: nfsd: got error -104 shutting down socket

Post by wuffy68 » 2008/04/12 16:04:11

kernel swap was not of my choice. I know it was originally needed for some of it's features (actually one of the reasons was for enhanced nfs stability :-)

Putting the 2.6.9 kernel on the remote sites where it's happening is an impossiblity at this stage...

Unfortunateley I can't reproduce the problem "exactly" on a local identically configured machine (probably because I'm still scrounging for an AIX system to try it from). I can cause the sockets to shutdown, but the Oops occurs close, but elsewhere (See below):

If I change kernels, I'll probably have to try going forward before I go backward based on scheduling pressure.

Thanks,

wuffy68

Local Oops:

Apr 9 20:49:51 javelin kernel: ------------[ cut here ]------------
Apr 9 20:49:51 javelin kernel: kernel BUG at :41705!
Apr 9 20:49:51 javelin kernel: invalid operand: 0000 [#1]
Apr 9 20:49:51 javelin kernel: SMP
Apr 9 20:49:51 javelin kernel: Modules linked in: unh_iscsi_target unh_scsi_target crmiscsi target vdisk trace qbfs nfsd lockd bmlite fcca vtl_mem_alloc cvfs_appliance sunrpc e1000
Apr 9 20:49:51 javelin kernel: CPU: 0
Apr 9 20:49:51 javelin kernel: EIP: 0060:[] Tainted: P VLI
Apr 9 20:49:51 javelin kernel: EFLAGS: 00010287 (2.6.12-5smp)
Apr 9 20:49:51 javelin kernel: EIP is at vfs_unlink+0x181/0x188
Apr 9 20:49:51 javelin kernel: eax: ddf33280 ebx: dde05398 ecx: ddf33280 edx: fffffffe
Apr 9 20:49:51 javelin kernel: esi: dde05398 edi: e2ebc8f0 ebp: ffffc000 esp: ee6c1e9c
Apr 9 20:49:51 javelin kernel: ds: 007b es: 007b ss: 0068
Apr 9 20:49:51 javelin kernel: Process nfsd (pid: 5047, threadinfo=ee6c0000 task=ee624520)
Apr 9 20:49:51 javelin kernel: Stack: e2ebc8f0 00000000 dde05398 dde05398 dde05168 e2ebc8f0 f0876ca3 e2ebc8f0
Apr 9 20:49:51 javelin kernel: dde05398 00000000 dde05168 ddf33280 fffffff0 dde05168 e09093f0 ffffc000
Apr 9 20:49:51 javelin kernel: 8016584b e09093f0 dde05168 00000000 dde05168 e3b5199c ee625c04 f0967543
Apr 9 20:49:51 javelin kernel: Call Trace:
Apr 9 20:49:51 javelin kernel: [] qbfs_unlink+0xb6/0x18f [qbfs]
Apr 9 20:49:51 javelin kernel: [] vfs_unlink+0x151/0x188
Apr 9 20:49:51 javelin kernel: [] nfsd_unlink+0x1a8/0x282 [nfsd]
Apr 9 20:49:51 javelin kernel: [] nfsd3_proc_remove+0x8a/0xc4 [nfsd]
Apr 9 20:49:51 javelin kernel: [] nfsd_dispatch+0x99/0x222 [nfsd]
Apr 9 20:49:51 javelin kernel: [] svc_process+0x5af/0x66c [sunrpc]
Apr 9 20:49:51 javelin kernel: [] nfsd+0x1b0/0x335 [nfsd]
Apr 9 20:49:51 javelin kernel: [] nfsd+0x0/0x335 [nfsd]
Apr 9 20:49:51 javelin kernel: [] kernel_thread_helper+0x5/0xb
Apr 9 20:49:51 javelin kernel: Code: ff ff ff ff eb 89 89 34 24 e8 9d 7b 00 00 f6 87 1c 01 00 00 08 74 cb c7 44 24 04 08 00 00 00 89 3c 24 e8 f8 c6 01 00 89 d8 eb b9 0b e9 a2 fe ff ff 55 57 31 ff 56 53 83 ec 48 8b 44 24 5c 89

Post Reply

Return to “CentOS 4 - Networking Support”