smartpqi resets the SCSI and disk go offline on CentOS 7.6

fabrigrion · Post by **fabrigrion** » 2020/06/25 18:57:08

Hello,

we have a HP DL380 Gen10 server with CentOS 7.6 and after 30 minutes running we lost our logical drive (with all that means).
Searching for known bugs we found a similar problem with CentOS 7.4 (https://bugs.centos.org/view.php?id=15801) and it was solved updating the kernel. There is also a close situation with RHEL 7.6 (https://bugzilla.redhat.com/show_bug.cgi?id=1666912) and bugs related to smartpqi driver (https://support.hpe.com/hpesc/public/do ... 71158en_us). Lastly, it could be a physical problem with the disks. (FYI, we have several servers in the same conditions and working well).

Executing the dmesg command we can see:

Code: Select all

[   19.593914] warning: `BackgrProcPool' uses 32-bit capabilities (legacy support in use)
[  363.182434] smartpqi 0000:5c:00.0: resetting scsi 1:1:0:1
[  363.187484] smartpqi 0000:5c:00.0: reset of scsi 1:1:0:1: SUCCESS
[  409.445866] smartpqi 0000:5c:00.0: resetting scsi 1:1:0:1
[  409.449850] smartpqi 0000:5c:00.0: reset of scsi 1:1:0:1: SUCCESS
[  569.358969] usb 1-1: USB disconnect, device number 2
[  780.619211] smartpqi 0000:5c:00.0: resetting scsi 1:1:0:1
[  780.623117] smartpqi 0000:5c:00.0: reset of scsi 1:1:0:1: SUCCESS
[ 1618.332963] smartpqi 0000:5c:00.0: resetting scsi 1:1:0:1
[ 1651.417533] smartpqi 0000:5c:00.0: reset of scsi 1:1:0:1: SUCCESS
[ 1651.417633] sd 1:1:0:1: [sdc] Medium access timeout failure. Offlining disk!
[ 1651.417700] sd 1:1:0:1: Device offlined - not ready after error recovery
[ 1651.417715] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[ 1651.417722] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 06 47 6e 0e 00 00 00 01 d8 00 00
[ 1651.417727] blk_update_request: I/O error, dev sdc, sector 26968198656
[ 1651.417792] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417808] sd 1:1:0:1: [sdc] killing request
[ 1651.417833] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417859] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417879] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417900] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417916] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 1651.417923] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 04 a3 a7 7b f0 00 00 00 08 00 00
[ 1651.417930] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417935] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417949] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.417950] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417951] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 03 8d 22 6e 68 00 00 01 50 00 00
[ 1651.417952] blk_update_request: I/O error, dev sdc, sector 15252745832
[ 1651.417959] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.417984] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418002] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418003] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 07 1b ae 6e 00 00 00 00 38 00 00
[ 1651.418004] blk_update_request: I/O error, dev sdc, sector 30529187328
[ 1651.418016] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418034] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418035] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 04 ea 12 af c8 00 00 00 08 00 00
[ 1651.418036] blk_update_request: I/O error, dev sdc, sector 21106962376
[ 1651.418053] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418069] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418072] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 05 75 c6 2e 00 00 00 02 00 00 00
[ 1651.418077] blk_update_request: I/O error, dev sdc, sector 23450758656
[ 1651.418096] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418097] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 00 45 ef 6f f8 00 00 00 08 00 00
[ 1651.418098] blk_update_request: I/O error, dev sdc, sector 1173319672
[ 1651.418104] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418123] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418127] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 03 46 77 0f a0 00 00 00 08 00 00
[ 1651.418128] blk_update_request: I/O error, dev sdc, sector 14067109792
[ 1651.418133] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418153] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418155] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 05 2f 8f 0d d0 00 00 00 68 00 00
[ 1651.418156] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418156] blk_update_request: I/O error, dev sdc, sector 22272740816
[ 1651.418169] sd 1:1:0:1: [sdc] FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
[ 1651.418170] sd 1:1:0:1: [sdc] CDB: Read(16) 88 00 00 00 00 04 e9 b8 cf 68 00 00 00 08 00 00
[ 1651.418170] blk_update_request: I/O error, dev sdc, sector 21101072232
[ 1651.418176] blk_update_request: I/O error, dev sdc, sector 21121355264
[ 1651.418190] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418204] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418243] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418266] XFS (dm-2): metadata I/O error: block 0x45ddb4400 ("xlog_iodone") error 5 numblks 512
[ 1651.418268] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418271] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1221 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffc0962c30
[ 1651.418286] sd 1:1:0:1: rejecting I/O to offline device
[ 1651.418585] XFS (dm-2): Log I/O Error Detected.  Shutting down filesystem
[ 1651.418586] XFS (dm-2): Please umount the filesystem and rectify the problem(s)
[ 1651.418588] XFS (dm-2): metadata I/O error: block 0x45ddb4600 ("xlog_iodone") error 5 numblks 512
[ 1651.418589] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1221 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffc0962c30
[ 1651.418619] XFS (dm-2): metadata I/O error: block 0x6d288c160 ("xfs_trans_read_buf_map") error 5 numblks 8
[ 1651.418645] XFS (dm-2): xfs_do_force_shutdown(0x1) called from line 236 of file fs/xfs/libxfs/xfs_defer.c.  Return address = 0xffffffffc092182b
[ 1651.418664] XFS (dm-2): metadata I/O error: block 0x41c20 ("xfs_trans_read_buf_map") error 5 numblks 32
[ 1651.418687] XFS (dm-2): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[ 1651.418689] XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 3416 of file fs/xfs/xfs_inode.c.  Return address = 0xffffffffc0956ea6

And then, our kernel and smartpqi versions:

Code: Select all

-bash-4.2$ hostnamectl
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-957.5.1.el7.x86_64
      Architecture: x86-64


-bash-4.2$ modinfo smartpqi
filename:       /lib/modules/3.10.0-957.5.1.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz
license:        GPL
version:        1.1.4-115
description:    Driver for Microsemi Smart Family Controller version 1.1.4-115
author:         Microsemi
retpoline:      Y
rhelversion:    7.6

We need help to isolate the problem between the kernel version, smartpqi driver version, hardware failure or other reasons. Can anyone give us any advice?

Thanks in advance
Fabrizio

Post by **TrevorH** » 2020/06/25 19:24:03

CentOS 7.6 is unsupported and has been since the release of 7.7 in August last year. The current and only supported version is 7.8 which ships a 3.10.0-1127 series of kernels. Please upgrade and test the most recent kernel-3.10.0-1127.13.1.el7.x86_64 and see if it has the same problems.

From rpm -q --changelog kernel-3.10.0-1127.13.1.el7.x86_64 I see:

Code: Select all

- [scsi] scsi: smartpqi: change TMF timeout from 60 to 30 seconds (Don Brace) [1709620]
- [scsi] scsi: smartpqi: fix LUN reset when fw bkgnd thread is hung (Don Brace) [1709620]
- [scsi] scsi: smartpqi: add inquiry timeouts (Don Brace) [1709620]
- [scsi] scsi: smartpqi: increase LUN reset timeout (Don Brace) [1709620]
- [scsi] scsi: smartpqi_init: fix boolean expression in pqi_device_remove_start (Don Brace) [1678479]
- [scsi] smartpqi: correct nr_hw_queues (Don Brace) [1641112]
- [scsi] smartpqi: call pqi_free_interrupts() in pqi_shutdown() (Don Brace) [1641112]
- [scsi] smartpqi: fix build warnings (Don Brace) [1641112]
- [scsi] smartpqi: update driver version (Don Brace) [1641112]
- [scsi] smartpqi: add ofa support (Don Brace) [1641112]
- [scsi] smartpqi: increase fw status register read timeout (Don Brace) [1641112]
- [scsi] smartpqi: bump driver version (Don Brace) [1641112]
- [scsi] smartpqi: add smp_utils support (Don Brace) [1641112]
- [scsi] smartpqi: correct lun reset issues (Don Brace) [1641112]
- [scsi] smartpqi: correct volume status (Don Brace) [1641112]
- [scsi] smartpqi: do not offline disks for transient did no connect conditions (Don Brace) [1641112]
- [scsi] smartpqi: allow for larger raid maps (Don Brace) [1641112]
- [scsi] smartpqi: check for null device pointers (Don Brace) [1641112]
- [scsi] smartpqi: add support for huawei controllers (Don Brace) [1641112]
- [scsi] smartpqi: enhance numa node detection (Don Brace) [1641112]
- [scsi] smartpqi: wake up drives after os resumes from suspend (Don Brace) [1641112]
- [scsi] smartpqi: fix disk name mount point (Don Brace) [1641112]
- [scsi] smartpqi: add h3c ssid (Don Brace) [1641112]
- [scsi] smartpqi: add sysfs attributes (Don Brace) [1641112]
- [scsi] smartpqi: refactor sending controller raid requests (Don Brace) [1641112]
- [scsi] smartpqi: turn off lun data caching for ptraid (Don Brace) [1641112]
- [scsi] smartpqi: correct host serial num for ssa (Don Brace) [1641112]
- [scsi] smartpqi: add no_write_same for logical volumes (Don Brace) [1641112]
- [scsi] smartpqi: Add retries for device reset (Don Brace) [1641112]
- [scsi] smartpqi: add support for PQI Config Table handshake (Don Brace) [1641112]
- [scsi] smartpqi: fully convert to the generic DMA API (Don Brace) [1641112]
- [scsi] smartpqi: bump driver version to 1.1.4-130 (Don Brace) [1641112]
- [scsi] smartpqi: add inspur advantech ids (Don Brace) [1641112]
- [scsi] smartpqi: improve error checking for sync requests (Don Brace) [1641112]
- [scsi] smartpqi: improve handling for sync requests (Don Brace) [1641112]
- [scsi] smartpqi: cleanup interrupt management (Don Brace) [1641112]
- [scsi] smartpqi: switch to pci_alloc_irq_vectors (Don Brace) [1641112]

All those fixes are in kernels later than the one you are running.

fabrigrion · Post by **fabrigrion** » 2020/07/03 06:17:24

Thanks for your quick reply TrevorH,

we are going to analyze the possibility to upgrade every server but I want to let you know that for now the problem was solved replacing a disk with physical damage that caused a failure in the logical drive. The weird thing is that we are using RAID5 so a failure in one disk should be tolerated.
Maybe there were two problems, one physical and other one with the firmware out of date, and the first one exposed the second one.

Thanks again
Greetings
Fabrizio

z900collector · Post by **z900collector** » 2021/02/16 06:00:52

I've tested this in Centos 7.9 and it still occurs on brand new install on brand ne kit with updates:

[ 9.237928] smartpqi 0000:03:00.0: added 1:0:-:- 500c0ff052365f3e Enclosure HPE D8000 AIO-
[ 9.255756] smartpqi 0000:03:00.0: added 1:0:-:- 500c0ff05235f63e Enclosure HPE D8000 AIO-
[ 9.272723] smartpqi 0000:03:00.0: added 1:0:-:- 51402ec01381ba78 Enclosure HPE Smart Adapter AIO-
[ 9.284758] smartpqi 0000:03:00.0: added 1:1:0:0 4000000000000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 9.301730] smartpqi 0000:03:00.0: added 1:1:0:1 4000000100000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 9.314701] smartpqi 0000:03:00.0: added 1:1:0:2 4000000200000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 9.326709] smartpqi 0000:03:00.0: added 1:1:0:3 4000000300000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 9.338698] smartpqi 0000:03:00.0: added 1:1:0:4 4000000400000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 9.350694] smartpqi 0000:03:00.0: added 1:2:0:0 0000000000000000 RAID HPE P408e-p SR Gen10
[ 3844.407462] smartpqi 0000:03:00.0: removed 1:1:0:0 4000000000000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 3844.430181] smartpqi 0000:03:00.0: removed 1:1:0:1 4000000100000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 3844.443173] smartpqi 0000:03:00.0: removed 1:1:0:2 4000000200000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 3844.461173] smartpqi 0000:03:00.0: removed 1:1:0:3 4000000300000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5
[ 3844.482174] smartpqi 0000:03:00.0: removed 1:1:0:4 4000000400000000 Direct-Access HPE LOGICAL VOLUME SSDSmartPathCap- En- RAID-5

Sid

Post by **TrevorH** » 2021/02/16 11:41:54

Those are just info messages about what's attached to the controller. Those are not errors.

CentOS

smartpqi resets the SCSI and disk go offline on CentOS 7.6

smartpqi resets the SCSI and disk go offline on CentOS 7.6

Re: smartpqi resets the SCSI and disk go offline on CentOS 7.6

Re: smartpqi resets the SCSI and disk go offline on CentOS 7.6

Re: smartpqi resets the SCSI and disk go offline on CentOS 7.6

Re: smartpqi resets the SCSI and disk go offline on CentOS 7.6