Hardware is failing - need advice

Issues related to hardware problems
Post Reply
may24
Posts: 30
Joined: 2014/10/13 15:35:36

Hardware is failing - need advice

Post by may24 » 2014/10/13 15:54:43

Hi all,

I build a custom storage system - a NAS basically.
I put a SATA-3 controller on my E450 board that has a Marvel 88SE9230 chip on it and attached 4x 4TB Seagate drives.
Setup with mdadm went fine: create a SW Raid-5 over the four disks - no LVM - and only one big ext4 partition.
At first everything was fine. But then some "strange error messages appeared in the /var/lg messages ... I checked a few forums and a few people pointed out some possible firmware/driver problems as I was (still) running CentOS 5.6 (with the 2.6 kernel).
So I upgraded to CentOS 7.0
Again everything was running ok ... for three weeks no errors at all ... however then this appeared:

Code: Select all

Oct  5 03:29:03 data-server kernel: ata7.00: exception Emask 0x0 SAct 0xc000000 SErr 0x0 action 0x6
Oct  5 03:29:03 data-server kernel: ata7.00: irq_stat 0x40000008
Oct  5 03:29:03 data-server kernel: ata7.00: failed command: READ FPDMA QUEUED
Oct  5 03:29:03 data-server kernel: ata7.00: cmd 60/00:d8:90:f4:2e/04:00:4d:00:00/40 tag 27 ncq 524288 in
         res 41/84:00:10:8e:21/00:04:4d:00:00/00 Emask 0x410 (ATA bus error) <F>
Oct  5 03:29:03 data-server kernel: ata7.00: status: { DRDY ERR }
Oct  5 03:29:03 data-server kernel: ata7.00: error: { ICRC ABRT }
Oct  5 03:29:03 data-server kernel: ata7: hard resetting link
Oct  5 03:29:03 data-server kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 03:29:03 data-server kernel: ata7.00: configured for UDMA/133
Oct  5 03:29:03 data-server kernel: ata7: EH complete
... and after a while:

Code: Select all

Oct  5 08:43:03 data-server kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  5 08:43:03 data-server kernel: sd 6:0:0:0: [sdf] CDB: 
Oct  5 08:43:03 data-server kernel: Read(16): 88 00 00 00 00 00 f3 0b a2 00 00 00 00 08 00 00
Oct  5 08:43:03 data-server kernel: md/raid:md1: Too many read errors, failing device sdf1.
Oct  5 08:43:03 data-server kernel: md/raid:md1: Disk failure on sdf1, disabling device.
md/raid:md1: Operation continuing on 3 devices.
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622784 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622792 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622800 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622808 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622816 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622824 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622832 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622840 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622848 on sdf1).
Oct  5 08:43:03 data-server kernel: md/raid:md1: read error not correctable (sector 4077622856 on sdf1).
Oct  5 08:43:03 data-server kernel: sd 6:0:0:0: [sdf] Unhandled error code
Oct  5 08:43:03 data-server kernel: sd 6:0:0:0: [sdf]  
Oct  5 08:43:03 data-server kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  5 08:43:03 data-server kernel: sd 6:0:0:0: [sdf] CDB: 
Oct  5 08:43:03 data-server kernel: Write(16): 8a 00 00 00 00 00 f3 0b a1 f8 00 00 04 00 00 00
Oct  5 08:43:34 data-server kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  5 08:43:34 data-server kernel: ata10.00: failed command: FLUSH CACHE EXT
Oct  5 08:43:34 data-server kernel: ata10.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 26
         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Oct  5 08:43:34 data-server kernel: ata10.00: status: { DRDY }
Oct  5 08:43:34 data-server kernel: ata10: hard resetting link
Oct  5 08:43:34 data-server kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  5 08:43:34 data-server kernel: ata9.00: failed command: FLUSH CACHE EXT
Oct  5 08:43:34 data-server kernel: ata9.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 16
         res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)
Oct  5 08:43:34 data-server kernel: ata9.00: status: { DRDY }
Oct  5 08:43:34 data-server kernel: ata9: hard resetting link
Oct  5 08:43:34 data-server kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct  5 08:43:34 data-server kernel: ata8.00: failed command: FLUSH CACHE EXT
Oct  5 08:43:34 data-server kernel: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 28
         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Oct  5 08:43:34 data-server kernel: ata8.00: status: { DRDY }
Oct  5 08:43:34 data-server kernel: ata8: hard resetting link
Oct  5 08:43:35 data-server kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:35 data-server kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:35 data-server kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:40 data-server kernel: ata10.00: qc timeout (cmd 0xec)
Oct  5 08:43:40 data-server kernel: ata9.00: qc timeout (cmd 0xec)
Oct  5 08:43:40 data-server kernel: ata8.00: qc timeout (cmd 0xec)
Oct  5 08:43:40 data-server kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:40 data-server kernel: ata10.00: revalidation failed (errno=-5)
Oct  5 08:43:40 data-server kernel: ata10: hard resetting link
Oct  5 08:43:40 data-server kernel: ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:40 data-server kernel: ata9.00: revalidation failed (errno=-5)
Oct  5 08:43:40 data-server kernel: ata9: hard resetting link
Oct  5 08:43:40 data-server kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:40 data-server kernel: ata8.00: revalidation failed (errno=-5)
Oct  5 08:43:40 data-server kernel: ata8: hard resetting link
Oct  5 08:43:41 data-server kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:41 data-server kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:41 data-server kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Oct  5 08:43:51 data-server kernel: ata10.00: qc timeout (cmd 0xec)
Oct  5 08:43:51 data-server kernel: ata9.00: qc timeout (cmd 0xec)
Oct  5 08:43:51 data-server kernel: ata8.00: qc timeout (cmd 0xec)
Oct  5 08:43:51 data-server kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:51 data-server kernel: ata10.00: revalidation failed (errno=-5)
Oct  5 08:43:51 data-server kernel: ata10: limiting SATA link speed to 3.0 Gbps
Oct  5 08:43:51 data-server kernel: ata10: hard resetting link
Oct  5 08:43:51 data-server kernel: ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:51 data-server kernel: ata9.00: revalidation failed (errno=-5)
Oct  5 08:43:51 data-server kernel: ata9: limiting SATA link speed to 3.0 Gbps
Oct  5 08:43:51 data-server kernel: ata9: hard resetting link
Oct  5 08:43:51 data-server kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:43:51 data-server kernel: ata8.00: revalidation failed (errno=-5)
Oct  5 08:43:51 data-server kernel: ata8: limiting SATA link speed to 3.0 Gbps
Oct  5 08:43:51 data-server kernel: ata8: hard resetting link
Oct  5 08:43:52 data-server kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:43:52 data-server kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:43:52 data-server kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:44:22 data-server kernel: ata10.00: qc timeout (cmd 0xec)
Oct  5 08:44:22 data-server kernel: ata9.00: qc timeout (cmd 0xec)
Oct  5 08:44:22 data-server kernel: ata8.00: qc timeout (cmd 0xec)
Oct  5 08:44:23 data-server kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:44:23 data-server kernel: ata10.00: revalidation failed (errno=-5)
Oct  5 08:44:23 data-server kernel: ata10.00: disabled
Oct  5 08:44:23 data-server kernel: ata10.00: device reported invalid CHS sector 0
Oct  5 08:44:23 data-server kernel: ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:44:23 data-server kernel: ata9.00: revalidation failed (errno=-5)
Oct  5 08:44:23 data-server kernel: ata9.00: disabled
Oct  5 08:44:23 data-server kernel: ata9.00: device reported invalid CHS sector 0
Oct  5 08:44:23 data-server kernel: ata8.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Oct  5 08:44:23 data-server kernel: ata8.00: revalidation failed (errno=-5)
Oct  5 08:44:23 data-server kernel: ata8.00: disabled
Oct  5 08:44:23 data-server kernel: ata8.00: device reported invalid CHS sector 0
Oct  5 08:44:23 data-server kernel: ata10: hard resetting link
Oct  5 08:44:23 data-server kernel: ata9: hard resetting link
Oct  5 08:44:23 data-server kernel: ata8: hard resetting link
Oct  5 08:44:24 data-server kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:44:24 data-server kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:44:24 data-server kernel: ata8: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
Oct  5 08:44:25 data-server kernel: ata10: EH complete
Oct  5 08:44:25 data-server kernel: sd 9:0:0:0: [sdi] Unhandled error code
Oct  5 08:44:25 data-server kernel: sd 9:0:0:0: [sdi]  
Oct  5 08:44:25 data-server kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  5 08:44:25 data-server kernel: sd 9:0:0:0: [sdi] CDB: 
Oct  5 08:44:25 data-server kernel: Write(16): 8a 00 00 00 00 00 00 00 08 08 00 00 00 01 00 00
Oct  5 08:44:25 data-server kernel: blk_update_request: 249 callbacks suppressed
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdi, sector 2056
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdi, sector 2056
Oct  5 08:44:25 data-server kernel: md: super_written gets error=-5, uptodate=0
Oct  5 08:44:25 data-server kernel: md/raid:md1: Disk failure on sdi1, disabling device.
md/raid:md1: Operation continuing on 2 devices.
Oct  5 08:44:25 data-server kernel: ata9: EH complete
Oct  5 08:44:25 data-server kernel: sd 8:0:0:0: [sdh] Unhandled error code
Oct  5 08:44:25 data-server kernel: sd 8:0:0:0: [sdh]  
Oct  5 08:44:25 data-server kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  5 08:44:25 data-server kernel: sd 8:0:0:0: [sdh] CDB: 
Oct  5 08:44:25 data-server kernel: Write(16): 8a 00 00 00 00 00 00 00 08 08 00 00 00 01 00 00
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdh, sector 2056
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdh, sector 2056
Oct  5 08:44:25 data-server kernel: md: super_written gets error=-5, uptodate=0
Oct  5 08:44:25 data-server kernel: md/raid:md1: Disk failure on sdh1, disabling device.
md/raid:md1: Operation continuing on 1 devices.
Oct  5 08:44:25 data-server kernel: ata8: EH complete
Oct  5 08:44:25 data-server kernel: sd 7:0:0:0: [sdg] Unhandled error code
Oct  5 08:44:25 data-server kernel: sd 7:0:0:0: [sdg]  
Oct  5 08:44:25 data-server kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct  5 08:44:25 data-server kernel: sd 7:0:0:0: [sdg] CDB: 
Oct  5 08:44:25 data-server kernel: Write(16): 8a 00 00 00 00 00 00 00 08 08 00 00 00 01 00 00
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdg, sector 2056
Oct  5 08:44:25 data-server kernel: end_request: I/O error, dev sdg, sector 2056
Oct  5 08:44:25 data-server kernel: md: super_written gets error=-5, uptodate=0
Oct  5 08:44:25 data-server kernel: md/raid:md1: Disk failure on sdg1, disabling device.
md/raid:md1: Operation continuing on 0 devices.
So given the fact that all 4 disk die simultaneously (without excessive force ;) ) is pretty much impossible, I suspect the StarTech Raid Controller to be the broken part ...
I checked their Website but there are zero Linux drivers nor Firmware updates ...

Maybe anyone someone else came along such an issue too ? If so, how did you solve it ?

However I'm thinking of replacing the controller ... but that seem not to be such an easy task.
What HW SATA-3 controller (4 Ports internal !!) with good Linux support do you suggest ?

As this is a home NAS system, I don't need (nor can afford) high-end Raid Controllers ...

any help would be highly appreciated :)

User avatar
avij
Retired Moderator
Posts: 3046
Joined: 2010/12/01 19:25:52
Location: Helsinki, Finland
Contact:

Re: Hardware is failing - need advice

Post by avij » 2014/10/13 22:11:46

Just a shot in the dark, but does your power supply give enough electricity to power all the four hard disks?

may24
Posts: 30
Joined: 2014/10/13 15:35:36

Re: Hardware is failing - need advice

Post by may24 » 2014/10/14 09:51:55

Hm, good question. Well, I've a 350 Watt power supply.
According to the disks tech. details, each consume: 5W in operating mode.
The E450 Thermal Design Power = 18 Watt

This sounds way than enough ...

BTW: If I would run low on power wouldn't there be an entry in the /var/log/messages ?

aks
Posts: 3073
Joined: 2014/09/20 11:22:14

Re: Hardware is failing - need advice

Post by aks » 2014/10/14 15:59:33

It *could* be a firmware issue - search the manufacturer. What does SMART say?

Regards

Post Reply