Intermittent hang/high iowait on software Raid 5

three03 · Post by **three03** » 2009/10/06 06:34:10

I'm running into a very odd hanging issue on Centos 5.3. It appears to be software RAID related, but I really can't find any troubleshooting tips out there on how to debug the problem and to determine the root cause.

I apologize for the long post and the hard to read formatting, but I wanted to give as much information as possible.
A warning as well, I'm by no means a Linux expert.

CentOS 5.3 x86_64, 2.6.18-128
[root@drawer init.d]# uname -a
Linux drawer 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux

The system has 3 1TB SATA drives configured with a RAID 1 for the boot partition, and a RAID 5 for file storage and using LVM. The system is a Intel dual-core 2.2ghz with 4GB of RAM.

At random the system will become completely unresponsive. I've started running a continuous iostat output, and when the hang occurs I see this output.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    0.50   99.50    0.00    0.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               9.00    38.40  0.60  1.40    38.40   160.00   198.40     0.07   35.50   8.50   1.70
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2              9.00    38.40  0.60  1.40    38.40   160.00   198.40     0.07   35.50   8.50   1.70
sdb               0.80    46.60  0.60  1.40     5.60   192.80   198.40     0.07   34.00   6.70   1.34
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb2              0.80    46.60  0.60  1.40     5.60   192.80   198.40     0.07   34.00   6.70   1.34
sdc               0.00    47.60  0.20  1.20     0.80   188.00   269.71     9.97 7394.57 714.29 100.00
sdc1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc2              0.00    47.60  0.20  1.20     0.80   188.00   269.71     9.97 7394.57 714.29 100.00
md1               0.00     0.00  0.00 85.20     0.00   340.80     8.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
VG0-LV0           0.00     0.00  0.00  0.00     0.00     0.00     0.00     5.00    0.00   0.00 100.00
VG0-LV4           0.00     0.00  0.00 85.20     0.00   340.80     8.00   443.59 9654.17  11.74 100.00
VG0-LV2           0.00     0.00  0.00  0.00     0.00     0.00     0.00     3.96    0.00   0.00 100.00
VG0-LV1           0.00     0.00  0.00  0.00     0.00     0.00     0.00     3.00    0.00   0.00 100.00
VG0-LV3           0.00     0.00  0.00  0.00     0.00     0.00     0.00    13.00    0.00   0.00 100.00
VG0-LVSWAP        0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
hde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

By looking at this output it appears that one drive is holding up the array, sdc (specificall the raid5 portion of this device, sdc2), causing the iowait. This occurs for a few minutes, and then the system returns to a normal state.

The output below is from the system when everything is OK.

[root@drawer init.d]# hdparm -T -t /dev/sda

/dev/sda:
Timing cached reads: 4756 MB in 2.00 seconds = 2378.40 MB/sec
Timing buffered disk reads: 266 MB in 3.09 seconds = 86.07 MB/sec

[root@drawer init.d]# hdparm -T -t /dev/sdb

/dev/sdb:
Timing cached reads: 4760 MB in 2.00 seconds = 2379.91 MB/sec
Timing buffered disk reads: 286 MB in 3.02 seconds = 94.75 MB/sec

[root@drawer init.d]# hdparm -T -t /dev/sdc

/dev/sdc:
Timing cached reads: 4752 MB in 2.00 seconds = 2376.43 MB/sec
Timing buffered disk reads: 274 MB in 3.00 seconds = 91.25 MB/sec

If this was a drive error, I would expect the SMART output to show the errors. Here is the smartctl output from /dev/sdc

[root@drawer init.d]# smartctl --all /dev/sdc
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD10EADS-00M2B0
Serial Number: WD-WCAV51035383
Firmware Version: 01.00A01
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Oct 5 23:05:31 2009 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21180) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 244) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 109 109 021 Pre-fail Always - 7516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 31
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 125
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 29
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1840
194 Temperature_Celsius 0x0022 111 102 000 Old_age Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 124 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I have only run one selftest on this drive, and it's possible that the flags I used didn't test the drive fully, so treat this output accordingly.

Here is a write test of the array, and the corresponding iostat output during.

[root@drawer init.d]# time dd if=/dev/zero of=/files/test1 bs=8192k count=450
450+0 records in
450+0 records out
3774873600 bytes (3.8 GB) copied, 41.6163 seconds, 90.7 MB/s

real 0m41.903s
user 0m0.000s
sys 0m11.036s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   34.00   29.75    0.00   36.25

Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda             399.00  8358.50 20.00 187.00  1672.00 34326.00   347.81     3.09   14.80   4.54  94.05
sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda2            399.00  8358.50 20.00 187.00  1672.00 34326.00   347.81     3.09   14.80   4.54  94.05
sdb             139.00  8364.50 12.00 186.00   710.00 34218.00   352.81     3.04   15.43   4.72  93.40
sdb1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb2            139.00  8364.50 12.00 186.00   710.00 34218.00   352.81     3.04   15.43   4.72  93.40
sdc             288.50  8099.00 19.00 365.00  1230.00 33858.00   182.75     1.52    3.95   1.45  55.50
sdc1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc2            288.50  8099.00 19.00 365.00  1230.00 33858.00   182.75     1.52    3.95   1.44  55.45
md1               0.00     0.00  0.00 16988.50     0.00 67810.00     7.98     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
VG0-LV0           0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
VG0-LV4           0.00     0.00  0.00 16940.50     0.00 67762.00     8.00   421.88   24.88   0.06 100.05
VG0-LV2           0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
VG0-LV1           0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
VG0-LV3           0.00     0.00  0.00 48.00     0.00    48.00     2.00     5.28  110.06  10.96  52.60
VG0-LVSWAP        0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
hde               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

While I believe it could be a intermittent hardware issue, no evidence in any log I can find points to one of the drives actually throwing errors. The iostat output indicates iowait only on that one device, but I can't seem to find any more information on exactly what the system is doing during the problem.

Thanks in advance for any assistance.

themarvin2k · Post by **themarvin2k** » 2009/10/06 15:08:05

Have you looked at the dmesg output? The system becoming unresponsive can be a symptom for a bus reset caused by faulty cabling, etc.
If that's the case you should see messages in the dmesg log.

pjwelsh · Post by **pjwelsh** » 2009/10/06 16:01:22

These are 5400 RPM "green" drives w/ 16MB cache... There is only 3 drives in RAID 5 (effective 2 drive possible performance)... Search the forum for "dstat"(per process i/o accounting)... Make sure that there is not some RAID (re)build currently happening by "cat /proc/mdstat". Run the hparm command on the LV.

three03 · Post by **three03** » 2009/10/06 17:30:24

[quote]Have you looked at the dmesg output? The system becoming unresponsive can be a symptom for a bus reset caused by faulty cabling, etc.
If that's the case you should see messages in the dmesg log.[/quote]

No dmesg output at all when the event happens. I'll try replacing all of the SATA cables tonight and see if that helps.

[quote]
pjwelsh wrote:
These are 5400 RPM "green" drives w/ 16MB cache... There is only 3 drives in RAID 5 (effective 2 drive possible performance)... Search the forum for "dstat"(per process i/o accounting)... Make sure that there is not some RAID (re)build currently happening by "cat /proc/mdstat". Run the hparm command on the LV.[/quote]

So I booted up today, and now the "hanging" IO is on sdb, instead of sdc. You are correct on the drives, this is just a home file/VM server with a minimal budget. I know the performance isn't going to be spectacular, but right now when the system doesn't hang it works great.

No rebuild happening. I did have one a few days back, and while the system was slow during that time, it didn't seem to hang for long periods of time.

I just installed dstat and with a few minutes of playing around I couldn't get it to produce output which might give me the right information. I'll search for more info on it.

Thanks for all the help.

mdadm, and /proc/mdstat output.

[root@drawer log]# mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.03
Creation Time : Sun Sep 20 15:20:14 2009
Raid Level : raid5
Array Size : 1953005568 (1862.53 GiB 1999.88 GB)
Used Dev Size : 976502784 (931.27 GiB 999.94 GB)
Raid Devices : 3
Total Devices : 3
Preferred Minor : 1
Persistence : Superblock is persistent

Update Time : Tue Oct 6 09:59:46 2009
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 256K

UUID : e4adf874:1945b305:eb28daa6:5a3a941b
Events : 0.5167

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 8 34 1 active sync /dev/sdc2
2 8 18 2 active sync /dev/sdb2

[root@drawer log]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[0] sdb1[2] sdc1[1]
256896 blocks [3/3] [UUU]

md1 : active raid5 sdc2[1] sdb2[2] sda2[0]
1953005568 blocks level 5, 256k chunk, algorithm 2 [3/3] [UUU]

unused devices:

three03 · Post by **three03** » 2009/10/06 17:49:56

Here is some hdparm output. The first two produce disk reads it what look like the appropriate range. During the last two tests the issue was showing in iostat output, with sdb hanging, and the disk read is very low.

/dev/mapper/VG0-LV0:
Timing cached reads: 4792 MB in 2.00 seconds = 2396.17 MB/sec
Timing buffered disk reads: 466 MB in 3.00 seconds = 155.31 MB/sec
[root@drawer log]# hdparm -T -t /dev/mapper/VG0-LV4

/dev/mapper/VG0-LV4:
Timing cached reads: 4768 MB in 2.00 seconds = 2384.47 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl for device
Timing buffered disk reads: 494 MB in 3.01 seconds = 164.34 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl for device
[root@drawer log]# hdparm -T -t /dev/mapper/VG0-LV1

/dev/mapper/VG0-LV1:
Timing cached reads: 4780 MB in 2.00 seconds = 2390.12 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6 MB in 4.31 seconds = 1.39 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate ioctl for device
[root@drawer log]# hdparm -T -t /dev/mapper/VG0-LV0

/dev/mapper/VG0-LV0:
Timing cached reads: 4756 MB in 2.00 seconds = 2378.24 MB/sec
Timing buffered disk reads: 6 MB in 4.26 seconds = 1.41 MB/sec

pjwelsh · Post by **pjwelsh** » 2009/10/06 17:50:00

The dstat command for i/o looks like[code]dstat -M topio -d -M topbio[/code] if you only get 2 columns of data, then you need to update your kernel.
Also, there have been reports of misc issue with certain SATA chipsets and controller cards. You may want to google either of those for issue resolution.
I do recall some past R5 issue that was helped with some echo something > /proc/something... I will try to get the details on that "something" soon.

three03 · Post by **three03** » 2009/10/06 17:53:28

[quote]
pjwelsh wrote:
The dstat command for i/o looks like[code]dstat -M topio -d -M topbio[/code] if you only get 2 columns of data, then you need to update your kernel.
Also, there have been reports of misc issue with certain SATA chipsets and controller cards. You may want to google either of those for issue resolution.
I do recall some past R5 issue that was helped with some echo something > /proc/something... I will try to get the details on that "something" soon.[/quote]

The dstat command you specify only gave two columns of output. Do I just need to upgrade to a newer kernel?

pjwelsh · Post by **pjwelsh** » 2009/10/06 18:01:32

At least from kernel 2.6.18-164.el5 (and one before I think) the per process i/o is enabled.
I can't yet find the stuff I though I remembered, but I do have notes on better md tuning using different schedulers like[code]echo "deadline" > /sys/block/sde/queue/scheduler[/code] for each drive used.

Edit:
Found some incomplete notes on additional performance tuning testing on the md stuff and original el5. I have items like:
echo "64" > /sys/block/sda/queue/max_sectors_kb #for each drive
blockdev --setra 16384 /dev/sda #for each drive
echo "512" > /sys/block/sda/queue/nr_requests #for each drive
echo "deadline" > /sys/block/sda/queue/scheduler #for each drive
echo "20" > /proc/sys/vm/dirty_background_ratio
echo "60" > /proc/sys/vm/dirty_ratio

The "/proc/sys/vm" items were the ones related to part of a R5 hang IIRC. The deadline scheduler seemed to be generally best for *MY* needs. Part of this was testing md -vs- sudo hw raid5 on some older 3ware SATA cards for a small NASish box.

three03 · Post by **three03** » 2009/10/06 20:32:18

I modified the scheduler and the ratio settings, and no effect.

I have noticed pattern that the device which is "hanging" always is the last device listed in mdadm and /proc/mdstat. Prior to today, sdc2 was listed as the last device, and now sdb2 is being shown in the last position and it's showing 100% utilization.

three03 · Post by **three03** » 2009/10/06 20:52:47

Well it turns out that what used to be "sdc" is now detected as "sdb". I did a smartctl output on the "sdb" device and it shows the same output that I included for "sdc" yesterday. Namely, that there is one completed self-test listed. The current "sdc" device shows no self-tests in the history. Edit: And I guess the serial number matching. :)

Odd I guess, since I didn't make any cable changes, but at least points to the same physical disk causing the issue.

CentOS

Intermittent hang/high iowait on software Raid 5

Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5

Re: Intermittent hang/high iowait on software Raid 5