[solved] RAID - a cautionary tale

Issues related to hardware problems
Post Reply
MartinR
Posts: 652
Joined: 2015/05/11 07:53:27
Location: UK

[solved] RAID - a cautionary tale

Post by MartinR » 2021/01/15 09:58:27

After the last couple of days I though I should post this here, there are a couple of "gotchas" which others might be interested in.

Background

My system is a general purpose home machine used for development, as a workstation and as a server to other machines in the home network. I have about a dozen VMs that can be run up when required which are used for general research and for exploring alternatives to CentOSstream ( :cry: ) There are three spindles, one handles the main system, a small one (ex laptop) is a buffer for backups and the third is a relatively new 2TB WD disk. The WD disk has two partitions, one mounted on /virt for the qcow files and the other is part of LVM and holds the family photo archive.

First problem

The big disk took itself off line and refused to come back. Eventually I hoovered out the inside of the tower case and reseated all the disk's connections and managed to get it back. Needless to say I immediately made a copy (zipped tarball) of the photos and the most important VMs. I then ran a full backup - phew! Obviously the drive is now suspect, so I started to consider alternatives. To cut a long story short I purchased a ORICO DS500U3 external box which connects to the system over USB3 and provides five SATA slots for 3.5" disks. In passing I did look at a dedicated RAID box, but it was more expensive and had a longer delivery time. I also purchased three 1 TB Seagate Barracudas and duly installed them. After benchmarking the drives they were duly bound into a RAIDset:

Code: Select all

# mdadm -C /dev/md0 -l5 /dev/sd{i,j,k}
as I recall. In the words of the mdadm manual: "there is no need for the initial resync to finish", so I immediately ran a benchmark on /dev/md0". A quick pvcreate and vgextend and I had a nice large logical group.

Major problem
You'll recall that the WD disk had a 1 TB PV on it, this basically held the filesystem where the photos were. I then set off a pvmove to shift data off the (supposedly) unreliable disk onto the (supposedly) reliable RAID system. The system froze at once with a kernel panic. The system then refused to boot and brought me just to the emergency prompt. LVM refused to assemble, and since elsewhere in the LVM was /var I had an ususable system. Further investigations narrowed the problem down to the RAIDset which was now dirty and degraded. Ultimately the LVM could only be recovered by dumping the configuration out, editing the config file to remove reference to the RAIDset and the photos filesystem then restoring the configuration (vgcfgbackup and vgcfgrestore).

Analysis
The RAIDset looked odd. Two spindles were active, one was a spare and hence the RAID5 was degraded. Some data transfer had started, hence RAIDset was "dirty". LVM refuses to accept RAIDsets which are both dirty and degraded. Furthermore it "locks" the PV and won't allow you to remove it. Getting back to the RAIDset, there is a paragraph buried away on page 22 of the manual that mentions a "feature" of RAID5 creation: it automatically creates an n-1 degraded array with a spare since this is faster! :evil: I cleared out the old array and created it with --force and now have a functional RAIDset. One final word of warning, the initial sync can take several hours and the system will refuse to shutdown during that period.

Hope this is of some interest and saves someone else from a day or so's wasted time.

Post Reply

Return to “CentOS 7 - Hardware Support”