Drive 1 of 4 failed in the middle of D4 Repair

Boyd_Wold · August 26, 2023, 4:48pm

Setup:
version: DS920+
RAID: SHR
Drives: started with (4) 4TB ironwolf drives using 1 volume
Current: (4) 14TB ironwolf drives inserted.

D1 - Critical
D2 - Healthy
D3 - Healthy
D4 - Repairing (39% done. Currently incrementing at about .08% per 24 hour period)
(2) 1TB NVME drives configured as a second Volume. Both Healthy
DSM 7.2
—–
A week ago it had (4) ironwolf 4TB drives.
I bought 4 new ironwolf 14TB drives
No Backup
I replaced Drive 1 - 3 and the volume repaired just fine each time.
During Drive 4 replacement, Drive 1 went CRITICAL with UNC errors.
The system slowed to barely useable. Simple actions like logging into the web interface took 5-10 minutes and many times failed.
SSH also took 5-10 minutes to login and then another 5 minutes when I did a sudo -i
D1 utilization at 100%
All other drives in volume are typically 0% but occasionaly show some usage.

I have done a number of things to try and work through this.

Backing up during its current state is possible but so slow that I think it may take months at its current rate.
I removed the UNC errors in the database and restarted the machine to try and force a Healthy status in case it was a false error. Within 10 minutes of the system coming online, new UNC errors showed up so the the drive is going bad.

I don’t know what to do at this point. I want to backup in the fastest way possible but I can’t seem to speed that process up because D4 isn’t able to be used yet in the volume. Anyone have ideas on how to proceed?

Paul · August 26, 2023, 7:56pm

Difficult. What you did not mention (and it is the most relevant piece of information in the entire story) is the RAID type of your storage pool. I assume it is SHR or RAID 5 with 1-drive fault tolerance.

The fact that you do not have a backup is something you probably regret much. But if I may ask, what were your considerations not to have one made regularly? I try to understand the logic people have.

Boyd_Wold · August 26, 2023, 8:35pm

Hi Paul,
You are correct, it is SHR. I’ll edit my original to include that important bit of info.
Looking back, I can see what happened. The NAS was something I wanted for my home for sometime. I wanted to move my HTPC (Plex) and the surrounding services into a NAS run by docker. I picked up the synology because of the price. Got lucky on ebay and got it for $400 including the four 4TB drives. I then began the evolving journey of building a sweet home set up and I think I forgot along the way how hard it would be to rebuild if I lost everything. Nothing magic, just evolved into a more complex system over time. At the beginning, I was playing and felt like I would rebuild if I screwed something up. Now however, having this issue show up, I am making a list of what it will take to rebuild and I wish sincerely that I had a full backup.
Side note, I just setup an S3 Glacier bucket for this purpose about 2 weeks ago and I was pushing small amounts of stuff up to it to test and see what the costs looked like. Again, kinda wish I had pushed the whole thing now.

Paul · August 27, 2023, 9:08am

Hi Boyd,
Thanks for sharing. Makes sense. Rebuilding a RAID with a new HDD often goes well, but not always. All the disks are pushed to their limits.

With clients, I always make a backup first or demand it be made before proceeding with a drive swap.