Are Western Digital 6TB drives WD60EFRX drives ok to use?

Stephan_Angele · June 26, 2023, 2:02pm

So we had 6 NAS systems with a total of 24 WD60-EFAX 6TB SMR drives running BTRFS storate pools. Due to the horrible performance of the SMR drives we opened a ticket with WD and got all of the drives replaced with WD60-EFRX drives.

Allthough we had BTRFS snapshots replicated to our backup layer we still wanted to retain the local BTRFS snapshots. So we simply replaced the EFAX drives with EFRX drives one-by-one mainting the integrity of the storage pool. We ran the SMART check on each new disk and each disk got approved without errors. This was a process spanning several weeks as not to create down-times to the customers. We were lucky that the effected boxes were exclusively deployed for AB4B so not customer suffered from the poor performance.

This morning I had the 3rd unreponsive DS1522+ in 3 weeks were one of the replacement EFRX crashed. The NAS became so unresponsive that 2FA blocked the login and the system became bricked. Not even SMB was working.

I regained access:

remove all 4 EFRX drives,
insert a new 5th drive as tempory drive.
Install DSM (to chekc if NAS or disk are faulty)
Turn off System
Reinsert the 4 EFRX drives (the complete storage pool, including the faulty disk)
Start Synology Assistant → search for NAS → Migrteable NAS found
reinstall DSM with Mode2 reset (maintains data but not settings)
remove the 5th disk
Now the NAS is back online with the storage pool of the old NAS in critical condition because the crashed HDD was not mountet to the storage pool. But it was resposibe again and all data seem to be there.

Checking the health information of the crashed disk I get absolutely no error indication.
I have exactly the same behaviour on 3 different NAS systems (DS1520+; DS1522+, DS1522+) with brand new WD60-EFRX disk coming from different production batches of WD

What am I missing here? After rebuilding the storge pools on the other NAS I got some corruption warnungs but data scrubbing says it has repaired the issues.

I know the precise moment the NAS crashed and I have replicated snapshots (now flagged “blocked” againts deletion through space reclamation) in the local network. So if the current pool should turnes out to be corrupt I can always do a fail-over and merge with a current snapshot. So the data corruption is not my biggest issue.

I am more concerend as to why these systems crashed, if they will contirnue to crash, if there might be bad luggage from the SMR storage pool,…
On the long run I will likely replace all WD disk with disks from other vendors, most likely with EXOS 16TB (over 50pcs installed with no problems so far)
But none of the 3 failing discs has more than 1000hrs

I will not open an ticket with Synology as in the past this turned out to be a waist of time.

I appreciate your opinions

Will · June 28, 2023, 1:19am

So likely what happened here is one of the drives they sent you was a lemon.

Sometimes when a drive is completely borked Synology does not properly offline it from the RAID pool. It continues to retry the dead drive bit by bit, waiting for the timeout every single time. This makes the NAS completely lock up.

What likely was happening is one of your drives had failed, then it was not offlined properly. Pulling it from the Synology would bring it back on.

Stephan_Angele · June 28, 2023, 7:56am

Yes, I confirm that when pulling the drive out of the stations they all became responsive again.

But what really annoys me is that out of 5 shipments totaling 24 disks we received 6 replacement shipments of which 3 contained a disk that failed <1000 / max. 10 power-ups. Packing was OK and since it was always a single disk that failed in different NAS models I do not think that transport or Synology is the root cause of failure.

So far the WD data center disk (ULTRASTAR) are OK but my track record with WD prosumer line disk (WD RED, WD RED PLUS,…) is an absolute desaster.

Will · June 28, 2023, 10:39am

So what I am betting happened is that they gave you drives that had been repaired by them. Those drives will have a much higher failure rate because they already were broken at one time.

I would not judge the entire line by this shipment of drives. Though you can judge the company for sending them to you