Best way to copy files with checksum verification

Hi,
Thanks for the great content here. I have a few questions in order to get started with a new RackStation.

My Scenario:
I bought a new RS1221+ (DSM 7.2 U2, Encrypted Btrfs volume) and a new M2 MacBook. But I still have my important data (A priority) on my Intel MacBook. The data is messy, nested (20-30 folder levels) with large files (up to 50GB) and many small files (< 1kB), even source code. In total 600GB with 523k files and 58k folders. Other data with B priority is on an external USB drive and a DS212, which should be copied to the RS at a later stage.

My goal:
The RS should be the single source of truth, where I run backups and sync to my Macbooks. First, I want to move all files to the RS, then clean the data and use snapshots on the way.

My questions:

  1. What’s the best way to copy data from MacOS to the RS? SMB, Synology Drive or rsync?
  2. I already used Synology Drive to transfer data, but I want to double-check the copied data. What’s the best way to run a checksum comparison on file level?
  3. Are there any limitations of Synology DSM, Btrfs, SMB, Synology Drive, File Station? Anything else besides File name length 255 bytes and path name length 4096 bytes?
  4. What’s the best way to move data from the DS212 and the external disk?
  5. Do I miss something?
1 Like

I can only answer some of your questions.

You basically have to first hash all the files you want to copy to your NAS/server, then copy the files, then check if the copy was successful, i.e. check if the hashes match. You can always do that yourself with a shell script—I suggest using Blake3 (b3sum), which is the fastest hashing algorithm around—, but it’s tricky because you have to store the file hashes somehow. But there are definitely third-party solutions. I know of two options:

(1) the *intch suite by Howard of EclecticLight: Fintch, Dintch and the CLI cintch: Spundle, Cormorant, Stibium, Dintch, Fintch and cintch – The Eclectic Light Company … macOS native apps & freeware

(2) IntegrityChecker, part of diglloydTools: diglloydTools IntegrityChecker … Java-based (i.e. cross-platform), but quite expensive

*intch saves the hashes as extended attributes (metadata), which should persist on network storage, but not on amateur software like NextCloud, while IntegrityChecker saves a list of all file hashes in an .icjh file in the enclosing directory, which would also work on storage (or with software like NextCloud) that doesn’t respect/support macOS/*BSD XAs.

Personally, I seriously suggest to ditch macOS Finder and use a proper file manager. I use Nimble Commander—note: don’t buy on the Mac App Store!—, which you can set to always verify copied files. Once your files have been copied, your server’s file system will hopefully take care of the rest, like ZFS on TrueNAS or unRAID.

As for rsync, afaik it runs an integrity check after copying a file, so if rsync says a copy is good, the copy should have the same hash as the original. (I have never used rsync directly, but I use CarbonCopyCloner on macOS, which utilizes rsync under the hood, and file hashes are always compared.) Obviously, this doesn’t work for SMB file copies, which would be the standard way that users approach a server/client integration with file copying/backup. (But there are solutions; see above.)

Others have to answer regarding Synology Drive and file integrity checks after copy.

As for Synology in general, I can’t say if the file system (btrfs) has implemented regular file integrity checks like ZFS. If it has, you should be good even with an SMB file copy, as long as you use a file manager or a third-party-application-based setup that ensures the file integrity of copied files.

PS (edit): there’s also a freeware GUI frontend for rsync btw: GitHub - rsyncOSX/RsyncOSX: A macOS GUI for rsync. Compiled for macOS Big Sur and later

1 Like

From the Synology website:

Btrfs file self-healing

Traditional storage systems might experience errors that go completely unnoticed which result in corrupt data being provided to applications with no warning or error messages. In order to avoid these types of errors, Btrfs provides checksums for data and metadata, generates two copies of metadata, and then verifies the checksums during each read process. Once discovering a mismatch (silent data corruption), the Btrfs file system is able to auto-detect corrupted files (silent data corruption) with mirrored metadata, and recover broken data using the supported RAID volumes, including RAID 1, RAID 5, RAID 6, RAID 10, F1, and SHR.

So it’s very similar to ZFS: once the data is on a Synology, provided it’s using btrfs and a suitable RAID, the data should be protected against corruption. The only thing then that we’d need to ensure is that the original copy of the data is the same as on the source, i.e. on the macOS client machine.

Using the IntegrityChecker Java app would be a no-brainer, because it creates an additional dot file that will just get synced together with the proper data files.

According to the Synology user forums, macOS metadata is preserved too, but there are some glitches on what extended attribute macOS Finder will display in a file’s Info window when that file resides on the Synology. But the underlying XA is apparently not destroyed by DSM/btrfs, so the above solution with the EclecticLight tools (Dintch et al.) should also work, probably even with Synology Drive.

For scenarios where you are not using rsync, and you are just accessing your server share with SMB from your file manager, and if that file manager doesn’t auto-verify copied files, you could programmatically include the hashing with a proprietary file copy workflow on/from the client side.

Option #1

(1) Create a keyboard shortcut, e.g. CTRL-CMD-C, and use it to trigger either an AppleScript-based workflow/QuickAction (if you use Finder) or a shell script, if your file manager supports direct shell script execution (e.g. Nimble Commander).

(2) The script receives the full paths to the local files that you have selected in the file manager on your macOS client.

(3) The script then opens a macOS file selection window pointing to your SMB share or Synology Drive shared folder etc., where you can select the destination directory for the file copy.

(4) Before starting any copy process, the script then uses the cintch CLI to hash all the files you have selected for copying… or all the enclosed files, if you have selected one or more directories for copying.

(5) Only when the hashing is complete, the script will then copy the files over to the NAS/server share you have selected before, using a simple cp command, which will preserve extended attributes (metadata) if you don’t use any special arguments.

(6) After the copy is complete, the script then executes the cintch CLI again, but this time on the copies on your SMB share to compare the hash that has been stored in the file copies’ metadata with the newly calculated hashes.

(7) Play a success sound if there are no errors… or play an error sound, if cintch has found a hash mismatch, and save a log file next to the source that includes the paths of the file(s) that produced the error.

With such a workflow, you would have to train yourself to refrain from using drag-and-drop operations. What could help in that regard is to auto-mount the SMB shares hidden. (I think that’s possible on macOS.)

However, you could probably also make this work for drag-and-drop, but you’d need a more complex workflow, which would be a fairly hacky workaround to “reroute” copy operations.

Option #2

(1) Ensure that the main server share is SMB-auto-mounted when you log in as your main local macOS user, whether hidden or visible.

(2) For the workaround you’d need a local sparsebundle that should also be automounted at login, and definitely mounted as visible. It’s prudent to auto-create the sparsebundle file anew at every login, after removing/deleting the old one, otherwise it would grow too big in size eventually. The mounted volume of this sparsebundle is your dummy share.

(3) Create a regular LaunchAgent, running at load and then e.g. once every hour, that on the dummy share will recreate/update the complete directory structure as present on the main server share:

rsync -a -f"+ */" -f"- *" /path/to/mainServerShare /path/to/localDummyShare

This will only sync the directories and exclude all other file types. After mounting the dummy share (sparsebundle) at login, load that LaunchAgent with launchctl, which will then run, if you have enabled the RunAtLoad key. (Afterwards it will run at regular intervals, which you also need to specify in the agent’s plist.)

(4) Create a second LaunchAgent running a watcher script that will trigger auxscript.sh when there are file creations on the dummy share volume. You can’t use the WatchPaths key that macOS launchd offers, because sadly it will ignore file changes happening in subdirectories, so you need to use fswatch, which you can install with Homebrew:

fswatch -0 -r --event Created /path/to/localDummyShare | xargs -0 -n 1 -I {} /path/to/auxscript.sh

(5) Now, whenever you want to copy a file or folder to your Synology, don’t copy it onto the proper SMB share, but copy it onto your local dummy share into the relevant subfolder.

(6) fswatch will trigger auxscript.sh which basically does all of the operations stated above under option #1: hash the new file on the dummy share with cintch & store the hash as metadata, copy the file incl. metadata with cp from the dummy share to your regular SMB share on the Synology, run cintch on the destination file to hash again and compare the new hash with the hash previously stored as metadata. Then, if the hashes match, delete (unlink/rm -f) the file from the dummy share. If there’s a hash mismatch, don’t unlink the file, but store an error log next to it.

PS (edit): the auxscript.sh should of course ignore .DS_Store files, which are created by Finder and certain macOS processes, and not copy those over to your server. If the file path received by auxscript.sh is a directory file, no hashing and copying is necessary, of course, and auxscript.sh should just execute mkdir on the main SMB share.

PPS (edit): of course, if your file manager includes auto-verification of file copies, and your server is using btrfs or ZFS, you don’t really need any of the above solutions, but even in that case it would still be nice to have the file hash stored as metadata on your server, just in case something happens on your server’s file system that even ZFS or btrfs are unable to detect.

So for stuff like this the biggest risk is data not all getting copied over to the NAS.

For that I normally use Carbon Copy Cloner! It has checksum verification after copy (takes a long time) but you know all the data is there, exactly.

Then after it gets to the NAS you know BTRFS will take good care of your data!

2 Likes

Thanks a lot for all the great input. Currently, I’m looking into the different options and do some testing.

It looks like it will be a combination of Dintch and Carbon Copy Cloner (or Nimble Commander).

1 Like