Synology HyperBackup Versioning: nitty gritty details

Hello! New to the forums!
(I did try searching on this, but no luck)

I’m trying to figure out how HyperBackup precisely works. It’d be great if Synology issued a white paper, but whatever.

Tell me this, all wise folk who are reading this:

Let’s say I have a Syno DS920+ with two 10TB drives mirroring each other. I use BackBlaze B2 for backups. I break down my backups : one set for Plex Media, one set for Family Photos/videos, one set for data backups from my laptop/phone, etc.

Here’s the question: Let’s say that I have some music files, Song 1 , Song2, and Song3. Let’s also say that I’ve got multiple copies of each scattered around (some in my old laptop backups, some in my Plex Media folder on the NAS, etc)>

Now, when my different HyperBackup backup sets fire off, is Synology DS920+ making multiple copies on the BackBlaze B2 bucket? (I have one bucket, and multiple HBK backup sets on BB). Note: I keep BackBlaze “dumb”: I don’t let it control versioning, I don’t let it control deleting older files etc — I have Synology HyperBackup control all of that.

Will I have on BB BackupSet1.hbk and BackupSet2.hbk and BackupSet3.hbk each containing Song1 , Song 2, and Song 3, or will there be only one copy of each song on BB and some sort of “virtual shortcut” that’s used to save space on the backup?

I know this post is long as hell, and I’ll appreciate any feedback. Thank you all for your time.

Well, here’s a white paper on Synology Drive: not exactly what I’m looking for but here it is: https://global.download.synology.com/download/Document/Software/WhitePaper/Package/SynologyDrive/All/enu/Synology_Drive_WP.pdf

HA! well, there is some info on backups and versioning on pp. 10-12 of that white paper. Hmm…

OK, dang, this stuff is way above my head. Guess I shoulda been a computer guy. :thinking:

So what you are looking for is backup deduplication. Hyperbackup does not do any kind of deduplication, only compression. If you were to backup to a system like ZFS which has deduplication (or a cloud server that has block deduplication) you could save the space. But hyperbackup does not do it so you would end up with 3x the space used if you have the same file in 3 locations on your NAS

I can’t speak to Synology’s implementation but perhaps a glimpse down the rabbit hole of the Linux file system will give some insight into backups, deduplication, and incremental synchronization techniques.

DISCLAIMER ALERT
What follows is not intended to be complete and accurate. It’s a cartoon sketch of decades of technological advancement.

A file is a collection of data and a label that allows you to identify it. A directory is another file. What makes it special is it is a list of other files. Some of the files in the list can be directories.

At its core, a file is a collection of blocks of data, along with some metadata with bits of data like its name. This thing is called an inode. So every file is an inode and every directory is a file, so every directory is an inode too. Our file is a label we use to identify an inode.

We can give the same inode multiple labels. An inode can simultaneously be /home/user/myfile.txt or /var/log/something_else.lst. That’s called a hard link. The inode keeps track of the number of links using a usage counter. Every time a hard link is created, the usage counter increments. Every time a link is destroyed, the usage counter is decremented. When the usage counter reaches zero, the inode and all the associated data blocks become available for some other use.

So lets talk snapshot backups.

A snapshot is a copy of all the inodes on a file system at a specific point in time. The operating system can make this copy very quickly because it’s just the metadata and not a copy of the actual files. The process is making a directory structure with hard links to the existing inodes. The magic happens the next time any file is changed. Instead of altering the existing inode, the changed file is stored as a new inode. The old inode is still around, and will remain around as long as the usage count is above zero, which will be true as long as the snapshot directory copy exists. The good news is this is really efficient. The bad news is deleting a file may not actually recover any disk space. There might be some other link, like the one to the snapshot image, that keeps the usage counter from decrementing to zero.

Now we can finally get into deduplication and replication at the file level. This actually happens at the block level for greater efficiency, but the concept holds.

Every time a file is saved, it is compared to every other file on the system. The usual technique is to calculate a checksum for the file based on its contents and use that value as one of the hard link filenames. If the hard link exists, then the new file is linked to the existing file. Otherwise, a new inode is created.

The last backup was done from a snapshot. Every file in that snapshot has a usage count of at least 2; 1 for the original file and 1 for the snapshot.

This backup starts with a new snapshot. Every file in the old snapshot now has a usage count of 3. All the newly altered files have a usage count of 2. An incremental backup searches for all files with a usage count of 2 and copies their inodes and data blocks to the remote location.

Some files may have had their usage count reduced to 1 if they are in the last snapshot but no longer in the latest snapshot. When the old snapshot is deleted, they will also be deleted.

1 Like