Possibly, there are thousands of backup programs. So, why another one? The reason arose from my activities as a consultant. The entire week I was moving around and I had no way to secure my data during the week at home. All I had was a 250MB ZIP drive on my parallel port. The backup on the ZIP drive did not give me a lot of storage space and I had to live with a low bandwidth (about 200KB/s) and high latency. In contradiction to that I wanted fast, simple access to my data – I did not like the usual options of full, differential and incremental backups (e.g. with tar or dump): on one hand it is usually too cumbersome to retrieve one of the versions, on the other hand it is not possible to delete an old backup at will, this has to be planned carefully at the generation of the backup.

It was my goal to be able to back up quickly during my work and find my files quickly and without hassle.

So, at the end of 1999 the first version of storeBackup was created, it was, however, not suitable for large environments. It was not performing well enough, did not scale sufficiently and was not able do deal with nasty file names (e.g. ‘n’ in a name).

Based on that experience with the first version I wrote a new one which was published a little bit less than a year later under the GPL. In the meantime the number of users had grown – from home user applications, securing of (mail) directories at ISPs or hospitals as well as universities and for general archiving.

What would be an ideal Backup Tool?

The most important aspect of a backup is that you are not only able to restore but to do this easily.

The following reflects backups of files, not databases.

The ideal backup tool would create every day a complete copy of the entire data system (including the applicable access rights) on another data system with minimal effort for the administrator and maximal comfort for the user. The computer and hard disk systems to make this possible should be in a distant, secure building, of course. With the help of a file system browser the user could search and access the data and copy data directly back. The backup would be usable directly and restoring possible without problems or special learnings. Dealing with backups would become something normal – since the route over the administration would in general be unnecessary.

The process described here has a “small” disadvantage: it needs a lot of hard drive space and it is quite slow because each time the total amount of data needs to be copied.

Reducing Disk Space

The first measure to decrease the necessary hard drive storage space would be the compression of data – if that makes sense. storeBackup allows the use of any compression algorithm as an external program. The default is bzip2.

Looking at the stored data closely, it is apparent that from backup to backup relatively few files change – which is the reason for incremental backups. We also find that many files with the same content may be found in a backup because users copy files or a version administration program (like cvs) is active. In addition, files or directory structures are re-named by users, in incremental backups they are again (unnecessarily) secured. The solution to this is to check the backup for files with the same content (possibly compressed) and to refer to those. Within storeBackup, a hard link is used for referencing. With this trick of adding hard links, which were already created in existing backup files, each file is present in each backup although it exists physically on the hard drive only once. Copying and renaming of files or directories takes only the storage space of the hard links – nearly nothing.

Most likely not only one computer needs to be secured but a number of them. They often have a high proportion of identical files, especially with directories like /etc/usr or /home. Obviously, there should be only one copy of identical files stored on the backup drive. To mount all directories from the backup server and to backup all computers in one sweep would be the most simple solution. This way duplicate files get detected and hard linked. However, this procedure has the disadvantage that all machines to be secured have to be available for the backup time. That procedure can in many cases not be feasible, for example, if notebooks shall be backed up using storeBackup. Specifically with notebooks we can find a high overlap rate of files since users create local copies. In such cases or if servers are backed up independently from one another, and the available hard drive space shall be utilized optimally through hard links, storeBackup is able to hard link files in independent backups (meaning: independent from each other, possibly from different machines).

For the deletion of files storeBackup offers a set of options. It is a great advantage for deletion when each backup is a full backup, those may be deleted indiscriminately. Unlike with traditional backups, there is no need to consider if an incremental backup is depending on previous backups. The options permit the deletion or saving of backups on specific workdays, first or last existing backup of the week/month or year. It can be assured that a set of a minimum number of backups remains. This is especially useful if backups are not generated on a regular basis. It is possible to keep the last backups of a laptop until the end of a four week vacation even though the period to keep it is set to three weeks. Furthermore it is possible to define the maximal number of backups. There are more options to resolve the existence of conflicts between contradictory rules (by using common sense).

Performance

The procedure described above assumes that an existing backup is being checked for identical files prior to a new backup of a file. This applies to files in the previous backup as well as to the newly created one. Of course it does not make much sense to directly compare every file to be backed up with the previous backup. So, the md5 sums of the previous backup are being compared with the md5 sum of the file to be backed up with the utilization of the hash table.

Computing the md5 sum is fast, but in case of a large amount of data it is still not fast enough. For this reason storeBackup checks initially if the file was altered since the last backup (path + file name, ctime, mtime and size are the same). If that is the case, the md5 sum of the last backup is being adopted and the hard link set. If the initial check shows a difference, the md5 sum is being computed and a check takes place to see if another file with the same md5 sum exists. (The comparison with a number of backup series uses an expanded but similarly efficient process). For this approach only a few md5 sums need to be calculated for a backup. If you want to tune storeBackup, especially if you save via NFS, there are two things you can do:

  • tune NFS (see configuring nfs)
  • use the lateLinks option of storeBackup, and possibly delete your old backups independent from the backup process.
    Using storeBackup with lateLinks is like using an asynchronous client / server application or to be more precisely like using multiple batches on (normally) multiple machines:

    • Checking the source directory to know what has changed and to be compressed and save the relevant data to a save (another) place (on the backup server).
    • Take this information and restore a “normal” fully linked backup.
    • Delete old backups depending on the rules for the deletion.

The follwing performance measurements only show the direct backup time (without calling storeBackupUpdateBackup.pl (if necessary)). The have been done with a beta version of storeBackup 2.0.
Some background information to the following numbers: The backup was run on an Athlon X2, 2.3 GHz, 4 GB RAM. The NFS server was an Athlon XP, 1.84 GHz, 1.5 GB RAM. The network was running with 100 MBit/s, storeBackup was used with standard parameters. The units of the measurements are in hours:minutes:seconds or minutes:seconds. The size of sourceDir was 12GB, the size of the backup done with storeBackup was 9.2 GB. The backups were done with 4769 directories and 38499 files. StoreBackup.pl linked 5038 files internally which means this were duplicates. The source for the data were my files and the “Desktop” from my Windows XP Laptop, so “real” data.
The first table shows the time for copying the data to the nfs server with standard programs. The nfs server is mounted with option async, which is a performance optimization and not the standard configuration.

commanddurationsize of backup
cp -a28:4612 GB
tar jcf01:58:209.4 GB
tar cf21:0612 GB

All is like it was to expect: Tar with compression is much slower than the other ones; and cp is slower than tar, because it has to create lots of files. There is one astonishing number: The size of the backup file of tar jcf ist 9.4 GB, while the resulting size of the backup with storeBackup.pl is only 9.2 GB. We see the reason for this in the internal linked 5038 files — the duplicates are stored only once with storeBackup.

We do not see the effect of comparing the contents in this benchmark again, but it makes a lot of differences in performance and especially used disk space. If the time stamp of a file is changed, then traditional backup software will store this file in an incremental backup — storeBackup will only create a hard link.

Now let’s run storeBackup.pl on the same contents. The nfs server is still mounted with option async. There are no changes in the source directory between the first to the second or third backup.

storeBackup1.19, Standard2.0, Standard2.0, lateLinksmount with async
1. backup49:51100%49:2099%31:1463%
2. backup02:45100%02:2588%00:4225%file system read cache empty
3. backup01:51100%01:54100%00:2623%file system read cache filled

We can see the following:

  • The first run of storeBackup.pl is faster than tar jcf (tar with compression.) It’s easy to understand why: storeBackup.pl uses both cores of the machine, while the compression with tar uses only one. But if you look a little bit deeper to the number, you see that storeBackup.pl needs less than half the time (42%) of tar with compression. It naturally additionally calculates all md5 sums and has to perform the overhead of creating thousands of files (look at the difference between cp and tar cf above). The effect of reducing the time for copying more than 50% comes from two effects: storeBackup.pl does not compress all files (depending on their suffix, eg. .bz2 files are not compressed again) and it recognizes files with the same content and sets just a hard link (also the reason for 9.2 instead of 9.4 GB).
  • The second backup was done with a new mount of the source directory, so the read cache for it was not filled. You can see some improvement between version 1.19 and 2.0 because of better parallelization reading the data in storeBackup itself.
    You see no difference in the third run between version 1.19 and 2.0, because reading the source directory entries is now in the file system cache, which means that the blocking factor is now the speed of the nfs server — and that’s the same in both runs.
  • With option lateLinks, you can see an improvement by a factor of 4. The time you see depends massively on the time needed for reading the source directory (plus reading the information from the previous backup, which is always the same).

Now let’s do the same with an nfs mount without “tricks” like configuring async:

commanddurationsize of backup
cp -a37:5112 GB
tar jcf02:02:019.4 GB
tar cf25:0512 GB

 

storeBackup1.19, Standard2.0, Standard2.0, lateLinksmount with sync
1. backup53:35100%49:20100%38:5363%
2. backup05:36100%05:2496%00:4313%file system read cache empty
3. backup05:10100%04:5495%00:279%file system read cache filled

We can see the following:

  • Everything is more or less slower, because of higher latency due to the synchronous communication with the nfs server. If only one file is written (like with) tar, the difference to the backups with async is smaller, if many files are written, it’s bigger.
  • We see that the difference between sync and async using lateLinks is very small and the reason is simple. Only a few files are written over nfs, so the latency only has a small impact on the overall time for the backup. This results in the fact, that the backup with lateLinks and a very fast source directory (cache) is now 10 times faster.
  • Because the latency is not important for making a backup, I mounted this file server over a VPN over the Internet. This means very high latency and a bandwidth of about 20KByte/s from the nfs server and 50KByte/s to the nfs server (seen on a network monitoring tool). With same same boundary conditions as before (mounted with async, source directory file system in cache, no changes) I got a speed up with lateLinks (compared with non-lateLinks backup) by a factor of 70.
    So if your changed or new files are not too big compared with the available bandwidth, you can also use storeBackup (with lateLinks) for making a backup over a vpn on high latency lines.Naturally you should not choose option lateCompress in such a case. Another advantage with lateLinks in such cases is, that parallelization works much better, because reading unchanged data in the source directory nearly needs no action on the NFS mount.

Conclusion: If you mount with nfs, you can make it really fast using option lateLinks.

Example of a Run

Here you can see the statistical output of a big backup I ran on my laptop and saved to an NFS server. (I’m running this backup including OS once or twice a week and a smaller one every day, similar to the description of example 3, section 5.4.) I had to backup more than 500,000 entries:

STATISTIC 2008.09.08 23:40:17  3961  [sec] |      user|    system
STATISTIC 2008.09.08 23:40:17  3961 -------+----------+----------
STATISTIC 2008.09.08 23:40:17  3961 process|    386.30|    166.27
STATISTIC 2008.09.08 23:40:17  3961 childs |    209.02|    116.96
STATISTIC 2008.09.08 23:40:17  3961 -------+----------+----------
STATISTIC 2008.09.08 23:40:17  3961 sum    |    595.32|    283.23 => 878.55 (14m39s)
STATISTIC 2008.09.08 23:40:17  3961                    directories = 43498
STATISTIC 2008.09.08 23:40:17  3961                          files = 482516
STATISTIC 2008.09.08 23:40:17  3961                 symbolic links = 12024
STATISTIC 2008.09.08 23:40:17  3961                     late links = 462267
STATISTIC 2008.09.08 23:40:17  3961                    named pipes = 3
STATISTIC 2008.09.08 23:40:17  3961                        sockets = 48
STATISTIC 2008.09.08 23:40:17  3961                  block devices = 0
STATISTIC 2008.09.08 23:40:17  3961              character devices = 0
STATISTIC 2008.09.08 23:40:17  3961      new internal linked files = 178
STATISTIC 2008.09.08 23:40:17  3961               old linked files = 462089
STATISTIC 2008.09.08 23:40:17  3961                unchanged files = 0
STATISTIC 2008.09.08 23:40:17  3961                   copied files = 2896
STATISTIC 2008.09.08 23:40:17  3961               compressed files = 5204
STATISTIC 2008.09.08 23:40:17  3961    excluded files because rule = 78
STATISTIC 2008.09.08 23:40:17  3961    included files because rule = 0
STATISTIC 2008.09.08 23:40:17  3961         max size of copy queue = 22
STATISTIC 2008.09.08 23:40:17  3961  max size of compression queue = 361
STATISTIC 2008.09.08 23:40:17  3961                calced md5 sums = 50606
STATISTIC 2008.09.08 23:40:17  3961                    forks total = 9176
STATISTIC 2008.09.08 23:40:17  3961                      forks md5 = 3957
STATISTIC 2008.09.08 23:40:17  3961                     forks copy = 12
STATISTIC 2008.09.08 23:40:17  3961                    forks bzip2 = 5204
STATISTIC 2008.09.08 23:40:17  3961                  sum of source =  10G (10965625851)
STATISTIC 2008.09.08 23:40:17  3961              sum of target all = 10.0G (10731903808)
STATISTIC 2008.09.08 23:40:17  3961              sum of target all = 97.87%
STATISTIC 2008.09.08 23:40:17  3961              sum of target new = 109M (114598007)
STATISTIC 2008.09.08 23:40:17  3961              sum of target new = 1.05%
STATISTIC 2008.09.08 23:40:17  3961             sum of md5ed files = 744M (779727492)
STATISTIC 2008.09.08 23:40:17  3961             sum of md5ed files = 7.11%
STATISTIC 2008.09.08 23:40:17  3961     sum internal linked (copy) =  32k (32472)
STATISTIC 2008.09.08 23:40:17  3961    sum internal linked (compr) = 6.2M (6543998)
STATISTIC 2008.09.08 23:40:17  3961          sum old linked (copy) = 3.3G (3515951642)
STATISTIC 2008.09.08 23:40:17  3961         sum old linked (compr) = 6.6G (7094777689)
STATISTIC 2008.09.08 23:40:17  3961           sum unchanged (copy) = 0.0  (0)
STATISTIC 2008.09.08 23:40:17  3961          sum unchanged (compr) = 0.0  (0)
STATISTIC 2008.09.08 23:40:17  3961                 sum new (copy) =  11M (11090534)
STATISTIC 2008.09.08 23:40:17  3961                sum new (compr) =  99M (103507473)
STATISTIC 2008.09.08 23:40:17  3961     sum new (compr), orig size = 321M (336637589)
STATISTIC 2008.09.08 23:40:17  3961                 sum new / orig = 32.96%
STATISTIC 2008.09.08 23:40:17  3961       size of md5CheckSum file =  16M (16271962)
STATISTIC 2008.09.08 23:40:17  3961     size of temporary db files = 0.0  (0)
STATISTIC 2008.09.08 23:40:17  3961            precommand duration = 1s
STATISTIC 2008.09.08 23:40:17  3961            deleted old backups = 0
STATISTIC 2008.09.08 23:40:17  3961            deleted directories = 0
STATISTIC 2008.09.08 23:40:17  3961                  deleted files = 0
STATISTIC 2008.09.08 23:40:17  3961           (only) removed links = 0
STATISTIC 2008.09.08 23:40:17  3961 freed space in old directories = 0.0  (0)
STATISTIC 2008.09.08 23:40:17  3961       add. used space in files = 125M (130869969)
STATISTIC 2008.09.08 23:40:17  3961                backup duration = 27m3s
STATISTIC 2008.09.08 23:40:17  3961 over all files/sec (real time) = 297.30
STATISTIC 2008.09.08 23:40:17  3961  over all files/sec (CPU time) = 549.22
STATISTIC 2008.09.08 23:40:17  3961                      CPU usage = 54.13%

It took about 27 minutes to run the backup.

But look at the number of calced md5 sums: 50,606. This is the number of files, a “normal” backup (which does not examine the contents) would have saved because a time stamp has changed or they have moved (I didn’t move files around, the changes were mainly from OS updates.). StoreBackup calculates the md5 sums and recognises that only 8,100 files (copied + compressd files) have changed.
So only 16% of the files which normally whould have been saved had to be stored. Over the time, this makes a big differnce in the space you need for your backups. And naturally, the files in the backup are compressed (if reasonable).

Because the backup ran with option lateLinks, I later had to run (via cron) storeBackupUpdateBackup.pl to set all the links etc.:

INFO      2008.09.09 02:17:52 13323 updating </disk1/store-backup/fschjc-gentoo-all/2008.09.08_23.13.14>
INFO      2008.09.09 02:17:52 13323 phase 1: mkdir, symlink and compressing files
STATISTIC 2008.09.09 02:18:18 13323 created 43498 directories
STATISTIC 2008.09.09 02:18:18 13323 created 12024 symbolic links
STATISTIC 2008.09.09 02:18:18 13323 compressed 0 files
STATISTIC 2008.09.09 02:18:18 13323 used 0.0  instead of 0.0  (0 <- 0)
INFO      2008.09.09 02:18:18 13323 phase 2: setting hard links
STATISTIC 2008.09.09 02:27:55 13323 linked 462267 files
INFO      2008.09.09 02:27:55 13323 phase 3: setting file permissions
STATISTIC 2008.09.09 02:31:05 13323 set permissions for 482442 files
INFO      2008.09.09 02:31:05 13323 phase 4: setting directory permissions
STATISTIC 2008.09.09 02:31:47 13323 set permissions for 43498 directories

It took about 14 minutes to “complete” the backup for 500,000 entries.