Possibly, there are thousands of backup programs. So, why another one? The reason arose from my activities as a consultant. The entire week I was moving around and I had no way to secure my data during the week at home. All I had was a 250MB ZIP drive on my parallel port. The backup on the ZIP drive did not give me a lot of storage space and I had to live with a low bandwidth (about 200KB/s) and high latency. In contradiction to that I wanted fast, simple access to my data – I did not like the usual options of full, differential and incremental backups (e.g. with tar or dump): on one hand it is usually too cumbersome to retrieve one of the versions, on the other hand it is not possible to delete an old backup at will, this has to be planned carefully at the generation of the backup.
It was my goal to be able to back up quickly during my work and find my files quickly and without hassle.
So, at the end of 1999 the first version of storeBackup was created, it was, however, not suitable for large environments. It was not performing well enough, did not scale sufficiently and was not able do deal with nasty file names (e.g. ‘n’ in a name).
Based on that experience with the first version I wrote a new one which was published a little bit less than a year later under the GPL. In the meantime the number of users had grown – from home user applications, securing of (mail) directories at ISPs or hospitals as well as universities and for general archiving.
What would be an ideal Backup Tool?
The most important aspect of a backup is that you are not only able to restore but to do this easily.
The following reflects backups of files, not databases.
The ideal backup tool would create every day a complete copy of the entire data system (including the applicable access rights) on another data system with minimal effort for the administrator and maximal comfort for the user. The computer and hard disk systems to make this possible should be in a distant, secure building, of course. With the help of a file system browser the user could search and access the data and copy data directly back. The backup would be usable directly and restoring possible without problems or special learnings. Dealing with backups would become something normal – since the route over the administration would in general be unnecessary.
The process described here has a “small” disadvantage: it needs a lot of hard drive space and it is quite slow because each time the total amount of data needs to be copied.
Reducing Disk Space
The first measure to decrease the necessary hard drive storage space would be the compression of data – if that makes sense. storeBackup allows the use of any compression algorithm as an external program. The default is bzip2.
Looking at the stored data closely, it is apparent that from backup to backup relatively few files change – which is the reason for incremental backups. We also find that many files with the same content may be found in a backup because users copy files or a version administration program (like cvs) is active. In addition, files or directory structures are re-named by users, in incremental backups they are again (unnecessarily) secured. The solution to this is to check the backup for files with the same content (possibly compressed) and to refer to those. Within storeBackup, a hard link is used for referencing. With this trick of adding hard links, which were already created in existing backup files, each file is present in each backup although it exists physically on the hard drive only once. Copying and renaming of files or directories takes only the storage space of the hard links – nearly nothing.
Most likely not only one computer needs to be secured but a number of them. They often have a high proportion of identical files, especially with directories like /etc, /usr or /home. Obviously, there should be only one copy of identical files stored on the backup drive. To mount all directories from the backup server and to backup all computers in one sweep would be the most simple solution. This way duplicate files get detected and hard linked. However, this procedure has the disadvantage that all machines to be secured have to be available for the backup time. That procedure can in many cases not be feasible, for example, if notebooks shall be backed up using storeBackup. Specifically with notebooks we can find a high overlap rate of files since users create local copies. In such cases or if servers are backed up independently from one another, and the available hard drive space shall be utilized optimally through hard links, storeBackup is able to hard link files in independent backups (meaning: independent from each other, possibly from different machines).
For the deletion of files storeBackup offers a set of options. It is a great advantage for deletion when each backup is a full backup, those may be deleted indiscriminately. Unlike with traditional backups, there is no need to consider if an incremental backup is depending on previous backups. The options permit the deletion or saving of backups on specific workdays, first or last existing backup of the week/month or year. It can be assured that a set of a minimum number of backups remains. This is especially useful if backups are not generated on a regular basis. It is possible to keep the last backups of a laptop until the end of a four week vacation even though the period to keep it is set to three weeks. Furthermore it is possible to define the maximal number of backups. There are more options to resolve the existence of conflicts between contradictory rules (by using common sense).
The procedure described above assumes that an existing backup is being checked for identical files prior to a new backup of a file. This applies to files in the previous backup as well as to the newly created one. Of course it does not make much sense to directly compare every file to be backed up with the previous backup. So, the md5 sums of the previous backup are being compared with the md5 sum of the file to be backed up with the utilization of the hash table.
Computing the md5 sum is fast, but in case of a large amount of data it is still not fast enough. For this reason storeBackup checks initially if the file was altered since the last backup (path + file name, ctime, mtime and size are the same). If that is the case, the md5 sum of the last backup is being adopted and the hard link set. If the initial check shows a difference, the md5 sum is being computed and a check takes place to see if another file with the same md5 sum exists. (The comparison with a number of backup series uses an expanded but similarly efficient process). For this approach only a few md5 sums need to be calculated for a backup. If you want to tune storeBackup, especially if you save via NFS, there are two things you can do:
- tune NFS (see configuring nfs)
- use the lateLinks option of storeBackup, and possibly delete your old backups independent from the backup process.
Using storeBackup with lateLinks is like using an asynchronous client / server application or to be more precisely like using multiple batches on (normally) multiple machines:
- Checking the source directory to know what has changed and to be compressed and save the relevant data to a save (another) place (on the backup server).
- Take this information and restore a “normal” fully linked backup.
- Delete old backups depending on the rules for the deletion.
The follwing performance measurements only show the direct backup time (without calling storeBackupUpdateBackup.pl (if necessary)). The have been done with a beta version of storeBackup 2.0.
Some background information to the following numbers: The backup was run on an Athlon X2, 2.3 GHz, 4 GB RAM. The NFS server was an Athlon XP, 1.84 GHz, 1.5 GB RAM. The network was running with 100 MBit/s, storeBackup was used with standard parameters. The units of the measurements are in hours:minutes:seconds or minutes:seconds. The size of sourceDir was 12GB, the size of the backup done with storeBackup was 9.2 GB. The backups were done with 4769 directories and 38499 files. StoreBackup.pl linked 5038 files internally which means this were duplicates. The source for the data were my files and the “Desktop” from my Windows XP Laptop, so “real” data.
The first table shows the time for copying the data to the nfs server with standard programs. The nfs server is mounted with option async, which is a performance optimization and not the standard configuration.
|command||duration||size of backup|
|cp -a||28:46||12 GB|
|tar jcf||01:58:20||9.4 GB|
|tar cf||21:06||12 GB|
All is like it was to expect: Tar with compression is much slower than the other ones; and cp is slower than tar, because it has to create lots of files. There is one astonishing number: The size of the backup file of tar jcf ist 9.4 GB, while the resulting size of the backup with storeBackup.pl is only 9.2 GB. We see the reason for this in the internal linked 5038 files — the duplicates are stored only once with storeBackup.
We do not see the effect of comparing the contents in this benchmark again, but it makes a lot of differences in performance and especially used disk space. If the time stamp of a file is changed, then traditional backup software will store this file in an incremental backup — storeBackup will only create a hard link.
Now let’s run storeBackup.pl on the same contents. The nfs server is still mounted with option async. There are no changes in the source directory between the first to the second or third backup.
|storeBackup||1.19, Standard||2.0, Standard||2.0, lateLinks||mount with async|
|2. backup||02:45||100%||02:25||88%||00:42||25%||file system read cache empty|
|3. backup||01:51||100%||01:54||100%||00:26||23%||file system read cache filled|
We can see the following:
- The first run of storeBackup.pl is faster than tar jcf (tar with compression.) It’s easy to understand why: storeBackup.pl uses both cores of the machine, while the compression with tar uses only one. But if you look a little bit deeper to the number, you see that storeBackup.pl needs less than half the time (42%) of tar with compression. It naturally additionally calculates all md5 sums and has to perform the overhead of creating thousands of files (look at the difference between cp and tar cf above). The effect of reducing the time for copying more than 50% comes from two effects: storeBackup.pl does not compress all files (depending on their suffix, eg. .bz2 files are not compressed again) and it recognizes files with the same content and sets just a hard link (also the reason for 9.2 instead of 9.4 GB).
- The second backup was done with a new mount of the source directory, so the read cache for it was not filled. You can see some improvement between version 1.19 and 2.0 because of better parallelization reading the data in storeBackup itself.
You see no difference in the third run between version 1.19 and 2.0, because reading the source directory entries is now in the file system cache, which means that the blocking factor is now the speed of the nfs server — and that’s the same in both runs.
- With option lateLinks, you can see an improvement by a factor of 4. The time you see depends massively on the time needed for reading the source directory (plus reading the information from the previous backup, which is always the same).
Now let’s do the same with an nfs mount without “tricks” like configuring async:
|command||duration||size of backup|
|cp -a||37:51||12 GB|
|tar jcf||02:02:01||9.4 GB|
|tar cf||25:05||12 GB|
|storeBackup||1.19, Standard||2.0, Standard||2.0, lateLinks||mount with sync|
|2. backup||05:36||100%||05:24||96%||00:43||13%||file system read cache empty|
|3. backup||05:10||100%||04:54||95%||00:27||9%||file system read cache filled|
We can see the following:
- Everything is more or less slower, because of higher latency due to the synchronous communication with the nfs server. If only one file is written (like with) tar, the difference to the backups with async is smaller, if many files are written, it’s bigger.
- We see that the difference between sync and async using lateLinks is very small and the reason is simple. Only a few files are written over nfs, so the latency only has a small impact on the overall time for the backup. This results in the fact, that the backup with lateLinks and a very fast source directory (cache) is now 10 times faster.
- Because the latency is not important for making a backup, I mounted this file server over a VPN over the Internet. This means very high latency and a bandwidth of about 20KByte/s from the nfs server and 50KByte/s to the nfs server (seen on a network monitoring tool). With same same boundary conditions as before (mounted with async, source directory file system in cache, no changes) I got a speed up with lateLinks (compared with non-lateLinks backup) by a factor of 70.
So if your changed or new files are not too big compared with the available bandwidth, you can also use storeBackup (with lateLinks) for making a backup over a vpn on high latency lines.Naturally you should not choose option lateCompress in such a case. Another advantage with lateLinks in such cases is, that parallelization works much better, because reading unchanged data in the source directory nearly needs no action on the NFS mount.
Conclusion: If you mount with nfs, you can make it really fast using option lateLinks.
Example of a Run
Here you can see the statistical output of a big backup I ran on my laptop and saved to an NFS server. (I’m running this backup including OS once or twice a week and a smaller one every day, similar to the description of example 3, section 5.4.) I had to backup more than 500,000 entries:
STATISTIC 2008.09.08 23:40:17 3961 [sec] | user| system STATISTIC 2008.09.08 23:40:17 3961 -------+----------+---------- STATISTIC 2008.09.08 23:40:17 3961 process| 386.30| 166.27 STATISTIC 2008.09.08 23:40:17 3961 childs | 209.02| 116.96 STATISTIC 2008.09.08 23:40:17 3961 -------+----------+---------- STATISTIC 2008.09.08 23:40:17 3961 sum | 595.32| 283.23 => 878.55 (14m39s) STATISTIC 2008.09.08 23:40:17 3961 directories = 43498 STATISTIC 2008.09.08 23:40:17 3961 files = 482516 STATISTIC 2008.09.08 23:40:17 3961 symbolic links = 12024 STATISTIC 2008.09.08 23:40:17 3961 late links = 462267 STATISTIC 2008.09.08 23:40:17 3961 named pipes = 3 STATISTIC 2008.09.08 23:40:17 3961 sockets = 48 STATISTIC 2008.09.08 23:40:17 3961 block devices = 0 STATISTIC 2008.09.08 23:40:17 3961 character devices = 0 STATISTIC 2008.09.08 23:40:17 3961 new internal linked files = 178 STATISTIC 2008.09.08 23:40:17 3961 old linked files = 462089 STATISTIC 2008.09.08 23:40:17 3961 unchanged files = 0 STATISTIC 2008.09.08 23:40:17 3961 copied files = 2896 STATISTIC 2008.09.08 23:40:17 3961 compressed files = 5204 STATISTIC 2008.09.08 23:40:17 3961 excluded files because rule = 78 STATISTIC 2008.09.08 23:40:17 3961 included files because rule = 0 STATISTIC 2008.09.08 23:40:17 3961 max size of copy queue = 22 STATISTIC 2008.09.08 23:40:17 3961 max size of compression queue = 361 STATISTIC 2008.09.08 23:40:17 3961 calced md5 sums = 50606 STATISTIC 2008.09.08 23:40:17 3961 forks total = 9176 STATISTIC 2008.09.08 23:40:17 3961 forks md5 = 3957 STATISTIC 2008.09.08 23:40:17 3961 forks copy = 12 STATISTIC 2008.09.08 23:40:17 3961 forks bzip2 = 5204 STATISTIC 2008.09.08 23:40:17 3961 sum of source = 10G (10965625851) STATISTIC 2008.09.08 23:40:17 3961 sum of target all = 10.0G (10731903808) STATISTIC 2008.09.08 23:40:17 3961 sum of target all = 97.87% STATISTIC 2008.09.08 23:40:17 3961 sum of target new = 109M (114598007) STATISTIC 2008.09.08 23:40:17 3961 sum of target new = 1.05% STATISTIC 2008.09.08 23:40:17 3961 sum of md5ed files = 744M (779727492) STATISTIC 2008.09.08 23:40:17 3961 sum of md5ed files = 7.11% STATISTIC 2008.09.08 23:40:17 3961 sum internal linked (copy) = 32k (32472) STATISTIC 2008.09.08 23:40:17 3961 sum internal linked (compr) = 6.2M (6543998) STATISTIC 2008.09.08 23:40:17 3961 sum old linked (copy) = 3.3G (3515951642) STATISTIC 2008.09.08 23:40:17 3961 sum old linked (compr) = 6.6G (7094777689) STATISTIC 2008.09.08 23:40:17 3961 sum unchanged (copy) = 0.0 (0) STATISTIC 2008.09.08 23:40:17 3961 sum unchanged (compr) = 0.0 (0) STATISTIC 2008.09.08 23:40:17 3961 sum new (copy) = 11M (11090534) STATISTIC 2008.09.08 23:40:17 3961 sum new (compr) = 99M (103507473) STATISTIC 2008.09.08 23:40:17 3961 sum new (compr), orig size = 321M (336637589) STATISTIC 2008.09.08 23:40:17 3961 sum new / orig = 32.96% STATISTIC 2008.09.08 23:40:17 3961 size of md5CheckSum file = 16M (16271962) STATISTIC 2008.09.08 23:40:17 3961 size of temporary db files = 0.0 (0) STATISTIC 2008.09.08 23:40:17 3961 precommand duration = 1s STATISTIC 2008.09.08 23:40:17 3961 deleted old backups = 0 STATISTIC 2008.09.08 23:40:17 3961 deleted directories = 0 STATISTIC 2008.09.08 23:40:17 3961 deleted files = 0 STATISTIC 2008.09.08 23:40:17 3961 (only) removed links = 0 STATISTIC 2008.09.08 23:40:17 3961 freed space in old directories = 0.0 (0) STATISTIC 2008.09.08 23:40:17 3961 add. used space in files = 125M (130869969) STATISTIC 2008.09.08 23:40:17 3961 backup duration = 27m3s STATISTIC 2008.09.08 23:40:17 3961 over all files/sec (real time) = 297.30 STATISTIC 2008.09.08 23:40:17 3961 over all files/sec (CPU time) = 549.22 STATISTIC 2008.09.08 23:40:17 3961 CPU usage = 54.13%
It took about 27 minutes to run the backup.
But look at the number of calced md5 sums: 50,606. This is the number of files, a “normal” backup (which does not examine the contents) would have saved because a time stamp has changed or they have moved (I didn’t move files around, the changes were mainly from OS updates.). StoreBackup calculates the md5 sums and recognises that only 8,100 files (copied + compressd files) have changed.
So only 16% of the files which normally whould have been saved had to be stored. Over the time, this makes a big differnce in the space you need for your backups. And naturally, the files in the backup are compressed (if reasonable).
Because the backup ran with option lateLinks, I later had to run (via cron) storeBackupUpdateBackup.pl to set all the links etc.:
INFO 2008.09.09 02:17:52 13323 updating </disk1/store-backup/fschjc-gentoo-all/2008.09.08_23.13.14> INFO 2008.09.09 02:17:52 13323 phase 1: mkdir, symlink and compressing files STATISTIC 2008.09.09 02:18:18 13323 created 43498 directories STATISTIC 2008.09.09 02:18:18 13323 created 12024 symbolic links STATISTIC 2008.09.09 02:18:18 13323 compressed 0 files STATISTIC 2008.09.09 02:18:18 13323 used 0.0 instead of 0.0 (0 <- 0) INFO 2008.09.09 02:18:18 13323 phase 2: setting hard links STATISTIC 2008.09.09 02:27:55 13323 linked 462267 files INFO 2008.09.09 02:27:55 13323 phase 3: setting file permissions STATISTIC 2008.09.09 02:31:05 13323 set permissions for 482442 files INFO 2008.09.09 02:31:05 13323 phase 4: setting directory permissions STATISTIC 2008.09.09 02:31:47 13323 set permissions for 43498 directories
It took about 14 minutes to “complete” the backup for 500,000 entries.