File Deduplication using Base Jobs

A base job is sort of like a Full save except that you will want the FileSet to contain only files that are unlikely to change in the future (i.e. a snapshot of most of your system after installing it). After the base job has been run, when you are doing a Full save, you specify one or more Base jobs to be used. All files that have been backed up in the Base job/jobs but not modified will then be excluded from the backup. During a restore, the Base jobs will be automatically pulled in where necessary.

This can be a very nice optimization for your backups. Basically, imagine that you have 100 nearly identical Windows or Linux machines containing the OS and user files. Now for the OS part, a Base job will be backed up once, and rather than making 100 copies of the OS, there will be only one. If one or more of the systems have some files updated, no problem, they will be automatically saved and restored.

A new Job directive Base=Jobx, Joby... permits you to specify the list of jobs that will be used during a Full backup as base.

Job {
   Name = BackupLinux
   Level= Base
   ...
}

Job {
   Name = BackupZog4
   Base = BackupZog4, BackupLinux
   Accurate = yes
   ...
}

In this example, the job BackupZog4 will use the most recent version of all files contained in BackupZog4 and BackupLinux jobs. Base jobs should have run with level=Base to be used.

By default, Bacula will compare permissions bits, user and group fields, modification time, size and the checksum of the file to choose between the current backup and the BaseJob file list. You can change this behavior with the BaseJob FileSet option. This option works like the verify= one, that is described in the FileSetFileSetResource chapter.

FileSet {
  Name = Full
  Include = {
    Options {
       BaseJob  = pmugcs5
       Accurate = mcs
       Verify   = pin5
    }
    File = /
  }
}

Important note: The current implementation doesn't permit to scan a Volume with bscan. The result wouldn't properly restore files easily. It is recommended to not prune File or Job records with Basejobs.

Added the new ``M'' option letter for the Accurate directive in the FileSet Options block, which allows comparing the modification time and/or creation time against the last backup timestamp. This is in contrast to the existing options letters ``m'' and/or ``c'', mtime and ctime, which are checked against the stored catalog values, that could vary accross different machines when using the BaseJob feature.

The advantage of the new ``M'' option letter for Jobs that refer to BaseJobs is that it will backup files based on the last backup time, which is more useful, because the mtime/ctime timestamps may differ on various Clients, causing unnecessary files to be backed up.

  Job {
    Name = USR
    Level = Base
    FileSet = BaseFS
    ...
  }

  Job {
    Name = Full
    FileSet = FullFS
    Base = USR
    ...
  }

  FileSet {
    Name = BaseFS
    Include {
      Options {
        Signature = MD5
      }
      File = /usr
    }
  }

  FileSet {
    Name = FullFS
    Include {
      Options {
        Accurate = Ms      # check for mtime/ctime and Size
        Signature = MD5
      }
      File = /home
      File = /usr
    }
  }

Kern Sibbald 2018-02-03