Saving Image Files / raw Devices / Blocked Files
The scope of blocked files
Saving big image files which change only in parts completely with every
backup is inefficient, as it is very time and space consuming. To give some
examples:
- Some mailers use traditional mbox (mailbox) format to save
email. This is convenient, because it is a well supported format. But
you will get a big file of perhaps multiple gigabyte with all your
mails in it. Backing up such a file means backing up everything
despite the fact that only a very little part of it has
changed.
The same category are the .pst files from Outlook. If you
have to save this kind of files (and if they are big), you should
think about using ``blocked files''.
- If you use an image file with an encrypted file system in it
like e.g., TrueCrypt does, you should backup the encrypted
data, not the files in it. (If you backup the files in it you need
another encrypted container, which means the backup program has to
know all passwords to run automatically which is a perfect security
hole.)
For that reason you should backup the binary image data as it is. If
you make a simple copy, it will take the size of the image each time
(you also cannot compress this data). This is a perfect situation to
use the storeBackup blocked files feature (without compression),
where you can have lots of historic versions of the image without
needing too much space and without a security hole (storeBackup
does not need to know and does not know anything about the content it
saves).
- Images for hypervisors like Xen, KVM or VMware are another
example which you can save as ``blocked files'' very successfully.
- Do not use blocked files which are compressed as a whole like
jpegs or other types of compressed or encrypted files (.gz,
.bz2, .gpg, etc.). In most cases, changing
something in that files result in a complete change of all
blocks.
- The feature of blocked files is also not suitable for database
dump files, because storeBackup (up to now) works with fixed
blocks. If you add one byte in the beginning of a file, all blocks
will be different.
How it works
If you specify a file to be saved in block files (see below how to do this),
then storeBackup.pl will do the following:
- Create a directory with the same combination of path and file
name of the original image file in the source directory.
- Split the source file into blocks and check if any of these
blocks exist anywhere in a backup (see option otherBackupSeries
of storeBackup.pl). If a block already exists, a hard link is
generated, if it does not exist, the block will be copied or stored
compressed.
- The md5 sum of all these files will be stored in a special file
called .md5BlockCheckSums.bz2 in that directory.
- storeBackup.pl will also calculate the md5 sum of the whole file
and store it in .md5CheckSum.
Because references to existing files are realized via hard links,
every backup is a full backup.
If you use the option lateLinks, the links will be set later. If you
also use the option lateCompress, the compression will also be done later.
How to save image files
There are two ways to configure which files storeBackup.pl should
treat as blocked files:
- The easiest way is using the following options:
- checkBlocksSuffix
- The configuration is similar to
exceptSuffix, a list of suffixes which are checked for a
match, e.g., .vdmk for VMware images. They
simply mean that the last part of the file name must be similar to
what you define here.
The next options described here are only used if
checkBlocksSuffix is set.
- checkBlocksMinSize
- Only files with this minimum size
will the treated as blocked files. You can use the same shortcuts
as described in defining rules, e.g., 50M means 50 megabytes. The default
value is 100M.
- checkBlocksBS
- Defines the block size in which the
files which matches has to be split by storeBackup.pl. The format
is equal to checkBlocksMinSize. The default value is
1M. The minimal value is 10k.
- checkBlocksCompr
- Defines if the blocks are
compressed. Possible values are yes, no or
check. On the command line, set --checkBlocksCompr.
This flag only affects files selected with
checkBlocksSuffix.
Example:
You want to backup all your VMware images and you also have to
backup some Outlook.pst files. The blocked file feature will
be chosen from storeBackup for files with a minimum size of 50
megabyte ending with .vmdk or .pst. The block size
chosen is 500k and the resulting blocks in the backup will be
compressed:
checkBlocksSuffix = '\.vmdk' '\.pst'
checkBlocksMinSize = 50M
checkBlocksBS = 500k
checkBlocksCompr = yes
- The more flexible way to specify the handling of blocked files
is to use rules like described in defining rules. The following options are
available five times, so there is a checkBlocksRule0,
checkBlocksRule1, checkBlocksRule2,
checkBlocksRule3 and checkBlocksRule4:
- checkBlocksRulei
- The ith rule
specifying files to treat as blocked files in the backup.
- checkBlocksBSi
- The corresponding block size
for the blocks in the backup. The default value is 1 megabyte. The
minimal value is 10k.
- checkBlocksCompri
- If set to yes, the
blocks will be compressed. If set to no, they will not be
compressed. If set to check, storeBackup will decide itself
if they will be compressed. This may result in a mix of compressed
and copied blocks.
- checkBlocksReadi
- Defines a filter for reading
the specified file, e.g., gunzip or gzip -d. This
option may be useful if you have to save an already compressed image
file. (Using the ``blocked file'' feature of storeBackup with
already compressed files compressed as a whole does not make
sense.)
Example:
Let's assume, you have a TrueCrypt image on your disk and want to
have a backup of it each time you start storeBackup.pl. You chose the
unremarkable name myPics.iso, block size is 1M, no
compression. So you define rule 0:
checkBlocksRule0= '$file =~ m#/myPics\.iso$#'
#checkBlocksBS0=
#checkBlocksCompr0=
checkBlocksRule1= '$size > &::SIZE("50M")' and
( '$file =~ m#\.pst$#' or '$file =~ m#windows_D/Outlook/#' )
checkBlocksBS1=200k
checkBlocksCompr1=check
You also defined rule 1, which matches for all files bigger than 50
megabytes ending with .pst or located in the
relative path windows_D/Outlook/ in the backup. (I'm
using this to backup the data of my dual boot laptop.) If you are
not familiar with rules in storeBackup, you should read
section 7.4.
You can use checkBlocksSuffix and checkBlocksRule i at the same time in one configuration file. StoreBackup
evaluates checkBlocksRulei (in ascending order) first and
then checkBlocksSuffix.
how to save mass storage devices
Backing up a mass storage device (like /dev/sdc or
/dev/sdc1) works in the same way as saving an image file with
storeBackup. You choose the device(s) with checkDevices i, the block size in the backup with checkDevicesBS i and switch compression on or off with
checkDevicesCompri. Additionally, you have to specify
the relative path with checkDevicesDiri in the backup
where the contents of the devices will be stored.
The blocks in the backup resulting from image files or devices are
hard linked if storeBackup finds the same contents.
The options are in detail:
- checkDevicesi
- List of devices (e.g.,
/dev/sdd2 /dev/sde1) to backup.
- --checkDevicesDiri
- Directory where the devices
are stored in the backup (relative path). The image file
will also be restored in that directory if you restore the backup
with storeBackupRecover.pl (if you use default parameters). Into
this directory storeBackup will create a subdirectory which name
is generated from the parameters of option checkDevices, e.g.,
/dev/sdc will result in dev_sdc.
- checkDevicesBSi
- Defines the block size in which the
devices specified have to be split by storeBackup.pl. The
format is equal to checkBlocksMinSize. The default value
is 1M. The minimal value is 10k.
- checkDevicesCompri
- Defines if the blocks are
compressed. Possible values are yes, no or
check; the default value is no.
This option only affects files selected with
checkDevicesi. If you set this option to check,
every block is checked for compression (or not).
Choosing the block size
There is no fix rule about the ``best'' block size. I made some
measurements about the block size and the used space. The second
backup was done with lateLinks (see section 7.6), so
I could use df again to see how much space was really
needed. The used file system was reiserfs with tail packing. If you
use a file system without tail packing (like ext2, ext3 or ext4), the
overhead will be bigger and small block sizes are less attractive
(same if you use compression). The results also depend on the
application writing to your source image
file.
All the examples are done without compression (for performance
reasons). They were done with real data. Naturally, I'm using
compression in my real backups. The 2nd backup shows
the space needed for the changed data. The percentage line below shows
the relation between the first and the second backup. The sums line
shows the sum of the first and second backup, the next line (1x) the
relationship between that sum depending on the last value with 5M (5
megabyte blocks). The last line shows the same relationship regarding
the size of the first backup and 10 times the second one
(extrapolating 10 backups). So this should be the most interesting
value.
The first example shows the results when storing a big Outlook.pst
file of 1.2GB with the changes I had from one day to the
other:
BlockSize |
50k |
100k |
200k |
1M |
5M |
1. backup [kB] |
1219253 |
1172263 |
1172863 |
1173801 |
1173724 |
2. backup [kB] |
7692 |
13445 |
22720 |
73826 |
240885 |
|
0.63% |
1.15% |
1.94% |
6.29% |
20.52% |
sum [kB] |
1226945 |
1185708 |
1195583 |
1247627 |
1414609 |
1x |
86.73% |
83.82% |
84.52% |
88.20% |
100.00% |
10x |
36.18% |
36.47% |
39.08% |
53.37% |
100.00% |
The second example was done with a smaller Outlook file of 117
megabyte. This is the one for the input folder. The numbers show a
different behavior than in the first example.
BlockSize |
50k |
100k |
200k |
1M |
5M |
1. backup [kB] |
122487 |
118221 |
118891 |
119184 |
119181 |
2. backup [kB] |
33400 |
51240 |
74424 |
107632 |
119181 |
|
27.27% |
43.34% |
62.60% |
90.31% |
100.00% |
sum [kB] |
155887 |
169461 |
193315 |
226816 |
238362 |
1x |
65.40% |
71.09% |
81.10% |
95.16% |
100.00% |
10x |
34.82% |
48.10% |
65.84% |
91.19% |
100.00% |
The third example shows the results when storing a VMware
image of 2.1 GB. Between the first and the second backup the VM was
booted, a program for updating my navigational system was updated and
I connected the navigational system for an update also.
BlockSize |
50k |
100k |
200k |
1M |
5M |
1. backup [kB] |
2162595 |
2106781 |
2112547 |
2117178 |
2117094 |
2. backup [kB] |
53656 |
80609 |
131701 |
438241 |
1112652 |
|
2.48% |
3.83% |
6.23% |
20.70% |
52.56% |
sum [kB] |
2216251 |
2187390 |
2244248 |
2555419 |
3229746 |
1x |
68.62% |
67.73% |
69.49% |
79.12% |
100.00% |
10x |
20.38% |
21.99% |
25.90% |
49.08% |
100.00% |
In all these examples you can see in the last line, that at some point
smaller block sizes will not reduce the space needed. An optimum
values seems to be between 50k and 200k (when using tail packing).
There is one additional important aspect about the block size: If you
choose a small block size, the performance will also go down. To be
able to achieve acceptable performance, the following optimizations
are implemented:
- If you do not compress the the blocks within storeBackup.pl (no
compression at all or later compression via option lateCompress), no
parallelization is used.
- If you compress the blocks within storeBackup.pl and configure a
block size of 1 megabyte or more, parallelizing is used.
- If you compress the blocks within storeBackup.pl with bzip2 and
configure a block size of less than 1 megabyte, storeBackup.pl tries
to use the perl module IO::Compress::Bzip2. If it is
installed on your system, it will be used.
it is best to make your own tests to get a feeling of useful block
sizes in your use cases.
Heinz-Josef Claes
2014-04-20