Talk for the GNU Hackers Meeting 2019

The lzip format

Summary

This talk overviews the lzip compressed format, its features, implementations and uses, including its combination with the tar format to implement an appendable, robust and parallelizable compressed archive format. It also introduces the recently improved combined data recovery capabilities of lziprecover and GNU ddrescue.

This talk is available at http://www.nongnu.org/lzip/lzip_talk_ghm_2019.html and http://www.nongnu.org/lzip/lzip_talk_ghm_2019_es.html (slides)

Introduction

There are a lot of compression algorithms, and they continue inventing them. Most of those algorithms are just variations of some other algorithm in order to optimize some feature, generally in detriment of the rest. Some compress very well at the cost of being unbearably slow, while others compress fast at the cost of a reduced compression ratio. Some even try to speed up decompression by removing integrity checking. Data compression is like everything else; you can do it well or you can do it fast, but not both.

General-purpose data compression is a well studied field. The basic ideas of compression algorithms were discovered early in the history of computer science. Deflate and LZMA are based on ideas discovered in the 1970s. It is not expected that an algorithm much better than LZMA will appear in the foreseeable future.

Moreover, the formats existing when lzip was designed in 2008 (gzip and bzip2) have limitations that aren't easily fixable.

Therefore, it seemed adequate to pack a good algorithm like LZMA into a well designed format so that it could be used for all tasks requiring reliability or interoperability. For example, long-term archiving or distribution of compressed data for a wide range of machines, from embedded devices to mainframes. Lzip is an attempt at developing such a format.

Why a new format and tool?

You may be wondering why I developed lzip instead of adding LZMA compression to gzip, for example. This would have avoided the need for a new file name extension, and people could get support for the new format simply by virtue of updating their gzip. After all, a new format and tool is an annoyance, no?

It seems a good idea, and people have been proposing it since long ago. For example, H. Peter Anvin tried it in 2006: http://lkml.org/lkml/2006/9/21/236

"I have been holding out on implementing LZMA on kernel.org, because just as zip (deflate) didn't become common in the Unix world until an encapsulation format that handles things expected in the Unix world, e.g. streaming, was created (gzip), I don't think LZMA is going to be widely used until there is an "lzip" which does the same thing. I actually started the work of adding LZMA support to gzip, but then realized it would be better if a new encapsulation format with proper 64-bit support everywhere was created."

Note that in this context, "64-bit support" means "64-bit file sizes" not "64-bit processors".

The reason why this didn't work is that the gzip format was designed long ago, has limitations, and can't be easily extended without imposing these limitations to the new algorithm. See in the simplified diagram below how gzip lacks an index and stores only the lower 32 bits of the uncompressed size (ISIZE). (We will see later why these limitations are important).

+=============+=====================+-+-+-+-+-+-+-+-+
| gzip header |  compressed blocks  | CRC32 | ISIZE |        <-- no index
+=============+=====================+-+-+-+-+-+-+-+-+

So I developed a new encapsulation format with the required features and called it lzip. If it is ever needed, I expect that the lzip format will be able to seamlessly and reliably incorporate a new compression algorithm.

Lzip has been designed, written and tested with great care to replace gzip and bzip2 as the standard general-purpose compressed format for unix-like systems, specially for the GNU operating system. In this talk we'll see some of the lessons learned from these previous formats, and their application to the design of lzip.

Features of the LZMA algorithm

Lzip uses an independent implementation of the LZMA "algorithm" invented/discovered by Igor Pavlov. In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a concrete algorithm; it is more like "any algorithm using the LZMA coding scheme". For example, the option '-0' of lzip uses the scheme in almost the simplest way possible; issuing the longest match it can find, or a literal byte if it can't find a match. Using different (but compatible) variations of LZMA a wide range of compression ratios and speeds can be achieved, from about the speed of gzip to beyond the compression ratio of bzip2. Single-threaded decompression speed is intermediate between gzip and bzip2.

LZMA has two main parameters: dictionary size and match length limit. LZMA is much like deflate, the algorithm of gzip, with two main differences that account for its higher compression ratio. First, LZMA can use a dictionary size thousands of times larger than deflate. Second, LZMA uses a range encoder as its last stage, instead of the less efficient (but faster) Huffman coding used by deflate.

Lzip currently implements two variants of the LZMA algorithm; fast (used by option '-0') and normal (used by all other compression levels).

Features of the lzip format

A lzip file is formed by one or more "members". Each member has the following structure:

+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID string | VN | DS | LZMA stream | CRC32 |   Data size   |  Member size  |
+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Lzip allows a fast multi-threaded decompression thanks to its distributed index (the 'Member size' field in the diagram above).

Lzip limits the dictionary size to 1/8 of the maximum allowed by LZMA, which has two advantages. First, it allows efficient decompression on 32 bit machines. Second, it transforms the unused sizes into embedded error detection, which is the basis of some of lziprecover's recovery capabilities.

Lzip provides a robust and very safe 3 factor integrity checking. The estimated probability of an undetected error (false negative) is of about 6e-17. This high level of safety is another of the bases of lziprecover's recovery capabilities. Preliminary results even suggest that the lzip format is safe enough to be used in critical safety avionics systems.

The lzip format is as simple as possible (but not simpler). It offers all the features required for efficient and safe data sharing and long-term archiving, but no more features. Because for interoperability and long-term archiving, simple is robust.

The lzip format does not store any metadata of the uncompressed file, except its size. Therefore, lzip files are reproducible; lzip produces identical compressed output from identical input.

Lzip is prepared for the future. It offers 64-bit uncompressed and compressed sizes, and the reference implementation automatically creates a multimember file when the size is too large for a single member, allowing for an unlimited file size.

Distributed index

The lzip format provides a distributed index (the 'Member size' field in the diagram below). An index is essential for efficient multi-threaded decompression, improves the verification of stream integrity, helps in the recovery of damaged files by lziprecover, and allows the listing of correct file sizes even for multimember files.

Neither the gzip format nor the bzip2 format do provide an index. Gzip and bzip2 were designed before the multi-processor era, when an index was not so important.

(The diagrams for gzip and bzip2 formats are simplified. The bzip2 trailer is not byte-aligned, but this is not important here).

gzip                                         32-bit size
+=============+=====================+-+-+-+-+-+-+-+-+
| gzip header |  compressed blocks  | CRC32 | ISIZE |        <-- no index
+=============+=====================+-+-+-+-+-+-+-+-+

bzip2
+==============+====================+-+-+-+-+
| bzip2 header | compressed blocks  | CRC32 |       <-- no size, no index
+==============+====================+-+-+-+-+

lzip                                         64-bit sizes      index entry
+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ID string | VN | DS | LZMA stream | CRC32 |   Data size   |  Member size  |
+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Note that gzip only stores the least significant 32 bits of the uncompressed size (ISIZE). This is probably the most frequently reported shortcoming of the gzip format because the size of any file larger than 4 GiB is stored truncated. Bzip2 does not even store the uncompressed size of the file.

Parallel compression and decompression

Parallel compression is easy; one just breaks the input into chunks, compresses the chunks in parallel and writes the compressed chunks in order to the output. On the other hand, efficient parallel decompression requires a format designed for the task.

An index like the one provided by the lzip format is essential for efficient multi-threaded decompression because it tells the threads where to start decompressing. In these diagrams of multimember files, one can see how the header of each lzip member can be located by starting from the end of the file and subtracting each 'member size' (MS) until the beginning of the file is reached. In the case of gzip, there is no way to tell where the members start without decompressing them.

gzip
+=============+=============+=============+=============+=============+
|   member    |   member    |   member    |   member    |   member    |
+=============+=============+=============+=============+=============+

 __________    __________    __________    __________    __________
|          |  |          |  |          |  |          |  |          |
V          |  V          |  V          |  V          |  V          |
+=============+=============+=============+=============+=============+
| member | MS | member | MS | member | MS | member | MS | member | MS |
+=============+=============+=============+=============+=============+
lzip

Thanks to the index, the multi-threaded implementation of lzip (plzip) is one of the fastest decompressors available. Without an index, the file needs to be parsed sequentially to find the decompression entry points. In the case of gzip, not even the sequential parsing is possible in practice because of the small size of the identification string in the gzip header (16 bits). Therefore, gzip decompression can't be parallelized. As a result plzip is able to decompress regular files faster than pigz (the multi-threaded implementation of gzip) on machines with more than 4 processors.

Plzip can also decompress in parallel data from non-seekable streams because the 'Member size' field validates the identification string in the lzip header, reliably finding the decompression starting points.

Implementations

It is very important for a format with aspirations to become universal that the tools processing it can be ported to all kind of architectures and machines. Specially to the oldest and less capable, that is where problems normally arise.

These are the lzip tools that I maintain. Each one of them offers some unique feature:

Lzip          The reference implementation (C++).
Clzip         A C implementation of lzip for systems lacking a C++ compiler.
Plzip         A multi-threaded compressor using the lzip file format.
Lzlib         A compression library for the lzip file format, written in C.
Lziprecover   A data recovery tool and decompressor for the lzip format.
Lunzip        A decompressor for lzip files, written in C.
Pdlzip        A limited, "public domain" C implementation intended for those
              who can't distribute or use GPL licensed Free Software.
Lzd           An educational decompressor for the lzip format.
Tarlz         An archiver with multi-threaded lzip compression.

We will talk about lziprecover and tarlz later.

Other people also maintain tools and libraries with support for the lzip format.

Quality of implementation

Just like the lzip format provides 3 factor protection against undetected data corruption, the development methodology of the lzip family of compressors provides 3 factor protection against undetected programming errors.

Three related but independent compressor implementations, lzip, clzip and minilzip/lzlib, are developed concurrently. Every stable release of any of them is subjected to a hundred hours of intensive testing to verify that it produces identical output to the other two. This guarantees that all three implement the same algorithm, and makes it unlikely that any of them may contain serious undiscovered errors. In fact, no errors (beyond portability issues) have been discovered in lzip since 2009.

Additionally, the three implementations have been extensively tested with unzcrash, valgrind and 'american fuzzy lop' without finding a single vulnerability or false negative.

The lzip format is adequate for both new and old hardware. Either 64-bit or 32-bit. From supercomputers to embedded devices. This allows users of old hardware to continue using it, avoiding the contamination caused by its replacement.

For example, clzip, the C language version of lzip, can be compiled on machines where gzip can't even be configured. And lunzip provides a 'low memory' mode able to decompress any file using as little memory as 50 kB, irrespective of the dictionary size used to compress the file. Both have been tested on a laptop with a 486 processor and 4 MiB of RAM.

Lziprecover

Lziprecover provides unique data recovery capabilities because the lzip format has been designed from the beginning to be simple and safe. It also helps that the LZMA data stream as used by lzip is extraordinarily safe because it provides embedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decompressor to detect the approximate position of errors, and leaving very little work for the check sequence (CRC and data sizes) in the detection of errors. Lzip is usually able to detect all possible bit flips in the compressed data without resorting to the check sequence. It would be difficult to write an automatic recovery tool like lziprecover for the gzip format. And, as far as I know, it has never been written.

The lzip format does not include redundant data blocks nor error correcting codes (ECC). It is simply that the integrity checking of the lzip format is safe enough as to allow lziprecover to repair small errors or find and combine the good parts of several damaged copies with almost no danger of doing it incorrectly.

In addition to decompressing and testing the integrity of lzip files, lziprecover is able to:

Repair slightly damaged files.
Merge the good parts of two or more damaged copies.
Reproduce a missing (zeroed) sector using a reference file.
Extract the recoverable data from damaged files.
Remove the damaged members from multimember files.
Provide random access to the data in multimember files.
Manage metadata stored as trailing data in lzip files.

The development version of lziprecover has introduced recently a new recovery method (reproduce mode) that may be useful to recover from hardware errors in series of backup copies or source tarballs. Reproduce mode is based on the fact that usually only a small fraction of the data changes from one version of a backup or source tarball to the next. Thus it is sometimes possible to fix a corrupted tarball, which ddrescue was unable to read without errors, by using reference data from another tarball corresponding to a near version. The reference data is used by lziprecover to fill the missing sector, fixing the file.

Let's see how it works.

Imagine that your hard drive fails. You try to recover the data with ddrescue, but ddrescue is unable to read the data corresponding to the sector 9 ('s9' in the diagram below). This leaves a "hole" in the tar.lz archive containing 's9'. In this case, the compressed data originally in this hole were the result of compressing part of 'file 2', which is itself part of a tar archive. Now, if another copy of 'file 2' is found somewhere, either alone or inside a different tar.lz archive, lziprecover can use lzip to compress again the part of 'file 2' that is missing, and fill 's9' with it. It does not matter if the reference copy of 'file 2' used has been modified, as long as the modification does not affect the part that needs to be compressed to reproduce the data missing from 's9'.

Note: in the diagram below, file sizes look the same, but the uncompressed tar archives shown are usually much larger than the corresponding compressed tar.lz archives.

disc sectors as read by ddrescue (data from 's9' are missing)
+====+====+====+====+====+====+====+====+====+=====+=====+=====+=====+
| s1 | s2 | s3 | s4 | s5 | s6 | s7 | s8 | s9 | s10 | s11 | s12 | s13 |
+====+====+====+====+====+====+====+====+====+=====+=====+=====+=====+

tar.lz (stored in s1 to s13)
+====================================================================+
|                            lzip member                             |
+====================================================================+

tar archive corresponding to the tar.lz above
+================+================================+============+=====+
|     file 1     |             file 2             |   file 3   | EOF |
+================+================================+============+=====+

Some other tar also containing a copy of 'file 2'
's9' is reproduced using data from this copy of 'file 2'
+================================+================+============+=====+
|             file 2             |     file 4     |   file 5   | EOF |
+================================+================+============+=====+

What happens if the sector missing is 's4', which contains data from 'file 1', 'file 2' and the tar header for 'file 2'? In this case we need to use as reference a tar archive containing 'file 1' and 'file 2' in the same order as the tar archive to be reproduced. Also the metadata of 'file 2' needs to be identical between both archives.

By the way, testing lziprecover takes a lot of time because lziprecover processes corrupt files, and there is an almost infinite number of ways in which a file may become corrupted.

tar + lzip

Following the Unix concept of "doing one thing and doing it well", archive creation and archive compression have been kept until now as two separate tasks, usually performed by tar and gzip respectively. The whole tar archive was usually compressed into one gzip member (solid compression), which has the advantage of achieving a high compression ratio. On the other hand, a solidly compressed archive can't be decompressed nor decoded in parallel. It is not possible to append files to the end of the archive. The whole archive needs to be decompressed even if one wants to extract just one file. Finally, in case of data corruption, half the archive (on average) is lost along with the integrity checking of the recoverable part.

tar
+============+============+============+============+=============+========+
| tar member | tar member | tar member | tar member | tar member  |   EOF  |
+============+============+============+============+=============+========+

tar.lz (solid compression)
+==========================================================================+
|                               lzip member                                |
+==========================================================================+

tar + plzip

The first attempt at parallelization of tar decompression was done using a parallel compressor, like lbzip2 or plzip. (As stated earlier, gzip decompression can't be parallelized). The compression ratio is slightly lower than with solid compression. Decompression is faster because the archive can be decompressed in parallel, but the tar layer works the same as before. Tar must process the whole archive sequentially. The lack of alignment between tar members and compressed members prevents the archive from being decoded in parallel. It is also not possible to append files to the end of the archive. To extract just one file, the lzip member containing the file and all the members preceding it need to be decompressed.

"Alignment" here means that each compressed member contains an integer number of tar members, which can't be achieved compressing a tar archive with a parallel compressor. Alignment is important for efficient multi-threaded extraction because if a tar member is split between two compressed members, it can't be extracted by any of the two threads involved.

Finally, in case of data corruption, half the lzip member (on average) where the corruption happened is lost, along with the integrity checking of the recoverable part. A special decompressor (lziprecover), able to ignore the error and continue decompressing the remaining lzip members, is required to recover the part of the archive following the corruption.

tar
+============+============+============+============+=============+========+
| tar member | tar member | tar member | tar member | tar member  |   EOF  |
+============+============+============+============+=============+========+

tar.lz (plzip unaligned compression)
+======================+========================+==========================+
|      lzip member     |       lzip member      |       lzip member        |
+======================+========================+==========================+

tarlz --no-solid

With the generalization of multi-processor computers, the definition of what counts as "one thing" (in "doing one thing and doing it well") has changed. It happens that efficient multi-threaded creation, extraction and listing of compressed tarballs is one task, not two, because the communication required between archiver and compressor needs to go beyond normal communication through a pipe, as in 'tar -c | plzip'. A single tool controlling both archiving and compression is required to guarantee the alignment between tar members and compressed members. It is also convenient that the compressed format provides an index, which makes possible decoding the tar archive efficiently in parallel.

The compression ratio of aligned compression is lower than that of solid compression (depending on lzip member size). Extraction is fast because the archive can be decompressed and decoded in parallel. New files can be appended to the end of the archive. To extract one file, only the tar headers need to be decompressed to find the file. The lzip member containing the file is the only one that needs to be fully decompressed.

In case of data corruption, only half tar member (on average) is lost, along with the integrity checking of the recoverable part. Tarlz can skip over the damaged member and continue decoding the remaining members, just like the standard (uncompressed) tar. The member alignment minimizes the number of tar members affected, which usually also minimizes the amount of data lost.

The increase in speed may be spectacular in some cases. For example, multi-threaded listing of a regular (seekable) tar.lz archive containing large files can be hundreds of times faster than sequential listing because, in addition to using several processors, it only needs to decompress a small part of each lzip member.

tar
+==============+==============+==============+==============+==============+
|  tar member  |  tar member  |  tar member  |  tar member  |     EOF      |
+==============+==============+==============+==============+==============+

tar.lz (aligned compression)
+==============+==============+==============+==============+==============+
| lzip member  | lzip member  | lzip member  | lzip member  | lzip member  |
+==============+==============+==============+==============+==============+

tarlz --bsolid

If the files in the archive are small, compressing them independently may result in a low compression ratio. In order to increase the compression ratio, several tar members can be compressed together in each compressed member, as long as the alignment is maintained.

The compression ratio of grouped aligned compression is about as good as that of a parallel compressor like plzip (depending on lzip member size). Extraction is fast because the archive can be decompressed and decoded in parallel. New files can be appended to the end of the archive. To extract one file, only the beginning of the lzip members up to the last tar header in each member need to be decompressed to find the file. The lzip member containing the file is the only one that needs to be fully decompressed (to verify data integrity).

In case of data corruption, only half lzip member (on average) is lost, along with the integrity checking of the recoverable part. Tarlz can skip over the damaged member and continue decoding the remaining members, just like the standard (uncompressed) tar.

tar
+==============+==============+==============+==============+==============+
|  tar member  |  tar member  |  tar member  |  tar member  |     EOF      |
+==============+==============+==============+==============+==============+

tar.lz (grouped aligned compression)
+=============================+=============================+==============+
|          lzip member        |          lzip member        | lzip member  |
+=============================+=============================+==============+

Tarlz

In short, what is tarlz?

Tarlz is a massively parallel (multi-threaded) combined implementation of the tar archiver and the lzip compressor.

Why is a combined implementation needed?

Because for multi-threaded tools, archive creation and archive compression are one task, not two. Therefore, a single tool controlling both archiving and compression is required to guarantee the alignment between tar members and compressed members.

Some tarlz features:

Tarlz brings the compressed tar format to the multi-processor era.
Tarlz implements multi-threaded creation and listing.*
Tarlz allows operations only available before with uncompressed tar:
- Appending of new files to the archive.
- Concatenation of archives.
- Efficient extraction or listing of one file.

The multimember tar.lz archives created by tarlz are fully backward compatible with standard tar tools like GNU tar, which treat them like any other tar.lz archive.

* Multi-threaded extraction is not yet implemented in tarlz, but I plan to start implementing it as soon as this GHM finishes.

Who is using the lzip format?

I think that all the GNU packages where it makes sense are already supporting lzip. For example:

GNU Automake creates distribution tarballs compressed with lzip, among other formats.
GNU Tar produces and decodes tarballs compressed with lzip, among other formats.
GNU Guix offers a lzlib module for the compression/decompression of lzip streams.

Several GNU and nongnu projects already distribute their source tarballs in lzip format. Guix distributes its substitutes (packages) in lzip format. Dragora GNU/Linux distributes its packages in tar.lz format.

About the distribution of source tarballs, one thing that developers usually fail to consider is that the format used for distribution of source code should be chosen carefully because source tarballs are meant to be archived forever as a way to compile and run a given version of some software in the future, for example to decode old data.

It is OK to distribute binary packages in a format that decompresses fast, as long as the source is distributed in a format suitable for long-term archiving.

Outside of the GNU project, several notable projects are already using lzip to distribute their data. For example:

The IANA timezone database, maintained by our dear fellow GNU hacker Paul Eggert, distributes the source tarballs of its full distribution in lzip format.
The European Parliament distributes its data (information on dossiers, representatives, vote results and committee agendas) in JSON format compressed with lzip.

Thank you for your attention!
Any questions?

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2019-09-04