Libchop, Tools & Library for Data Backup and Distributed Storage

Libchop
1 Introduction
- 1.1 Overview
- 1.2 Use Cases
2 Utilities
3 API Reference
4 Guile Bindings
5 References
Appendix A GNU Free Documentation License
Concept Index
Function Index

Libchop

This document describes libchop version 0.5.2.

Next: Utilities, Previous: Top, Up: Top

1 Introduction

Libchop is a set of utilities and library for data backup and distributed storage. Its main application is an encrypted backup program with support for versioning, selective sharing, and adaptive compression (see Invoking chop-backup).

The library itself was initially designed as the basis for a peer-to-peer, cooperative backup system. In such a system, data has to be sent by pieces, incrementally, and it may be scattered across several participating nodes. In addition, participating nodes may be untrusted, which puts data confidentiality, integrity, and availability at risk.

In a nutshell, libchop addresses these issues by providing mechanisms to:

chop an input file into blocks, optionally compressed and/or encrypted, and to store them into a key-value store. It returns an abstract index containing all the information necessary to retrieve the file.
restore (rebuild) a file based on an abstract index, reading blocks from one (or several) stores. Each block is decoded, deciphered, decompressed, and its integrity is checked according to the methods specified by the abstract index.

Libchop decomposes these tasks with a fine grain, providing generic interfaces for each of the sub-tasks involved. It also comes with several implementations of each of these interfaces, allowing various storage strategies to be quickly experimented and compared. Implementations of these interfaces provide storage techniques such as content-based addressing and single-instance storage, content-hash keys, lossless compression, and similarity detection.

The following section gives an overview of the various interfaces and how they fit together. The next one gives an insight on typical use cases of the techniques and algorithms implemented by libchop.

Next: Use Cases, Up: Introduction

1.1 Overview

The libchop library defines several fine-grain interfaces for file chopping and indexing. Objects implementing those interfaces may be composed together to form a data storage pipeline: the input of this pipeline are the raw contents of a file, and its output is a set of data blocks and an index handle¹ that suffices to restore the file contents from those blocks.

The interfaces that allow components to interact altogether within this framework are the following:

the stream interface, which delivers input data (e.g., file contents) as a stream;
the chopper interface, given an input stream, splits this input and returns block(s);
the block indexer (resp. the block fetcher) interface deals with the naming and storage a single data block (resp. the retrieval of a single data block given its name);
the stream indexer is responsible for the storage of a series of data blocks along with meta-data describing how these blocks relate to each other; when retrieving a byte stream, the stream indexers reads the meta-data and assembles data blocks accordingly;
the block store is an interface for the storage of key/data pairs, allowing the contents of data blocks to be associated with a key.

The data flow in a stream indexing pipeline will look like this:

                       ,--  block indexer --.
                       |         ^          |
                       v         |          |
stream ----> chopper --+-> stream indexer --+-> block store

It is worth noting that the block indexer has a central position because it both stores data block directly to the block store, and produces meta-data (individual block indices) which is fed back to the stream indexer in order to be eventually stored and indexed as if it were “normal” data.

When doing the opposite operation, i.e., restoring a data stream, the data flow looks like this:

                        ,-------------------.
                        |                   |
                        v                   |
block store ----> block fetcher ----> stream indexer ----> byte stream

Here, the stream “indexer” (actually a stream fetcher²) is controlled by the user (who provided the index handle of the stream to be retrieved). The stream indexer, in turns, controls the block fetcher in order to retrieve individual blocks from the underlying block store.

For each of these interfaces, libchop provides several implementations. Therefore, the user can conveniently choose the pipeline configuration that fits them best.

Additionally, it is sometimes useful to filter input streams or input/output blocks, pretty much in the same way one uses “pipes” under Unix. Libchop provides a number of filters that may be used for this purpose, such as zlib/bz2 unzip/zip filters, or public-key ciphering filters. The user can then create a filtered stream (via chop_filtered_stream_open ()) or a filtered block store (via chop_filtered_store_open ()) in order to make use of them (see Filters).

The storage pipeline can also be built and used from the command-line using the chop-archiver tool (see Invoking chop-archiver).

Previous: Overview, Up: Introduction

1.2 Use Cases

File chopping and indexing techniques are used in a wide range of applications, including distributed storage, peer-to-peer file sharing, archival, backup, and revision control. Each of these classes of application has specific requirements, and each particular design in these classes has its own particularities.

Libchop, as a library or using its stand-alone utilities, can be used to build such applications, as demonstrated by chop-backup (see Invoking chop-backup). For instance, incremental encrypted data backup to an untrusted site can be achieved this way:

     # Mount the target FTP directory (using a GNU/Hurd translator; on
     # GNU/Linux, `curlftpfs' and `sshfs' can be used similarly.)
     $ settrans -ca backup-site /hurd/ftpfs ftp.example.com:backup
     
     $ chop-backup backup-site /pix/holidays
     (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...agfq2a/4a" ())

The long line displayed by chop-backup is an index handle or tuple that can eventually be passed to chop-backup --restore to restore the directory tree. Although the directory and its contents are stored as a set of encrypted blocks, it can be shared: knowing the index handle is necessary and sufficient to retrieve the file. For more information, see Invoking chop-backup, and Invoking chop-archiver.

Anonymous peer-to-peer file sharing systems like GNUnet and FreeNet both use the same particular configurations of the storage pipeline (see GNUnet). Roughly, they use a fixed-size stream chopper (e.g., in GNUnet 0.7, input files are chopped into 32 kB blocks) and some form of a CHK³ block indexer. GNUnet uses a tree stream indexer. In GNUnet, this whole encoding configuration is called ECRS. The Tahoe-LAFS distributed file system uses similar techniques, with the addition of erasure coding as the “chopper”, an option currently not available in libchop (see Tahoe-LAFS). Obviously, these projects use a completely distributed block store underneath.

The write-once read many (WORM) archival system Venti (used in Plan 9) uses a tree indexer and a content-based block indexer. Venti chops files into fixed-size blocks (see Venti).

Recent distributed version control systems (such as Monotone, Git, Mercurial, Bazaar, etc.) typically do not cut file into pieces. Therefore, they do not require a stream indexer since each input file yields one block. However, they index files using a content-based block indexer, which guarantees single-instance storage and integrity verification.

Rsync uses content-based indexing as well in order to reduce the amount of data to be transferred when synchronizing copies of a set of files across different machines over the network.

Libchop is modular, which makes it possible to combine various storage techniques such as those described above. For a detailed discussion of the trade-offs involved, Documents related to libchop.

Next: API Reference, Previous: Introduction, Up: Top

2 Utilities

Libchop comes with a number of utilities that exercise all its API.

The first one, chop-backup, is an encrypted backup application, with some bells and whistles.

The remaining tools offer a lower-level interface. Using them is actually a good way to become familiar with libchop's interfaces and their implementations.

The chop-archiver tool exposes all of libchop's storage pipeline, which makes it the main entry point to libchop from the command line. The chop-block-server program implements a dumb block server, which listens to remote procedure calls (RPCs) to read and write data blocks.

Next: Invoking chop-archiver, Up: Utilities

2.1 Invoking `chop-backup`

chop-backup is an encrypted backup application⁴. It is designed to liberate users from the need to entrust their storage providers with data confidentiality, integrity, and availability. We believe this should make users less dependent on any storage providers, while making it easier to share storage capacity among distrustful people or organizations.

The tool has several salient features towards that goal:

Encrypted.: The backup data can safely be stored at an untrusted site without compromising its confidentiality.
Tamper-proof.: The backup's integrity is checked upon recovery.
Distributable.: Backup data can be written to more than one store.
Shareable.: Each directory/file in a snapshot is identified by a “tuple”, which is necessary and sufficient to retrieve it. A tuple can be shared with others, which gives them access to the corresponding file/directory and only it—in accordance with the principle of least authority⁵.
Versioned.: The history of directory snapshots is recorded, at little cost.
Compressed.: Similar data among files or versions are coalesced (this is sometimes referred to as deduplication or single-instance storage). For each file type an appropriate compression method is chosen.
Evolutive.: The application is not bound to any storage, hash, encryption, or compression method. In fact, all these parameters can vary from file to file within a snapshot.
Fast.: Only files and directories changed since the last run are read; only pieces of data missing in the target snapshot are written.

To create a snapshot, type:

     $ chop-backup --backup ~/bak ~/doc/important
     (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...mdfagfq2a/4a" ())
     
     $ chop-backup --backup /net/backup-machine ~/doc/important
     (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...mdfagfq2a/4a" ())

The first command above creates a snapshot of the ~/doc/important directory and stores it under ~/bak, which we thereafter refer to as a store or block store. Upon completion it displays a tuple, which is the key that is necessary and sufficient to access the snapshot. The second command writes a new (possibly identical) snapshot to /net/backup-machine, which could be (say) an SSHFS mount point to some backup host. In both cases, only data not already available in the target store is actually written to it.

Since ~/doc/important was unchanged between the two runs, chop-backup returned the exact same tuple⁶. If the contents of that directory, or file meta-data such as permissions or modification time are changed, chop-backup returns a different tuple:

     $ chop-backup --backup /mnt/usb-key ~/doc/important
     (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...ssqijsady/4a" ())

Because the returned tuples are so important (and hard to type), chop-backup caches a copy of them locally. They can be accessed with the --recent option:

     $ chop-backup --recent
     /home/ludo/doc/important        (() "tree_indexer...ssqijsady/4a" ())
     /home/ludo/doc/important        (() "tree_indexer...mdfagfq2a/4a" ())

However, the above command does not tell you in which store these backups are available. To check which of these recent backups a given store holds, try the --probe option:

     $ chop-backup --probe /net/backup-machine
     /home/ludo/doc/important        (() "tree_indexer...mdfagfq2a/4a" ())
     
     $ chop-backup --probe /mnt/usb-key
     /home/ludo/doc/important        (() "tree_indexer...ssqijsady/4a" ())

In this example, there are two stores, each of which holds one snapshot of the /home/ludo/doc/important directory.

To allow you to retrieve your data even if the host on which chop-backup is run crashes, it is recommended to backup the resulting tuple by other means—e.g., by sending it by mail.

The contents of a snapshot can be listed:

     $ chop-backup --list ~/bak
     accessing most recent backup of `/home/ludo/doc/important' at `(() "tree_...gfq2a/4a" ())'
     d---------        0 Jan  1 01:00     //previous-version//
     drwxr-xr-x     4096 Oct  6 22:18     chop
     -rw-r--r--    32001 Oct  6 22:35     Makefile
     -rw-r--r--     1451 Dec  8 10:23     Makefile.am
     -rw-r--r--       59 Dec  8 10:23     .gitignore
     -rw-r--r--    38221 Oct  6 22:20     Makefile.in

Since the command did not specify a tuple, chop-backup assumed that the user may be interested in the latest backup that was made and is available in ~/bak.

The special entry //previous-version// above represents the previous snapshot of that directory. The code --list=verbose option would show its tuple, which could then be passed to chop-backup --list to view specifically that version.

Complete directory trees can be restored:

     $ chop-backup --restore ~/bak out

This command restores the latest snapshot and writes it to directory out, preserving file permissions and symbolic links.

For each action, chop-backup has a corresponding option, as seen above. The list of options is described below.

--backup=[opts]

-bopts

Create an encrypted backup of the given directories, store it in the specified block store, and return a tuple that designates the backup.

For this action, chop-backup must be passed at least two arguments: a file name for the backup store, and one or more directories to be backed up. This is the default action.

opts is a comma-separated list of long or one-letter options:

f
fast: When the root block of a file is available in the store, assume all the file's blocks are available. This allows chop-backup to avoid reading the file and passing it through the storage pipeline, thereby significantly saving I/O and CPU time.
This is the default mode of operation.
r
repair: This option is the converse of fast: attempt to write all the blocks of files to the store, regardless of whether their root blocks were already available. This is useful when chop-backup --check reported some blocks as missing. It is much more CPU- and I/O-intensive than fast.
q
quiet: Do not print anything except the resulting tuple. This is the default.
v
verbose: Print on the standard error output each file being stored, along with an indication as to whether it actually needed to be processed.

--recent

-R

List recent backups and their tuple. For each backup, a line is shown with the original directory name and the tuple of its latest backup.

--probe

-p

List recent backups and their tuple that are available in the given store. The output has the same format as that of --recent and it is a subset of the directory/tuple pairs shown by --recent.

--list=[opts]

-lopts

List the contents of the directory pointed to by the specified tuple in the specified store.

For this action, chop-backup must be passed at least the file name of the store, and optionally the tuple of a directory snapshot.

opts is a comma-separated list of long or one-letter options:

r
recursive: Show the contents of sub-directories, recursively.
f
flat: Do not recurse into sub-directories.
h
history: Show the contents of previous directory versions, recursively. Each directory version appears as if it were a sub-directory of the next version, and called //previous-version//.
l
latest: Show only the latest directory version.
v
verbose: Show each file or directory tuple.
c
concise: Do not show tuples.

For instance, chop-backup -lr,h ~/bak lists the contents of the most recent backup, recursively traversing sub-directories and previous directory versions.

--log=[opts]

-Lopts

Show the changes in the directory specified by the given tuple, at the specified store, compared to earlier versions. The output is similar in spirit to that of git whatchanged—i.e., one or several letters indicate how a file was affected:

A: the file was added;
M: the file was modified;
T: the file was touched—i.e., its modification time changed but its contents did not;
D: the file was deleted.

Renamed files are identified by an arrow indicating their old and new name.

opts can be a comma-separated list of these suboptions:

r
recursive: Show the contents of sub-directories, recursively. This operation takes time proportional to the depth and size of the modified sub-directories.
f
flat: Do not recurse into sub-directories (this is the default).

--show

-s

Show the contents pointed to by the given tuple in the given store.

For this action, chop-backup must be passed the store's file name, and optionally a tuple.

--check

-c

Check whether the contents of the directory pointed to by the given tuple are available from in the given store, recursively. Each file is marked as OK, CORRUPT, or MISSING; MISSING means that some of the data blocks that comprise the given file were not found in the store.

For this action, chop-backup must be passed the store's file name, and optional a tuple pointing to a directory snapshot.

As with --restore, all the file blocks are retrieved from the store and passed through the storage pipeline, which deciphers them, checks their integrity, and reassembles them.

--restore

-r

Retrieve from the given store and restore the complete file system hierarchy pointed to by the given tuple, and write it to the given directory.

For this action, chop-backup must be passed the store's file name, an optional tuple, and the output directory name.

--help

Print a summary of the command-line options and exit.

--version

Print the version number of libchop and exit.

Under the hood, chop-backup uses content-hash keys—i.e., the chk_block_indexer—to provide symmetric encryption and single-instance storage (see Block Indexers & Fetchers). The rest of the storage pipeline, notably choppers and compression filters, is chosen as a function of the file type (text file, already-compressed file, unknown file type).

Next: Invoking chop-block-server, Previous: Invoking chop-backup, Up: Utilities

2.2 Invoking `chop-archiver`

The chop-archiver tool is a storage client that exposes most of the libchop programming interface at the command line. It stores and retrieves individual files through the libchop storage pipeline (see Overview). Unlike chop-backup, it does not natively support directory traversal and versioning; instead, it offers a lower-level interface. Users can choose each component of the pipeline, thanks to libchop's introspection capabilities (see Object System).

The tool can be used in one of two modes:

In archival mode, with the --archive or --archive-fd option, chop-archiver stores the given file and returns the resulting index (see Block Indexers & Fetchers):

          $ chop-archiver --archive README
          tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26
          
          $ chop-archiver --archive-fd=5 5< README
          tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26

The above command stored the file README in a local block store under $HOME/.chop-archiver and printed an index that can later be used to restore the file.

In retrieval mode, with the --restore option, chop-archiver retrieves data pointed to by the given index:
```
          $ chop-archiver --restore \
              tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26
```
This command retrieves and decodes the data pointed to by the given index and prints it on the standard output. The integrity of each data block is checked and an error is returned if one of the blocks has been tampered with.

The available options allow the customization of the storage process:

     chop-archiver --block-indexer sha256 -a README

The command above instructs chop-archiver to store the given file with content-based addressing, using sha256 hashes of individual data blocks as their key.

     tar cf - /secret/things | \
     chop-archiver -i chk_block_indexer -I aes256,cbc,sha256,sha1 -A

This command uses content-hash keys (CHKs) to encrypt and name individual data blocks. Data blocks are encrypted with the aes256 symmetric cipher algorithm in cipher block chaining mode; the encryption key is the sha256 hash of the cleartext data block. Finally, the key associated with the resulting (encrypted) data block is the sha1 hash of that block.

The following command specifies that the input data should be chopped into data blocks using an algorithm that chooses block boundaries so as to facilitate similarity detection, trying to obtain blocks of 512 bytes on average:

     chop-archiver --chopper anchor_based_chopper --block-size 512 -a README

Finally, chop-archiver can be told to use a remote block store rather than a local on-disk store. In that case requests to read and write data blocks are sent as remote procedure calls (RPCs) over the network:

     chop-archiver --remote `chop-store-discover -D1 | cut -f2` -a README

The above command sends data to whichever remote block store is found on the local network (assuming Avahi was available at compile time). An implementation of a remote block store server is the chop-block-server tool (see Invoking chop-block-server).

The long list of available options follows:

--archive=file
-a file: Archive file and print an index handle. This index can be used for later retrieval with the --restore option.
--archive-fd=[fd]
-A[fd]: Same as above, but reads data from fd, an open file descriptor, or the standard input if fd is omitted.
--restore=index
-r index: Restore a file's revision from handle, an index handle pointing to an archived data stream.
--block-size=size
-b size: Choose a typical size of size bytes for the blocks produced by the chopper.
--chopper=chopper
-C chopper: Use an instance of the class named chopper as the input stream chopper (see Stream Choppers).
--db-file=file
-f file: Write the block database to file. By default, chop-archiver uses a block store under $HOME/.chop-archiver.
--block-indexer-class=bi-class
-i bi-class: Use bi-class as the block-indexer class (see Block Indexers & Fetchers). This implies -I. By default chop-archiver uses an instance of the hash_block_indexer class, which provides content-based addressing without encryption.
--block-indexer=bi
-I bi: Deserialize bi as an instance of bi-class and use it.
--indexer-class=i-class
-k i-class: Use i-class as the indexer class (see Stream Indexers). This implies -K.
--indexer=i
-K i: Deserialize i as an instance of i-class and use it.
--openpgp-pubkey=pubkey-file
-o pubkey-file
--openpgp-privkey=privkey-file
-O privkey-file: Use pubkey-file and privkey-file as the OpenPGP key pair to be used during TLS mutual authentication. The files should contain base64-encoded OpenPGP keys. This option is only available when GnuTLS was detected at configure time.
--protocol=proto
-p proto: Use proto (one of tls/tcp, tcp, udp or unix) when communicating with the remote store specified with --remote. Note that tls/tcp is only available when GnuTLS was detected at configure time.
--remote=host
-R host: Use the remote block store located at host for both data and meta-data blocks; host may contain : followed by a port number.
--no-smart-store
-N: In archival mode, instruct chop-archiver to write data blocks regardless of whether they already exist in the block store.
By default, only blocks not already available in the block store are written⁷. It makes incremental storage very fast (e.g., when archiving a file that has already been stored, or when storing a file that contains data similar to what's already available in the store) and saves bandwidth when using a remote block store.
--show-stats
-s: Show statistics about the blocks that have been written (in archival mode).
--store=class
-S class: Use class as the underlying file-based block store (see Block Stores).
--zip[=zip-type]
-zzip-type: Pass data blocks through a zip filter to compress (resp. decompress) data when writing (resp. reading) to (resp. from) the archive (see Filters). zip-type should be one of zlib, bzip2, or lzo.
--zip-input[=zip-type]
-Zzip-type: Same as above, except that the zip filter is applied to the input data stream.
--debug
-d: Produce debugging output and use a dummy block store (i.e., a block store that does nothing but print messages).
--verbose
-v: Produce verbose output.
--help: Print a help message.
--usage: Print a short usage message.
--version
-V: Print the program and libchop version.

Next: Invoking chop-show-similarities, Previous: Invoking chop-archiver, Up: Utilities

2.3 Invoking `chop-block-server`

The chop-block-server tool is an ONC RPC server that serves requests from the network to read and write data blocks. It it similar in spirit to Plan 9's Venti (see Venti). By default, it is completely oblivious to the encoding and naming scheme of incoming data blocks: it doesn't know what block indexer was used to index them, doesn't require content-based indexing, and of course doesn't know how these data blocks relate to each other.

The server is invoked like this:

     chop-block-server [options] local-block-store

This command starts a block store server listening to block store remote procedure calls (RPCs) and applying them to the store from file local-block-store (the type of store in this file is determined by the --store option; see below.) options can contain any options from the following list:

--help: Print a summary of the command-line options and exit.
--version: Print the version number of libchop and exit.
--address=address
-a address: Bind to address and listen to connections there.
--port=port
-p port: Run the service on port port.
--protocol=proto
-P proto: Serve RPCs over PROTO, either tcp or udp.
--no-collision-check
-C: Turn off block/key collision checks. By default, when a request to write a block under a given key is received, chop-block-server returns an error if there's already a block associated with that key and that block differs from the new one.
--enforce-hash=algo
-H: Enforce content-hash algorithm algo. That is, return an error if the key under which a block is to be written is not the algo hash of this block.
--zip[=zip-type]
-z[zip-type]: Pass data blocks through a zip-type filter to compress (resp. decompress) data when writing (resp. reading) to (resp. from) the block store. zip-type may be one of zlib, bzip2, or lzo, for instance.
--service-name=name
-s name: Choose name as the service name to be published via Avahi. This option is only available when Avahi was detected at configure time.
--no-publication
-n: Turn off service publication on the LAN via Avahi. This option is only available when Avahi was detected at configure time.
--tls
-t: Use RPCs over TLS (recommended). This option is only available when GnuTLS was detected at configure time.
--openpgp-pubkey=pubkey-file
-o pubkey-file
--openpgp-privkey=privkey-file
-O privkey-file: Use the OpenPGP public key from pubkey-file and the corresponding private key from privkey-file for TLS mutual authentication. These files must contain a base64-encoded OpenPGP key. These options are available when GnuTLS was detected at configure time.
--store=class
-S class: Use class as the underlying file-based block store (see Block Stores).
--debug
-d: Produce debugging output and use a dummy block store (i.e., a block store that does nothing but print messages).
--verbose
-v: Produce verbose output.

Next: Invoking chop-show-anchors, Previous: Invoking chop-block-server, Up: Utilities

2.4 Invoking `chop-show-similarities`

The chop-show-similarities tool estimates the similarity between two files. It is an application of anchor-based choppers (see anchor-based choppers). It is invoked like this:

     chop-show-similarities [options] file1 file2

Note that ssdeep may be a more efficient tool for the job.

Previous: Invoking chop-show-similarities, Up: Utilities

2.5 Invoking `chop-show-anchors`

The chop-show-anchors shows the blocks yielded by the anchor-based chopper on a given input file (see anchor-based choppers). It is mostly meant as a demonstration or development tool to illustrate the behavior of the anchor-based chopper. It is run this way:

     chop-show-anchors [options] file

This command runs an anchor-based chopper on file and shows the contents of file, interspersed with --- (three hyphens on a line of their own) to show block boundaries.

Next: Guile Bindings, Previous: Utilities, Up: Top

3 API Reference

This chapter is currently largely incomplete. Please refer to the documentation in the libchop header files.

Next: Input Streams, Up: API Reference

3.1 Object System

Libchop comes with its own lightweight reflexive object system defined in the <chop/objects.h> header file. Given that it is a reflexive object system, type information is available at run-time. With little overhead, this object system greatly improves the flexibility of the library. The chop-archiver command line tool illustrates how type introspection can be leveraged to provide more flexibility (see Invoking chop-archiver).

Libchop's object system can be used to provide binary compatibility with little overhead: the size of a given class' instances can be known at run-time, which allows for caller allocation. Callers can allocate objects of a given class as they prefer (e.g., on the stack) while remaining binary-compatible should the class layout change.⁸

Let's have a closer look at it.

Next: Instantiating an Object, Up: Object System

3.1.1 Declaring and Defining a Class

At run-time, libchop classes are represented by chop_class_t objects. The canonical naming scheme for classes is the following. Let chbouib be a libchop class:

its C type is named chop_chbouib_t;
its run-time representation is a global chop_class_t object named chop_chbouib_class.

Libchop defines macros to declare and define classes while maintaining the naming scheme in a consistent way:

— Macro: CHOP_DECLARE_RT_CLASS (name, parent_name, fields)

Declare a class named name, that is declare the type chop_NAME_t and the global variable chop_NAME_class. fields should be a list of fields (as in a struct definition). Note that libchop classes must always inherit (directly or not) the object class.
Example:
          CHOP_DECLARE_RT_CLASS (chbouib, object,
                                 int x;
                                 float y;);

— Macro: CHOP_DEFINE_RT_CLASS (name, parent_name, ctor, dtor, copy, equalp, serial, dserial)

Once the class named name has been declared, this macro defines it as being a sub-class of the class named parent_name (this must be the same name as the one used in the declaration!), with constructor ctor and destructor dtor (both are optional), and with copy constructor copy and equality predicate equalp (both optional too). Finally, serial is a pointer to a serializer and dserial and pointer to a deserializer which may both be NULL as well.

See the <chop/objects.h> for details.

Next: Cloning an Object, Previous: Declaring and Defining a Class, Up: Object System

3.1.2 Instantiating an Object

Constructors, in libchop, do not allocate memory to store objects: they simple initialize objects. Therefore, memory must be allocated to store a given an object before its constructor is called. Run-time type information turns out to be helpful here: each chop_class_t object contains information about the amount of memory needed to store an instance of it.

— Function: size_t chop_class_instance_size (const chop_class_t *class)

Return the size, in bytes, of instances of class class.

— Macro: chop_class_alloca_instance (class)

Return a pointer to a memory area large enough to store an instance of class. This macro uses alloca to allocate storage on the stack (see see the GNU libc manual, for details).

— Function: chop_error_t chop_object_initialize (chop_object_t *object, const chop_class_t *class)

Initialize object as an instance of class class, thus invoking the constructor of class and the one of its parent classes.
As a user, you will not usually call this function directly. Instead, you will call a particular constructor for class that can take additional arguments (e.g., chop_file_stream_open ()) and which will in turn call this function behind the scenes.

— Function: void chop_object_destroy (chop_object_t *object)

Destroy object, that is deallocate resources associated to it. Note that it does not deallocate the memory where object is stored.

It is important to note that all libchop objects must be destroyed using chop_object_destroy (), regardless of where they are stored in memory.

Another way to instantiate an object is described in Cloning an Object.

Next: Testing Objects for Equality, Previous: Instantiating an Object, Up: Object System

3.1.3 Cloning an Object

Libchop's object system allows deep copies or cloning of objects. Classes, however, may choose to not implement this feature.

— Function: chop_error_t chop_object_copy (const chop_object_t *source, chop_object_t *dest)

Clone the object pointed to by source into dest. As usual, dest must point to an (uninitialized) memory area large enough to hold an instance of source's class. If source's class does not implement cloning, then a default “shallow” copy is performed.
Note that, in any case, the copy constructors of the whole class hierarchy of source are invoked, starting from the higher one, just like for a regular instantiation (see Instantiating an Object).

Next: Serializing and Deserializing an Object, Previous: Cloning an Object, Up: Object System

3.1.4 Testing Objects for Equality

Testing whether two objects are equal or not may be dependant on the implementation of their classes. Classes may implement an equality predicate that can be used two compare two instances⁹.

— Function: int chop_object_equal (const chop_object_t *o1, const chop_object_t *o2)

Return true (non-zero) if objects o1 and o2 are equal. If both objects are instances of the same class but that class does not define an equality predicate, they will only be considered equal if and only if o1 == o2, from a pointer arithmetic viewpoint.

Previous: Testing Objects for Equality, Up: Object System

3.1.5 Serializing and Deserializing an Object

Classes in libchop may implement a serialization and a deserialization methods—these methods may be specified at class definition time (see Declaring and Defining a Class). A serialization method basically converts an object into a portable representation of its state as a byte stream. The deserialization operation does the opposite.

The chop_serial_method_t type defines two standard serialization/deserialization methods:

CHOP_SERIAL_ASCII: An object is serialized using only ASCII characters. This yields human-readable representations.
CHOP_SERIAL_BINARY: An object is serialized using as a byte stream where each byte can have any value. This usually yields to significantly more compact representations.

— Function: chop_error_t chop_object_serialize (const chop_object_t *object, chop_serial_method_t method, chop_buffer_t *buffer)

Serialize object according to serialization method method into buffer. If no serializer exists for object's class, CHOP_ERR_NOT_IMPL is returned. On success, zero is returned.

— Function: chop_error_t chop_object_deserialize (chop_object_t *object, const chop_class_t *class, chop_serial_method_t method, const char *buffer, size_t size, size_t *bytes_read)

Initialize object, which is expected to be of type class, by deserializing buffer (of size bytes), according to method. On success, zero is returned and object is initialized. Otherwise, object is left in an undefined state. On success, bytes_read is set to the number of bytes that were read from buffer in order to deserialize object.

Next: Stream Choppers, Previous: Object System, Up: API Reference

3.2 Input Streams

The chop_stream_t type provides a simple interface to input byte streams. Data is read on-demand using the chop_stream_read () function:

— Type: chop_stream_t

The type of an input byte stream.

— Variable: chop_class_t chop_stream_class

The class object for instances of type chop_stream_t.

— Function: chop_error_t chop_stream_read (chop_stream_t *stream, char *buffer, size_t size, size_t *read)

Read at most size bytes from stream into buffer. On failure, return an error code (non-zero). Otherwise, return in read the number of bytes actually read.

— Function: size_t chop_stream_preferred_block_size (const chop_stream_t *stream)

Return the preferred block size for reading stream. This gives callers, such as choppers, a hint, which may improve read performance.

3.2.1 Input Stream Classes

Several classes implement chop_stream_t:

— Variable: chop_class_t chop_file_stream_class
— Variable: chop_class_t chop_mem_stream_class
— Variable: chop_class_t chop_filtered_stream_class

Classes that inherit from chop_stream_class.

These classes implement input streams backed by files, in-memory byte arrays, or by another stream passed through a filter (see Filters).

To create instances of these classes, use the constructors below.

— Function: chop_error_t chop_file_stream_open (const char *path, chop_stream_t *stream)

Open file located at path and initialize stream as a file stream representing this file. stream has to point to a large-enough memory area to hold an object whose class is chop_file_stream_class.

— Function: void chop_mem_stream_open (const char *base, size_t size, void (*free_func) (void *), chop_stream_t *stream)

Open a memory-backed stream, i.e., a stream whose input is read from base which is size byte-long. If free_func is not NULL, it is called upon closing stream.

— Function: chop_error_t chop_filtered_stream_open (chop_stream_t *backend, chop_proxy_semantics_t bps, chop_filter_t *filter, int owns_filter, chop_stream_t *stream)

Initialize stream as a filtered stream that reads input data from backend through filter. bps defines the semantics of stream as a proxy of backend (whether backend should eventually be destroyed, etc.). Similarly, if owns_filter is true, then closing stream will destroy filter.

Next: Block Stores, Previous: Input Streams, Up: API Reference

3.3 Stream Choppers

Choppers take a byte stream as input and return a series of blocks, which are subsets of the input stream. Their name comes from the fact that they “chop” the input stream into smaller blocks. They are defined in chop/choppers.h.

— Type: chop_chopper_t

The type of a chopper.

— Variable: chop_class_t chop_chopper_class

The class object for instances of type chop_chopper_t.

— Variable: chop_class_t chop_chopper_class

This meta-class—i.e., a class that inherits from chop_class_class—defines the chop_chopper_generic_open method to instantiate choppers (see below.)

— Function: chop_error_t chop_chopper_generic_open (const chop_chopper_class_t *class, chop_stream_t *input, size_t typical_block_size, chop_chopper_t *chopper)

Initialize chopper as an instance of the class chopper class with input stream input. Return zero on success.
The implementation of class will choose default parameters for the specific chopper implementation (class-specific initialization methods are described below for fine-tuning of parameters). However, it should make sure that the average block size will be typical_block_size bytes. If typical_block_size is zero, then the implementation of class is free to choose any block size.

— Function: chop_stream_t * chop_chopper_stream (const chop_chopper_t *chopper)

Return the input stream attached to chopper.

— Function: chop_error_t chop_chopper_read_block (chop_chopper_t *chopper, chop_buffer_t *block, size_t *size)

Read a block from chopper and store its contents into block. On success, block contains the exact contents of the block—i.e., the contents are "pushed"—, size contains the size of the block, and zero is returned. On end of stream, *size is set to zero and CHOP_STREAM_END is returned.

— Function: size_t chop_chopper_typical_block_size (const chop_chopper_t *chopper)

Return the “typical” size of the blocks produced by chopper. The meaning of “typical” actually depends on the chopper implementation. The value returned can be used as a hint for the initial size of block buffers.

— Function: void chop_chopper_close (chop_chopper_t *chopper)

Deallocate any resources associated with chopper. Once closed, chopper becomes unusable.

3.3.1 Stream Chopper Classes

The chop_chopper_t interface described above is implemented by several classes:

— Variable: chop_chopper_class_t chop_fixed_size_chopper_class
— Variable: chop_chopper_class_t chop_anchor_based_chopper_class
— Variable: chop_chopper_class_t chop_whole_stream_chopper_class

Classes that inherit from chop_chopper_class. All of these support the chop_chopper_generic_open method described above.

The first class implements the fixed-size chopper, a chopper that returns fixed-size blocks:

— Function: chop_error_t chop_fixed_size_chopper_init (chop_stream_t *input, size_t block_size, int pad_blocks, chop_chopper_t *chopper)

Initialize chopper as a fixed-size stream chopper. It will read data from input and cut it into blocks of block_size bytes. If pad_blocks is non-zero, the last block before CHOP_STREAM_END will be padded with zeros to be block_size long.

The second class implements anchor-based choppers. They return blocks of variable size whose boundaries are determined as a function of the input stream. The intent of this chopper is to help detect similarity among subsets of files; using single-instance storage, identical blocks are stored only once, which saves storage space.

— Function: chop_error_t chop_anchor_based_chopper_init (chop_stream_t *input, size_t window_size, unsigned long magic_fpr_mask, chop_chopper_t *chopper)

Initialize chopper as an anchor-based stream chopper. It will read data from input and produce variably sized blocks.
Anchor-based choppers implement the algorithm described by Udi Manber¹⁰, which itself relies on Rabin fingerprints.
Basically, this algorithm allows anchors to be defined deterministically within a file such that identical blocks among similar files may be discovered. window_size is the size of the sliding window used to compute fingerprints. The paper recommends 50. Lower values may yield smaller blocks and better similarity detection.
The probability of finding a block boundary, and therefore the average size of blocks, largely depends on the value of magic_fpr_mask, the mask that will be applied to each fingerprint to determine whether it should yield a block boundary. The more bits are set in magic_fpr_mask, the less likely a fingerprint will match, and the larger the average block size will be.

An application of anchor-based choppers is chop-show-similarities (see Invoking chop-show-similarities). To see how block boundaries are determined, use the chop-show-anchors command (see Invoking chop-show-anchors).

Finally, the last class implements whole-stream choppers, which do not actually chop the input stream:

— Function: chop_error_t chop_whole_stream_chopper_open (chop_stream_t *input, chop_chopper_t *chopper)

Initialize chopper as a whole-stream chopper which fetches data from input. chopper will simply return one single block containing the entire contents of input.
Consequently, this can be costly in terms of memory consumption. Also, chop_whole_stream_chopper_class does not honor at all the typical_block_size argument of chop_chopper_generic_open.

Next: Block Indexers & Fetchers, Previous: Stream Choppers, Up: API Reference

3.4 Block Stores

Next: Stream Indexers, Previous: Block Stores, Up: API Reference

3.5 Block Indexers & Fetchers

Block indexers take a chunk of data and a block store, add that block to the block store (possibly after processing it), and return an index handle that names this block. The dual of block indexers are block fetchers: given an index handle and a block store, they return the original chunk of data.

— Type: chop_block_indexer_t
— Type: chop_block_fetcher_t

The type of a block indexer or fetcher, respectively.

— Type: chop_index_handle_t

The type of an abstract index handle.

The classes associated with a block indexer can be accessed with the following functions.

— Function: const chop_class_t * chop_block_indexer_fetcher_class (const chop_block_indexer_t *indexer)

Return the block fetcher class associated with the class of indexer.

— Function: const chop_class_t * chop_block_indexer_index_handle_class (const chop_block_indexer_t *indexer)

Return the index handle class associated with the class of indexer.

— Function: chop_error_t chop_block_indexer_initialize_fetcher (const chop_block_indexer_t *indexer, chop_block_fetcher_t *fetcher)

Initialize fetcher as a block fetcher corresponding to block indexer indexer. fetcher must point to a memory area large enough to contain an instance of the fetcher class associated to indexer's class.

Storing and retrieving a chunk of data is achieved using the methods below.

— Function: chop_error_t chop_block_indexer_index (chop_block_indexer_t *indexer, chop_block_store_t *store, const char *buffer, size_t size, chop_index_handle_t *handle)

Using indexer, index the size-byte long data block pointed to by buffer to store. On success, return zero and initialize handle with an index handle sufficient to restore that block from store using the corresponding fetcher. Otherwise, return an error.

— Function: chop_error_t chop_block_fetcher_fetch (chop_block_fetcher_t *fetcher, const chop_index_handle_t *handle, chop_block_store_t *store, chop_buffer_t *buffer, size_t *size)

Using fetcher, fetch from store the data block whose index handle is handle. On success, fill in buffer with its contents and set size to its size in bytes.

— Function: chop_error_t chop_block_fetcher_exists (chop_block_fetcher_t *fetcher, const chop_index_handle_t *handle, chop_block_store_t *store, int *exists)

Using fetcher, check whether the block pointed to by handle is available. On success, set *exists to 1 if it's available in store, 0 if it's not available; otherwise return an error.

3.5.1 Block Indexer Classes

The block indexer, block fetcher, and index handle interfaces described above have several implementations.

— Variable: chop_class_t chop_hash_block_indexer_class
— Variable: chop_class_t chop_chk_block_indexer_class
— Variable: chop_class_t chop_uuid_block_indexer_class
— Variable: chop_class_t chop_integer_block_indexer_class

Classes that inherit from chop_block_indexer_class.

— Variable: chop_class_t chop_hash_block_fetcher_class
— Variable: chop_class_t chop_chk_block_fetcher_class
— Variable: chop_class_t chop_uuid_block_fetcher_class
— Variable: chop_class_t chop_integer_block_fetcher_class

Classes that inherit from chop_block_fetcher_class, and are the dual of the above.

The most important block indexer classes are hash and chk, described below. Both use cryptographic hashes to determine a the key under which a block is stored. Consequently, they provide the single-instance storage property: a given chunk of data is always stored only once in the block store.

— Function: chop_error_t chop_hash_block_indexer_open (chop_hash_method_t hash_method, chop_block_indexer_t *indexer)

Initialize indexer as a hash block indexer. indexer will use hash_method to compute the identifier of the given block and will use that identifier when storing the block. It leaves the block contents unchanged. This technique is known as content-based addressing, or compare-by-hash.

— Function: chop_error_t chop_chk_block_indexer_open (chop_cipher_handle_t cipher_handle, int owns_cipher_handle, chop_hash_method_t key_hash_method, chop_hash_method_t block_id_hash_method, chop_block_indexer_t *block_indexer)

Initialize block_indexer as a content-hash key (CHK) block indexer. block_indexer will cipher data blocks using cipher_handle and the block's hash yielded by key_hash_method (this is symmetric ciphering). Finally, block_indexer will compute the block's identifier using block_id_hash_method.
This technique is known as convergent encryption, and the index yielded is sometimes referred to as a content-hash key (in GNUnet/FreeNet terminology.) For more details, References.

— Function: chop_error_t chop_uuid_block_indexer_open (chop_block_indexer_t *block_indexer)

Initialize block_indexer as a UUID block indexer. In other words, block_indexer will then yield DCE-compatible Universally Unique Identifiers for each block, using libuuid (provided it was available at compilation time).
Therefore, block_indexer will not have the single-instance storage property, unlike the chk and hash block indexers.

— Function: chop_error_t chop_integer_block_indexer_open (unsigned long start, chop_block_indexer_t *indexer)

Initialize indexer as an indexer that will simply return consecutive 32-bit integers as block IDs, starting from start.
Such indexers are mostly useful for benchmarking. They lack the single-instance storage property, and can actually overwrite previous blocks given that the returned block keys are not universally unique.

Next: Filters, Previous: Block Indexers & Fetchers, Up: API Reference

3.6 Stream Indexers

Previous: Stream Indexers, Up: API Reference

3.7 Filters

Next: References, Previous: API Reference, Up: Top

4 Guile Bindings

Libchop can be used in Scheme programs written for GNU Guile 2.0 (see GNU Guile). Each C header file has a corresponding Guile module—e.g., the (chop indexers) module corresponds to <chop/indexers.h>. The (chop) module is a composite module that aggregates all the sub-modules.

The example below shows how one would write a procedure that, given a file name, stores and encrypts the given file and returns an opaque index handle (see index handle).

     (use-modules (chop))
     
     (define (chop/encrypt file store)
       (let* ((input (file-stream-open file))
              (i     (tree-indexer-open))
              (c     (fixed-size-chopper-open input 777))
              (bi    (chk-block-indexer-open (make-cipher cipher-algorithm/aes256
                                                          cipher-mode/cbc)
                                             hash-method/sha256
                                             hash-method/sha1)))
         (indexer-index-blocks i c bi store store)))

To store an encrypted copy of a file, this procedure can be invoked as follows:

     (let ((s (file-based-block-store-open (lookup-class "gdbm_block_store")
                                           "encrypted-data.gdbm"
                                           (logior O_RDWR O_CREAT)
                                           #o644)))
      (chop/encrypt "secret-file.txt" s))
     ⇒ #<chop chk_index_handle 346d1c0 (3475000)>

In this example, secret-file.txt is chopped into blocks of 777 bytes (see Stream Choppers). Those blocks are then encrypted and stored in encrypted-data.gdbm, a GNU dbm key/value store (see Intro).

The returned index handle can be serialized as a string and stashed, for later reuse:

     (serialize-object/ascii (chop/encrypt "secret-file.txt" s))
     ⇒ "ci4r7xyktacal3rqrf6wpk5kitripgrv3cxvllarepzzsus5mmuq====,6v4nej5hjmtkvcinlawcf24wnq4zplm7/64a"

Getting back an index handle from such a string is done this way:

     (let ((str "ci4r7xyktacal3rqrf6wpk5kitripgrv3cxvllarepzzsus5mmuq====,6v4nej5hjmtkvcinlawcf24wnq4zplm7/64a"))
       (deserialize-object/ascii (lookup-class "chk_block_indexer") str))
     ⇒ #<chop chk_index_handle 284ee40 (2854000)>

Finally, the procedure below returns a Scheme port from the given index handle and block store:

     (define (decrypt/assemble index store)
       (let* ((i  (tree-indexer-open))
              (bi (chk-block-indexer-open (make-cipher cipher-algorithm/aes256
                                                       cipher-mode/cbc)
                                          hash-method/sha256
                                          hash-method/sha1))
              (bf (block-indexer-fetcher bi)))
         (stream->port (indexer-fetch-stream i index bf store store))))

The data pointed to by index in store can be obtained by reading from the returned port.

And voilà!

The API reference is not written yet, but all the Scheme procedures come with a docstring, and can be easily browsed using Geiser (see Documentation helpers).

Next: GNU Free Documentation License, Previous: Guile Bindings, Up: Top

5 References

Free software related to libchop:

Plan 9's Venti archival storage server: http://plan9.bell-labs.com/magic/man2html/8/venti
GNUnet, GNU's decentralized anonymous and censorship-resistant peer-to-peer framework: http://gnunet.org/
Tahoe-LAFS, a secure, decentralized, data store: http://allmydata.org/trac/tahoe-lafs

Documents related to libchop:

Cooperative Data Backup for Mobile Devices: Ludovic Courtès (PhD thesis), 2007, http://tel.archives-ouvertes.fr/tel-00196822/en/. Chapter 4 discusses the design of libchop. It contains an evaluation of several storage strategies using libchop in terms of storage efficiency and computational cost, for various types of input files.
Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices: Ludovic Courtès et al., 2006, http://hal.archives-ouvertes.fr/hal-00187069/en/. This paper provides a summary of the above.

Appendix A GNU Free Documentation License

Version 1.3, 3 November 2008

     Copyright © 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc.
     http://fsf.org/
     
     Everyone is permitted to copy and distribute verbatim copies
     of this license document, but changing it is not allowed.

PREAMBLE
The purpose of this License is to make a manual, textbook, or other functional and useful document free in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of “copyleft”, which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
APPLICABILITY AND DEFINITIONS
This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The “Document”, below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as “you”. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law.
A “Modified Version” of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A “Secondary Section” is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The “Invariant Sections” are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none.
The “Cover Texts” are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
A “Transparent” copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.
Examples of suitable formats for Transparent copies include plain ascii without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML, PostScript or PDF designed for human modification. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML, PostScript or PDF produced by some word processors for output purposes only.
The “Title Page” means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, “Title Page” means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
The “publisher” means any person or entity that distributes copies of the Document to the public.
A section “Entitled XYZ” means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the Title” of such a section when you modify the Document means that it remains a section “Entitled XYZ” according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License.
VERBATIM COPYING
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
COPYING IN QUANTITY
If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
MODIFICATIONS
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
1. Use in the Title Page (and on the covers, if any) a title distinct from that of the Document, and from those of previous versions (which should, if there were any, be listed in the History section of the Document). You may use the same title as a previous version if the original publisher of that version gives permission.
2. List on the Title Page, as authors, one or more persons or entities responsible for authorship of the modifications in the Modified Version, together with at least five of the principal authors of the Document (all of its principal authors, if it has fewer than five), unless they release you from this requirement.
3. State on the Title page the name of the publisher of the Modified Version, as the publisher.
4. Preserve all the copyright notices of the Document.
5. Add an appropriate copyright notice for your modifications adjacent to the other copyright notices.
6. Include, immediately after the copyright notices, a license notice giving the public permission to use the Modified Version under the terms of this License, in the form shown in the Addendum below.
7. Preserve in that license notice the full lists of Invariant Sections and required Cover Texts given in the Document's license notice.
8. Include an unaltered copy of this License.
9. Preserve the section Entitled “History”, Preserve its Title, and add to it an item stating at least the title, year, new authors, and publisher of the Modified Version as given on the Title Page. If there is no section Entitled “History” in the Document, create one stating the title, year, authors, and publisher of the Document as given on its Title Page, then add an item describing the Modified Version as stated in the previous sentence.
10. Preserve the network location, if any, given in the Document for public access to a Transparent copy of the Document, and likewise the network locations given in the Document for previous versions it was based on. These may be placed in the “History” section. You may omit a network location for a work that was published at least four years before the Document itself, or if the original publisher of the version it refers to gives permission.
11. For any section Entitled “Acknowledgements” or “Dedications”, Preserve the Title of the section, and preserve in the section all the substance and tone of each of the contributor acknowledgements and/or dedications given therein.
12. Preserve all the Invariant Sections of the Document, unaltered in their text and in their titles. Section numbers or the equivalent are not considered part of the section titles.
13. Delete any section Entitled “Endorsements”. Such a section may not be included in the Modified Version.
14. Do not retitle any existing section to be Entitled “Endorsements” or to conflict in title with any Invariant Section.
15. Preserve any Warranty Disclaimers.
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section Entitled “Endorsements”, provided it contains nothing but endorsements of your Modified Version by various parties—for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.
COMBINING DOCUMENTS
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled “History” in the various original documents, forming one section Entitled “History”; likewise combine any sections Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all sections Entitled “Endorsements.”
COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
AGGREGATION WITH INDEPENDENT WORKS
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an “aggregate” if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate.
TRANSLATION
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “History”, the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title.
TERMINATION
You may not copy, modify, sublicense, or distribute the Document except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, or distribute it is void, and will automatically terminate your rights under this License.
However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.
Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, receipt of a copy of some or all of the same material does not give you any rights to use it.
FUTURE REVISIONS OF THIS LICENSE
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License “or any later version” applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation. If the Document specifies that a proxy can decide which future versions of this License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Document.
RELICENSING
“Massive Multiauthor Collaboration Site” (or “MMC Site”) means any World Wide Web server that publishes copyrightable works and also provides prominent facilities for anybody to edit those works. A public wiki that anybody can edit is an example of such a server. A “Massive Multiauthor Collaboration” (or “MMC”) contained in the site means any set of copyrightable works thus published on the MMC site.
“CC-BY-SA” means the Creative Commons Attribution-Share Alike 3.0 license published by Creative Commons Corporation, a not-for-profit corporation with a principal place of business in San Francisco, California, as well as future copyleft versions of that license published by that same organization.
“Incorporate” means to publish or republish a Document, in whole or in part, as part of another Document.
An MMC is “eligible for relicensing” if it is licensed under this License, and if all works that were first published under this License somewhere other than this MMC, and subsequently incorporated in whole or in part into the MMC, (1) had no cover texts or invariant sections, and (2) were thus incorporated prior to November 1, 2008.
The operator of an MMC Site may republish an MMC contained in the site under CC-BY-SA on the same site at any time before August 1, 2009, provided the MMC is eligible for relicensing.

ADDENDUM: How to use this License for your documents

To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:

       Copyright (C)  year  your name.
       Permission is granted to copy, distribute and/or modify this document
       under the terms of the GNU Free Documentation License, Version 1.3
       or any later version published by the Free Software Foundation;
       with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
       Texts.  A copy of the license is included in the section entitled ``GNU
       Free Documentation License''.

If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with...Texts.” line with this:

         with the Invariant Sections being list their titles, with
         the Front-Cover Texts being list, and with the Back-Cover Texts
         being list.

If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation.

If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.

Next: Function Index, Previous: GNU Free Documentation License, Up: Top

Concept Index

abstract index: Overview
anchor-based choppers: Stream Choppers
backup: Invoking chop-backup
binary compatibility: Object System
block indexer: Overview
block store: Overview
capabilities: Invoking chop-backup
change log: Invoking chop-backup
CHK: Block Indexers & Fetchers
CHK: Invoking chop-archiver
chop-archiver: Invoking chop-archiver
chop-backup: Invoking chop-backup
chop-block-server: Invoking chop-block-server
chopper: Stream Choppers
chopper: Overview
class: Declaring and Defining a Class
clone: Cloning an Object
compare-by-hash: Block Indexers & Fetchers
content-based addressing: Block Indexers & Fetchers
content-hash key: Block Indexers & Fetchers
content-hash keys: Invoking chop-archiver
convergent encryption: Block Indexers & Fetchers
data blocks: Stream Choppers
deduplication: Invoking chop-backup
deep copy: Cloning an Object
deserialization: Serializing and Deserializing an Object
encryption: Invoking chop-archiver
equality predicate: Testing Objects for Equality
filter: Overview
filtered block store: Overview
filtered stream: Overview
fixed-size chopper: Stream Choppers
incremental storage: Invoking chop-archiver
index handle: Invoking chop-archiver
index handle: Overview
indexer: Overview
object system: Object System
pipeline: Overview
principle of least authority (POLA): Invoking chop-backup
Rabin fingerprints: Stream Choppers
referential transparency (chop-backup): Invoking chop-backup
serialization: Serializing and Deserializing an Object
similarity detection: Stream Choppers
similarity detection: Invoking chop-archiver
single-instance storage: Block Indexers & Fetchers
single-instance storage: Invoking chop-backup
storage pipeline: Overview
store: Overview
stream: Overview
stream choppers: Stream Choppers
stream indexer: Overview
tuple: Invoking chop-backup
versioning: Invoking chop-backup
whole-stream choppers: Stream Choppers

Previous: Concept Index, Up: Top

Function Index

chop_anchor_based_chopper_init: Stream Choppers
chop_block_fetcher_exists: Block Indexers & Fetchers
chop_block_fetcher_fetch: Block Indexers & Fetchers
chop_block_indexer_fetcher_class: Block Indexers & Fetchers
chop_block_indexer_index: Block Indexers & Fetchers
chop_block_indexer_index_handle_class: Block Indexers & Fetchers
chop_block_indexer_initialize_fetcher: Block Indexers & Fetchers
chop_chk_block_indexer_open: Block Indexers & Fetchers
chop_chopper_close: Stream Choppers
chop_chopper_generic_open: Stream Choppers
chop_chopper_read_block: Stream Choppers
chop_chopper_stream: Stream Choppers
chop_chopper_typical_block_size: Stream Choppers
chop_class_alloca_instance: Instantiating an Object
chop_class_instance_size: Instantiating an Object
CHOP_DECLARE_RT_CLASS: Declaring and Defining a Class
CHOP_DEFINE_RT_CLASS: Declaring and Defining a Class
chop_file_stream_open: Input Streams
chop_filtered_store_open: Overview
chop_filtered_stream_open: Input Streams
chop_filtered_stream_open: Overview
chop_fixed_size_chopper_init: Stream Choppers
chop_hash_block_indexer_open: Block Indexers & Fetchers
chop_integer_block_indexer_open: Block Indexers & Fetchers
chop_mem_stream_open: Input Streams
chop_object_copy: Cloning an Object
chop_object_deserialize: Serializing and Deserializing an Object
chop_object_destroy: Instantiating an Object
chop_object_equal: Testing Objects for Equality
chop_object_initialize: Instantiating an Object
chop_object_serialize: Serializing and Deserializing an Object
chop_stream_preferred_block_size: Input Streams
chop_stream_read: Input Streams
chop_uuid_block_indexer_open: Block Indexers & Fetchers
chop_whole_stream_chopper_open: Stream Choppers

Footnotes

[1] In this document the terms abstract index and index handle are used interchangeably to describe indices are returned by a block indexer.

[2] For design reasons, there is actually one interface (class) for both the stream indexer and the stream fetcher, while the block indexer and block fetcher interfaces are represented by two separate classes. Conceptually, indexing and fetching are really two distinct functions in both cases. They are, however, dual functions, and may share code.

[3] CHK means “content-hash key”. See Stream Choppers, for details.

[4] This utility is installed only when GNU Guile 2.0 is available.

[5] Tuples can be thought of as capabilities—i.e., “unforgeable”, opaque bit strings that designate content and the right to access it. Tuples are “protected by sparsity”, whereas capabilities in operating systems or programming languages are typically “kernel-protected”.

[6] In other words, chop-backup --backup can be said to be referentially transparent, a notion functional programmers are familiar with. It also looks very scientific to present it this way, doesn't it?

[7] To determine whether a block is available, chop-archiver checks whether something is stored under the block's name without checking whether the contents are the same. It makes sense for content-based addressing, e.g., with hash_block_indexer and chk_block_indexer.

[8] Conversely, the Glib family of libraries achieves binary compatibility at the cost of removing the possibility for caller memory allocation: constructors do not only initialize an object, they also allocate its storage on the heap. The situation in C++ is often similar, but C++ has the advantage of having compiler support.

[9] This is similar to Scheme's equal? vs. eq? see R5RS Equivalence Predicates.

[10] Udi Manber, “Finding similar files in a large file system”, USENIX Winter Conference, 1994.