This document describes libchop version 0.5.2.
Libchop is a set of utilities and library for data backup and distributed storage. Its main application is an encrypted backup program with support for versioning, selective sharing, and adaptive compression (see Invoking chop-backup).
The library itself was initially designed as the basis for a peer-to-peer, cooperative backup system. In such a system, data has to be sent by pieces, incrementally, and it may be scattered across several participating nodes. In addition, participating nodes may be untrusted, which puts data confidentiality, integrity, and availability at risk.
In a nutshell, libchop addresses these issues by providing mechanisms to:
Libchop decomposes these tasks with a fine grain, providing generic interfaces for each of the sub-tasks involved. It also comes with several implementations of each of these interfaces, allowing various storage strategies to be quickly experimented and compared. Implementations of these interfaces provide storage techniques such as content-based addressing and single-instance storage, content-hash keys, lossless compression, and similarity detection.
The following section gives an overview of the various interfaces and how they fit together. The next one gives an insight on typical use cases of the techniques and algorithms implemented by libchop.
The libchop library defines several fine-grain interfaces for file chopping and indexing. Objects implementing those interfaces may be composed together to form a data storage pipeline: the input of this pipeline are the raw contents of a file, and its output is a set of data blocks and an index handle1 that suffices to restore the file contents from those blocks.
The interfaces that allow components to interact altogether within this framework are the following:
The data flow in a stream indexing pipeline will look like this:
,-- block indexer --. | ^ | v | | stream ----> chopper --+-> stream indexer --+-> block store
It is worth noting that the block indexer has a central position because it both stores data block directly to the block store, and produces meta-data (individual block indices) which is fed back to the stream indexer in order to be eventually stored and indexed as if it were “normal” data.
When doing the opposite operation, i.e., restoring a data stream, the data flow looks like this:
,-------------------. | | v | block store ----> block fetcher ----> stream indexer ----> byte stream
Here, the stream “indexer” (actually a stream fetcher2) is controlled by the user (who provided the index handle of the stream to be retrieved). The stream indexer, in turns, controls the block fetcher in order to retrieve individual blocks from the underlying block store.
For each of these interfaces, libchop provides several implementations. Therefore, the user can conveniently choose the pipeline configuration that fits them best.
Additionally, it is sometimes useful to filter input streams or
input/output blocks, pretty much in the same way one uses “pipes”
under Unix. Libchop provides a number of filters that may be used for
this purpose, such as zlib/bz2 unzip/zip filters, or public-key
ciphering filters. The user can then create a filtered stream (via
chop_filtered_stream_open ()
) or a filtered block store (via
chop_filtered_store_open ()
) in order to make use of them
(see Filters).
The storage pipeline can also be built and used from the command-line
using the chop-archiver
tool (see Invoking chop-archiver).
File chopping and indexing techniques are used in a wide range of applications, including distributed storage, peer-to-peer file sharing, archival, backup, and revision control. Each of these classes of application has specific requirements, and each particular design in these classes has its own particularities.
Libchop, as a library or using its stand-alone utilities, can be used to build such applications, as demonstrated by chop-backup (see Invoking chop-backup). For instance, incremental encrypted data backup to an untrusted site can be achieved this way:
# Mount the target FTP directory (using a GNU/Hurd translator; on # GNU/Linux, `curlftpfs' and `sshfs' can be used similarly.) $ settrans -ca backup-site /hurd/ftpfs ftp.example.com:backup $ chop-backup backup-site /pix/holidays (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...agfq2a/4a" ())
The long line displayed by chop-backup
is an index handle
or tuple that can eventually be passed to chop-backup
--restore
to restore the directory tree. Although the directory and
its contents are stored as a set of encrypted blocks, it can be
shared: knowing the index handle is necessary and sufficient to
retrieve the file. For more information, see Invoking chop-backup,
and Invoking chop-archiver.
Anonymous peer-to-peer file sharing systems like GNUnet and FreeNet both use the same particular configurations of the storage pipeline (see GNUnet). Roughly, they use a fixed-size stream chopper (e.g., in GNUnet 0.7, input files are chopped into 32 kB blocks) and some form of a CHK3 block indexer. GNUnet uses a tree stream indexer. In GNUnet, this whole encoding configuration is called ECRS. The Tahoe-LAFS distributed file system uses similar techniques, with the addition of erasure coding as the “chopper”, an option currently not available in libchop (see Tahoe-LAFS). Obviously, these projects use a completely distributed block store underneath.
The write-once read many (WORM) archival system Venti (used in Plan 9) uses a tree indexer and a content-based block indexer. Venti chops files into fixed-size blocks (see Venti).
Recent distributed version control systems (such as Monotone, Git, Mercurial, Bazaar, etc.) typically do not cut file into pieces. Therefore, they do not require a stream indexer since each input file yields one block. However, they index files using a content-based block indexer, which guarantees single-instance storage and integrity verification.
Rsync uses content-based indexing as well in order to reduce the amount of data to be transferred when synchronizing copies of a set of files across different machines over the network.
Libchop is modular, which makes it possible to combine various storage techniques such as those described above. For a detailed discussion of the trade-offs involved, Documents related to libchop.
Libchop comes with a number of utilities that exercise all its API.
The first one, chop-backup, is an encrypted backup application, with some bells and whistles.
The remaining tools offer a lower-level interface. Using them is actually a good way to become familiar with libchop's interfaces and their implementations.
The chop-archiver tool exposes all of libchop's storage
pipeline, which makes it the main entry point to libchop from the
command line. The chop-block-server
program implements a dumb
block server, which listens to remote procedure calls (RPCs) to
read and write data blocks.
chop-backup is an encrypted backup application4. It is designed to liberate users from the need to entrust their storage providers with data confidentiality, integrity, and availability. We believe this should make users less dependent on any storage providers, while making it easier to share storage capacity among distrustful people or organizations.
The tool has several salient features towards that goal:
To create a snapshot, type:
$ chop-backup --backup ~/bak ~/doc/important (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...mdfagfq2a/4a" ()) $ chop-backup --backup /net/backup-machine ~/doc/important (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...mdfagfq2a/4a" ())
The first command above creates a snapshot of the ~/doc/important directory and stores it under ~/bak, which we thereafter refer to as a store or block store. Upon completion it displays a tuple, which is the key that is necessary and sufficient to access the snapshot. The second command writes a new (possibly identical) snapshot to /net/backup-machine, which could be (say) an SSHFS mount point to some backup host. In both cases, only data not already available in the target store is actually written to it.
Since ~/doc/important was unchanged between the two runs, chop-backup returned the exact same tuple6. If the contents of that directory, or file meta-data such as permissions or modification time are changed, chop-backup returns a different tuple:
$ chop-backup --backup /mnt/usb-key ~/doc/important (() "tree_indexer:chk_block_fetcher:chk_index_handle:100:aes256,cbc,sha1...ssqijsady/4a" ())
Because the returned tuples are so important (and hard to type),
chop-backup caches a copy of them locally. They can be
accessed with the --recent
option:
$ chop-backup --recent /home/ludo/doc/important (() "tree_indexer...ssqijsady/4a" ()) /home/ludo/doc/important (() "tree_indexer...mdfagfq2a/4a" ())
However, the above command does not tell you in which store these backups are available. To check which of these recent backups a given store holds, try the --probe option:
$ chop-backup --probe /net/backup-machine /home/ludo/doc/important (() "tree_indexer...mdfagfq2a/4a" ()) $ chop-backup --probe /mnt/usb-key /home/ludo/doc/important (() "tree_indexer...ssqijsady/4a" ())
In this example, there are two stores, each of which holds one snapshot of the /home/ludo/doc/important directory.
To allow you to retrieve your data even if the host on which chop-backup is run crashes, it is recommended to backup the resulting tuple by other means—e.g., by sending it by mail.
The contents of a snapshot can be listed:
$ chop-backup --list ~/bak accessing most recent backup of `/home/ludo/doc/important' at `(() "tree_...gfq2a/4a" ())' d--------- 0 Jan 1 01:00 //previous-version// drwxr-xr-x 4096 Oct 6 22:18 chop -rw-r--r-- 32001 Oct 6 22:35 Makefile -rw-r--r-- 1451 Dec 8 10:23 Makefile.am -rw-r--r-- 59 Dec 8 10:23 .gitignore -rw-r--r-- 38221 Oct 6 22:20 Makefile.in
Since the command did not specify a tuple, chop-backup assumed that the user may be interested in the latest backup that was made and is available in ~/bak.
The special entry //previous-version//
above represents the
previous snapshot of that directory. The code --list=verbose
option would show its tuple, which could then be passed to
chop-backup --list to view specifically that version.
Complete directory trees can be restored:
$ chop-backup --restore ~/bak out
This command restores the latest snapshot and writes it to directory out, preserving file permissions and symbolic links.
For each action, chop-backup has a corresponding option, as seen above. The list of options is described below.
For this action, chop-backup must be passed at least two arguments: a file name for the backup store, and one or more directories to be backed up. This is the default action.
opts is a comma-separated list of long or one-letter options:
This is the default mode of operation.
chop-backup --check
reported some blocks as missing. It is much more CPU- and I/O-intensive
than fast.
For this action, chop-backup must be passed at least the file name of the store, and optionally the tuple of a directory snapshot.
opts is a comma-separated list of long or one-letter options:
For instance, chop-backup -lr,h ~/bak lists the contents of
the most recent backup, recursively traversing sub-directories and
previous directory versions.
git whatchanged
—i.e., one or several letters
indicate how a file was affected:
A
M
T
D
Renamed files are identified by an arrow indicating their old and new name.
opts can be a comma-separated list of these suboptions:
For this action, chop-backup must be passed the store's file
name, and optionally a tuple.
OK
, CORRUPT
, or MISSING
; MISSING
means that some of the data blocks that comprise the given file were not
found in the store.
For this action, chop-backup must be passed the store's file name, and optional a tuple pointing to a directory snapshot.
As with --restore
, all the file blocks are retrieved from the
store and passed through the storage pipeline, which deciphers them,
checks their integrity, and reassembles them.
For this action, chop-backup must be passed the store's file
name, an optional tuple, and the output directory name.
Under the hood, chop-backup uses content-hash keys—i.e., the
chk_block_indexer
—to provide symmetric encryption and
single-instance storage (see Block Indexers & Fetchers). The rest
of the storage pipeline, notably choppers and compression filters, is
chosen as a function of the file type (text file, already-compressed
file, unknown file type).
chop-archiver
The chop-archiver
tool is a storage client that exposes
most of the libchop programming interface at the command line. It
stores and retrieves individual files through the libchop storage
pipeline (see Overview). Unlike chop-backup, it does not
natively support directory traversal and versioning; instead, it offers
a lower-level interface. Users can choose each component of the
pipeline, thanks to libchop's introspection capabilities (see Object System).
The tool can be used in one of two modes:
--archive
or
--archive-fd
option, chop-archiver
stores the given file
and returns the resulting index (see Block Indexers & Fetchers):
$ chop-archiver --archive README tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26 $ chop-archiver --archive-fd=5 5< README tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26
The above command stored the file README in a local block store
under $HOME/.chop-archiver
and printed an index that can
later be used to restore the file.
--restore
option,
chop-archiver
retrieves data pointed to by the given index:
$ chop-archiver --restore \ tree_indexer:hash_block_fetcher:hash_index_handle:64:sha1:6jmb3wrbzyqpsngxybt4r4rxtctdo4o2/26
This command retrieves and decodes the data pointed to by the given index and prints it on the standard output. The integrity of each data block is checked and an error is returned if one of the blocks has been tampered with.
The available options allow the customization of the storage process:
chop-archiver --block-indexer sha256 -a README
The command above instructs chop-archiver
to store the given file
with content-based addressing, using sha256
hashes of individual
data blocks as their key.
tar cf - /secret/things | \ chop-archiver -i chk_block_indexer -I aes256,cbc,sha256,sha1 -A
This command uses content-hash keys (CHKs) to encrypt and name
individual data blocks. Data blocks are encrypted with the
aes256
symmetric cipher algorithm in cipher block chaining
mode; the encryption key is the sha256
hash of the cleartext data
block. Finally, the key associated with the resulting (encrypted) data
block is the sha1
hash of that block.
The following command specifies that the input data should be chopped into data blocks using an algorithm that chooses block boundaries so as to facilitate similarity detection, trying to obtain blocks of 512 bytes on average:
chop-archiver --chopper anchor_based_chopper --block-size 512 -a README
Finally, chop-archiver
can be told to use a remote block store
rather than a local on-disk store. In that case requests to read and
write data blocks are sent as remote procedure calls (RPCs) over
the network:
chop-archiver --remote `chop-store-discover -D1 | cut -f2` -a README
The above command sends data to whichever remote block store is found on
the local network (assuming Avahi was available at compile time). An
implementation of a remote block store server is the
chop-block-server
tool (see Invoking chop-block-server).
The long list of available options follows:
--restore
option.
chop-archiver
uses a block store under
$HOME/.chop-archiver
.
-I
. By default chop-archiver
uses an instance of the hash_block_indexer
class, which provides
content-based addressing without encryption.
-K
.
tls/tcp
, tcp
, udp
or
unix
) when communicating with the remote store specified with
--remote
. Note that tls/tcp
is only available when GnuTLS
was detected at configure time.
:
followed by a port
number.
By default, only blocks not already available in the block store are
written7. It makes
incremental storage very fast (e.g., when archiving a file that has
already been stored, or when storing a file that contains data similar
to what's already available in the store) and saves bandwidth when using
a remote block store.
zlib
,
bzip2
, or lzo
.
chop-block-server
The chop-block-server
tool is an ONC RPC server that serves
requests from the network to read and write data blocks. It it similar
in spirit to Plan 9's Venti (see Venti). By default, it
is completely oblivious to the encoding and naming scheme of incoming
data blocks: it doesn't know what block indexer was used to index them,
doesn't require content-based indexing, and of course doesn't know how
these data blocks relate to each other.
The server is invoked like this:
chop-block-server [options] local-block-store
This command starts a block store server listening to block store remote
procedure calls (RPCs) and applying them to the store from file
local-block-store (the type of store in this file is determined by
the --store
option; see below.) options can contain any
options from the following list:
tcp
or udp
.
chop-block-server
returns an error if there's already a block associated with that key and
that block differs from the new one.
zlib
, bzip2
, or
lzo
, for instance.
chop-show-similarities
The chop-show-similarities tool estimates the similarity between two files. It is an application of anchor-based choppers (see anchor-based choppers). It is invoked like this:
chop-show-similarities [options] file1 file2
Note that ssdeep may be a more efficient tool for the job.
chop-show-anchors
The chop-show-anchors shows the blocks yielded by the anchor-based chopper on a given input file (see anchor-based choppers). It is mostly meant as a demonstration or development tool to illustrate the behavior of the anchor-based chopper. It is run this way:
chop-show-anchors [options] file
This command runs an anchor-based chopper on file and shows the
contents of file, interspersed with ---
(three hyphens on a
line of their own) to show block boundaries.
This chapter is currently largely incomplete. Please refer to the documentation in the libchop header files.
Libchop comes with its own lightweight reflexive object system defined
in the <chop/objects.h>
header file. Given that it is a
reflexive object system, type information is available at run-time.
With little overhead, this object system greatly improves the
flexibility of the library. The chop-archiver
command line tool
illustrates how type introspection can be leveraged to provide more
flexibility (see Invoking chop-archiver).
Libchop's object system can be used to provide binary compatibility with little overhead: the size of a given class' instances can be known at run-time, which allows for caller allocation. Callers can allocate objects of a given class as they prefer (e.g., on the stack) while remaining binary-compatible should the class layout change.8
Let's have a closer look at it.
At run-time, libchop classes are represented by chop_class_t
objects. The canonical naming scheme for classes is the following.
Let chbouib
be a libchop class:
chop_chbouib_t
;
chop_class_t
object named chop_chbouib_class
.
Libchop defines macros to declare and define classes while maintaining the naming scheme in a consistent way:
Declare a class named name, that is declare the type
chop_NAME_t
and the global variablechop_NAME_class
. fields should be a list of fields (as in astruct
definition). Note that libchop classes must always inherit (directly or not) theobject
class.Example:
CHOP_DECLARE_RT_CLASS (chbouib, object, int x; float y;);
Once the class named name has been declared, this macro defines it as being a sub-class of the class named parent_name (this must be the same name as the one used in the declaration!), with constructor ctor and destructor dtor (both are optional), and with copy constructor copy and equality predicate equalp (both optional too). Finally, serial is a pointer to a serializer and dserial and pointer to a deserializer which may both be
NULL
as well.
See the <chop/objects.h>
for details.
Constructors, in libchop, do not allocate memory to store objects:
they simple initialize objects. Therefore, memory must be
allocated to store a given an object before its constructor is
called. Run-time type information turns out to be helpful here: each
chop_class_t
object contains information about the amount of
memory needed to store an instance of it.
Return the size, in bytes, of instances of class class.
Return a pointer to a memory area large enough to store an instance of class. This macro uses
alloca
to allocate storage on the stack (see see the GNU libc manual, for details).
Initialize object as an instance of class class, thus invoking the constructor of class and the one of its parent classes.
As a user, you will not usually call this function directly. Instead, you will call a particular constructor for class that can take additional arguments (e.g.,
chop_file_stream_open ()
) and which will in turn call this function behind the scenes.
Destroy object, that is deallocate resources associated to it. Note that it does not deallocate the memory where object is stored.
It is important to note that all libchop objects must be destroyed
using chop_object_destroy ()
, regardless of where they are
stored in memory.
Another way to instantiate an object is described in Cloning an Object.
Libchop's object system allows deep copies or cloning of objects. Classes, however, may choose to not implement this feature.
Clone the object pointed to by source into dest. As usual, dest must point to an (uninitialized) memory area large enough to hold an instance of source's class. If source's class does not implement cloning, then a default “shallow” copy is performed.
Note that, in any case, the copy constructors of the whole class hierarchy of source are invoked, starting from the higher one, just like for a regular instantiation (see Instantiating an Object).
Testing whether two objects are equal or not may be dependant on the implementation of their classes. Classes may implement an equality predicate that can be used two compare two instances9.
Return true (non-zero) if objects o1 and o2 are equal. If both objects are instances of the same class but that class does not define an equality predicate, they will only be considered equal if and only if o1
==
o2, from a pointer arithmetic viewpoint.
Classes in libchop may implement a serialization and a deserialization methods—these methods may be specified at class definition time (see Declaring and Defining a Class). A serialization method basically converts an object into a portable representation of its state as a byte stream. The deserialization operation does the opposite.
The chop_serial_method_t
type defines two standard
serialization/deserialization methods:
CHOP_SERIAL_ASCII
CHOP_SERIAL_BINARY
Serialize object according to serialization method method into buffer. If no serializer exists for object's class,
CHOP_ERR_NOT_IMPL
is returned. On success, zero is returned.
Initialize object, which is expected to be of type class, by deserializing buffer (of size bytes), according to method. On success, zero is returned and object is initialized. Otherwise, object is left in an undefined state. On success, bytes_read is set to the number of bytes that were read from buffer in order to deserialize object.
The chop_stream_t
type provides a simple interface to input byte
streams. Data is read on-demand using the chop_stream_read ()
function:
Read at most size bytes from stream into buffer. On failure, return an error code (non-zero). Otherwise, return in read the number of bytes actually read.
Return the preferred block size for reading stream. This gives callers, such as choppers, a hint, which may improve read performance.
Several classes implement chop_stream_t
:
Classes that inherit from chop_stream_class.
These classes implement input streams backed by files, in-memory byte arrays, or by another stream passed through a filter (see Filters).
To create instances of these classes, use the constructors below.
Open file located at path and initialize stream as a file stream representing this file. stream has to point to a large-enough memory area to hold an object whose class is
chop_file_stream_class
.
Open a memory-backed stream, i.e., a stream whose input is read from base which is size byte-long. If free_func is not
NULL
, it is called upon closing stream.
Initialize stream as a filtered stream that reads input data from backend through filter. bps defines the semantics of stream as a proxy of backend (whether backend should eventually be destroyed, etc.). Similarly, if owns_filter is true, then closing stream will destroy filter.
Choppers take a byte stream as input and return a series of blocks, which are subsets of the input stream. Their name comes from the fact that they “chop” the input stream into smaller blocks. They are defined in chop/choppers.h.
This meta-class—i.e., a class that inherits from chop_class_class—defines the
chop_chopper_generic_open
method to instantiate choppers (see below.)
Initialize chopper as an instance of the class chopper class with input stream input. Return zero on success.
The implementation of class will choose default parameters for the specific chopper implementation (class-specific initialization methods are described below for fine-tuning of parameters). However, it should make sure that the average block size will be typical_block_size bytes. If typical_block_size is zero, then the implementation of class is free to choose any block size.
Return the input stream attached to chopper.
Read a block from chopper and store its contents into block. On success, block contains the exact contents of the block—i.e., the contents are "pushed"—, size contains the size of the block, and zero is returned. On end of stream,
*
size is set to zero andCHOP_STREAM_END
is returned.
Return the “typical” size of the blocks produced by chopper. The meaning of “typical” actually depends on the chopper implementation. The value returned can be used as a hint for the initial size of block buffers.
Deallocate any resources associated with chopper. Once closed, chopper becomes unusable.
The chop_chopper_t
interface described above is implemented by
several classes:
Classes that inherit from chop_chopper_class. All of these support the chop_chopper_generic_open method described above.
The first class implements the fixed-size chopper, a chopper that returns fixed-size blocks:
Initialize chopper as a fixed-size stream chopper. It will read data from input and cut it into blocks of block_size bytes. If pad_blocks is non-zero, the last block before
CHOP_STREAM_END
will be padded with zeros to be block_size long.
The second class implements anchor-based choppers. They return blocks of variable size whose boundaries are determined as a function of the input stream. The intent of this chopper is to help detect similarity among subsets of files; using single-instance storage, identical blocks are stored only once, which saves storage space.
Initialize chopper as an anchor-based stream chopper. It will read data from input and produce variably sized blocks.
Anchor-based choppers implement the algorithm described by Udi Manber10, which itself relies on Rabin fingerprints.
Basically, this algorithm allows anchors to be defined deterministically within a file such that identical blocks among similar files may be discovered. window_size is the size of the sliding window used to compute fingerprints. The paper recommends 50. Lower values may yield smaller blocks and better similarity detection.
The probability of finding a block boundary, and therefore the average size of blocks, largely depends on the value of magic_fpr_mask, the mask that will be applied to each fingerprint to determine whether it should yield a block boundary. The more bits are set in magic_fpr_mask, the less likely a fingerprint will match, and the larger the average block size will be.
An application of anchor-based choppers is chop-show-similarities (see Invoking chop-show-similarities). To see how block boundaries are determined, use the chop-show-anchors command (see Invoking chop-show-anchors).
Finally, the last class implements whole-stream choppers, which do not actually chop the input stream:
Initialize chopper as a whole-stream chopper which fetches data from input. chopper will simply return one single block containing the entire contents of input.
Consequently, this can be costly in terms of memory consumption. Also, chop_whole_stream_chopper_class does not honor at all the typical_block_size argument of
chop_chopper_generic_open
.
Block indexers take a chunk of data and a block store, add that block to the block store (possibly after processing it), and return an index handle that names this block. The dual of block indexers are block fetchers: given an index handle and a block store, they return the original chunk of data.
The type of a block indexer or fetcher, respectively.
The classes associated with a block indexer can be accessed with the following functions.
Return the block fetcher class associated with the class of indexer.
Return the index handle class associated with the class of indexer.
Initialize fetcher as a block fetcher corresponding to block indexer indexer. fetcher must point to a memory area large enough to contain an instance of the fetcher class associated to indexer's class.
Storing and retrieving a chunk of data is achieved using the methods below.
Using indexer, index the size-byte long data block pointed to by buffer to store. On success, return zero and initialize handle with an index handle sufficient to restore that block from store using the corresponding fetcher. Otherwise, return an error.
Using fetcher, fetch from store the data block whose index handle is handle. On success, fill in buffer with its contents and set size to its size in bytes.
Using fetcher, check whether the block pointed to by handle is available. On success, set
*
exists to 1 if it's available in store, 0 if it's not available; otherwise return an error.
The block indexer, block fetcher, and index handle interfaces described above have several implementations.
Classes that inherit from chop_block_indexer_class.
Classes that inherit from chop_block_fetcher_class, and are the dual of the above.
The most important block indexer classes are hash
and chk
,
described below. Both use cryptographic hashes to determine a the key
under which a block is stored. Consequently, they provide the
single-instance storage property: a given chunk of data is always
stored only once in the block store.
Initialize indexer as a hash block indexer. indexer will use hash_method to compute the identifier of the given block and will use that identifier when storing the block. It leaves the block contents unchanged. This technique is known as content-based addressing, or compare-by-hash.
Initialize block_indexer as a content-hash key (CHK) block indexer. block_indexer will cipher data blocks using cipher_handle and the block's hash yielded by key_hash_method (this is symmetric ciphering). Finally, block_indexer will compute the block's identifier using block_id_hash_method.
This technique is known as convergent encryption, and the index yielded is sometimes referred to as a content-hash key (in GNUnet/FreeNet terminology.) For more details, References.
Initialize block_indexer as a UUID block indexer. In other words, block_indexer will then yield DCE-compatible Universally Unique Identifiers for each block, using
libuuid
(provided it was available at compilation time).Therefore, block_indexer will not have the single-instance storage property, unlike the
chk
andhash
block indexers.
Initialize indexer as an indexer that will simply return consecutive 32-bit integers as block IDs, starting from start.
Such indexers are mostly useful for benchmarking. They lack the single-instance storage property, and can actually overwrite previous blocks given that the returned block keys are not universally unique.
Libchop can be used in Scheme programs written for GNU Guile 2.0
(see GNU Guile). Each C
header file has a corresponding Guile module—e.g., the (chop
indexers)
module corresponds to <chop/indexers.h>
. The
(chop)
module is a composite module that aggregates all the
sub-modules.
The example below shows how one would write a procedure that, given a file name, stores and encrypts the given file and returns an opaque index handle (see index handle).
(use-modules (chop)) (define (chop/encrypt file store) (let* ((input (file-stream-open file)) (i (tree-indexer-open)) (c (fixed-size-chopper-open input 777)) (bi (chk-block-indexer-open (make-cipher cipher-algorithm/aes256 cipher-mode/cbc) hash-method/sha256 hash-method/sha1))) (indexer-index-blocks i c bi store store)))
To store an encrypted copy of a file, this procedure can be invoked as follows:
(let ((s (file-based-block-store-open (lookup-class "gdbm_block_store") "encrypted-data.gdbm" (logior O_RDWR O_CREAT) #o644))) (chop/encrypt "secret-file.txt" s)) ⇒ #<chop chk_index_handle 346d1c0 (3475000)>
In this example, secret-file.txt is chopped into blocks of 777 bytes (see Stream Choppers). Those blocks are then encrypted and stored in encrypted-data.gdbm, a GNU dbm key/value store (see Intro).
The returned index handle can be serialized as a string and stashed, for later reuse:
(serialize-object/ascii (chop/encrypt "secret-file.txt" s)) ⇒ "ci4r7xyktacal3rqrf6wpk5kitripgrv3cxvllarepzzsus5mmuq====,6v4nej5hjmtkvcinlawcf24wnq4zplm7/64a"
Getting back an index handle from such a string is done this way:
(let ((str "ci4r7xyktacal3rqrf6wpk5kitripgrv3cxvllarepzzsus5mmuq====,6v4nej5hjmtkvcinlawcf24wnq4zplm7/64a")) (deserialize-object/ascii (lookup-class "chk_block_indexer") str)) ⇒ #<chop chk_index_handle 284ee40 (2854000)>
Finally, the procedure below returns a Scheme port from the given index handle and block store:
(define (decrypt/assemble index store) (let* ((i (tree-indexer-open)) (bi (chk-block-indexer-open (make-cipher cipher-algorithm/aes256 cipher-mode/cbc) hash-method/sha256 hash-method/sha1)) (bf (block-indexer-fetcher bi))) (stream->port (indexer-fetch-stream i index bf store store))))
The data pointed to by index in store can be obtained by reading from the returned port.
And voilà!
The API reference is not written yet, but all the Scheme procedures come with a docstring, and can be easily browsed using Geiser (see Documentation helpers).
Free software related to libchop:
Documents related to libchop:
Other readings:
Copyright © 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc. http://fsf.org/ Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The purpose of this License is to make a manual, textbook, or other functional and useful document free in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of “copyleft”, which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
This License applies to any manual or other work, in any medium, that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. Such a notice grants a world-wide, royalty-free license, unlimited in duration, to use that work under the conditions stated herein. The “Document”, below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as “you”. You accept the license if you copy, modify or distribute the work in a way requiring permission under copyright law.
A “Modified Version” of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A “Secondary Section” is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (Thus, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The “Invariant Sections” are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License. If a section does not fit the above definition of Secondary then it is not allowed to be designated as Invariant. The Document may contain zero Invariant Sections. If the Document does not identify any Invariant Sections then there are none.
The “Cover Texts” are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at most 25 words.
A “Transparent” copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, that is suitable for revising the document straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup, or absence of markup, has been arranged to thwart or discourage subsequent modification by readers is not Transparent. An image format is not Transparent if used for any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.
Examples of suitable formats for Transparent copies include plain ascii without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML, PostScript or PDF designed for human modification. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats include proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML, PostScript or PDF produced by some word processors for output purposes only.
The “Title Page” means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, “Title Page” means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
The “publisher” means any person or entity that distributes copies of the Document to the public.
A section “Entitled XYZ” means a named subunit of the Document whose title either is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in another language. (Here XYZ stands for a specific section name mentioned below, such as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the Title” of such a section when you modify the Document means that it remains a section “Entitled XYZ” according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that this License applies to the Document. These Warranty Disclaimers are considered to be included by reference in this License, but only as regards disclaiming warranties: any other implication that these Warranty Disclaimers may have is void and has no effect on the meaning of this License.
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
If you publish printed copies (or copies in media that commonly have printed covers) of the Document, numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a computer-network location from which the general network-using public has access to download using public-standard network protocols a complete Transparent copy of the Document, free of added material. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section Entitled “Endorsements”, provided it contains nothing but endorsements of your Modified Version by various parties—for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled “History” in the various original documents, forming one section Entitled “History”; likewise combine any sections Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all sections Entitled “Endorsements.”
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, is called an “aggregate” if the copyright resulting from the compilation is not used to limit the legal rights of the compilation's users beyond what the individual works permit. When the Document is included in an aggregate, this License does not apply to the other works in the aggregate which are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one half of the entire aggregate, the Document's Cover Texts may be placed on covers that bracket the Document within the aggregate, or the electronic equivalent of covers if the Document is in electronic form. Otherwise they must appear on printed covers that bracket the whole aggregate.
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License, and all the license notices in the Document, and any Warranty Disclaimers, provided that you also include the original English version of this License and the original versions of those notices and disclaimers. In case of a disagreement between the translation and the original version of this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “History”, the requirement (section 4) to Preserve its Title (section 1) will typically require changing the actual title.
You may not copy, modify, sublicense, or distribute the Document except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense, or distribute it is void, and will automatically terminate your rights under this License.
However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice.
Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, receipt of a copy of some or all of the same material does not give you any rights to use it.
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License “or any later version” applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation. If the Document specifies that a proxy can decide which future versions of this License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Document.
“Massive Multiauthor Collaboration Site” (or “MMC Site”) means any World Wide Web server that publishes copyrightable works and also provides prominent facilities for anybody to edit those works. A public wiki that anybody can edit is an example of such a server. A “Massive Multiauthor Collaboration” (or “MMC”) contained in the site means any set of copyrightable works thus published on the MMC site.
“CC-BY-SA” means the Creative Commons Attribution-Share Alike 3.0 license published by Creative Commons Corporation, a not-for-profit corporation with a principal place of business in San Francisco, California, as well as future copyleft versions of that license published by that same organization.
“Incorporate” means to publish or republish a Document, in whole or in part, as part of another Document.
An MMC is “eligible for relicensing” if it is licensed under this License, and if all works that were first published under this License somewhere other than this MMC, and subsequently incorporated in whole or in part into the MMC, (1) had no cover texts or invariant sections, and (2) were thus incorporated prior to November 1, 2008.
The operator of an MMC Site may republish an MMC contained in the site under CC-BY-SA on the same site at any time before August 1, 2009, provided the MMC is eligible for relicensing.
To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:
Copyright (C) year your name. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled ``GNU Free Documentation License''.
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with...Texts.” line with this:
with the Invariant Sections being list their titles, with the Front-Cover Texts being list, and with the Back-Cover Texts being list.
If you have Invariant Sections without Cover Texts, or some other combination of the three, merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.
chop-archiver
: Invoking chop-archiverchop-block-server
: Invoking chop-block-serverchop_anchor_based_chopper_init
: Stream Chopperschop_block_fetcher_exists
: Block Indexers & Fetcherschop_block_fetcher_fetch
: Block Indexers & Fetcherschop_block_indexer_fetcher_class
: Block Indexers & Fetcherschop_block_indexer_index
: Block Indexers & Fetcherschop_block_indexer_index_handle_class
: Block Indexers & Fetcherschop_block_indexer_initialize_fetcher
: Block Indexers & Fetcherschop_chk_block_indexer_open
: Block Indexers & Fetcherschop_chopper_close
: Stream Chopperschop_chopper_generic_open
: Stream Chopperschop_chopper_read_block
: Stream Chopperschop_chopper_stream
: Stream Chopperschop_chopper_typical_block_size
: Stream Chopperschop_class_alloca_instance
: Instantiating an Objectchop_class_instance_size
: Instantiating an ObjectCHOP_DECLARE_RT_CLASS
: Declaring and Defining a ClassCHOP_DEFINE_RT_CLASS
: Declaring and Defining a Classchop_file_stream_open
: Input Streamschop_filtered_store_open
: Overviewchop_filtered_stream_open
: Input Streamschop_filtered_stream_open
: Overviewchop_fixed_size_chopper_init
: Stream Chopperschop_hash_block_indexer_open
: Block Indexers & Fetcherschop_integer_block_indexer_open
: Block Indexers & Fetcherschop_mem_stream_open
: Input Streamschop_object_copy
: Cloning an Objectchop_object_deserialize
: Serializing and Deserializing an Objectchop_object_destroy
: Instantiating an Objectchop_object_equal
: Testing Objects for Equalitychop_object_initialize
: Instantiating an Objectchop_object_serialize
: Serializing and Deserializing an Objectchop_stream_preferred_block_size
: Input Streamschop_stream_read
: Input Streamschop_uuid_block_indexer_open
: Block Indexers & Fetcherschop_whole_stream_chopper_open
: Stream Choppers[1] In this document the terms abstract index and index handle are used interchangeably to describe indices are returned by a block indexer.
[2] For design reasons, there is actually one interface (class) for both the stream indexer and the stream fetcher, while the block indexer and block fetcher interfaces are represented by two separate classes. Conceptually, indexing and fetching are really two distinct functions in both cases. They are, however, dual functions, and may share code.
[3] CHK means “content-hash key”. See Stream Choppers, for details.
[4] This utility is installed only when GNU Guile 2.0 is available.
[5] Tuples can be thought of as capabilities—i.e., “unforgeable”, opaque bit strings that designate content and the right to access it. Tuples are “protected by sparsity”, whereas capabilities in operating systems or programming languages are typically “kernel-protected”.
[6] In other words, chop-backup --backup can be said to be referentially transparent, a notion functional programmers are familiar with. It also looks very scientific to present it this way, doesn't it?
[7] To determine whether a block is available,
chop-archiver checks whether something is stored under
the block's name without checking whether the contents are the same. It
makes sense for content-based addressing, e.g., with
hash_block_indexer
and chk_block_indexer
.
[8] Conversely, the Glib family of libraries achieves binary compatibility at the cost of removing the possibility for caller memory allocation: constructors do not only initialize an object, they also allocate its storage on the heap. The situation in C++ is often similar, but C++ has the advantage of having compiler support.
[9] This is similar to Scheme's equal?
vs. eq?
see R5RS Equivalence Predicates.
[10] Udi Manber, “Finding similar files in a large file system”, USENIX Winter Conference, 1994.