view cpack/dml/scripts/compression/README @ 0:718306e29690 tip

commiting public release
author Daniel Wolff
date Tue, 09 Feb 2016 21:05:06 +0100
parents
children
line wrap: on
line source
# Delta compression

Scripts in dml-cliopatrial/cpack/dml/scripts/compression provide a common interface
to several delta compression programs.  The interface is 

	stdin --->  [ <script name>  (encode|decode) <name of reference file> ] ---> stdout

The following scripts work this way:

	zbs - use bsdiff
	zxd - uses xdelta3
	zvcd - uses open-vcdiff
	zvcz - uses vczip
	zdiff - converts binary to text and uses diff to produce an ed script

# bufs

The bufs script allows an arbitrary command to be run such that if the command expects a
filename as its nth argument, then 

	$ bufs <n> <command> <arg1> ... <argn> ...
	
can be run with <argn> as a bash process redirection, even if <command> reads that
source several times. bufs works by buffering the stream on the nth argument to a temporary
file.


# findcat

findcat dumps the contents of every file under a given directory to stdout.

# Examples

For example, to estimate the conditional K.C. of all the humdrum files in ~/lib/kern/ireland
given those in ~/lib/ker/lorraine, using xdelta3,  do

	$ findcat ~/lib/kern/ireland | bufs 2 zxd encode <(findcat ~/lib/kern/lorraine) | length

Scripts encode/decode include bufs, so an alternative is

	$ findcat ~/lib/kern/ireland | encode zxd <(findcat ~/lib/kern/lorraine) | length

A better estimate is

	$ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | length

Sometimes the output can be compressed still further:

	$ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | lzma | length

rid -G is a humdrum command that removes comments.


# zzcd and zzd

The scripts zzd and zzcd implement more complex schemes, where the input and the reference
are concatenated and/or compressed before delta compressed. For example,

	$ findcat ~/lib/kern/ireland | rid -G | bufs 3 zzcd lzma zxd  <(findcat ~/lib/kern/lorraine | rid -G) | length

computes (using a more functional notation)

	length( zxd( lzma(lorraine), lzma(lorraine+ireland)))

that is, the amount of information needed to transform one LZMA compressed corpus
into the LZMA compressed concatenation of two corpuses.

# dlzma

This is a program written in C using liblzma (part of xz utils package) to estimate the conditional
complexity of an object. It works by using the SYNC_FLUSH feature of liblzma. The compressed data is
discarded and only the number of bits used is output on stdout.