annotate cpack/dml/scripts/compression/README @ 0:718306e29690 tip

commiting public release
author Daniel Wolff
date Tue, 09 Feb 2016 21:05:06 +0100
parents
children
rev   line source
Daniel@0 1 # Delta compression
Daniel@0 2
Daniel@0 3 Scripts in dml-cliopatrial/cpack/dml/scripts/compression provide a common interface
Daniel@0 4 to several delta compression programs. The interface is
Daniel@0 5
Daniel@0 6 stdin ---> [ <script name> (encode|decode) <name of reference file> ] ---> stdout
Daniel@0 7
Daniel@0 8 The following scripts work this way:
Daniel@0 9
Daniel@0 10 zbs - use bsdiff
Daniel@0 11 zxd - uses xdelta3
Daniel@0 12 zvcd - uses open-vcdiff
Daniel@0 13 zvcz - uses vczip
Daniel@0 14 zdiff - converts binary to text and uses diff to produce an ed script
Daniel@0 15
Daniel@0 16 # bufs
Daniel@0 17
Daniel@0 18 The bufs script allows an arbitrary command to be run such that if the command expects a
Daniel@0 19 filename as its nth argument, then
Daniel@0 20
Daniel@0 21 $ bufs <n> <command> <arg1> ... <argn> ...
Daniel@0 22
Daniel@0 23 can be run with <argn> as a bash process redirection, even if <command> reads that
Daniel@0 24 source several times. bufs works by buffering the stream on the nth argument to a temporary
Daniel@0 25 file.
Daniel@0 26
Daniel@0 27
Daniel@0 28 # findcat
Daniel@0 29
Daniel@0 30 findcat dumps the contents of every file under a given directory to stdout.
Daniel@0 31
Daniel@0 32 # Examples
Daniel@0 33
Daniel@0 34 For example, to estimate the conditional K.C. of all the humdrum files in ~/lib/kern/ireland
Daniel@0 35 given those in ~/lib/ker/lorraine, using xdelta3, do
Daniel@0 36
Daniel@0 37 $ findcat ~/lib/kern/ireland | bufs 2 zxd encode <(findcat ~/lib/kern/lorraine) | length
Daniel@0 38
Daniel@0 39 Scripts encode/decode include bufs, so an alternative is
Daniel@0 40
Daniel@0 41 $ findcat ~/lib/kern/ireland | encode zxd <(findcat ~/lib/kern/lorraine) | length
Daniel@0 42
Daniel@0 43 A better estimate is
Daniel@0 44
Daniel@0 45 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | length
Daniel@0 46
Daniel@0 47 Sometimes the output can be compressed still further:
Daniel@0 48
Daniel@0 49 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | lzma | length
Daniel@0 50
Daniel@0 51 rid -G is a humdrum command that removes comments.
Daniel@0 52
Daniel@0 53
Daniel@0 54 # zzcd and zzd
Daniel@0 55
Daniel@0 56 The scripts zzd and zzcd implement more complex schemes, where the input and the reference
Daniel@0 57 are concatenated and/or compressed before delta compressed. For example,
Daniel@0 58
Daniel@0 59 $ findcat ~/lib/kern/ireland | rid -G | bufs 3 zzcd lzma zxd <(findcat ~/lib/kern/lorraine | rid -G) | length
Daniel@0 60
Daniel@0 61 computes (using a more functional notation)
Daniel@0 62
Daniel@0 63 length( zxd( lzma(lorraine), lzma(lorraine+ireland)))
Daniel@0 64
Daniel@0 65 that is, the amount of information needed to transform one LZMA compressed corpus
Daniel@0 66 into the LZMA compressed concatenation of two corpuses.
Daniel@0 67
Daniel@0 68 # dlzma
Daniel@0 69
Daniel@0 70 This is a program written in C using liblzma (part of xz utils package) to estimate the conditional
Daniel@0 71 complexity of an object. It works by using the SYNC_FLUSH feature of liblzma. The compressed data is
Daniel@0 72 discarded and only the number of bits used is output on stdout.