Daniel@0
|
1 # Delta compression
|
Daniel@0
|
2
|
Daniel@0
|
3 Scripts in dml-cliopatrial/cpack/dml/scripts/compression provide a common interface
|
Daniel@0
|
4 to several delta compression programs. The interface is
|
Daniel@0
|
5
|
Daniel@0
|
6 stdin ---> [ <script name> (encode|decode) <name of reference file> ] ---> stdout
|
Daniel@0
|
7
|
Daniel@0
|
8 The following scripts work this way:
|
Daniel@0
|
9
|
Daniel@0
|
10 zbs - use bsdiff
|
Daniel@0
|
11 zxd - uses xdelta3
|
Daniel@0
|
12 zvcd - uses open-vcdiff
|
Daniel@0
|
13 zvcz - uses vczip
|
Daniel@0
|
14 zdiff - converts binary to text and uses diff to produce an ed script
|
Daniel@0
|
15
|
Daniel@0
|
16 # bufs
|
Daniel@0
|
17
|
Daniel@0
|
18 The bufs script allows an arbitrary command to be run such that if the command expects a
|
Daniel@0
|
19 filename as its nth argument, then
|
Daniel@0
|
20
|
Daniel@0
|
21 $ bufs <n> <command> <arg1> ... <argn> ...
|
Daniel@0
|
22
|
Daniel@0
|
23 can be run with <argn> as a bash process redirection, even if <command> reads that
|
Daniel@0
|
24 source several times. bufs works by buffering the stream on the nth argument to a temporary
|
Daniel@0
|
25 file.
|
Daniel@0
|
26
|
Daniel@0
|
27
|
Daniel@0
|
28 # findcat
|
Daniel@0
|
29
|
Daniel@0
|
30 findcat dumps the contents of every file under a given directory to stdout.
|
Daniel@0
|
31
|
Daniel@0
|
32 # Examples
|
Daniel@0
|
33
|
Daniel@0
|
34 For example, to estimate the conditional K.C. of all the humdrum files in ~/lib/kern/ireland
|
Daniel@0
|
35 given those in ~/lib/ker/lorraine, using xdelta3, do
|
Daniel@0
|
36
|
Daniel@0
|
37 $ findcat ~/lib/kern/ireland | bufs 2 zxd encode <(findcat ~/lib/kern/lorraine) | length
|
Daniel@0
|
38
|
Daniel@0
|
39 Scripts encode/decode include bufs, so an alternative is
|
Daniel@0
|
40
|
Daniel@0
|
41 $ findcat ~/lib/kern/ireland | encode zxd <(findcat ~/lib/kern/lorraine) | length
|
Daniel@0
|
42
|
Daniel@0
|
43 A better estimate is
|
Daniel@0
|
44
|
Daniel@0
|
45 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | length
|
Daniel@0
|
46
|
Daniel@0
|
47 Sometimes the output can be compressed still further:
|
Daniel@0
|
48
|
Daniel@0
|
49 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | lzma | length
|
Daniel@0
|
50
|
Daniel@0
|
51 rid -G is a humdrum command that removes comments.
|
Daniel@0
|
52
|
Daniel@0
|
53
|
Daniel@0
|
54 # zzcd and zzd
|
Daniel@0
|
55
|
Daniel@0
|
56 The scripts zzd and zzcd implement more complex schemes, where the input and the reference
|
Daniel@0
|
57 are concatenated and/or compressed before delta compressed. For example,
|
Daniel@0
|
58
|
Daniel@0
|
59 $ findcat ~/lib/kern/ireland | rid -G | bufs 3 zzcd lzma zxd <(findcat ~/lib/kern/lorraine | rid -G) | length
|
Daniel@0
|
60
|
Daniel@0
|
61 computes (using a more functional notation)
|
Daniel@0
|
62
|
Daniel@0
|
63 length( zxd( lzma(lorraine), lzma(lorraine+ireland)))
|
Daniel@0
|
64
|
Daniel@0
|
65 that is, the amount of information needed to transform one LZMA compressed corpus
|
Daniel@0
|
66 into the LZMA compressed concatenation of two corpuses.
|
Daniel@0
|
67
|
Daniel@0
|
68 # dlzma
|
Daniel@0
|
69
|
Daniel@0
|
70 This is a program written in C using liblzma (part of xz utils package) to estimate the conditional
|
Daniel@0
|
71 complexity of an object. It works by using the SYNC_FLUSH feature of liblzma. The compressed data is
|
Daniel@0
|
72 discarded and only the number of bits used is output on stdout.
|