Mercurial > hg > dml-open-cliopatria
comparison cpack/dml/scripts/compression/README @ 0:718306e29690 tip
commiting public release
| author | Daniel Wolff |
|---|---|
| date | Tue, 09 Feb 2016 21:05:06 +0100 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:718306e29690 |
|---|---|
| 1 # Delta compression | |
| 2 | |
| 3 Scripts in dml-cliopatrial/cpack/dml/scripts/compression provide a common interface | |
| 4 to several delta compression programs. The interface is | |
| 5 | |
| 6 stdin ---> [ <script name> (encode|decode) <name of reference file> ] ---> stdout | |
| 7 | |
| 8 The following scripts work this way: | |
| 9 | |
| 10 zbs - use bsdiff | |
| 11 zxd - uses xdelta3 | |
| 12 zvcd - uses open-vcdiff | |
| 13 zvcz - uses vczip | |
| 14 zdiff - converts binary to text and uses diff to produce an ed script | |
| 15 | |
| 16 # bufs | |
| 17 | |
| 18 The bufs script allows an arbitrary command to be run such that if the command expects a | |
| 19 filename as its nth argument, then | |
| 20 | |
| 21 $ bufs <n> <command> <arg1> ... <argn> ... | |
| 22 | |
| 23 can be run with <argn> as a bash process redirection, even if <command> reads that | |
| 24 source several times. bufs works by buffering the stream on the nth argument to a temporary | |
| 25 file. | |
| 26 | |
| 27 | |
| 28 # findcat | |
| 29 | |
| 30 findcat dumps the contents of every file under a given directory to stdout. | |
| 31 | |
| 32 # Examples | |
| 33 | |
| 34 For example, to estimate the conditional K.C. of all the humdrum files in ~/lib/kern/ireland | |
| 35 given those in ~/lib/ker/lorraine, using xdelta3, do | |
| 36 | |
| 37 $ findcat ~/lib/kern/ireland | bufs 2 zxd encode <(findcat ~/lib/kern/lorraine) | length | |
| 38 | |
| 39 Scripts encode/decode include bufs, so an alternative is | |
| 40 | |
| 41 $ findcat ~/lib/kern/ireland | encode zxd <(findcat ~/lib/kern/lorraine) | length | |
| 42 | |
| 43 A better estimate is | |
| 44 | |
| 45 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | length | |
| 46 | |
| 47 Sometimes the output can be compressed still further: | |
| 48 | |
| 49 $ findcat ~/lib/kern/ireland | rid -G | encode zxd <(findcat ~/lib/kern/lorraine | rid -G) | lzma | length | |
| 50 | |
| 51 rid -G is a humdrum command that removes comments. | |
| 52 | |
| 53 | |
| 54 # zzcd and zzd | |
| 55 | |
| 56 The scripts zzd and zzcd implement more complex schemes, where the input and the reference | |
| 57 are concatenated and/or compressed before delta compressed. For example, | |
| 58 | |
| 59 $ findcat ~/lib/kern/ireland | rid -G | bufs 3 zzcd lzma zxd <(findcat ~/lib/kern/lorraine | rid -G) | length | |
| 60 | |
| 61 computes (using a more functional notation) | |
| 62 | |
| 63 length( zxd( lzma(lorraine), lzma(lorraine+ireland))) | |
| 64 | |
| 65 that is, the amount of information needed to transform one LZMA compressed corpus | |
| 66 into the LZMA compressed concatenation of two corpuses. | |
| 67 | |
| 68 # dlzma | |
| 69 | |
| 70 This is a program written in C using liblzma (part of xz utils package) to estimate the conditional | |
| 71 complexity of an object. It works by using the SYNC_FLUSH feature of liblzma. The compressed data is | |
| 72 discarded and only the number of bits used is output on stdout. |
