annotate src/bzip2-1.0.6/bzip2.1 @ 23:619f715526df sv_v2.1

Update Vamp plugin SDK to 2.5
author Chris Cannam
date Thu, 09 May 2013 10:52:46 +0100
parents e13257ea84a4
children
rev   line source
Chris@4 1 .PU
Chris@4 2 .TH bzip2 1
Chris@4 3 .SH NAME
Chris@4 4 bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6
Chris@4 5 .br
Chris@4 6 bzcat \- decompresses files to stdout
Chris@4 7 .br
Chris@4 8 bzip2recover \- recovers data from damaged bzip2 files
Chris@4 9
Chris@4 10 .SH SYNOPSIS
Chris@4 11 .ll +8
Chris@4 12 .B bzip2
Chris@4 13 .RB [ " \-cdfkqstvzVL123456789 " ]
Chris@4 14 [
Chris@4 15 .I "filenames \&..."
Chris@4 16 ]
Chris@4 17 .ll -8
Chris@4 18 .br
Chris@4 19 .B bunzip2
Chris@4 20 .RB [ " \-fkvsVL " ]
Chris@4 21 [
Chris@4 22 .I "filenames \&..."
Chris@4 23 ]
Chris@4 24 .br
Chris@4 25 .B bzcat
Chris@4 26 .RB [ " \-s " ]
Chris@4 27 [
Chris@4 28 .I "filenames \&..."
Chris@4 29 ]
Chris@4 30 .br
Chris@4 31 .B bzip2recover
Chris@4 32 .I "filename"
Chris@4 33
Chris@4 34 .SH DESCRIPTION
Chris@4 35 .I bzip2
Chris@4 36 compresses files using the Burrows-Wheeler block sorting
Chris@4 37 text compression algorithm, and Huffman coding. Compression is
Chris@4 38 generally considerably better than that achieved by more conventional
Chris@4 39 LZ77/LZ78-based compressors, and approaches the performance of the PPM
Chris@4 40 family of statistical compressors.
Chris@4 41
Chris@4 42 The command-line options are deliberately very similar to
Chris@4 43 those of
Chris@4 44 .I GNU gzip,
Chris@4 45 but they are not identical.
Chris@4 46
Chris@4 47 .I bzip2
Chris@4 48 expects a list of file names to accompany the
Chris@4 49 command-line flags. Each file is replaced by a compressed version of
Chris@4 50 itself, with the name "original_name.bz2".
Chris@4 51 Each compressed file
Chris@4 52 has the same modification date, permissions, and, when possible,
Chris@4 53 ownership as the corresponding original, so that these properties can
Chris@4 54 be correctly restored at decompression time. File name handling is
Chris@4 55 naive in the sense that there is no mechanism for preserving original
Chris@4 56 file names, permissions, ownerships or dates in filesystems which lack
Chris@4 57 these concepts, or have serious file name length restrictions, such as
Chris@4 58 MS-DOS.
Chris@4 59
Chris@4 60 .I bzip2
Chris@4 61 and
Chris@4 62 .I bunzip2
Chris@4 63 will by default not overwrite existing
Chris@4 64 files. If you want this to happen, specify the \-f flag.
Chris@4 65
Chris@4 66 If no file names are specified,
Chris@4 67 .I bzip2
Chris@4 68 compresses from standard
Chris@4 69 input to standard output. In this case,
Chris@4 70 .I bzip2
Chris@4 71 will decline to
Chris@4 72 write compressed output to a terminal, as this would be entirely
Chris@4 73 incomprehensible and therefore pointless.
Chris@4 74
Chris@4 75 .I bunzip2
Chris@4 76 (or
Chris@4 77 .I bzip2 \-d)
Chris@4 78 decompresses all
Chris@4 79 specified files. Files which were not created by
Chris@4 80 .I bzip2
Chris@4 81 will be detected and ignored, and a warning issued.
Chris@4 82 .I bzip2
Chris@4 83 attempts to guess the filename for the decompressed file
Chris@4 84 from that of the compressed file as follows:
Chris@4 85
Chris@4 86 filename.bz2 becomes filename
Chris@4 87 filename.bz becomes filename
Chris@4 88 filename.tbz2 becomes filename.tar
Chris@4 89 filename.tbz becomes filename.tar
Chris@4 90 anyothername becomes anyothername.out
Chris@4 91
Chris@4 92 If the file does not end in one of the recognised endings,
Chris@4 93 .I .bz2,
Chris@4 94 .I .bz,
Chris@4 95 .I .tbz2
Chris@4 96 or
Chris@4 97 .I .tbz,
Chris@4 98 .I bzip2
Chris@4 99 complains that it cannot
Chris@4 100 guess the name of the original file, and uses the original name
Chris@4 101 with
Chris@4 102 .I .out
Chris@4 103 appended.
Chris@4 104
Chris@4 105 As with compression, supplying no
Chris@4 106 filenames causes decompression from
Chris@4 107 standard input to standard output.
Chris@4 108
Chris@4 109 .I bunzip2
Chris@4 110 will correctly decompress a file which is the
Chris@4 111 concatenation of two or more compressed files. The result is the
Chris@4 112 concatenation of the corresponding uncompressed files. Integrity
Chris@4 113 testing (\-t)
Chris@4 114 of concatenated
Chris@4 115 compressed files is also supported.
Chris@4 116
Chris@4 117 You can also compress or decompress files to the standard output by
Chris@4 118 giving the \-c flag. Multiple files may be compressed and
Chris@4 119 decompressed like this. The resulting outputs are fed sequentially to
Chris@4 120 stdout. Compression of multiple files
Chris@4 121 in this manner generates a stream
Chris@4 122 containing multiple compressed file representations. Such a stream
Chris@4 123 can be decompressed correctly only by
Chris@4 124 .I bzip2
Chris@4 125 version 0.9.0 or
Chris@4 126 later. Earlier versions of
Chris@4 127 .I bzip2
Chris@4 128 will stop after decompressing
Chris@4 129 the first file in the stream.
Chris@4 130
Chris@4 131 .I bzcat
Chris@4 132 (or
Chris@4 133 .I bzip2 -dc)
Chris@4 134 decompresses all specified files to
Chris@4 135 the standard output.
Chris@4 136
Chris@4 137 .I bzip2
Chris@4 138 will read arguments from the environment variables
Chris@4 139 .I BZIP2
Chris@4 140 and
Chris@4 141 .I BZIP,
Chris@4 142 in that order, and will process them
Chris@4 143 before any arguments read from the command line. This gives a
Chris@4 144 convenient way to supply default arguments.
Chris@4 145
Chris@4 146 Compression is always performed, even if the compressed
Chris@4 147 file is slightly
Chris@4 148 larger than the original. Files of less than about one hundred bytes
Chris@4 149 tend to get larger, since the compression mechanism has a constant
Chris@4 150 overhead in the region of 50 bytes. Random data (including the output
Chris@4 151 of most file compressors) is coded at about 8.05 bits per byte, giving
Chris@4 152 an expansion of around 0.5%.
Chris@4 153
Chris@4 154 As a self-check for your protection,
Chris@4 155 .I
Chris@4 156 bzip2
Chris@4 157 uses 32-bit CRCs to
Chris@4 158 make sure that the decompressed version of a file is identical to the
Chris@4 159 original. This guards against corruption of the compressed data, and
Chris@4 160 against undetected bugs in
Chris@4 161 .I bzip2
Chris@4 162 (hopefully very unlikely). The
Chris@4 163 chances of data corruption going undetected is microscopic, about one
Chris@4 164 chance in four billion for each file processed. Be aware, though, that
Chris@4 165 the check occurs upon decompression, so it can only tell you that
Chris@4 166 something is wrong. It can't help you
Chris@4 167 recover the original uncompressed
Chris@4 168 data. You can use
Chris@4 169 .I bzip2recover
Chris@4 170 to try to recover data from
Chris@4 171 damaged files.
Chris@4 172
Chris@4 173 Return values: 0 for a normal exit, 1 for environmental problems (file
Chris@4 174 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
Chris@4 175 compressed file, 3 for an internal consistency error (eg, bug) which
Chris@4 176 caused
Chris@4 177 .I bzip2
Chris@4 178 to panic.
Chris@4 179
Chris@4 180 .SH OPTIONS
Chris@4 181 .TP
Chris@4 182 .B \-c --stdout
Chris@4 183 Compress or decompress to standard output.
Chris@4 184 .TP
Chris@4 185 .B \-d --decompress
Chris@4 186 Force decompression.
Chris@4 187 .I bzip2,
Chris@4 188 .I bunzip2
Chris@4 189 and
Chris@4 190 .I bzcat
Chris@4 191 are
Chris@4 192 really the same program, and the decision about what actions to take is
Chris@4 193 done on the basis of which name is used. This flag overrides that
Chris@4 194 mechanism, and forces
Chris@4 195 .I bzip2
Chris@4 196 to decompress.
Chris@4 197 .TP
Chris@4 198 .B \-z --compress
Chris@4 199 The complement to \-d: forces compression, regardless of the
Chris@4 200 invocation name.
Chris@4 201 .TP
Chris@4 202 .B \-t --test
Chris@4 203 Check integrity of the specified file(s), but don't decompress them.
Chris@4 204 This really performs a trial decompression and throws away the result.
Chris@4 205 .TP
Chris@4 206 .B \-f --force
Chris@4 207 Force overwrite of output files. Normally,
Chris@4 208 .I bzip2
Chris@4 209 will not overwrite
Chris@4 210 existing output files. Also forces
Chris@4 211 .I bzip2
Chris@4 212 to break hard links
Chris@4 213 to files, which it otherwise wouldn't do.
Chris@4 214
Chris@4 215 bzip2 normally declines to decompress files which don't have the
Chris@4 216 correct magic header bytes. If forced (-f), however, it will pass
Chris@4 217 such files through unmodified. This is how GNU gzip behaves.
Chris@4 218 .TP
Chris@4 219 .B \-k --keep
Chris@4 220 Keep (don't delete) input files during compression
Chris@4 221 or decompression.
Chris@4 222 .TP
Chris@4 223 .B \-s --small
Chris@4 224 Reduce memory usage, for compression, decompression and testing. Files
Chris@4 225 are decompressed and tested using a modified algorithm which only
Chris@4 226 requires 2.5 bytes per block byte. This means any file can be
Chris@4 227 decompressed in 2300k of memory, albeit at about half the normal speed.
Chris@4 228
Chris@4 229 During compression, \-s selects a block size of 200k, which limits
Chris@4 230 memory use to around the same figure, at the expense of your compression
Chris@4 231 ratio. In short, if your machine is low on memory (8 megabytes or
Chris@4 232 less), use \-s for everything. See MEMORY MANAGEMENT below.
Chris@4 233 .TP
Chris@4 234 .B \-q --quiet
Chris@4 235 Suppress non-essential warning messages. Messages pertaining to
Chris@4 236 I/O errors and other critical events will not be suppressed.
Chris@4 237 .TP
Chris@4 238 .B \-v --verbose
Chris@4 239 Verbose mode -- show the compression ratio for each file processed.
Chris@4 240 Further \-v's increase the verbosity level, spewing out lots of
Chris@4 241 information which is primarily of interest for diagnostic purposes.
Chris@4 242 .TP
Chris@4 243 .B \-L --license -V --version
Chris@4 244 Display the software version, license terms and conditions.
Chris@4 245 .TP
Chris@4 246 .B \-1 (or \-\-fast) to \-9 (or \-\-best)
Chris@4 247 Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
Chris@4 248 effect when decompressing. See MEMORY MANAGEMENT below.
Chris@4 249 The \-\-fast and \-\-best aliases are primarily for GNU gzip
Chris@4 250 compatibility. In particular, \-\-fast doesn't make things
Chris@4 251 significantly faster.
Chris@4 252 And \-\-best merely selects the default behaviour.
Chris@4 253 .TP
Chris@4 254 .B \--
Chris@4 255 Treats all subsequent arguments as file names, even if they start
Chris@4 256 with a dash. This is so you can handle files with names beginning
Chris@4 257 with a dash, for example: bzip2 \-- \-myfilename.
Chris@4 258 .TP
Chris@4 259 .B \--repetitive-fast --repetitive-best
Chris@4 260 These flags are redundant in versions 0.9.5 and above. They provided
Chris@4 261 some coarse control over the behaviour of the sorting algorithm in
Chris@4 262 earlier versions, which was sometimes useful. 0.9.5 and above have an
Chris@4 263 improved algorithm which renders these flags irrelevant.
Chris@4 264
Chris@4 265 .SH MEMORY MANAGEMENT
Chris@4 266 .I bzip2
Chris@4 267 compresses large files in blocks. The block size affects
Chris@4 268 both the compression ratio achieved, and the amount of memory needed for
Chris@4 269 compression and decompression. The flags \-1 through \-9
Chris@4 270 specify the block size to be 100,000 bytes through 900,000 bytes (the
Chris@4 271 default) respectively. At decompression time, the block size used for
Chris@4 272 compression is read from the header of the compressed file, and
Chris@4 273 .I bunzip2
Chris@4 274 then allocates itself just enough memory to decompress
Chris@4 275 the file. Since block sizes are stored in compressed files, it follows
Chris@4 276 that the flags \-1 to \-9 are irrelevant to and so ignored
Chris@4 277 during decompression.
Chris@4 278
Chris@4 279 Compression and decompression requirements,
Chris@4 280 in bytes, can be estimated as:
Chris@4 281
Chris@4 282 Compression: 400k + ( 8 x block size )
Chris@4 283
Chris@4 284 Decompression: 100k + ( 4 x block size ), or
Chris@4 285 100k + ( 2.5 x block size )
Chris@4 286
Chris@4 287 Larger block sizes give rapidly diminishing marginal returns. Most of
Chris@4 288 the compression comes from the first two or three hundred k of block
Chris@4 289 size, a fact worth bearing in mind when using
Chris@4 290 .I bzip2
Chris@4 291 on small machines.
Chris@4 292 It is also important to appreciate that the decompression memory
Chris@4 293 requirement is set at compression time by the choice of block size.
Chris@4 294
Chris@4 295 For files compressed with the default 900k block size,
Chris@4 296 .I bunzip2
Chris@4 297 will require about 3700 kbytes to decompress. To support decompression
Chris@4 298 of any file on a 4 megabyte machine,
Chris@4 299 .I bunzip2
Chris@4 300 has an option to
Chris@4 301 decompress using approximately half this amount of memory, about 2300
Chris@4 302 kbytes. Decompression speed is also halved, so you should use this
Chris@4 303 option only where necessary. The relevant flag is -s.
Chris@4 304
Chris@4 305 In general, try and use the largest block size memory constraints allow,
Chris@4 306 since that maximises the compression achieved. Compression and
Chris@4 307 decompression speed are virtually unaffected by block size.
Chris@4 308
Chris@4 309 Another significant point applies to files which fit in a single block
Chris@4 310 -- that means most files you'd encounter using a large block size. The
Chris@4 311 amount of real memory touched is proportional to the size of the file,
Chris@4 312 since the file is smaller than a block. For example, compressing a file
Chris@4 313 20,000 bytes long with the flag -9 will cause the compressor to
Chris@4 314 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
Chris@4 315 kbytes of it. Similarly, the decompressor will allocate 3700k but only
Chris@4 316 touch 100k + 20000 * 4 = 180 kbytes.
Chris@4 317
Chris@4 318 Here is a table which summarises the maximum memory usage for different
Chris@4 319 block sizes. Also recorded is the total compressed size for 14 files of
Chris@4 320 the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
Chris@4 321 column gives some feel for how compression varies with block size.
Chris@4 322 These figures tend to understate the advantage of larger block sizes for
Chris@4 323 larger files, since the Corpus is dominated by smaller files.
Chris@4 324
Chris@4 325 Compress Decompress Decompress Corpus
Chris@4 326 Flag usage usage -s usage Size
Chris@4 327
Chris@4 328 -1 1200k 500k 350k 914704
Chris@4 329 -2 2000k 900k 600k 877703
Chris@4 330 -3 2800k 1300k 850k 860338
Chris@4 331 -4 3600k 1700k 1100k 846899
Chris@4 332 -5 4400k 2100k 1350k 845160
Chris@4 333 -6 5200k 2500k 1600k 838626
Chris@4 334 -7 6100k 2900k 1850k 834096
Chris@4 335 -8 6800k 3300k 2100k 828642
Chris@4 336 -9 7600k 3700k 2350k 828642
Chris@4 337
Chris@4 338 .SH RECOVERING DATA FROM DAMAGED FILES
Chris@4 339 .I bzip2
Chris@4 340 compresses files in blocks, usually 900kbytes long. Each
Chris@4 341 block is handled independently. If a media or transmission error causes
Chris@4 342 a multi-block .bz2
Chris@4 343 file to become damaged, it may be possible to
Chris@4 344 recover data from the undamaged blocks in the file.
Chris@4 345
Chris@4 346 The compressed representation of each block is delimited by a 48-bit
Chris@4 347 pattern, which makes it possible to find the block boundaries with
Chris@4 348 reasonable certainty. Each block also carries its own 32-bit CRC, so
Chris@4 349 damaged blocks can be distinguished from undamaged ones.
Chris@4 350
Chris@4 351 .I bzip2recover
Chris@4 352 is a simple program whose purpose is to search for
Chris@4 353 blocks in .bz2 files, and write each block out into its own .bz2
Chris@4 354 file. You can then use
Chris@4 355 .I bzip2
Chris@4 356 \-t
Chris@4 357 to test the
Chris@4 358 integrity of the resulting files, and decompress those which are
Chris@4 359 undamaged.
Chris@4 360
Chris@4 361 .I bzip2recover
Chris@4 362 takes a single argument, the name of the damaged file,
Chris@4 363 and writes a number of files "rec00001file.bz2",
Chris@4 364 "rec00002file.bz2", etc, containing the extracted blocks.
Chris@4 365 The output filenames are designed so that the use of
Chris@4 366 wildcards in subsequent processing -- for example,
Chris@4 367 "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in
Chris@4 368 the correct order.
Chris@4 369
Chris@4 370 .I bzip2recover
Chris@4 371 should be of most use dealing with large .bz2
Chris@4 372 files, as these will contain many blocks. It is clearly
Chris@4 373 futile to use it on damaged single-block files, since a
Chris@4 374 damaged block cannot be recovered. If you wish to minimise
Chris@4 375 any potential data loss through media or transmission errors,
Chris@4 376 you might consider compressing with a smaller
Chris@4 377 block size.
Chris@4 378
Chris@4 379 .SH PERFORMANCE NOTES
Chris@4 380 The sorting phase of compression gathers together similar strings in the
Chris@4 381 file. Because of this, files containing very long runs of repeated
Chris@4 382 symbols, like "aabaabaabaab ..." (repeated several hundred times) may
Chris@4 383 compress more slowly than normal. Versions 0.9.5 and above fare much
Chris@4 384 better than previous versions in this respect. The ratio between
Chris@4 385 worst-case and average-case compression time is in the region of 10:1.
Chris@4 386 For previous versions, this figure was more like 100:1. You can use the
Chris@4 387 \-vvvv option to monitor progress in great detail, if you want.
Chris@4 388
Chris@4 389 Decompression speed is unaffected by these phenomena.
Chris@4 390
Chris@4 391 .I bzip2
Chris@4 392 usually allocates several megabytes of memory to operate
Chris@4 393 in, and then charges all over it in a fairly random fashion. This means
Chris@4 394 that performance, both for compressing and decompressing, is largely
Chris@4 395 determined by the speed at which your machine can service cache misses.
Chris@4 396 Because of this, small changes to the code to reduce the miss rate have
Chris@4 397 been observed to give disproportionately large performance improvements.
Chris@4 398 I imagine
Chris@4 399 .I bzip2
Chris@4 400 will perform best on machines with very large caches.
Chris@4 401
Chris@4 402 .SH CAVEATS
Chris@4 403 I/O error messages are not as helpful as they could be.
Chris@4 404 .I bzip2
Chris@4 405 tries hard to detect I/O errors and exit cleanly, but the details of
Chris@4 406 what the problem is sometimes seem rather misleading.
Chris@4 407
Chris@4 408 This manual page pertains to version 1.0.6 of
Chris@4 409 .I bzip2.
Chris@4 410 Compressed data created by this version is entirely forwards and
Chris@4 411 backwards compatible with the previous public releases, versions
Chris@4 412 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following
Chris@4 413 exception: 0.9.0 and above can correctly decompress multiple
Chris@4 414 concatenated compressed files. 0.1pl2 cannot do this; it will stop
Chris@4 415 after decompressing just the first file in the stream.
Chris@4 416
Chris@4 417 .I bzip2recover
Chris@4 418 versions prior to 1.0.2 used 32-bit integers to represent
Chris@4 419 bit positions in compressed files, so they could not handle compressed
Chris@4 420 files more than 512 megabytes long. Versions 1.0.2 and above use
Chris@4 421 64-bit ints on some platforms which support them (GNU supported
Chris@4 422 targets, and Windows). To establish whether or not bzip2recover was
Chris@4 423 built with such a limitation, run it without arguments. In any event
Chris@4 424 you can build yourself an unlimited version if you can recompile it
Chris@4 425 with MaybeUInt64 set to be an unsigned 64-bit integer.
Chris@4 426
Chris@4 427
Chris@4 428
Chris@4 429 .SH AUTHOR
Chris@4 430 Julian Seward, jsewardbzip.org.
Chris@4 431
Chris@4 432 http://www.bzip.org
Chris@4 433
Chris@4 434 The ideas embodied in
Chris@4 435 .I bzip2
Chris@4 436 are due to (at least) the following
Chris@4 437 people: Michael Burrows and David Wheeler (for the block sorting
Chris@4 438 transformation), David Wheeler (again, for the Huffman coder), Peter
Chris@4 439 Fenwick (for the structured coding model in the original
Chris@4 440 .I bzip,
Chris@4 441 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
Chris@4 442 (for the arithmetic coder in the original
Chris@4 443 .I bzip).
Chris@4 444 I am much
Chris@4 445 indebted for their help, support and advice. See the manual in the
Chris@4 446 source distribution for pointers to sources of documentation. Christian
Chris@4 447 von Roques encouraged me to look for faster sorting algorithms, so as to
Chris@4 448 speed up compression. Bela Lubkin encouraged me to improve the
Chris@4 449 worst-case compression performance.
Chris@4 450 Donna Robinson XMLised the documentation.
Chris@4 451 The bz* scripts are derived from those of GNU gzip.
Chris@4 452 Many people sent patches, helped
Chris@4 453 with portability problems, lent machines, gave advice and were generally
Chris@4 454 helpful.