cannam@89: .PU cannam@89: .TH bzip2 1 cannam@89: .SH NAME cannam@89: bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6 cannam@89: .br cannam@89: bzcat \- decompresses files to stdout cannam@89: .br cannam@89: bzip2recover \- recovers data from damaged bzip2 files cannam@89: cannam@89: .SH SYNOPSIS cannam@89: .ll +8 cannam@89: .B bzip2 cannam@89: .RB [ " \-cdfkqstvzVL123456789 " ] cannam@89: [ cannam@89: .I "filenames \&..." cannam@89: ] cannam@89: .ll -8 cannam@89: .br cannam@89: .B bunzip2 cannam@89: .RB [ " \-fkvsVL " ] cannam@89: [ cannam@89: .I "filenames \&..." cannam@89: ] cannam@89: .br cannam@89: .B bzcat cannam@89: .RB [ " \-s " ] cannam@89: [ cannam@89: .I "filenames \&..." cannam@89: ] cannam@89: .br cannam@89: .B bzip2recover cannam@89: .I "filename" cannam@89: cannam@89: .SH DESCRIPTION cannam@89: .I bzip2 cannam@89: compresses files using the Burrows-Wheeler block sorting cannam@89: text compression algorithm, and Huffman coding. Compression is cannam@89: generally considerably better than that achieved by more conventional cannam@89: LZ77/LZ78-based compressors, and approaches the performance of the PPM cannam@89: family of statistical compressors. cannam@89: cannam@89: The command-line options are deliberately very similar to cannam@89: those of cannam@89: .I GNU gzip, cannam@89: but they are not identical. cannam@89: cannam@89: .I bzip2 cannam@89: expects a list of file names to accompany the cannam@89: command-line flags. Each file is replaced by a compressed version of cannam@89: itself, with the name "original_name.bz2". cannam@89: Each compressed file cannam@89: has the same modification date, permissions, and, when possible, cannam@89: ownership as the corresponding original, so that these properties can cannam@89: be correctly restored at decompression time. File name handling is cannam@89: naive in the sense that there is no mechanism for preserving original cannam@89: file names, permissions, ownerships or dates in filesystems which lack cannam@89: these concepts, or have serious file name length restrictions, such as cannam@89: MS-DOS. cannam@89: cannam@89: .I bzip2 cannam@89: and cannam@89: .I bunzip2 cannam@89: will by default not overwrite existing cannam@89: files. If you want this to happen, specify the \-f flag. cannam@89: cannam@89: If no file names are specified, cannam@89: .I bzip2 cannam@89: compresses from standard cannam@89: input to standard output. In this case, cannam@89: .I bzip2 cannam@89: will decline to cannam@89: write compressed output to a terminal, as this would be entirely cannam@89: incomprehensible and therefore pointless. cannam@89: cannam@89: .I bunzip2 cannam@89: (or cannam@89: .I bzip2 \-d) cannam@89: decompresses all cannam@89: specified files. Files which were not created by cannam@89: .I bzip2 cannam@89: will be detected and ignored, and a warning issued. cannam@89: .I bzip2 cannam@89: attempts to guess the filename for the decompressed file cannam@89: from that of the compressed file as follows: cannam@89: cannam@89: filename.bz2 becomes filename cannam@89: filename.bz becomes filename cannam@89: filename.tbz2 becomes filename.tar cannam@89: filename.tbz becomes filename.tar cannam@89: anyothername becomes anyothername.out cannam@89: cannam@89: If the file does not end in one of the recognised endings, cannam@89: .I .bz2, cannam@89: .I .bz, cannam@89: .I .tbz2 cannam@89: or cannam@89: .I .tbz, cannam@89: .I bzip2 cannam@89: complains that it cannot cannam@89: guess the name of the original file, and uses the original name cannam@89: with cannam@89: .I .out cannam@89: appended. cannam@89: cannam@89: As with compression, supplying no cannam@89: filenames causes decompression from cannam@89: standard input to standard output. cannam@89: cannam@89: .I bunzip2 cannam@89: will correctly decompress a file which is the cannam@89: concatenation of two or more compressed files. The result is the cannam@89: concatenation of the corresponding uncompressed files. Integrity cannam@89: testing (\-t) cannam@89: of concatenated cannam@89: compressed files is also supported. cannam@89: cannam@89: You can also compress or decompress files to the standard output by cannam@89: giving the \-c flag. Multiple files may be compressed and cannam@89: decompressed like this. The resulting outputs are fed sequentially to cannam@89: stdout. Compression of multiple files cannam@89: in this manner generates a stream cannam@89: containing multiple compressed file representations. Such a stream cannam@89: can be decompressed correctly only by cannam@89: .I bzip2 cannam@89: version 0.9.0 or cannam@89: later. Earlier versions of cannam@89: .I bzip2 cannam@89: will stop after decompressing cannam@89: the first file in the stream. cannam@89: cannam@89: .I bzcat cannam@89: (or cannam@89: .I bzip2 -dc) cannam@89: decompresses all specified files to cannam@89: the standard output. cannam@89: cannam@89: .I bzip2 cannam@89: will read arguments from the environment variables cannam@89: .I BZIP2 cannam@89: and cannam@89: .I BZIP, cannam@89: in that order, and will process them cannam@89: before any arguments read from the command line. This gives a cannam@89: convenient way to supply default arguments. cannam@89: cannam@89: Compression is always performed, even if the compressed cannam@89: file is slightly cannam@89: larger than the original. Files of less than about one hundred bytes cannam@89: tend to get larger, since the compression mechanism has a constant cannam@89: overhead in the region of 50 bytes. Random data (including the output cannam@89: of most file compressors) is coded at about 8.05 bits per byte, giving cannam@89: an expansion of around 0.5%. cannam@89: cannam@89: As a self-check for your protection, cannam@89: .I cannam@89: bzip2 cannam@89: uses 32-bit CRCs to cannam@89: make sure that the decompressed version of a file is identical to the cannam@89: original. This guards against corruption of the compressed data, and cannam@89: against undetected bugs in cannam@89: .I bzip2 cannam@89: (hopefully very unlikely). The cannam@89: chances of data corruption going undetected is microscopic, about one cannam@89: chance in four billion for each file processed. Be aware, though, that cannam@89: the check occurs upon decompression, so it can only tell you that cannam@89: something is wrong. It can't help you cannam@89: recover the original uncompressed cannam@89: data. You can use cannam@89: .I bzip2recover cannam@89: to try to recover data from cannam@89: damaged files. cannam@89: cannam@89: Return values: 0 for a normal exit, 1 for environmental problems (file cannam@89: not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt cannam@89: compressed file, 3 for an internal consistency error (eg, bug) which cannam@89: caused cannam@89: .I bzip2 cannam@89: to panic. cannam@89: cannam@89: .SH OPTIONS cannam@89: .TP cannam@89: .B \-c --stdout cannam@89: Compress or decompress to standard output. cannam@89: .TP cannam@89: .B \-d --decompress cannam@89: Force decompression. cannam@89: .I bzip2, cannam@89: .I bunzip2 cannam@89: and cannam@89: .I bzcat cannam@89: are cannam@89: really the same program, and the decision about what actions to take is cannam@89: done on the basis of which name is used. This flag overrides that cannam@89: mechanism, and forces cannam@89: .I bzip2 cannam@89: to decompress. cannam@89: .TP cannam@89: .B \-z --compress cannam@89: The complement to \-d: forces compression, regardless of the cannam@89: invocation name. cannam@89: .TP cannam@89: .B \-t --test cannam@89: Check integrity of the specified file(s), but don't decompress them. cannam@89: This really performs a trial decompression and throws away the result. cannam@89: .TP cannam@89: .B \-f --force cannam@89: Force overwrite of output files. Normally, cannam@89: .I bzip2 cannam@89: will not overwrite cannam@89: existing output files. Also forces cannam@89: .I bzip2 cannam@89: to break hard links cannam@89: to files, which it otherwise wouldn't do. cannam@89: cannam@89: bzip2 normally declines to decompress files which don't have the cannam@89: correct magic header bytes. If forced (-f), however, it will pass cannam@89: such files through unmodified. This is how GNU gzip behaves. cannam@89: .TP cannam@89: .B \-k --keep cannam@89: Keep (don't delete) input files during compression cannam@89: or decompression. cannam@89: .TP cannam@89: .B \-s --small cannam@89: Reduce memory usage, for compression, decompression and testing. Files cannam@89: are decompressed and tested using a modified algorithm which only cannam@89: requires 2.5 bytes per block byte. This means any file can be cannam@89: decompressed in 2300k of memory, albeit at about half the normal speed. cannam@89: cannam@89: During compression, \-s selects a block size of 200k, which limits cannam@89: memory use to around the same figure, at the expense of your compression cannam@89: ratio. In short, if your machine is low on memory (8 megabytes or cannam@89: less), use \-s for everything. See MEMORY MANAGEMENT below. cannam@89: .TP cannam@89: .B \-q --quiet cannam@89: Suppress non-essential warning messages. Messages pertaining to cannam@89: I/O errors and other critical events will not be suppressed. cannam@89: .TP cannam@89: .B \-v --verbose cannam@89: Verbose mode -- show the compression ratio for each file processed. cannam@89: Further \-v's increase the verbosity level, spewing out lots of cannam@89: information which is primarily of interest for diagnostic purposes. cannam@89: .TP cannam@89: .B \-L --license -V --version cannam@89: Display the software version, license terms and conditions. cannam@89: .TP cannam@89: .B \-1 (or \-\-fast) to \-9 (or \-\-best) cannam@89: Set the block size to 100 k, 200 k .. 900 k when compressing. Has no cannam@89: effect when decompressing. See MEMORY MANAGEMENT below. cannam@89: The \-\-fast and \-\-best aliases are primarily for GNU gzip cannam@89: compatibility. In particular, \-\-fast doesn't make things cannam@89: significantly faster. cannam@89: And \-\-best merely selects the default behaviour. cannam@89: .TP cannam@89: .B \-- cannam@89: Treats all subsequent arguments as file names, even if they start cannam@89: with a dash. This is so you can handle files with names beginning cannam@89: with a dash, for example: bzip2 \-- \-myfilename. cannam@89: .TP cannam@89: .B \--repetitive-fast --repetitive-best cannam@89: These flags are redundant in versions 0.9.5 and above. They provided cannam@89: some coarse control over the behaviour of the sorting algorithm in cannam@89: earlier versions, which was sometimes useful. 0.9.5 and above have an cannam@89: improved algorithm which renders these flags irrelevant. cannam@89: cannam@89: .SH MEMORY MANAGEMENT cannam@89: .I bzip2 cannam@89: compresses large files in blocks. The block size affects cannam@89: both the compression ratio achieved, and the amount of memory needed for cannam@89: compression and decompression. The flags \-1 through \-9 cannam@89: specify the block size to be 100,000 bytes through 900,000 bytes (the cannam@89: default) respectively. At decompression time, the block size used for cannam@89: compression is read from the header of the compressed file, and cannam@89: .I bunzip2 cannam@89: then allocates itself just enough memory to decompress cannam@89: the file. Since block sizes are stored in compressed files, it follows cannam@89: that the flags \-1 to \-9 are irrelevant to and so ignored cannam@89: during decompression. cannam@89: cannam@89: Compression and decompression requirements, cannam@89: in bytes, can be estimated as: cannam@89: cannam@89: Compression: 400k + ( 8 x block size ) cannam@89: cannam@89: Decompression: 100k + ( 4 x block size ), or cannam@89: 100k + ( 2.5 x block size ) cannam@89: cannam@89: Larger block sizes give rapidly diminishing marginal returns. Most of cannam@89: the compression comes from the first two or three hundred k of block cannam@89: size, a fact worth bearing in mind when using cannam@89: .I bzip2 cannam@89: on small machines. cannam@89: It is also important to appreciate that the decompression memory cannam@89: requirement is set at compression time by the choice of block size. cannam@89: cannam@89: For files compressed with the default 900k block size, cannam@89: .I bunzip2 cannam@89: will require about 3700 kbytes to decompress. To support decompression cannam@89: of any file on a 4 megabyte machine, cannam@89: .I bunzip2 cannam@89: has an option to cannam@89: decompress using approximately half this amount of memory, about 2300 cannam@89: kbytes. Decompression speed is also halved, so you should use this cannam@89: option only where necessary. The relevant flag is -s. cannam@89: cannam@89: In general, try and use the largest block size memory constraints allow, cannam@89: since that maximises the compression achieved. Compression and cannam@89: decompression speed are virtually unaffected by block size. cannam@89: cannam@89: Another significant point applies to files which fit in a single block cannam@89: -- that means most files you'd encounter using a large block size. The cannam@89: amount of real memory touched is proportional to the size of the file, cannam@89: since the file is smaller than a block. For example, compressing a file cannam@89: 20,000 bytes long with the flag -9 will cause the compressor to cannam@89: allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 cannam@89: kbytes of it. Similarly, the decompressor will allocate 3700k but only cannam@89: touch 100k + 20000 * 4 = 180 kbytes. cannam@89: cannam@89: Here is a table which summarises the maximum memory usage for different cannam@89: block sizes. Also recorded is the total compressed size for 14 files of cannam@89: the Calgary Text Compression Corpus totalling 3,141,622 bytes. This cannam@89: column gives some feel for how compression varies with block size. cannam@89: These figures tend to understate the advantage of larger block sizes for cannam@89: larger files, since the Corpus is dominated by smaller files. cannam@89: cannam@89: Compress Decompress Decompress Corpus cannam@89: Flag usage usage -s usage Size cannam@89: cannam@89: -1 1200k 500k 350k 914704 cannam@89: -2 2000k 900k 600k 877703 cannam@89: -3 2800k 1300k 850k 860338 cannam@89: -4 3600k 1700k 1100k 846899 cannam@89: -5 4400k 2100k 1350k 845160 cannam@89: -6 5200k 2500k 1600k 838626 cannam@89: -7 6100k 2900k 1850k 834096 cannam@89: -8 6800k 3300k 2100k 828642 cannam@89: -9 7600k 3700k 2350k 828642 cannam@89: cannam@89: .SH RECOVERING DATA FROM DAMAGED FILES cannam@89: .I bzip2 cannam@89: compresses files in blocks, usually 900kbytes long. Each cannam@89: block is handled independently. If a media or transmission error causes cannam@89: a multi-block .bz2 cannam@89: file to become damaged, it may be possible to cannam@89: recover data from the undamaged blocks in the file. cannam@89: cannam@89: The compressed representation of each block is delimited by a 48-bit cannam@89: pattern, which makes it possible to find the block boundaries with cannam@89: reasonable certainty. Each block also carries its own 32-bit CRC, so cannam@89: damaged blocks can be distinguished from undamaged ones. cannam@89: cannam@89: .I bzip2recover cannam@89: is a simple program whose purpose is to search for cannam@89: blocks in .bz2 files, and write each block out into its own .bz2 cannam@89: file. You can then use cannam@89: .I bzip2 cannam@89: \-t cannam@89: to test the cannam@89: integrity of the resulting files, and decompress those which are cannam@89: undamaged. cannam@89: cannam@89: .I bzip2recover cannam@89: takes a single argument, the name of the damaged file, cannam@89: and writes a number of files "rec00001file.bz2", cannam@89: "rec00002file.bz2", etc, containing the extracted blocks. cannam@89: The output filenames are designed so that the use of cannam@89: wildcards in subsequent processing -- for example, cannam@89: "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in cannam@89: the correct order. cannam@89: cannam@89: .I bzip2recover cannam@89: should be of most use dealing with large .bz2 cannam@89: files, as these will contain many blocks. It is clearly cannam@89: futile to use it on damaged single-block files, since a cannam@89: damaged block cannot be recovered. If you wish to minimise cannam@89: any potential data loss through media or transmission errors, cannam@89: you might consider compressing with a smaller cannam@89: block size. cannam@89: cannam@89: .SH PERFORMANCE NOTES cannam@89: The sorting phase of compression gathers together similar strings in the cannam@89: file. Because of this, files containing very long runs of repeated cannam@89: symbols, like "aabaabaabaab ..." (repeated several hundred times) may cannam@89: compress more slowly than normal. Versions 0.9.5 and above fare much cannam@89: better than previous versions in this respect. The ratio between cannam@89: worst-case and average-case compression time is in the region of 10:1. cannam@89: For previous versions, this figure was more like 100:1. You can use the cannam@89: \-vvvv option to monitor progress in great detail, if you want. cannam@89: cannam@89: Decompression speed is unaffected by these phenomena. cannam@89: cannam@89: .I bzip2 cannam@89: usually allocates several megabytes of memory to operate cannam@89: in, and then charges all over it in a fairly random fashion. This means cannam@89: that performance, both for compressing and decompressing, is largely cannam@89: determined by the speed at which your machine can service cache misses. cannam@89: Because of this, small changes to the code to reduce the miss rate have cannam@89: been observed to give disproportionately large performance improvements. cannam@89: I imagine cannam@89: .I bzip2 cannam@89: will perform best on machines with very large caches. cannam@89: cannam@89: .SH CAVEATS cannam@89: I/O error messages are not as helpful as they could be. cannam@89: .I bzip2 cannam@89: tries hard to detect I/O errors and exit cleanly, but the details of cannam@89: what the problem is sometimes seem rather misleading. cannam@89: cannam@89: This manual page pertains to version 1.0.6 of cannam@89: .I bzip2. cannam@89: Compressed data created by this version is entirely forwards and cannam@89: backwards compatible with the previous public releases, versions cannam@89: 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following cannam@89: exception: 0.9.0 and above can correctly decompress multiple cannam@89: concatenated compressed files. 0.1pl2 cannot do this; it will stop cannam@89: after decompressing just the first file in the stream. cannam@89: cannam@89: .I bzip2recover cannam@89: versions prior to 1.0.2 used 32-bit integers to represent cannam@89: bit positions in compressed files, so they could not handle compressed cannam@89: files more than 512 megabytes long. Versions 1.0.2 and above use cannam@89: 64-bit ints on some platforms which support them (GNU supported cannam@89: targets, and Windows). To establish whether or not bzip2recover was cannam@89: built with such a limitation, run it without arguments. In any event cannam@89: you can build yourself an unlimited version if you can recompile it cannam@89: with MaybeUInt64 set to be an unsigned 64-bit integer. cannam@89: cannam@89: cannam@89: cannam@89: .SH AUTHOR cannam@89: Julian Seward, jsewardbzip.org. cannam@89: cannam@89: http://www.bzip.org cannam@89: cannam@89: The ideas embodied in cannam@89: .I bzip2 cannam@89: are due to (at least) the following cannam@89: people: Michael Burrows and David Wheeler (for the block sorting cannam@89: transformation), David Wheeler (again, for the Huffman coder), Peter cannam@89: Fenwick (for the structured coding model in the original cannam@89: .I bzip, cannam@89: and many refinements), and Alistair Moffat, Radford Neal and Ian Witten cannam@89: (for the arithmetic coder in the original cannam@89: .I bzip). cannam@89: I am much cannam@89: indebted for their help, support and advice. See the manual in the cannam@89: source distribution for pointers to sources of documentation. Christian cannam@89: von Roques encouraged me to look for faster sorting algorithms, so as to cannam@89: speed up compression. Bela Lubkin encouraged me to improve the cannam@89: worst-case compression performance. cannam@89: Donna Robinson XMLised the documentation. cannam@89: The bz* scripts are derived from those of GNU gzip. cannam@89: Many people sent patches, helped cannam@89: with portability problems, lent machines, gave advice and were generally cannam@89: helpful.