Chris@4: .PU Chris@4: .TH bzip2 1 Chris@4: .SH NAME Chris@4: bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6 Chris@4: .br Chris@4: bzcat \- decompresses files to stdout Chris@4: .br Chris@4: bzip2recover \- recovers data from damaged bzip2 files Chris@4: Chris@4: .SH SYNOPSIS Chris@4: .ll +8 Chris@4: .B bzip2 Chris@4: .RB [ " \-cdfkqstvzVL123456789 " ] Chris@4: [ Chris@4: .I "filenames \&..." Chris@4: ] Chris@4: .ll -8 Chris@4: .br Chris@4: .B bunzip2 Chris@4: .RB [ " \-fkvsVL " ] Chris@4: [ Chris@4: .I "filenames \&..." Chris@4: ] Chris@4: .br Chris@4: .B bzcat Chris@4: .RB [ " \-s " ] Chris@4: [ Chris@4: .I "filenames \&..." Chris@4: ] Chris@4: .br Chris@4: .B bzip2recover Chris@4: .I "filename" Chris@4: Chris@4: .SH DESCRIPTION Chris@4: .I bzip2 Chris@4: compresses files using the Burrows-Wheeler block sorting Chris@4: text compression algorithm, and Huffman coding. Compression is Chris@4: generally considerably better than that achieved by more conventional Chris@4: LZ77/LZ78-based compressors, and approaches the performance of the PPM Chris@4: family of statistical compressors. Chris@4: Chris@4: The command-line options are deliberately very similar to Chris@4: those of Chris@4: .I GNU gzip, Chris@4: but they are not identical. Chris@4: Chris@4: .I bzip2 Chris@4: expects a list of file names to accompany the Chris@4: command-line flags. Each file is replaced by a compressed version of Chris@4: itself, with the name "original_name.bz2". Chris@4: Each compressed file Chris@4: has the same modification date, permissions, and, when possible, Chris@4: ownership as the corresponding original, so that these properties can Chris@4: be correctly restored at decompression time. File name handling is Chris@4: naive in the sense that there is no mechanism for preserving original Chris@4: file names, permissions, ownerships or dates in filesystems which lack Chris@4: these concepts, or have serious file name length restrictions, such as Chris@4: MS-DOS. Chris@4: Chris@4: .I bzip2 Chris@4: and Chris@4: .I bunzip2 Chris@4: will by default not overwrite existing Chris@4: files. If you want this to happen, specify the \-f flag. Chris@4: Chris@4: If no file names are specified, Chris@4: .I bzip2 Chris@4: compresses from standard Chris@4: input to standard output. In this case, Chris@4: .I bzip2 Chris@4: will decline to Chris@4: write compressed output to a terminal, as this would be entirely Chris@4: incomprehensible and therefore pointless. Chris@4: Chris@4: .I bunzip2 Chris@4: (or Chris@4: .I bzip2 \-d) Chris@4: decompresses all Chris@4: specified files. Files which were not created by Chris@4: .I bzip2 Chris@4: will be detected and ignored, and a warning issued. Chris@4: .I bzip2 Chris@4: attempts to guess the filename for the decompressed file Chris@4: from that of the compressed file as follows: Chris@4: Chris@4: filename.bz2 becomes filename Chris@4: filename.bz becomes filename Chris@4: filename.tbz2 becomes filename.tar Chris@4: filename.tbz becomes filename.tar Chris@4: anyothername becomes anyothername.out Chris@4: Chris@4: If the file does not end in one of the recognised endings, Chris@4: .I .bz2, Chris@4: .I .bz, Chris@4: .I .tbz2 Chris@4: or Chris@4: .I .tbz, Chris@4: .I bzip2 Chris@4: complains that it cannot Chris@4: guess the name of the original file, and uses the original name Chris@4: with Chris@4: .I .out Chris@4: appended. Chris@4: Chris@4: As with compression, supplying no Chris@4: filenames causes decompression from Chris@4: standard input to standard output. Chris@4: Chris@4: .I bunzip2 Chris@4: will correctly decompress a file which is the Chris@4: concatenation of two or more compressed files. The result is the Chris@4: concatenation of the corresponding uncompressed files. Integrity Chris@4: testing (\-t) Chris@4: of concatenated Chris@4: compressed files is also supported. Chris@4: Chris@4: You can also compress or decompress files to the standard output by Chris@4: giving the \-c flag. Multiple files may be compressed and Chris@4: decompressed like this. The resulting outputs are fed sequentially to Chris@4: stdout. Compression of multiple files Chris@4: in this manner generates a stream Chris@4: containing multiple compressed file representations. Such a stream Chris@4: can be decompressed correctly only by Chris@4: .I bzip2 Chris@4: version 0.9.0 or Chris@4: later. Earlier versions of Chris@4: .I bzip2 Chris@4: will stop after decompressing Chris@4: the first file in the stream. Chris@4: Chris@4: .I bzcat Chris@4: (or Chris@4: .I bzip2 -dc) Chris@4: decompresses all specified files to Chris@4: the standard output. Chris@4: Chris@4: .I bzip2 Chris@4: will read arguments from the environment variables Chris@4: .I BZIP2 Chris@4: and Chris@4: .I BZIP, Chris@4: in that order, and will process them Chris@4: before any arguments read from the command line. This gives a Chris@4: convenient way to supply default arguments. Chris@4: Chris@4: Compression is always performed, even if the compressed Chris@4: file is slightly Chris@4: larger than the original. Files of less than about one hundred bytes Chris@4: tend to get larger, since the compression mechanism has a constant Chris@4: overhead in the region of 50 bytes. Random data (including the output Chris@4: of most file compressors) is coded at about 8.05 bits per byte, giving Chris@4: an expansion of around 0.5%. Chris@4: Chris@4: As a self-check for your protection, Chris@4: .I Chris@4: bzip2 Chris@4: uses 32-bit CRCs to Chris@4: make sure that the decompressed version of a file is identical to the Chris@4: original. This guards against corruption of the compressed data, and Chris@4: against undetected bugs in Chris@4: .I bzip2 Chris@4: (hopefully very unlikely). The Chris@4: chances of data corruption going undetected is microscopic, about one Chris@4: chance in four billion for each file processed. Be aware, though, that Chris@4: the check occurs upon decompression, so it can only tell you that Chris@4: something is wrong. It can't help you Chris@4: recover the original uncompressed Chris@4: data. You can use Chris@4: .I bzip2recover Chris@4: to try to recover data from Chris@4: damaged files. Chris@4: Chris@4: Return values: 0 for a normal exit, 1 for environmental problems (file Chris@4: not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt Chris@4: compressed file, 3 for an internal consistency error (eg, bug) which Chris@4: caused Chris@4: .I bzip2 Chris@4: to panic. Chris@4: Chris@4: .SH OPTIONS Chris@4: .TP Chris@4: .B \-c --stdout Chris@4: Compress or decompress to standard output. Chris@4: .TP Chris@4: .B \-d --decompress Chris@4: Force decompression. Chris@4: .I bzip2, Chris@4: .I bunzip2 Chris@4: and Chris@4: .I bzcat Chris@4: are Chris@4: really the same program, and the decision about what actions to take is Chris@4: done on the basis of which name is used. This flag overrides that Chris@4: mechanism, and forces Chris@4: .I bzip2 Chris@4: to decompress. Chris@4: .TP Chris@4: .B \-z --compress Chris@4: The complement to \-d: forces compression, regardless of the Chris@4: invocation name. Chris@4: .TP Chris@4: .B \-t --test Chris@4: Check integrity of the specified file(s), but don't decompress them. Chris@4: This really performs a trial decompression and throws away the result. Chris@4: .TP Chris@4: .B \-f --force Chris@4: Force overwrite of output files. Normally, Chris@4: .I bzip2 Chris@4: will not overwrite Chris@4: existing output files. Also forces Chris@4: .I bzip2 Chris@4: to break hard links Chris@4: to files, which it otherwise wouldn't do. Chris@4: Chris@4: bzip2 normally declines to decompress files which don't have the Chris@4: correct magic header bytes. If forced (-f), however, it will pass Chris@4: such files through unmodified. This is how GNU gzip behaves. Chris@4: .TP Chris@4: .B \-k --keep Chris@4: Keep (don't delete) input files during compression Chris@4: or decompression. Chris@4: .TP Chris@4: .B \-s --small Chris@4: Reduce memory usage, for compression, decompression and testing. Files Chris@4: are decompressed and tested using a modified algorithm which only Chris@4: requires 2.5 bytes per block byte. This means any file can be Chris@4: decompressed in 2300k of memory, albeit at about half the normal speed. Chris@4: Chris@4: During compression, \-s selects a block size of 200k, which limits Chris@4: memory use to around the same figure, at the expense of your compression Chris@4: ratio. In short, if your machine is low on memory (8 megabytes or Chris@4: less), use \-s for everything. See MEMORY MANAGEMENT below. Chris@4: .TP Chris@4: .B \-q --quiet Chris@4: Suppress non-essential warning messages. Messages pertaining to Chris@4: I/O errors and other critical events will not be suppressed. Chris@4: .TP Chris@4: .B \-v --verbose Chris@4: Verbose mode -- show the compression ratio for each file processed. Chris@4: Further \-v's increase the verbosity level, spewing out lots of Chris@4: information which is primarily of interest for diagnostic purposes. Chris@4: .TP Chris@4: .B \-L --license -V --version Chris@4: Display the software version, license terms and conditions. Chris@4: .TP Chris@4: .B \-1 (or \-\-fast) to \-9 (or \-\-best) Chris@4: Set the block size to 100 k, 200 k .. 900 k when compressing. Has no Chris@4: effect when decompressing. See MEMORY MANAGEMENT below. Chris@4: The \-\-fast and \-\-best aliases are primarily for GNU gzip Chris@4: compatibility. In particular, \-\-fast doesn't make things Chris@4: significantly faster. Chris@4: And \-\-best merely selects the default behaviour. Chris@4: .TP Chris@4: .B \-- Chris@4: Treats all subsequent arguments as file names, even if they start Chris@4: with a dash. This is so you can handle files with names beginning Chris@4: with a dash, for example: bzip2 \-- \-myfilename. Chris@4: .TP Chris@4: .B \--repetitive-fast --repetitive-best Chris@4: These flags are redundant in versions 0.9.5 and above. They provided Chris@4: some coarse control over the behaviour of the sorting algorithm in Chris@4: earlier versions, which was sometimes useful. 0.9.5 and above have an Chris@4: improved algorithm which renders these flags irrelevant. Chris@4: Chris@4: .SH MEMORY MANAGEMENT Chris@4: .I bzip2 Chris@4: compresses large files in blocks. The block size affects Chris@4: both the compression ratio achieved, and the amount of memory needed for Chris@4: compression and decompression. The flags \-1 through \-9 Chris@4: specify the block size to be 100,000 bytes through 900,000 bytes (the Chris@4: default) respectively. At decompression time, the block size used for Chris@4: compression is read from the header of the compressed file, and Chris@4: .I bunzip2 Chris@4: then allocates itself just enough memory to decompress Chris@4: the file. Since block sizes are stored in compressed files, it follows Chris@4: that the flags \-1 to \-9 are irrelevant to and so ignored Chris@4: during decompression. Chris@4: Chris@4: Compression and decompression requirements, Chris@4: in bytes, can be estimated as: Chris@4: Chris@4: Compression: 400k + ( 8 x block size ) Chris@4: Chris@4: Decompression: 100k + ( 4 x block size ), or Chris@4: 100k + ( 2.5 x block size ) Chris@4: Chris@4: Larger block sizes give rapidly diminishing marginal returns. Most of Chris@4: the compression comes from the first two or three hundred k of block Chris@4: size, a fact worth bearing in mind when using Chris@4: .I bzip2 Chris@4: on small machines. Chris@4: It is also important to appreciate that the decompression memory Chris@4: requirement is set at compression time by the choice of block size. Chris@4: Chris@4: For files compressed with the default 900k block size, Chris@4: .I bunzip2 Chris@4: will require about 3700 kbytes to decompress. To support decompression Chris@4: of any file on a 4 megabyte machine, Chris@4: .I bunzip2 Chris@4: has an option to Chris@4: decompress using approximately half this amount of memory, about 2300 Chris@4: kbytes. Decompression speed is also halved, so you should use this Chris@4: option only where necessary. The relevant flag is -s. Chris@4: Chris@4: In general, try and use the largest block size memory constraints allow, Chris@4: since that maximises the compression achieved. Compression and Chris@4: decompression speed are virtually unaffected by block size. Chris@4: Chris@4: Another significant point applies to files which fit in a single block Chris@4: -- that means most files you'd encounter using a large block size. The Chris@4: amount of real memory touched is proportional to the size of the file, Chris@4: since the file is smaller than a block. For example, compressing a file Chris@4: 20,000 bytes long with the flag -9 will cause the compressor to Chris@4: allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 Chris@4: kbytes of it. Similarly, the decompressor will allocate 3700k but only Chris@4: touch 100k + 20000 * 4 = 180 kbytes. Chris@4: Chris@4: Here is a table which summarises the maximum memory usage for different Chris@4: block sizes. Also recorded is the total compressed size for 14 files of Chris@4: the Calgary Text Compression Corpus totalling 3,141,622 bytes. This Chris@4: column gives some feel for how compression varies with block size. Chris@4: These figures tend to understate the advantage of larger block sizes for Chris@4: larger files, since the Corpus is dominated by smaller files. Chris@4: Chris@4: Compress Decompress Decompress Corpus Chris@4: Flag usage usage -s usage Size Chris@4: Chris@4: -1 1200k 500k 350k 914704 Chris@4: -2 2000k 900k 600k 877703 Chris@4: -3 2800k 1300k 850k 860338 Chris@4: -4 3600k 1700k 1100k 846899 Chris@4: -5 4400k 2100k 1350k 845160 Chris@4: -6 5200k 2500k 1600k 838626 Chris@4: -7 6100k 2900k 1850k 834096 Chris@4: -8 6800k 3300k 2100k 828642 Chris@4: -9 7600k 3700k 2350k 828642 Chris@4: Chris@4: .SH RECOVERING DATA FROM DAMAGED FILES Chris@4: .I bzip2 Chris@4: compresses files in blocks, usually 900kbytes long. Each Chris@4: block is handled independently. If a media or transmission error causes Chris@4: a multi-block .bz2 Chris@4: file to become damaged, it may be possible to Chris@4: recover data from the undamaged blocks in the file. Chris@4: Chris@4: The compressed representation of each block is delimited by a 48-bit Chris@4: pattern, which makes it possible to find the block boundaries with Chris@4: reasonable certainty. Each block also carries its own 32-bit CRC, so Chris@4: damaged blocks can be distinguished from undamaged ones. Chris@4: Chris@4: .I bzip2recover Chris@4: is a simple program whose purpose is to search for Chris@4: blocks in .bz2 files, and write each block out into its own .bz2 Chris@4: file. You can then use Chris@4: .I bzip2 Chris@4: \-t Chris@4: to test the Chris@4: integrity of the resulting files, and decompress those which are Chris@4: undamaged. Chris@4: Chris@4: .I bzip2recover Chris@4: takes a single argument, the name of the damaged file, Chris@4: and writes a number of files "rec00001file.bz2", Chris@4: "rec00002file.bz2", etc, containing the extracted blocks. Chris@4: The output filenames are designed so that the use of Chris@4: wildcards in subsequent processing -- for example, Chris@4: "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in Chris@4: the correct order. Chris@4: Chris@4: .I bzip2recover Chris@4: should be of most use dealing with large .bz2 Chris@4: files, as these will contain many blocks. It is clearly Chris@4: futile to use it on damaged single-block files, since a Chris@4: damaged block cannot be recovered. If you wish to minimise Chris@4: any potential data loss through media or transmission errors, Chris@4: you might consider compressing with a smaller Chris@4: block size. Chris@4: Chris@4: .SH PERFORMANCE NOTES Chris@4: The sorting phase of compression gathers together similar strings in the Chris@4: file. Because of this, files containing very long runs of repeated Chris@4: symbols, like "aabaabaabaab ..." (repeated several hundred times) may Chris@4: compress more slowly than normal. Versions 0.9.5 and above fare much Chris@4: better than previous versions in this respect. The ratio between Chris@4: worst-case and average-case compression time is in the region of 10:1. Chris@4: For previous versions, this figure was more like 100:1. You can use the Chris@4: \-vvvv option to monitor progress in great detail, if you want. Chris@4: Chris@4: Decompression speed is unaffected by these phenomena. Chris@4: Chris@4: .I bzip2 Chris@4: usually allocates several megabytes of memory to operate Chris@4: in, and then charges all over it in a fairly random fashion. This means Chris@4: that performance, both for compressing and decompressing, is largely Chris@4: determined by the speed at which your machine can service cache misses. Chris@4: Because of this, small changes to the code to reduce the miss rate have Chris@4: been observed to give disproportionately large performance improvements. Chris@4: I imagine Chris@4: .I bzip2 Chris@4: will perform best on machines with very large caches. Chris@4: Chris@4: .SH CAVEATS Chris@4: I/O error messages are not as helpful as they could be. Chris@4: .I bzip2 Chris@4: tries hard to detect I/O errors and exit cleanly, but the details of Chris@4: what the problem is sometimes seem rather misleading. Chris@4: Chris@4: This manual page pertains to version 1.0.6 of Chris@4: .I bzip2. Chris@4: Compressed data created by this version is entirely forwards and Chris@4: backwards compatible with the previous public releases, versions Chris@4: 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following Chris@4: exception: 0.9.0 and above can correctly decompress multiple Chris@4: concatenated compressed files. 0.1pl2 cannot do this; it will stop Chris@4: after decompressing just the first file in the stream. Chris@4: Chris@4: .I bzip2recover Chris@4: versions prior to 1.0.2 used 32-bit integers to represent Chris@4: bit positions in compressed files, so they could not handle compressed Chris@4: files more than 512 megabytes long. Versions 1.0.2 and above use Chris@4: 64-bit ints on some platforms which support them (GNU supported Chris@4: targets, and Windows). To establish whether or not bzip2recover was Chris@4: built with such a limitation, run it without arguments. In any event Chris@4: you can build yourself an unlimited version if you can recompile it Chris@4: with MaybeUInt64 set to be an unsigned 64-bit integer. Chris@4: Chris@4: Chris@4: Chris@4: .SH AUTHOR Chris@4: Julian Seward, jsewardbzip.org. Chris@4: Chris@4: http://www.bzip.org Chris@4: Chris@4: The ideas embodied in Chris@4: .I bzip2 Chris@4: are due to (at least) the following Chris@4: people: Michael Burrows and David Wheeler (for the block sorting Chris@4: transformation), David Wheeler (again, for the Huffman coder), Peter Chris@4: Fenwick (for the structured coding model in the original Chris@4: .I bzip, Chris@4: and many refinements), and Alistair Moffat, Radford Neal and Ian Witten Chris@4: (for the arithmetic coder in the original Chris@4: .I bzip). Chris@4: I am much Chris@4: indebted for their help, support and advice. See the manual in the Chris@4: source distribution for pointers to sources of documentation. Christian Chris@4: von Roques encouraged me to look for faster sorting algorithms, so as to Chris@4: speed up compression. Bela Lubkin encouraged me to improve the Chris@4: worst-case compression performance. Chris@4: Donna Robinson XMLised the documentation. Chris@4: The bz* scripts are derived from those of GNU gzip. Chris@4: Many people sent patches, helped Chris@4: with portability problems, lent machines, gave advice and were generally Chris@4: helpful.