cannam@89: cannam@89: NAME cannam@89: bzip2, bunzip2 - a block-sorting file compressor, v1.0.6 cannam@89: bzcat - decompresses files to stdout cannam@89: bzip2recover - recovers data from damaged bzip2 files cannam@89: cannam@89: cannam@89: SYNOPSIS cannam@89: bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ] cannam@89: bunzip2 [ -fkvsVL ] [ filenames ... ] cannam@89: bzcat [ -s ] [ filenames ... ] cannam@89: bzip2recover filename cannam@89: cannam@89: cannam@89: DESCRIPTION cannam@89: bzip2 compresses files using the Burrows-Wheeler block cannam@89: sorting text compression algorithm, and Huffman coding. cannam@89: Compression is generally considerably better than that cannam@89: achieved by more conventional LZ77/LZ78-based compressors, cannam@89: and approaches the performance of the PPM family of sta- cannam@89: tistical compressors. cannam@89: cannam@89: The command-line options are deliberately very similar to cannam@89: those of GNU gzip, but they are not identical. cannam@89: cannam@89: bzip2 expects a list of file names to accompany the com- cannam@89: mand-line flags. Each file is replaced by a compressed cannam@89: version of itself, with the name "original_name.bz2". cannam@89: Each compressed file has the same modification date, per- cannam@89: missions, and, when possible, ownership as the correspond- cannam@89: ing original, so that these properties can be correctly cannam@89: restored at decompression time. File name handling is cannam@89: naive in the sense that there is no mechanism for preserv- cannam@89: ing original file names, permissions, ownerships or dates cannam@89: in filesystems which lack these concepts, or have serious cannam@89: file name length restrictions, such as MS-DOS. cannam@89: cannam@89: bzip2 and bunzip2 will by default not overwrite existing cannam@89: files. If you want this to happen, specify the -f flag. cannam@89: cannam@89: If no file names are specified, bzip2 compresses from cannam@89: standard input to standard output. In this case, bzip2 cannam@89: will decline to write compressed output to a terminal, as cannam@89: this would be entirely incomprehensible and therefore cannam@89: pointless. cannam@89: cannam@89: bunzip2 (or bzip2 -d) decompresses all specified files. cannam@89: Files which were not created by bzip2 will be detected and cannam@89: ignored, and a warning issued. bzip2 attempts to guess cannam@89: the filename for the decompressed file from that of the cannam@89: compressed file as follows: cannam@89: cannam@89: filename.bz2 becomes filename cannam@89: filename.bz becomes filename cannam@89: filename.tbz2 becomes filename.tar cannam@89: filename.tbz becomes filename.tar cannam@89: anyothername becomes anyothername.out cannam@89: cannam@89: If the file does not end in one of the recognised endings, cannam@89: .bz2, .bz, .tbz2 or .tbz, bzip2 complains that it cannot cannam@89: guess the name of the original file, and uses the original cannam@89: name with .out appended. cannam@89: cannam@89: As with compression, supplying no filenames causes decom- cannam@89: pression from standard input to standard output. cannam@89: cannam@89: bunzip2 will correctly decompress a file which is the con- cannam@89: catenation of two or more compressed files. The result is cannam@89: the concatenation of the corresponding uncompressed files. cannam@89: Integrity testing (-t) of concatenated compressed files is cannam@89: also supported. cannam@89: cannam@89: You can also compress or decompress files to the standard cannam@89: output by giving the -c flag. Multiple files may be com- cannam@89: pressed and decompressed like this. The resulting outputs cannam@89: are fed sequentially to stdout. Compression of multiple cannam@89: files in this manner generates a stream containing multi- cannam@89: ple compressed file representations. Such a stream can be cannam@89: decompressed correctly only by bzip2 version 0.9.0 or cannam@89: later. Earlier versions of bzip2 will stop after decom- cannam@89: pressing the first file in the stream. cannam@89: cannam@89: bzcat (or bzip2 -dc) decompresses all specified files to cannam@89: the standard output. cannam@89: cannam@89: bzip2 will read arguments from the environment variables cannam@89: BZIP2 and BZIP, in that order, and will process them cannam@89: before any arguments read from the command line. This cannam@89: gives a convenient way to supply default arguments. cannam@89: cannam@89: Compression is always performed, even if the compressed cannam@89: file is slightly larger than the original. Files of less cannam@89: than about one hundred bytes tend to get larger, since the cannam@89: compression mechanism has a constant overhead in the cannam@89: region of 50 bytes. Random data (including the output of cannam@89: most file compressors) is coded at about 8.05 bits per cannam@89: byte, giving an expansion of around 0.5%. cannam@89: cannam@89: As a self-check for your protection, bzip2 uses 32-bit cannam@89: CRCs to make sure that the decompressed version of a file cannam@89: is identical to the original. This guards against corrup- cannam@89: tion of the compressed data, and against undetected bugs cannam@89: in bzip2 (hopefully very unlikely). The chances of data cannam@89: corruption going undetected is microscopic, about one cannam@89: chance in four billion for each file processed. Be aware, cannam@89: though, that the check occurs upon decompression, so it cannam@89: can only tell you that something is wrong. It can't help cannam@89: you recover the original uncompressed data. You can use cannam@89: bzip2recover to try to recover data from damaged files. cannam@89: cannam@89: Return values: 0 for a normal exit, 1 for environmental cannam@89: problems (file not found, invalid flags, I/O errors, &c), cannam@89: 2 to indicate a corrupt compressed file, 3 for an internal cannam@89: consistency error (eg, bug) which caused bzip2 to panic. cannam@89: cannam@89: cannam@89: OPTIONS cannam@89: -c --stdout cannam@89: Compress or decompress to standard output. cannam@89: cannam@89: -d --decompress cannam@89: Force decompression. bzip2, bunzip2 and bzcat are cannam@89: really the same program, and the decision about cannam@89: what actions to take is done on the basis of which cannam@89: name is used. This flag overrides that mechanism, cannam@89: and forces bzip2 to decompress. cannam@89: cannam@89: -z --compress cannam@89: The complement to -d: forces compression, cannam@89: regardless of the invocation name. cannam@89: cannam@89: -t --test cannam@89: Check integrity of the specified file(s), but don't cannam@89: decompress them. This really performs a trial cannam@89: decompression and throws away the result. cannam@89: cannam@89: -f --force cannam@89: Force overwrite of output files. Normally, bzip2 cannam@89: will not overwrite existing output files. Also cannam@89: forces bzip2 to break hard links to files, which it cannam@89: otherwise wouldn't do. cannam@89: cannam@89: bzip2 normally declines to decompress files which cannam@89: don't have the correct magic header bytes. If cannam@89: forced (-f), however, it will pass such files cannam@89: through unmodified. This is how GNU gzip behaves. cannam@89: cannam@89: -k --keep cannam@89: Keep (don't delete) input files during compression cannam@89: or decompression. cannam@89: cannam@89: -s --small cannam@89: Reduce memory usage, for compression, decompression cannam@89: and testing. Files are decompressed and tested cannam@89: using a modified algorithm which only requires 2.5 cannam@89: bytes per block byte. This means any file can be cannam@89: decompressed in 2300k of memory, albeit at about cannam@89: half the normal speed. cannam@89: cannam@89: During compression, -s selects a block size of cannam@89: 200k, which limits memory use to around the same cannam@89: figure, at the expense of your compression ratio. cannam@89: In short, if your machine is low on memory (8 cannam@89: megabytes or less), use -s for everything. See cannam@89: MEMORY MANAGEMENT below. cannam@89: cannam@89: -q --quiet cannam@89: Suppress non-essential warning messages. Messages cannam@89: pertaining to I/O errors and other critical events cannam@89: will not be suppressed. cannam@89: cannam@89: -v --verbose cannam@89: Verbose mode -- show the compression ratio for each cannam@89: file processed. Further -v's increase the ver- cannam@89: bosity level, spewing out lots of information which cannam@89: is primarily of interest for diagnostic purposes. cannam@89: cannam@89: -L --license -V --version cannam@89: Display the software version, license terms and cannam@89: conditions. cannam@89: cannam@89: -1 (or --fast) to -9 (or --best) cannam@89: Set the block size to 100 k, 200 k .. 900 k when cannam@89: compressing. Has no effect when decompressing. cannam@89: See MEMORY MANAGEMENT below. The --fast and --best cannam@89: aliases are primarily for GNU gzip compatibility. cannam@89: In particular, --fast doesn't make things signifi- cannam@89: cantly faster. And --best merely selects the cannam@89: default behaviour. cannam@89: cannam@89: -- Treats all subsequent arguments as file names, even cannam@89: if they start with a dash. This is so you can han- cannam@89: dle files with names beginning with a dash, for cannam@89: example: bzip2 -- -myfilename. cannam@89: cannam@89: --repetitive-fast --repetitive-best cannam@89: These flags are redundant in versions 0.9.5 and cannam@89: above. They provided some coarse control over the cannam@89: behaviour of the sorting algorithm in earlier ver- cannam@89: sions, which was sometimes useful. 0.9.5 and above cannam@89: have an improved algorithm which renders these cannam@89: flags irrelevant. cannam@89: cannam@89: cannam@89: MEMORY MANAGEMENT cannam@89: bzip2 compresses large files in blocks. The block size cannam@89: affects both the compression ratio achieved, and the cannam@89: amount of memory needed for compression and decompression. cannam@89: The flags -1 through -9 specify the block size to be cannam@89: 100,000 bytes through 900,000 bytes (the default) respec- cannam@89: tively. At decompression time, the block size used for cannam@89: compression is read from the header of the compressed cannam@89: file, and bunzip2 then allocates itself just enough memory cannam@89: to decompress the file. Since block sizes are stored in cannam@89: compressed files, it follows that the flags -1 to -9 are cannam@89: irrelevant to and so ignored during decompression. cannam@89: cannam@89: Compression and decompression requirements, in bytes, can cannam@89: be estimated as: cannam@89: cannam@89: Compression: 400k + ( 8 x block size ) cannam@89: cannam@89: Decompression: 100k + ( 4 x block size ), or cannam@89: 100k + ( 2.5 x block size ) cannam@89: cannam@89: Larger block sizes give rapidly diminishing marginal cannam@89: returns. Most of the compression comes from the first two cannam@89: or three hundred k of block size, a fact worth bearing in cannam@89: mind when using bzip2 on small machines. It is also cannam@89: important to appreciate that the decompression memory cannam@89: requirement is set at compression time by the choice of cannam@89: block size. cannam@89: cannam@89: For files compressed with the default 900k block size, cannam@89: bunzip2 will require about 3700 kbytes to decompress. To cannam@89: support decompression of any file on a 4 megabyte machine, cannam@89: bunzip2 has an option to decompress using approximately cannam@89: half this amount of memory, about 2300 kbytes. Decompres- cannam@89: sion speed is also halved, so you should use this option cannam@89: only where necessary. The relevant flag is -s. cannam@89: cannam@89: In general, try and use the largest block size memory con- cannam@89: straints allow, since that maximises the compression cannam@89: achieved. Compression and decompression speed are virtu- cannam@89: ally unaffected by block size. cannam@89: cannam@89: Another significant point applies to files which fit in a cannam@89: single block -- that means most files you'd encounter cannam@89: using a large block size. The amount of real memory cannam@89: touched is proportional to the size of the file, since the cannam@89: file is smaller than a block. For example, compressing a cannam@89: file 20,000 bytes long with the flag -9 will cause the cannam@89: compressor to allocate around 7600k of memory, but only cannam@89: touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the cannam@89: decompressor will allocate 3700k but only touch 100k + cannam@89: 20000 * 4 = 180 kbytes. cannam@89: cannam@89: Here is a table which summarises the maximum memory usage cannam@89: for different block sizes. Also recorded is the total cannam@89: compressed size for 14 files of the Calgary Text Compres- cannam@89: sion Corpus totalling 3,141,622 bytes. This column gives cannam@89: some feel for how compression varies with block size. cannam@89: These figures tend to understate the advantage of larger cannam@89: block sizes for larger files, since the Corpus is domi- cannam@89: nated by smaller files. cannam@89: cannam@89: Compress Decompress Decompress Corpus cannam@89: Flag usage usage -s usage Size cannam@89: cannam@89: -1 1200k 500k 350k 914704 cannam@89: -2 2000k 900k 600k 877703 cannam@89: -3 2800k 1300k 850k 860338 cannam@89: -4 3600k 1700k 1100k 846899 cannam@89: -5 4400k 2100k 1350k 845160 cannam@89: -6 5200k 2500k 1600k 838626 cannam@89: -7 6100k 2900k 1850k 834096 cannam@89: -8 6800k 3300k 2100k 828642 cannam@89: -9 7600k 3700k 2350k 828642 cannam@89: cannam@89: cannam@89: RECOVERING DATA FROM DAMAGED FILES cannam@89: bzip2 compresses files in blocks, usually 900kbytes long. cannam@89: Each block is handled independently. If a media or trans- cannam@89: mission error causes a multi-block .bz2 file to become cannam@89: damaged, it may be possible to recover data from the cannam@89: undamaged blocks in the file. cannam@89: cannam@89: The compressed representation of each block is delimited cannam@89: by a 48-bit pattern, which makes it possible to find the cannam@89: block boundaries with reasonable certainty. Each block cannam@89: also carries its own 32-bit CRC, so damaged blocks can be cannam@89: distinguished from undamaged ones. cannam@89: cannam@89: bzip2recover is a simple program whose purpose is to cannam@89: search for blocks in .bz2 files, and write each block out cannam@89: into its own .bz2 file. You can then use bzip2 -t to test cannam@89: the integrity of the resulting files, and decompress those cannam@89: which are undamaged. cannam@89: cannam@89: bzip2recover takes a single argument, the name of the dam- cannam@89: aged file, and writes a number of files cannam@89: "rec00001file.bz2", "rec00002file.bz2", etc, containing cannam@89: the extracted blocks. The output filenames are cannam@89: designed so that the use of wildcards in subsequent pro- cannam@89: cessing -- for example, "bzip2 -dc rec*file.bz2 > recov- cannam@89: ered_data" -- processes the files in the correct order. cannam@89: cannam@89: bzip2recover should be of most use dealing with large .bz2 cannam@89: files, as these will contain many blocks. It is clearly cannam@89: futile to use it on damaged single-block files, since a cannam@89: damaged block cannot be recovered. If you wish to min- cannam@89: imise any potential data loss through media or transmis- cannam@89: sion errors, you might consider compressing with a smaller cannam@89: block size. cannam@89: cannam@89: cannam@89: PERFORMANCE NOTES cannam@89: The sorting phase of compression gathers together similar cannam@89: strings in the file. Because of this, files containing cannam@89: very long runs of repeated symbols, like "aabaabaabaab cannam@89: ..." (repeated several hundred times) may compress more cannam@89: slowly than normal. Versions 0.9.5 and above fare much cannam@89: better than previous versions in this respect. The ratio cannam@89: between worst-case and average-case compression time is in cannam@89: the region of 10:1. For previous versions, this figure cannam@89: was more like 100:1. You can use the -vvvv option to mon- cannam@89: itor progress in great detail, if you want. cannam@89: cannam@89: Decompression speed is unaffected by these phenomena. cannam@89: cannam@89: bzip2 usually allocates several megabytes of memory to cannam@89: operate in, and then charges all over it in a fairly ran- cannam@89: dom fashion. This means that performance, both for com- cannam@89: pressing and decompressing, is largely determined by the cannam@89: speed at which your machine can service cache misses. cannam@89: Because of this, small changes to the code to reduce the cannam@89: miss rate have been observed to give disproportionately cannam@89: large performance improvements. I imagine bzip2 will per- cannam@89: form best on machines with very large caches. cannam@89: cannam@89: cannam@89: CAVEATS cannam@89: I/O error messages are not as helpful as they could be. cannam@89: bzip2 tries hard to detect I/O errors and exit cleanly, cannam@89: but the details of what the problem is sometimes seem cannam@89: rather misleading. cannam@89: cannam@89: This manual page pertains to version 1.0.6 of bzip2. Com- cannam@89: pressed data created by this version is entirely forwards cannam@89: and backwards compatible with the previous public cannam@89: releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, cannam@89: 1.0.2 and above, but with the following exception: 0.9.0 cannam@89: and above can correctly decompress multiple concatenated cannam@89: compressed files. 0.1pl2 cannot do this; it will stop cannam@89: after decompressing just the first file in the stream. cannam@89: cannam@89: bzip2recover versions prior to 1.0.2 used 32-bit integers cannam@89: to represent bit positions in compressed files, so they cannam@89: could not handle compressed files more than 512 megabytes cannam@89: long. Versions 1.0.2 and above use 64-bit ints on some cannam@89: platforms which support them (GNU supported targets, and cannam@89: Windows). To establish whether or not bzip2recover was cannam@89: built with such a limitation, run it without arguments. cannam@89: In any event you can build yourself an unlimited version cannam@89: if you can recompile it with MaybeUInt64 set to be an cannam@89: unsigned 64-bit integer. cannam@89: cannam@89: cannam@89: AUTHOR cannam@89: Julian Seward, jsewardbzip.org. cannam@89: cannam@89: http://www.bzip.org cannam@89: cannam@89: The ideas embodied in bzip2 are due to (at least) the fol- cannam@89: lowing people: Michael Burrows and David Wheeler (for the cannam@89: block sorting transformation), David Wheeler (again, for cannam@89: the Huffman coder), Peter Fenwick (for the structured cod- cannam@89: ing model in the original bzip, and many refinements), and cannam@89: Alistair Moffat, Radford Neal and Ian Witten (for the cannam@89: arithmetic coder in the original bzip). I am much cannam@89: indebted for their help, support and advice. See the man- cannam@89: ual in the source distribution for pointers to sources of cannam@89: documentation. Christian von Roques encouraged me to look cannam@89: for faster sorting algorithms, so as to speed up compres- cannam@89: sion. Bela Lubkin encouraged me to improve the worst-case cannam@89: compression performance. Donna Robinson XMLised the docu- cannam@89: mentation. The bz* scripts are derived from those of GNU cannam@89: gzip. Many people sent patches, helped with portability cannam@89: problems, lent machines, gave advice and were generally cannam@89: helpful. cannam@89: