cannam@89: bzip2(1) bzip2(1) cannam@89: cannam@89: cannam@89: cannam@89: NNAAMMEE cannam@89: bzip2, bunzip2 − a block‐sorting file compressor, v1.0.6 cannam@89: bzcat − decompresses files to stdout cannam@89: bzip2recover − recovers data from damaged bzip2 files cannam@89: cannam@89: cannam@89: SSYYNNOOPPSSIISS cannam@89: bbzziipp22 [ −−ccddffkkqqssttvvzzVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ] cannam@89: bbuunnzziipp22 [ −−ffkkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ] cannam@89: bbzzccaatt [ −−ss ] [ _f_i_l_e_n_a_m_e_s _._._. ] cannam@89: bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e cannam@89: cannam@89: cannam@89: DDEESSCCRRIIPPTTIIOONN cannam@89: _b_z_i_p_2 compresses files using the Burrows‐Wheeler block cannam@89: sorting text compression algorithm, and Huffman coding. cannam@89: Compression is generally considerably better than that cannam@89: achieved by more conventional LZ77/LZ78‐based compressors, cannam@89: and approaches the performance of the PPM family of sta­ cannam@89: tistical compressors. cannam@89: cannam@89: The command‐line options are deliberately very similar to cannam@89: those of _G_N_U _g_z_i_p_, but they are not identical. cannam@89: cannam@89: _b_z_i_p_2 expects a list of file names to accompany the com­ cannam@89: mand‐line flags. Each file is replaced by a compressed cannam@89: version of itself, with the name "original_name.bz2". cannam@89: Each compressed file has the same modification date, per­ cannam@89: missions, and, when possible, ownership as the correspond­ cannam@89: ing original, so that these properties can be correctly cannam@89: restored at decompression time. File name handling is cannam@89: naive in the sense that there is no mechanism for preserv­ cannam@89: ing original file names, permissions, ownerships or dates cannam@89: in filesystems which lack these concepts, or have serious cannam@89: file name length restrictions, such as MS‐DOS. cannam@89: cannam@89: _b_z_i_p_2 and _b_u_n_z_i_p_2 will by default not overwrite existing cannam@89: files. If you want this to happen, specify the −f flag. cannam@89: cannam@89: If no file names are specified, _b_z_i_p_2 compresses from cannam@89: standard input to standard output. In this case, _b_z_i_p_2 cannam@89: will decline to write compressed output to a terminal, as cannam@89: this would be entirely incomprehensible and therefore cannam@89: pointless. cannam@89: cannam@89: _b_u_n_z_i_p_2 (or _b_z_i_p_2 _−_d_) decompresses all specified files. cannam@89: Files which were not created by _b_z_i_p_2 will be detected and cannam@89: ignored, and a warning issued. _b_z_i_p_2 attempts to guess cannam@89: the filename for the decompressed file from that of the cannam@89: compressed file as follows: cannam@89: cannam@89: filename.bz2 becomes filename cannam@89: filename.bz becomes filename cannam@89: filename.tbz2 becomes filename.tar cannam@89: filename.tbz becomes filename.tar cannam@89: anyothername becomes anyothername.out cannam@89: cannam@89: If the file does not end in one of the recognised endings, cannam@89: _._b_z_2_, _._b_z_, _._t_b_z_2 or _._t_b_z_, _b_z_i_p_2 complains that it cannot cannam@89: guess the name of the original file, and uses the original cannam@89: name with _._o_u_t appended. cannam@89: cannam@89: As with compression, supplying no filenames causes decom­ cannam@89: pression from standard input to standard output. cannam@89: cannam@89: _b_u_n_z_i_p_2 will correctly decompress a file which is the con­ cannam@89: catenation of two or more compressed files. The result is cannam@89: the concatenation of the corresponding uncompressed files. cannam@89: Integrity testing (−t) of concatenated compressed files is cannam@89: also supported. cannam@89: cannam@89: You can also compress or decompress files to the standard cannam@89: output by giving the −c flag. Multiple files may be com­ cannam@89: pressed and decompressed like this. The resulting outputs cannam@89: are fed sequentially to stdout. Compression of multiple cannam@89: files in this manner generates a stream containing multi­ cannam@89: ple compressed file representations. Such a stream can be cannam@89: decompressed correctly only by _b_z_i_p_2 version 0.9.0 or cannam@89: later. Earlier versions of _b_z_i_p_2 will stop after decom­ cannam@89: pressing the first file in the stream. cannam@89: cannam@89: _b_z_c_a_t (or _b_z_i_p_2 _‐_d_c_) decompresses all specified files to cannam@89: the standard output. cannam@89: cannam@89: _b_z_i_p_2 will read arguments from the environment variables cannam@89: _B_Z_I_P_2 and _B_Z_I_P_, in that order, and will process them cannam@89: before any arguments read from the command line. This cannam@89: gives a convenient way to supply default arguments. cannam@89: cannam@89: Compression is always performed, even if the compressed cannam@89: file is slightly larger than the original. Files of less cannam@89: than about one hundred bytes tend to get larger, since the cannam@89: compression mechanism has a constant overhead in the cannam@89: region of 50 bytes. Random data (including the output of cannam@89: most file compressors) is coded at about 8.05 bits per cannam@89: byte, giving an expansion of around 0.5%. cannam@89: cannam@89: As a self‐check for your protection, _b_z_i_p_2 uses 32‐bit cannam@89: CRCs to make sure that the decompressed version of a file cannam@89: is identical to the original. This guards against corrup­ cannam@89: tion of the compressed data, and against undetected bugs cannam@89: in _b_z_i_p_2 (hopefully very unlikely). The chances of data cannam@89: corruption going undetected is microscopic, about one cannam@89: chance in four billion for each file processed. Be aware, cannam@89: though, that the check occurs upon decompression, so it cannam@89: can only tell you that something is wrong. It can’t help cannam@89: you recover the original uncompressed data. You can use cannam@89: _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged files. cannam@89: cannam@89: Return values: 0 for a normal exit, 1 for environmental cannam@89: problems (file not found, invalid flags, I/O errors, &c), cannam@89: 2 to indicate a corrupt compressed file, 3 for an internal cannam@89: consistency error (eg, bug) which caused _b_z_i_p_2 to panic. cannam@89: cannam@89: cannam@89: OOPPTTIIOONNSS cannam@89: −−cc ‐‐‐‐ssttddoouutt cannam@89: Compress or decompress to standard output. cannam@89: cannam@89: −−dd ‐‐‐‐ddeeccoommpprreessss cannam@89: Force decompression. _b_z_i_p_2_, _b_u_n_z_i_p_2 and _b_z_c_a_t are cannam@89: really the same program, and the decision about cannam@89: what actions to take is done on the basis of which cannam@89: name is used. This flag overrides that mechanism, cannam@89: and forces _b_z_i_p_2 to decompress. cannam@89: cannam@89: −−zz ‐‐‐‐ccoommpprreessss cannam@89: The complement to −d: forces compression, cannam@89: regardless of the invocation name. cannam@89: cannam@89: −−tt ‐‐‐‐tteesstt cannam@89: Check integrity of the specified file(s), but don’t cannam@89: decompress them. This really performs a trial cannam@89: decompression and throws away the result. cannam@89: cannam@89: −−ff ‐‐‐‐ffoorrccee cannam@89: Force overwrite of output files. Normally, _b_z_i_p_2 cannam@89: will not overwrite existing output files. Also cannam@89: forces _b_z_i_p_2 to break hard links to files, which it cannam@89: otherwise wouldn’t do. cannam@89: cannam@89: bzip2 normally declines to decompress files which cannam@89: don’t have the correct magic header bytes. If cannam@89: forced (‐f), however, it will pass such files cannam@89: through unmodified. This is how GNU gzip behaves. cannam@89: cannam@89: −−kk ‐‐‐‐kkeeeepp cannam@89: Keep (don’t delete) input files during compression cannam@89: or decompression. cannam@89: cannam@89: −−ss ‐‐‐‐ssmmaallll cannam@89: Reduce memory usage, for compression, decompression cannam@89: and testing. Files are decompressed and tested cannam@89: using a modified algorithm which only requires 2.5 cannam@89: bytes per block byte. This means any file can be cannam@89: decompressed in 2300k of memory, albeit at about cannam@89: half the normal speed. cannam@89: cannam@89: During compression, −s selects a block size of cannam@89: 200k, which limits memory use to around the same cannam@89: figure, at the expense of your compression ratio. cannam@89: In short, if your machine is low on memory (8 cannam@89: megabytes or less), use −s for everything. See cannam@89: MEMORY MANAGEMENT below. cannam@89: cannam@89: −−qq ‐‐‐‐qquuiieett cannam@89: Suppress non‐essential warning messages. Messages cannam@89: pertaining to I/O errors and other critical events cannam@89: will not be suppressed. cannam@89: cannam@89: −−vv ‐‐‐‐vveerrbboossee cannam@89: Verbose mode ‐‐ show the compression ratio for each cannam@89: file processed. Further −v’s increase the ver­ cannam@89: bosity level, spewing out lots of information which cannam@89: is primarily of interest for diagnostic purposes. cannam@89: cannam@89: −−LL ‐‐‐‐lliicceennssee ‐‐VV ‐‐‐‐vveerrssiioonn cannam@89: Display the software version, license terms and cannam@89: conditions. cannam@89: cannam@89: −−11 ((oorr −−−−ffaasstt)) ttoo −−99 ((oorr −−−−bbeesstt)) cannam@89: Set the block size to 100 k, 200 k .. 900 k when cannam@89: compressing. Has no effect when decompressing. cannam@89: See MEMORY MANAGEMENT below. The −−fast and −−best cannam@89: aliases are primarily for GNU gzip compatibility. cannam@89: In particular, −−fast doesn’t make things signifi­ cannam@89: cantly faster. And −−best merely selects the cannam@89: default behaviour. cannam@89: cannam@89: −−‐‐ Treats all subsequent arguments as file names, even cannam@89: if they start with a dash. This is so you can han­ cannam@89: dle files with names beginning with a dash, for cannam@89: example: bzip2 −‐ −myfilename. cannam@89: cannam@89: −−‐‐rreeppeettiittiivvee‐‐ffaasstt ‐‐‐‐rreeppeettiittiivvee‐‐bbeesstt cannam@89: These flags are redundant in versions 0.9.5 and cannam@89: above. They provided some coarse control over the cannam@89: behaviour of the sorting algorithm in earlier ver­ cannam@89: sions, which was sometimes useful. 0.9.5 and above cannam@89: have an improved algorithm which renders these cannam@89: flags irrelevant. cannam@89: cannam@89: cannam@89: MMEEMMOORRYY MMAANNAAGGEEMMEENNTT cannam@89: _b_z_i_p_2 compresses large files in blocks. The block size cannam@89: affects both the compression ratio achieved, and the cannam@89: amount of memory needed for compression and decompression. cannam@89: The flags −1 through −9 specify the block size to be cannam@89: 100,000 bytes through 900,000 bytes (the default) respec­ cannam@89: tively. At decompression time, the block size used for cannam@89: compression is read from the header of the compressed cannam@89: file, and _b_u_n_z_i_p_2 then allocates itself just enough memory cannam@89: to decompress the file. Since block sizes are stored in cannam@89: compressed files, it follows that the flags −1 to −9 are cannam@89: irrelevant to and so ignored during decompression. cannam@89: cannam@89: Compression and decompression requirements, in bytes, can cannam@89: be estimated as: cannam@89: cannam@89: Compression: 400k + ( 8 x block size ) cannam@89: cannam@89: Decompression: 100k + ( 4 x block size ), or cannam@89: 100k + ( 2.5 x block size ) cannam@89: cannam@89: Larger block sizes give rapidly diminishing marginal cannam@89: returns. Most of the compression comes from the first two cannam@89: or three hundred k of block size, a fact worth bearing in cannam@89: mind when using _b_z_i_p_2 on small machines. It is also cannam@89: important to appreciate that the decompression memory cannam@89: requirement is set at compression time by the choice of cannam@89: block size. cannam@89: cannam@89: For files compressed with the default 900k block size, cannam@89: _b_u_n_z_i_p_2 will require about 3700 kbytes to decompress. To cannam@89: support decompression of any file on a 4 megabyte machine, cannam@89: _b_u_n_z_i_p_2 has an option to decompress using approximately cannam@89: half this amount of memory, about 2300 kbytes. Decompres­ cannam@89: sion speed is also halved, so you should use this option cannam@89: only where necessary. The relevant flag is ‐s. cannam@89: cannam@89: In general, try and use the largest block size memory con­ cannam@89: straints allow, since that maximises the compression cannam@89: achieved. Compression and decompression speed are virtu­ cannam@89: ally unaffected by block size. cannam@89: cannam@89: Another significant point applies to files which fit in a cannam@89: single block ‐‐ that means most files you’d encounter cannam@89: using a large block size. The amount of real memory cannam@89: touched is proportional to the size of the file, since the cannam@89: file is smaller than a block. For example, compressing a cannam@89: file 20,000 bytes long with the flag ‐9 will cause the cannam@89: compressor to allocate around 7600k of memory, but only cannam@89: touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the cannam@89: decompressor will allocate 3700k but only touch 100k + cannam@89: 20000 * 4 = 180 kbytes. cannam@89: cannam@89: Here is a table which summarises the maximum memory usage cannam@89: for different block sizes. Also recorded is the total cannam@89: compressed size for 14 files of the Calgary Text Compres­ cannam@89: sion Corpus totalling 3,141,622 bytes. This column gives cannam@89: some feel for how compression varies with block size. cannam@89: These figures tend to understate the advantage of larger cannam@89: block sizes for larger files, since the Corpus is domi­ cannam@89: nated by smaller files. cannam@89: cannam@89: Compress Decompress Decompress Corpus cannam@89: Flag usage usage ‐s usage Size cannam@89: cannam@89: ‐1 1200k 500k 350k 914704 cannam@89: ‐2 2000k 900k 600k 877703 cannam@89: ‐3 2800k 1300k 850k 860338 cannam@89: ‐4 3600k 1700k 1100k 846899 cannam@89: ‐5 4400k 2100k 1350k 845160 cannam@89: ‐6 5200k 2500k 1600k 838626 cannam@89: ‐7 6100k 2900k 1850k 834096 cannam@89: ‐8 6800k 3300k 2100k 828642 cannam@89: ‐9 7600k 3700k 2350k 828642 cannam@89: cannam@89: cannam@89: RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS cannam@89: _b_z_i_p_2 compresses files in blocks, usually 900kbytes long. cannam@89: Each block is handled independently. If a media or trans­ cannam@89: mission error causes a multi‐block .bz2 file to become cannam@89: damaged, it may be possible to recover data from the cannam@89: undamaged blocks in the file. cannam@89: cannam@89: The compressed representation of each block is delimited cannam@89: by a 48‐bit pattern, which makes it possible to find the cannam@89: block boundaries with reasonable certainty. Each block cannam@89: also carries its own 32‐bit CRC, so damaged blocks can be cannam@89: distinguished from undamaged ones. cannam@89: cannam@89: _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to cannam@89: search for blocks in .bz2 files, and write each block out cannam@89: into its own .bz2 file. You can then use _b_z_i_p_2 −t to test cannam@89: the integrity of the resulting files, and decompress those cannam@89: which are undamaged. cannam@89: cannam@89: _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam­ cannam@89: aged file, and writes a number of files cannam@89: "rec00001file.bz2", "rec00002file.bz2", etc, containing cannam@89: the extracted blocks. The output filenames are cannam@89: designed so that the use of wildcards in subsequent pro­ cannam@89: cessing ‐‐ for example, "bzip2 ‐dc rec*file.bz2 > recov­ cannam@89: ered_data" ‐‐ processes the files in the correct order. cannam@89: cannam@89: _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2 cannam@89: files, as these will contain many blocks. It is clearly cannam@89: futile to use it on damaged single‐block files, since a cannam@89: damaged block cannot be recovered. If you wish to min­ cannam@89: imise any potential data loss through media or transmis­ cannam@89: sion errors, you might consider compressing with a smaller cannam@89: block size. cannam@89: cannam@89: cannam@89: PPEERRFFOORRMMAANNCCEE NNOOTTEESS cannam@89: The sorting phase of compression gathers together similar cannam@89: strings in the file. Because of this, files containing cannam@89: very long runs of repeated symbols, like "aabaabaabaab cannam@89: ..." (repeated several hundred times) may compress more cannam@89: slowly than normal. Versions 0.9.5 and above fare much cannam@89: better than previous versions in this respect. The ratio cannam@89: between worst‐case and average‐case compression time is in cannam@89: the region of 10:1. For previous versions, this figure cannam@89: was more like 100:1. You can use the −vvvv option to mon­ cannam@89: itor progress in great detail, if you want. cannam@89: cannam@89: Decompression speed is unaffected by these phenomena. cannam@89: cannam@89: _b_z_i_p_2 usually allocates several megabytes of memory to cannam@89: operate in, and then charges all over it in a fairly ran­ cannam@89: dom fashion. This means that performance, both for com­ cannam@89: pressing and decompressing, is largely determined by the cannam@89: speed at which your machine can service cache misses. cannam@89: Because of this, small changes to the code to reduce the cannam@89: miss rate have been observed to give disproportionately cannam@89: large performance improvements. I imagine _b_z_i_p_2 will per­ cannam@89: form best on machines with very large caches. cannam@89: cannam@89: cannam@89: CCAAVVEEAATTSS cannam@89: I/O error messages are not as helpful as they could be. cannam@89: _b_z_i_p_2 tries hard to detect I/O errors and exit cleanly, cannam@89: but the details of what the problem is sometimes seem cannam@89: rather misleading. cannam@89: cannam@89: This manual page pertains to version 1.0.6 of _b_z_i_p_2_. Com­ cannam@89: pressed data created by this version is entirely forwards cannam@89: and backwards compatible with the previous public cannam@89: releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, cannam@89: 1.0.2 and above, but with the following exception: 0.9.0 cannam@89: and above can correctly decompress multiple concatenated cannam@89: compressed files. 0.1pl2 cannot do this; it will stop cannam@89: after decompressing just the first file in the stream. cannam@89: cannam@89: _b_z_i_p_2_r_e_c_o_v_e_r versions prior to 1.0.2 used 32‐bit integers cannam@89: to represent bit positions in compressed files, so they cannam@89: could not handle compressed files more than 512 megabytes cannam@89: long. Versions 1.0.2 and above use 64‐bit ints on some cannam@89: platforms which support them (GNU supported targets, and cannam@89: Windows). To establish whether or not bzip2recover was cannam@89: built with such a limitation, run it without arguments. cannam@89: In any event you can build yourself an unlimited version cannam@89: if you can recompile it with MaybeUInt64 set to be an cannam@89: unsigned 64‐bit integer. cannam@89: cannam@89: cannam@89: cannam@89: cannam@89: AAUUTTHHOORR cannam@89: Julian Seward, jsewardbzip.org. cannam@89: cannam@89: http://www.bzip.org cannam@89: cannam@89: The ideas embodied in _b_z_i_p_2 are due to (at least) the fol­ cannam@89: lowing people: Michael Burrows and David Wheeler (for the cannam@89: block sorting transformation), David Wheeler (again, for cannam@89: the Huffman coder), Peter Fenwick (for the structured cod­ cannam@89: ing model in the original _b_z_i_p_, and many refinements), and cannam@89: Alistair Moffat, Radford Neal and Ian Witten (for the cannam@89: arithmetic coder in the original _b_z_i_p_)_. I am much cannam@89: indebted for their help, support and advice. See the man­ cannam@89: ual in the source distribution for pointers to sources of cannam@89: documentation. Christian von Roques encouraged me to look cannam@89: for faster sorting algorithms, so as to speed up compres­ cannam@89: sion. Bela Lubkin encouraged me to improve the worst‐case cannam@89: compression performance. Donna Robinson XMLised the docu­ cannam@89: mentation. The bz* scripts are derived from those of GNU cannam@89: gzip. Many people sent patches, helped with portability cannam@89: problems, lent machines, gave advice and were generally cannam@89: helpful. cannam@89: cannam@89: cannam@89: cannam@89: bzip2(1)