Chris@4: bzip2(1) bzip2(1) Chris@4: Chris@4: Chris@4: Chris@4: NNAAMMEE Chris@4: bzip2, bunzip2 − a block‐sorting file compressor, v1.0.6 Chris@4: bzcat − decompresses files to stdout Chris@4: bzip2recover − recovers data from damaged bzip2 files Chris@4: Chris@4: Chris@4: SSYYNNOOPPSSIISS Chris@4: bbzziipp22 [ −−ccddffkkqqssttvvzzVVLL112233445566778899 ] [ _f_i_l_e_n_a_m_e_s _._._. ] Chris@4: bbuunnzziipp22 [ −−ffkkvvssVVLL ] [ _f_i_l_e_n_a_m_e_s _._._. ] Chris@4: bbzzccaatt [ −−ss ] [ _f_i_l_e_n_a_m_e_s _._._. ] Chris@4: bbzziipp22rreeccoovveerr _f_i_l_e_n_a_m_e Chris@4: Chris@4: Chris@4: DDEESSCCRRIIPPTTIIOONN Chris@4: _b_z_i_p_2 compresses files using the Burrows‐Wheeler block Chris@4: sorting text compression algorithm, and Huffman coding. Chris@4: Compression is generally considerably better than that Chris@4: achieved by more conventional LZ77/LZ78‐based compressors, Chris@4: and approaches the performance of the PPM family of sta­ Chris@4: tistical compressors. Chris@4: Chris@4: The command‐line options are deliberately very similar to Chris@4: those of _G_N_U _g_z_i_p_, but they are not identical. Chris@4: Chris@4: _b_z_i_p_2 expects a list of file names to accompany the com­ Chris@4: mand‐line flags. Each file is replaced by a compressed Chris@4: version of itself, with the name "original_name.bz2". Chris@4: Each compressed file has the same modification date, per­ Chris@4: missions, and, when possible, ownership as the correspond­ Chris@4: ing original, so that these properties can be correctly Chris@4: restored at decompression time. File name handling is Chris@4: naive in the sense that there is no mechanism for preserv­ Chris@4: ing original file names, permissions, ownerships or dates Chris@4: in filesystems which lack these concepts, or have serious Chris@4: file name length restrictions, such as MS‐DOS. Chris@4: Chris@4: _b_z_i_p_2 and _b_u_n_z_i_p_2 will by default not overwrite existing Chris@4: files. If you want this to happen, specify the −f flag. Chris@4: Chris@4: If no file names are specified, _b_z_i_p_2 compresses from Chris@4: standard input to standard output. In this case, _b_z_i_p_2 Chris@4: will decline to write compressed output to a terminal, as Chris@4: this would be entirely incomprehensible and therefore Chris@4: pointless. Chris@4: Chris@4: _b_u_n_z_i_p_2 (or _b_z_i_p_2 _−_d_) decompresses all specified files. Chris@4: Files which were not created by _b_z_i_p_2 will be detected and Chris@4: ignored, and a warning issued. _b_z_i_p_2 attempts to guess Chris@4: the filename for the decompressed file from that of the Chris@4: compressed file as follows: Chris@4: Chris@4: filename.bz2 becomes filename Chris@4: filename.bz becomes filename Chris@4: filename.tbz2 becomes filename.tar Chris@4: filename.tbz becomes filename.tar Chris@4: anyothername becomes anyothername.out Chris@4: Chris@4: If the file does not end in one of the recognised endings, Chris@4: _._b_z_2_, _._b_z_, _._t_b_z_2 or _._t_b_z_, _b_z_i_p_2 complains that it cannot Chris@4: guess the name of the original file, and uses the original Chris@4: name with _._o_u_t appended. Chris@4: Chris@4: As with compression, supplying no filenames causes decom­ Chris@4: pression from standard input to standard output. Chris@4: Chris@4: _b_u_n_z_i_p_2 will correctly decompress a file which is the con­ Chris@4: catenation of two or more compressed files. The result is Chris@4: the concatenation of the corresponding uncompressed files. Chris@4: Integrity testing (−t) of concatenated compressed files is Chris@4: also supported. Chris@4: Chris@4: You can also compress or decompress files to the standard Chris@4: output by giving the −c flag. Multiple files may be com­ Chris@4: pressed and decompressed like this. The resulting outputs Chris@4: are fed sequentially to stdout. Compression of multiple Chris@4: files in this manner generates a stream containing multi­ Chris@4: ple compressed file representations. Such a stream can be Chris@4: decompressed correctly only by _b_z_i_p_2 version 0.9.0 or Chris@4: later. Earlier versions of _b_z_i_p_2 will stop after decom­ Chris@4: pressing the first file in the stream. Chris@4: Chris@4: _b_z_c_a_t (or _b_z_i_p_2 _‐_d_c_) decompresses all specified files to Chris@4: the standard output. Chris@4: Chris@4: _b_z_i_p_2 will read arguments from the environment variables Chris@4: _B_Z_I_P_2 and _B_Z_I_P_, in that order, and will process them Chris@4: before any arguments read from the command line. This Chris@4: gives a convenient way to supply default arguments. Chris@4: Chris@4: Compression is always performed, even if the compressed Chris@4: file is slightly larger than the original. Files of less Chris@4: than about one hundred bytes tend to get larger, since the Chris@4: compression mechanism has a constant overhead in the Chris@4: region of 50 bytes. Random data (including the output of Chris@4: most file compressors) is coded at about 8.05 bits per Chris@4: byte, giving an expansion of around 0.5%. Chris@4: Chris@4: As a self‐check for your protection, _b_z_i_p_2 uses 32‐bit Chris@4: CRCs to make sure that the decompressed version of a file Chris@4: is identical to the original. This guards against corrup­ Chris@4: tion of the compressed data, and against undetected bugs Chris@4: in _b_z_i_p_2 (hopefully very unlikely). The chances of data Chris@4: corruption going undetected is microscopic, about one Chris@4: chance in four billion for each file processed. Be aware, Chris@4: though, that the check occurs upon decompression, so it Chris@4: can only tell you that something is wrong. It can’t help Chris@4: you recover the original uncompressed data. You can use Chris@4: _b_z_i_p_2_r_e_c_o_v_e_r to try to recover data from damaged files. Chris@4: Chris@4: Return values: 0 for a normal exit, 1 for environmental Chris@4: problems (file not found, invalid flags, I/O errors, &c), Chris@4: 2 to indicate a corrupt compressed file, 3 for an internal Chris@4: consistency error (eg, bug) which caused _b_z_i_p_2 to panic. Chris@4: Chris@4: Chris@4: OOPPTTIIOONNSS Chris@4: −−cc ‐‐‐‐ssttddoouutt Chris@4: Compress or decompress to standard output. Chris@4: Chris@4: −−dd ‐‐‐‐ddeeccoommpprreessss Chris@4: Force decompression. _b_z_i_p_2_, _b_u_n_z_i_p_2 and _b_z_c_a_t are Chris@4: really the same program, and the decision about Chris@4: what actions to take is done on the basis of which Chris@4: name is used. This flag overrides that mechanism, Chris@4: and forces _b_z_i_p_2 to decompress. Chris@4: Chris@4: −−zz ‐‐‐‐ccoommpprreessss Chris@4: The complement to −d: forces compression, Chris@4: regardless of the invocation name. Chris@4: Chris@4: −−tt ‐‐‐‐tteesstt Chris@4: Check integrity of the specified file(s), but don’t Chris@4: decompress them. This really performs a trial Chris@4: decompression and throws away the result. Chris@4: Chris@4: −−ff ‐‐‐‐ffoorrccee Chris@4: Force overwrite of output files. Normally, _b_z_i_p_2 Chris@4: will not overwrite existing output files. Also Chris@4: forces _b_z_i_p_2 to break hard links to files, which it Chris@4: otherwise wouldn’t do. Chris@4: Chris@4: bzip2 normally declines to decompress files which Chris@4: don’t have the correct magic header bytes. If Chris@4: forced (‐f), however, it will pass such files Chris@4: through unmodified. This is how GNU gzip behaves. Chris@4: Chris@4: −−kk ‐‐‐‐kkeeeepp Chris@4: Keep (don’t delete) input files during compression Chris@4: or decompression. Chris@4: Chris@4: −−ss ‐‐‐‐ssmmaallll Chris@4: Reduce memory usage, for compression, decompression Chris@4: and testing. Files are decompressed and tested Chris@4: using a modified algorithm which only requires 2.5 Chris@4: bytes per block byte. This means any file can be Chris@4: decompressed in 2300k of memory, albeit at about Chris@4: half the normal speed. Chris@4: Chris@4: During compression, −s selects a block size of Chris@4: 200k, which limits memory use to around the same Chris@4: figure, at the expense of your compression ratio. Chris@4: In short, if your machine is low on memory (8 Chris@4: megabytes or less), use −s for everything. See Chris@4: MEMORY MANAGEMENT below. Chris@4: Chris@4: −−qq ‐‐‐‐qquuiieett Chris@4: Suppress non‐essential warning messages. Messages Chris@4: pertaining to I/O errors and other critical events Chris@4: will not be suppressed. Chris@4: Chris@4: −−vv ‐‐‐‐vveerrbboossee Chris@4: Verbose mode ‐‐ show the compression ratio for each Chris@4: file processed. Further −v’s increase the ver­ Chris@4: bosity level, spewing out lots of information which Chris@4: is primarily of interest for diagnostic purposes. Chris@4: Chris@4: −−LL ‐‐‐‐lliicceennssee ‐‐VV ‐‐‐‐vveerrssiioonn Chris@4: Display the software version, license terms and Chris@4: conditions. Chris@4: Chris@4: −−11 ((oorr −−−−ffaasstt)) ttoo −−99 ((oorr −−−−bbeesstt)) Chris@4: Set the block size to 100 k, 200 k .. 900 k when Chris@4: compressing. Has no effect when decompressing. Chris@4: See MEMORY MANAGEMENT below. The −−fast and −−best Chris@4: aliases are primarily for GNU gzip compatibility. Chris@4: In particular, −−fast doesn’t make things signifi­ Chris@4: cantly faster. And −−best merely selects the Chris@4: default behaviour. Chris@4: Chris@4: −−‐‐ Treats all subsequent arguments as file names, even Chris@4: if they start with a dash. This is so you can han­ Chris@4: dle files with names beginning with a dash, for Chris@4: example: bzip2 −‐ −myfilename. Chris@4: Chris@4: −−‐‐rreeppeettiittiivvee‐‐ffaasstt ‐‐‐‐rreeppeettiittiivvee‐‐bbeesstt Chris@4: These flags are redundant in versions 0.9.5 and Chris@4: above. They provided some coarse control over the Chris@4: behaviour of the sorting algorithm in earlier ver­ Chris@4: sions, which was sometimes useful. 0.9.5 and above Chris@4: have an improved algorithm which renders these Chris@4: flags irrelevant. Chris@4: Chris@4: Chris@4: MMEEMMOORRYY MMAANNAAGGEEMMEENNTT Chris@4: _b_z_i_p_2 compresses large files in blocks. The block size Chris@4: affects both the compression ratio achieved, and the Chris@4: amount of memory needed for compression and decompression. Chris@4: The flags −1 through −9 specify the block size to be Chris@4: 100,000 bytes through 900,000 bytes (the default) respec­ Chris@4: tively. At decompression time, the block size used for Chris@4: compression is read from the header of the compressed Chris@4: file, and _b_u_n_z_i_p_2 then allocates itself just enough memory Chris@4: to decompress the file. Since block sizes are stored in Chris@4: compressed files, it follows that the flags −1 to −9 are Chris@4: irrelevant to and so ignored during decompression. Chris@4: Chris@4: Compression and decompression requirements, in bytes, can Chris@4: be estimated as: Chris@4: Chris@4: Compression: 400k + ( 8 x block size ) Chris@4: Chris@4: Decompression: 100k + ( 4 x block size ), or Chris@4: 100k + ( 2.5 x block size ) Chris@4: Chris@4: Larger block sizes give rapidly diminishing marginal Chris@4: returns. Most of the compression comes from the first two Chris@4: or three hundred k of block size, a fact worth bearing in Chris@4: mind when using _b_z_i_p_2 on small machines. It is also Chris@4: important to appreciate that the decompression memory Chris@4: requirement is set at compression time by the choice of Chris@4: block size. Chris@4: Chris@4: For files compressed with the default 900k block size, Chris@4: _b_u_n_z_i_p_2 will require about 3700 kbytes to decompress. To Chris@4: support decompression of any file on a 4 megabyte machine, Chris@4: _b_u_n_z_i_p_2 has an option to decompress using approximately Chris@4: half this amount of memory, about 2300 kbytes. Decompres­ Chris@4: sion speed is also halved, so you should use this option Chris@4: only where necessary. The relevant flag is ‐s. Chris@4: Chris@4: In general, try and use the largest block size memory con­ Chris@4: straints allow, since that maximises the compression Chris@4: achieved. Compression and decompression speed are virtu­ Chris@4: ally unaffected by block size. Chris@4: Chris@4: Another significant point applies to files which fit in a Chris@4: single block ‐‐ that means most files you’d encounter Chris@4: using a large block size. The amount of real memory Chris@4: touched is proportional to the size of the file, since the Chris@4: file is smaller than a block. For example, compressing a Chris@4: file 20,000 bytes long with the flag ‐9 will cause the Chris@4: compressor to allocate around 7600k of memory, but only Chris@4: touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the Chris@4: decompressor will allocate 3700k but only touch 100k + Chris@4: 20000 * 4 = 180 kbytes. Chris@4: Chris@4: Here is a table which summarises the maximum memory usage Chris@4: for different block sizes. Also recorded is the total Chris@4: compressed size for 14 files of the Calgary Text Compres­ Chris@4: sion Corpus totalling 3,141,622 bytes. This column gives Chris@4: some feel for how compression varies with block size. Chris@4: These figures tend to understate the advantage of larger Chris@4: block sizes for larger files, since the Corpus is domi­ Chris@4: nated by smaller files. Chris@4: Chris@4: Compress Decompress Decompress Corpus Chris@4: Flag usage usage ‐s usage Size Chris@4: Chris@4: ‐1 1200k 500k 350k 914704 Chris@4: ‐2 2000k 900k 600k 877703 Chris@4: ‐3 2800k 1300k 850k 860338 Chris@4: ‐4 3600k 1700k 1100k 846899 Chris@4: ‐5 4400k 2100k 1350k 845160 Chris@4: ‐6 5200k 2500k 1600k 838626 Chris@4: ‐7 6100k 2900k 1850k 834096 Chris@4: ‐8 6800k 3300k 2100k 828642 Chris@4: ‐9 7600k 3700k 2350k 828642 Chris@4: Chris@4: Chris@4: RREECCOOVVEERRIINNGG DDAATTAA FFRROOMM DDAAMMAAGGEEDD FFIILLEESS Chris@4: _b_z_i_p_2 compresses files in blocks, usually 900kbytes long. Chris@4: Each block is handled independently. If a media or trans­ Chris@4: mission error causes a multi‐block .bz2 file to become Chris@4: damaged, it may be possible to recover data from the Chris@4: undamaged blocks in the file. Chris@4: Chris@4: The compressed representation of each block is delimited Chris@4: by a 48‐bit pattern, which makes it possible to find the Chris@4: block boundaries with reasonable certainty. Each block Chris@4: also carries its own 32‐bit CRC, so damaged blocks can be Chris@4: distinguished from undamaged ones. Chris@4: Chris@4: _b_z_i_p_2_r_e_c_o_v_e_r is a simple program whose purpose is to Chris@4: search for blocks in .bz2 files, and write each block out Chris@4: into its own .bz2 file. You can then use _b_z_i_p_2 −t to test Chris@4: the integrity of the resulting files, and decompress those Chris@4: which are undamaged. Chris@4: Chris@4: _b_z_i_p_2_r_e_c_o_v_e_r takes a single argument, the name of the dam­ Chris@4: aged file, and writes a number of files Chris@4: "rec00001file.bz2", "rec00002file.bz2", etc, containing Chris@4: the extracted blocks. The output filenames are Chris@4: designed so that the use of wildcards in subsequent pro­ Chris@4: cessing ‐‐ for example, "bzip2 ‐dc rec*file.bz2 > recov­ Chris@4: ered_data" ‐‐ processes the files in the correct order. Chris@4: Chris@4: _b_z_i_p_2_r_e_c_o_v_e_r should be of most use dealing with large .bz2 Chris@4: files, as these will contain many blocks. It is clearly Chris@4: futile to use it on damaged single‐block files, since a Chris@4: damaged block cannot be recovered. If you wish to min­ Chris@4: imise any potential data loss through media or transmis­ Chris@4: sion errors, you might consider compressing with a smaller Chris@4: block size. Chris@4: Chris@4: Chris@4: PPEERRFFOORRMMAANNCCEE NNOOTTEESS Chris@4: The sorting phase of compression gathers together similar Chris@4: strings in the file. Because of this, files containing Chris@4: very long runs of repeated symbols, like "aabaabaabaab Chris@4: ..." (repeated several hundred times) may compress more Chris@4: slowly than normal. Versions 0.9.5 and above fare much Chris@4: better than previous versions in this respect. The ratio Chris@4: between worst‐case and average‐case compression time is in Chris@4: the region of 10:1. For previous versions, this figure Chris@4: was more like 100:1. You can use the −vvvv option to mon­ Chris@4: itor progress in great detail, if you want. Chris@4: Chris@4: Decompression speed is unaffected by these phenomena. Chris@4: Chris@4: _b_z_i_p_2 usually allocates several megabytes of memory to Chris@4: operate in, and then charges all over it in a fairly ran­ Chris@4: dom fashion. This means that performance, both for com­ Chris@4: pressing and decompressing, is largely determined by the Chris@4: speed at which your machine can service cache misses. Chris@4: Because of this, small changes to the code to reduce the Chris@4: miss rate have been observed to give disproportionately Chris@4: large performance improvements. I imagine _b_z_i_p_2 will per­ Chris@4: form best on machines with very large caches. Chris@4: Chris@4: Chris@4: CCAAVVEEAATTSS Chris@4: I/O error messages are not as helpful as they could be. Chris@4: _b_z_i_p_2 tries hard to detect I/O errors and exit cleanly, Chris@4: but the details of what the problem is sometimes seem Chris@4: rather misleading. Chris@4: Chris@4: This manual page pertains to version 1.0.6 of _b_z_i_p_2_. Com­ Chris@4: pressed data created by this version is entirely forwards Chris@4: and backwards compatible with the previous public Chris@4: releases, versions 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, Chris@4: 1.0.2 and above, but with the following exception: 0.9.0 Chris@4: and above can correctly decompress multiple concatenated Chris@4: compressed files. 0.1pl2 cannot do this; it will stop Chris@4: after decompressing just the first file in the stream. Chris@4: Chris@4: _b_z_i_p_2_r_e_c_o_v_e_r versions prior to 1.0.2 used 32‐bit integers Chris@4: to represent bit positions in compressed files, so they Chris@4: could not handle compressed files more than 512 megabytes Chris@4: long. Versions 1.0.2 and above use 64‐bit ints on some Chris@4: platforms which support them (GNU supported targets, and Chris@4: Windows). To establish whether or not bzip2recover was Chris@4: built with such a limitation, run it without arguments. Chris@4: In any event you can build yourself an unlimited version Chris@4: if you can recompile it with MaybeUInt64 set to be an Chris@4: unsigned 64‐bit integer. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: AAUUTTHHOORR Chris@4: Julian Seward, jsewardbzip.org. Chris@4: Chris@4: http://www.bzip.org Chris@4: Chris@4: The ideas embodied in _b_z_i_p_2 are due to (at least) the fol­ Chris@4: lowing people: Michael Burrows and David Wheeler (for the Chris@4: block sorting transformation), David Wheeler (again, for Chris@4: the Huffman coder), Peter Fenwick (for the structured cod­ Chris@4: ing model in the original _b_z_i_p_, and many refinements), and Chris@4: Alistair Moffat, Radford Neal and Ian Witten (for the Chris@4: arithmetic coder in the original _b_z_i_p_)_. I am much Chris@4: indebted for their help, support and advice. See the man­ Chris@4: ual in the source distribution for pointers to sources of Chris@4: documentation. Christian von Roques encouraged me to look Chris@4: for faster sorting algorithms, so as to speed up compres­ Chris@4: sion. Bela Lubkin encouraged me to improve the worst‐case Chris@4: compression performance. Donna Robinson XMLised the docu­ Chris@4: mentation. The bz* scripts are derived from those of GNU Chris@4: gzip. Many people sent patches, helped with portability Chris@4: problems, lent machines, gave advice and were generally Chris@4: helpful. Chris@4: Chris@4: Chris@4: Chris@4: bzip2(1)