Chris@4: .PU
Chris@4: .TH bzip2 1
Chris@4: .SH NAME
Chris@4: bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6
Chris@4: .br
Chris@4: bzcat \- decompresses files to stdout
Chris@4: .br
Chris@4: bzip2recover \- recovers data from damaged bzip2 files
Chris@4: 
Chris@4: .SH SYNOPSIS
Chris@4: .ll +8
Chris@4: .B bzip2
Chris@4: .RB [ " \-cdfkqstvzVL123456789 " ]
Chris@4: [
Chris@4: .I "filenames \&..."
Chris@4: ]
Chris@4: .ll -8
Chris@4: .br
Chris@4: .B bunzip2
Chris@4: .RB [ " \-fkvsVL " ]
Chris@4: [ 
Chris@4: .I "filenames \&..."
Chris@4: ]
Chris@4: .br
Chris@4: .B bzcat
Chris@4: .RB [ " \-s " ]
Chris@4: [ 
Chris@4: .I "filenames \&..."
Chris@4: ]
Chris@4: .br
Chris@4: .B bzip2recover
Chris@4: .I "filename"
Chris@4: 
Chris@4: .SH DESCRIPTION
Chris@4: .I bzip2
Chris@4: compresses files using the Burrows-Wheeler block sorting
Chris@4: text compression algorithm, and Huffman coding.  Compression is
Chris@4: generally considerably better than that achieved by more conventional
Chris@4: LZ77/LZ78-based compressors, and approaches the performance of the PPM
Chris@4: family of statistical compressors.
Chris@4: 
Chris@4: The command-line options are deliberately very similar to 
Chris@4: those of 
Chris@4: .I GNU gzip, 
Chris@4: but they are not identical.
Chris@4: 
Chris@4: .I bzip2
Chris@4: expects a list of file names to accompany the
Chris@4: command-line flags.  Each file is replaced by a compressed version of
Chris@4: itself, with the name "original_name.bz2".  
Chris@4: Each compressed file
Chris@4: has the same modification date, permissions, and, when possible,
Chris@4: ownership as the corresponding original, so that these properties can
Chris@4: be correctly restored at decompression time.  File name handling is
Chris@4: naive in the sense that there is no mechanism for preserving original
Chris@4: file names, permissions, ownerships or dates in filesystems which lack
Chris@4: these concepts, or have serious file name length restrictions, such as
Chris@4: MS-DOS.
Chris@4: 
Chris@4: .I bzip2
Chris@4: and
Chris@4: .I bunzip2
Chris@4: will by default not overwrite existing
Chris@4: files.  If you want this to happen, specify the \-f flag.
Chris@4: 
Chris@4: If no file names are specified,
Chris@4: .I bzip2
Chris@4: compresses from standard
Chris@4: input to standard output.  In this case,
Chris@4: .I bzip2
Chris@4: will decline to
Chris@4: write compressed output to a terminal, as this would be entirely
Chris@4: incomprehensible and therefore pointless.
Chris@4: 
Chris@4: .I bunzip2
Chris@4: (or
Chris@4: .I bzip2 \-d) 
Chris@4: decompresses all
Chris@4: specified files.  Files which were not created by 
Chris@4: .I bzip2
Chris@4: will be detected and ignored, and a warning issued.  
Chris@4: .I bzip2
Chris@4: attempts to guess the filename for the decompressed file 
Chris@4: from that of the compressed file as follows:
Chris@4: 
Chris@4:        filename.bz2    becomes   filename
Chris@4:        filename.bz     becomes   filename
Chris@4:        filename.tbz2   becomes   filename.tar
Chris@4:        filename.tbz    becomes   filename.tar
Chris@4:        anyothername    becomes   anyothername.out
Chris@4: 
Chris@4: If the file does not end in one of the recognised endings, 
Chris@4: .I .bz2, 
Chris@4: .I .bz, 
Chris@4: .I .tbz2
Chris@4: or
Chris@4: .I .tbz, 
Chris@4: .I bzip2 
Chris@4: complains that it cannot
Chris@4: guess the name of the original file, and uses the original name
Chris@4: with
Chris@4: .I .out
Chris@4: appended.
Chris@4: 
Chris@4: As with compression, supplying no
Chris@4: filenames causes decompression from 
Chris@4: standard input to standard output.
Chris@4: 
Chris@4: .I bunzip2 
Chris@4: will correctly decompress a file which is the
Chris@4: concatenation of two or more compressed files.  The result is the
Chris@4: concatenation of the corresponding uncompressed files.  Integrity
Chris@4: testing (\-t) 
Chris@4: of concatenated 
Chris@4: compressed files is also supported.
Chris@4: 
Chris@4: You can also compress or decompress files to the standard output by
Chris@4: giving the \-c flag.  Multiple files may be compressed and
Chris@4: decompressed like this.  The resulting outputs are fed sequentially to
Chris@4: stdout.  Compression of multiple files 
Chris@4: in this manner generates a stream
Chris@4: containing multiple compressed file representations.  Such a stream
Chris@4: can be decompressed correctly only by
Chris@4: .I bzip2 
Chris@4: version 0.9.0 or
Chris@4: later.  Earlier versions of
Chris@4: .I bzip2
Chris@4: will stop after decompressing
Chris@4: the first file in the stream.
Chris@4: 
Chris@4: .I bzcat
Chris@4: (or
Chris@4: .I bzip2 -dc) 
Chris@4: decompresses all specified files to
Chris@4: the standard output.
Chris@4: 
Chris@4: .I bzip2
Chris@4: will read arguments from the environment variables
Chris@4: .I BZIP2
Chris@4: and
Chris@4: .I BZIP,
Chris@4: in that order, and will process them
Chris@4: before any arguments read from the command line.  This gives a 
Chris@4: convenient way to supply default arguments.
Chris@4: 
Chris@4: Compression is always performed, even if the compressed 
Chris@4: file is slightly
Chris@4: larger than the original.  Files of less than about one hundred bytes
Chris@4: tend to get larger, since the compression mechanism has a constant
Chris@4: overhead in the region of 50 bytes.  Random data (including the output
Chris@4: of most file compressors) is coded at about 8.05 bits per byte, giving
Chris@4: an expansion of around 0.5%.
Chris@4: 
Chris@4: As a self-check for your protection, 
Chris@4: .I 
Chris@4: bzip2
Chris@4: uses 32-bit CRCs to
Chris@4: make sure that the decompressed version of a file is identical to the
Chris@4: original.  This guards against corruption of the compressed data, and
Chris@4: against undetected bugs in
Chris@4: .I bzip2
Chris@4: (hopefully very unlikely).  The
Chris@4: chances of data corruption going undetected is microscopic, about one
Chris@4: chance in four billion for each file processed.  Be aware, though, that
Chris@4: the check occurs upon decompression, so it can only tell you that
Chris@4: something is wrong.  It can't help you 
Chris@4: recover the original uncompressed
Chris@4: data.  You can use 
Chris@4: .I bzip2recover
Chris@4: to try to recover data from
Chris@4: damaged files.
Chris@4: 
Chris@4: Return values: 0 for a normal exit, 1 for environmental problems (file
Chris@4: not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
Chris@4: compressed file, 3 for an internal consistency error (eg, bug) which
Chris@4: caused
Chris@4: .I bzip2
Chris@4: to panic.
Chris@4: 
Chris@4: .SH OPTIONS
Chris@4: .TP
Chris@4: .B \-c --stdout
Chris@4: Compress or decompress to standard output.
Chris@4: .TP
Chris@4: .B \-d --decompress
Chris@4: Force decompression.  
Chris@4: .I bzip2, 
Chris@4: .I bunzip2 
Chris@4: and
Chris@4: .I bzcat 
Chris@4: are
Chris@4: really the same program, and the decision about what actions to take is
Chris@4: done on the basis of which name is used.  This flag overrides that
Chris@4: mechanism, and forces 
Chris@4: .I bzip2
Chris@4: to decompress.
Chris@4: .TP
Chris@4: .B \-z --compress
Chris@4: The complement to \-d: forces compression, regardless of the
Chris@4: invocation name.
Chris@4: .TP
Chris@4: .B \-t --test
Chris@4: Check integrity of the specified file(s), but don't decompress them.
Chris@4: This really performs a trial decompression and throws away the result.
Chris@4: .TP
Chris@4: .B \-f --force
Chris@4: Force overwrite of output files.  Normally,
Chris@4: .I bzip2 
Chris@4: will not overwrite
Chris@4: existing output files.  Also forces 
Chris@4: .I bzip2 
Chris@4: to break hard links
Chris@4: to files, which it otherwise wouldn't do.
Chris@4: 
Chris@4: bzip2 normally declines to decompress files which don't have the
Chris@4: correct magic header bytes.  If forced (-f), however, it will pass
Chris@4: such files through unmodified.  This is how GNU gzip behaves.
Chris@4: .TP
Chris@4: .B \-k --keep
Chris@4: Keep (don't delete) input files during compression
Chris@4: or decompression.
Chris@4: .TP
Chris@4: .B \-s --small
Chris@4: Reduce memory usage, for compression, decompression and testing.  Files
Chris@4: are decompressed and tested using a modified algorithm which only
Chris@4: requires 2.5 bytes per block byte.  This means any file can be
Chris@4: decompressed in 2300k of memory, albeit at about half the normal speed.
Chris@4: 
Chris@4: During compression, \-s selects a block size of 200k, which limits
Chris@4: memory use to around the same figure, at the expense of your compression
Chris@4: ratio.  In short, if your machine is low on memory (8 megabytes or
Chris@4: less), use \-s for everything.  See MEMORY MANAGEMENT below.
Chris@4: .TP
Chris@4: .B \-q --quiet
Chris@4: Suppress non-essential warning messages.  Messages pertaining to
Chris@4: I/O errors and other critical events will not be suppressed.
Chris@4: .TP
Chris@4: .B \-v --verbose
Chris@4: Verbose mode -- show the compression ratio for each file processed.
Chris@4: Further \-v's increase the verbosity level, spewing out lots of
Chris@4: information which is primarily of interest for diagnostic purposes.
Chris@4: .TP
Chris@4: .B \-L --license -V --version
Chris@4: Display the software version, license terms and conditions.
Chris@4: .TP
Chris@4: .B \-1 (or \-\-fast) to \-9 (or \-\-best)
Chris@4: Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
Chris@4: effect when decompressing.  See MEMORY MANAGEMENT below.
Chris@4: The \-\-fast and \-\-best aliases are primarily for GNU gzip 
Chris@4: compatibility.  In particular, \-\-fast doesn't make things
Chris@4: significantly faster.  
Chris@4: And \-\-best merely selects the default behaviour.
Chris@4: .TP
Chris@4: .B \--
Chris@4: Treats all subsequent arguments as file names, even if they start
Chris@4: with a dash.  This is so you can handle files with names beginning
Chris@4: with a dash, for example: bzip2 \-- \-myfilename.
Chris@4: .TP
Chris@4: .B \--repetitive-fast --repetitive-best
Chris@4: These flags are redundant in versions 0.9.5 and above.  They provided
Chris@4: some coarse control over the behaviour of the sorting algorithm in
Chris@4: earlier versions, which was sometimes useful.  0.9.5 and above have an
Chris@4: improved algorithm which renders these flags irrelevant.
Chris@4: 
Chris@4: .SH MEMORY MANAGEMENT
Chris@4: .I bzip2 
Chris@4: compresses large files in blocks.  The block size affects
Chris@4: both the compression ratio achieved, and the amount of memory needed for
Chris@4: compression and decompression.  The flags \-1 through \-9
Chris@4: specify the block size to be 100,000 bytes through 900,000 bytes (the
Chris@4: default) respectively.  At decompression time, the block size used for
Chris@4: compression is read from the header of the compressed file, and
Chris@4: .I bunzip2
Chris@4: then allocates itself just enough memory to decompress
Chris@4: the file.  Since block sizes are stored in compressed files, it follows
Chris@4: that the flags \-1 to \-9 are irrelevant to and so ignored
Chris@4: during decompression.
Chris@4: 
Chris@4: Compression and decompression requirements, 
Chris@4: in bytes, can be estimated as:
Chris@4: 
Chris@4:        Compression:   400k + ( 8 x block size )
Chris@4: 
Chris@4:        Decompression: 100k + ( 4 x block size ), or
Chris@4:                       100k + ( 2.5 x block size )
Chris@4: 
Chris@4: Larger block sizes give rapidly diminishing marginal returns.  Most of
Chris@4: the compression comes from the first two or three hundred k of block
Chris@4: size, a fact worth bearing in mind when using
Chris@4: .I bzip2
Chris@4: on small machines.
Chris@4: It is also important to appreciate that the decompression memory
Chris@4: requirement is set at compression time by the choice of block size.
Chris@4: 
Chris@4: For files compressed with the default 900k block size,
Chris@4: .I bunzip2
Chris@4: will require about 3700 kbytes to decompress.  To support decompression
Chris@4: of any file on a 4 megabyte machine, 
Chris@4: .I bunzip2
Chris@4: has an option to
Chris@4: decompress using approximately half this amount of memory, about 2300
Chris@4: kbytes.  Decompression speed is also halved, so you should use this
Chris@4: option only where necessary.  The relevant flag is -s.
Chris@4: 
Chris@4: In general, try and use the largest block size memory constraints allow,
Chris@4: since that maximises the compression achieved.  Compression and
Chris@4: decompression speed are virtually unaffected by block size.
Chris@4: 
Chris@4: Another significant point applies to files which fit in a single block
Chris@4: -- that means most files you'd encounter using a large block size.  The
Chris@4: amount of real memory touched is proportional to the size of the file,
Chris@4: since the file is smaller than a block.  For example, compressing a file
Chris@4: 20,000 bytes long with the flag -9 will cause the compressor to
Chris@4: allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
Chris@4: kbytes of it.  Similarly, the decompressor will allocate 3700k but only
Chris@4: touch 100k + 20000 * 4 = 180 kbytes.
Chris@4: 
Chris@4: Here is a table which summarises the maximum memory usage for different
Chris@4: block sizes.  Also recorded is the total compressed size for 14 files of
Chris@4: the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
Chris@4: column gives some feel for how compression varies with block size.
Chris@4: These figures tend to understate the advantage of larger block sizes for
Chris@4: larger files, since the Corpus is dominated by smaller files.
Chris@4: 
Chris@4:            Compress   Decompress   Decompress   Corpus
Chris@4:     Flag     usage      usage       -s usage     Size
Chris@4: 
Chris@4:      -1      1200k       500k         350k      914704
Chris@4:      -2      2000k       900k         600k      877703
Chris@4:      -3      2800k      1300k         850k      860338
Chris@4:      -4      3600k      1700k        1100k      846899
Chris@4:      -5      4400k      2100k        1350k      845160
Chris@4:      -6      5200k      2500k        1600k      838626
Chris@4:      -7      6100k      2900k        1850k      834096
Chris@4:      -8      6800k      3300k        2100k      828642
Chris@4:      -9      7600k      3700k        2350k      828642
Chris@4: 
Chris@4: .SH RECOVERING DATA FROM DAMAGED FILES
Chris@4: .I bzip2
Chris@4: compresses files in blocks, usually 900kbytes long.  Each
Chris@4: block is handled independently.  If a media or transmission error causes
Chris@4: a multi-block .bz2
Chris@4: file to become damaged, it may be possible to
Chris@4: recover data from the undamaged blocks in the file.
Chris@4: 
Chris@4: The compressed representation of each block is delimited by a 48-bit
Chris@4: pattern, which makes it possible to find the block boundaries with
Chris@4: reasonable certainty.  Each block also carries its own 32-bit CRC, so
Chris@4: damaged blocks can be distinguished from undamaged ones.
Chris@4: 
Chris@4: .I bzip2recover
Chris@4: is a simple program whose purpose is to search for
Chris@4: blocks in .bz2 files, and write each block out into its own .bz2 
Chris@4: file.  You can then use
Chris@4: .I bzip2 
Chris@4: \-t
Chris@4: to test the
Chris@4: integrity of the resulting files, and decompress those which are
Chris@4: undamaged.
Chris@4: 
Chris@4: .I bzip2recover
Chris@4: takes a single argument, the name of the damaged file, 
Chris@4: and writes a number of files "rec00001file.bz2",
Chris@4: "rec00002file.bz2", etc, containing the  extracted  blocks.
Chris@4: The  output  filenames  are  designed  so  that the use of
Chris@4: wildcards in subsequent processing -- for example,  
Chris@4: "bzip2 -dc  rec*file.bz2 > recovered_data" -- processes the files in
Chris@4: the correct order.
Chris@4: 
Chris@4: .I bzip2recover
Chris@4: should be of most use dealing with large .bz2
Chris@4: files,  as  these will contain many blocks.  It is clearly
Chris@4: futile to use it on damaged single-block  files,  since  a
Chris@4: damaged  block  cannot  be recovered.  If you wish to minimise 
Chris@4: any potential data loss through media  or  transmission errors, 
Chris@4: you might consider compressing with a smaller
Chris@4: block size.
Chris@4: 
Chris@4: .SH PERFORMANCE NOTES
Chris@4: The sorting phase of compression gathers together similar strings in the
Chris@4: file.  Because of this, files containing very long runs of repeated
Chris@4: symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
Chris@4: compress more slowly than normal.  Versions 0.9.5 and above fare much
Chris@4: better than previous versions in this respect.  The ratio between
Chris@4: worst-case and average-case compression time is in the region of 10:1.
Chris@4: For previous versions, this figure was more like 100:1.  You can use the
Chris@4: \-vvvv option to monitor progress in great detail, if you want.
Chris@4: 
Chris@4: Decompression speed is unaffected by these phenomena.
Chris@4: 
Chris@4: .I bzip2
Chris@4: usually allocates several megabytes of memory to operate
Chris@4: in, and then charges all over it in a fairly random fashion.  This means
Chris@4: that performance, both for compressing and decompressing, is largely
Chris@4: determined by the speed at which your machine can service cache misses.
Chris@4: Because of this, small changes to the code to reduce the miss rate have
Chris@4: been observed to give disproportionately large performance improvements.
Chris@4: I imagine 
Chris@4: .I bzip2
Chris@4: will perform best on machines with very large caches.
Chris@4: 
Chris@4: .SH CAVEATS
Chris@4: I/O error messages are not as helpful as they could be.
Chris@4: .I bzip2
Chris@4: tries hard to detect I/O errors and exit cleanly, but the details of
Chris@4: what the problem is sometimes seem rather misleading.
Chris@4: 
Chris@4: This manual page pertains to version 1.0.6 of
Chris@4: .I bzip2.  
Chris@4: Compressed data created by this version is entirely forwards and
Chris@4: backwards compatible with the previous public releases, versions
Chris@4: 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following
Chris@4: exception: 0.9.0 and above can correctly decompress multiple
Chris@4: concatenated compressed files.  0.1pl2 cannot do this; it will stop
Chris@4: after decompressing just the first file in the stream.
Chris@4: 
Chris@4: .I bzip2recover
Chris@4: versions prior to 1.0.2 used 32-bit integers to represent
Chris@4: bit positions in compressed files, so they could not handle compressed
Chris@4: files more than 512 megabytes long.  Versions 1.0.2 and above use
Chris@4: 64-bit ints on some platforms which support them (GNU supported
Chris@4: targets, and Windows).  To establish whether or not bzip2recover was
Chris@4: built with such a limitation, run it without arguments.  In any event
Chris@4: you can build yourself an unlimited version if you can recompile it
Chris@4: with MaybeUInt64 set to be an unsigned 64-bit integer.
Chris@4: 
Chris@4: 
Chris@4: 
Chris@4: .SH AUTHOR
Chris@4: Julian Seward, jsewardbzip.org.
Chris@4: 
Chris@4: http://www.bzip.org
Chris@4: 
Chris@4: The ideas embodied in
Chris@4: .I bzip2
Chris@4: are due to (at least) the following
Chris@4: people: Michael Burrows and David Wheeler (for the block sorting
Chris@4: transformation), David Wheeler (again, for the Huffman coder), Peter
Chris@4: Fenwick (for the structured coding model in the original
Chris@4: .I bzip,
Chris@4: and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
Chris@4: (for the arithmetic coder in the original
Chris@4: .I bzip).  
Chris@4: I am much
Chris@4: indebted for their help, support and advice.  See the manual in the
Chris@4: source distribution for pointers to sources of documentation.  Christian
Chris@4: von Roques encouraged me to look for faster sorting algorithms, so as to
Chris@4: speed up compression.  Bela Lubkin encouraged me to improve the
Chris@4: worst-case compression performance.  
Chris@4: Donna Robinson XMLised the documentation.
Chris@4: The bz* scripts are derived from those of GNU gzip.
Chris@4: Many people sent patches, helped
Chris@4: with portability problems, lent machines, gave advice and were generally
Chris@4: helpful.