cannam@89: .PU
cannam@89: .TH bzip2 1
cannam@89: .SH NAME
cannam@89: bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6
cannam@89: .br
cannam@89: bzcat \- decompresses files to stdout
cannam@89: .br
cannam@89: bzip2recover \- recovers data from damaged bzip2 files
cannam@89: 
cannam@89: .SH SYNOPSIS
cannam@89: .ll +8
cannam@89: .B bzip2
cannam@89: .RB [ " \-cdfkqstvzVL123456789 " ]
cannam@89: [
cannam@89: .I "filenames \&..."
cannam@89: ]
cannam@89: .ll -8
cannam@89: .br
cannam@89: .B bunzip2
cannam@89: .RB [ " \-fkvsVL " ]
cannam@89: [ 
cannam@89: .I "filenames \&..."
cannam@89: ]
cannam@89: .br
cannam@89: .B bzcat
cannam@89: .RB [ " \-s " ]
cannam@89: [ 
cannam@89: .I "filenames \&..."
cannam@89: ]
cannam@89: .br
cannam@89: .B bzip2recover
cannam@89: .I "filename"
cannam@89: 
cannam@89: .SH DESCRIPTION
cannam@89: .I bzip2
cannam@89: compresses files using the Burrows-Wheeler block sorting
cannam@89: text compression algorithm, and Huffman coding.  Compression is
cannam@89: generally considerably better than that achieved by more conventional
cannam@89: LZ77/LZ78-based compressors, and approaches the performance of the PPM
cannam@89: family of statistical compressors.
cannam@89: 
cannam@89: The command-line options are deliberately very similar to 
cannam@89: those of 
cannam@89: .I GNU gzip, 
cannam@89: but they are not identical.
cannam@89: 
cannam@89: .I bzip2
cannam@89: expects a list of file names to accompany the
cannam@89: command-line flags.  Each file is replaced by a compressed version of
cannam@89: itself, with the name "original_name.bz2".  
cannam@89: Each compressed file
cannam@89: has the same modification date, permissions, and, when possible,
cannam@89: ownership as the corresponding original, so that these properties can
cannam@89: be correctly restored at decompression time.  File name handling is
cannam@89: naive in the sense that there is no mechanism for preserving original
cannam@89: file names, permissions, ownerships or dates in filesystems which lack
cannam@89: these concepts, or have serious file name length restrictions, such as
cannam@89: MS-DOS.
cannam@89: 
cannam@89: .I bzip2
cannam@89: and
cannam@89: .I bunzip2
cannam@89: will by default not overwrite existing
cannam@89: files.  If you want this to happen, specify the \-f flag.
cannam@89: 
cannam@89: If no file names are specified,
cannam@89: .I bzip2
cannam@89: compresses from standard
cannam@89: input to standard output.  In this case,
cannam@89: .I bzip2
cannam@89: will decline to
cannam@89: write compressed output to a terminal, as this would be entirely
cannam@89: incomprehensible and therefore pointless.
cannam@89: 
cannam@89: .I bunzip2
cannam@89: (or
cannam@89: .I bzip2 \-d) 
cannam@89: decompresses all
cannam@89: specified files.  Files which were not created by 
cannam@89: .I bzip2
cannam@89: will be detected and ignored, and a warning issued.  
cannam@89: .I bzip2
cannam@89: attempts to guess the filename for the decompressed file 
cannam@89: from that of the compressed file as follows:
cannam@89: 
cannam@89:        filename.bz2    becomes   filename
cannam@89:        filename.bz     becomes   filename
cannam@89:        filename.tbz2   becomes   filename.tar
cannam@89:        filename.tbz    becomes   filename.tar
cannam@89:        anyothername    becomes   anyothername.out
cannam@89: 
cannam@89: If the file does not end in one of the recognised endings, 
cannam@89: .I .bz2, 
cannam@89: .I .bz, 
cannam@89: .I .tbz2
cannam@89: or
cannam@89: .I .tbz, 
cannam@89: .I bzip2 
cannam@89: complains that it cannot
cannam@89: guess the name of the original file, and uses the original name
cannam@89: with
cannam@89: .I .out
cannam@89: appended.
cannam@89: 
cannam@89: As with compression, supplying no
cannam@89: filenames causes decompression from 
cannam@89: standard input to standard output.
cannam@89: 
cannam@89: .I bunzip2 
cannam@89: will correctly decompress a file which is the
cannam@89: concatenation of two or more compressed files.  The result is the
cannam@89: concatenation of the corresponding uncompressed files.  Integrity
cannam@89: testing (\-t) 
cannam@89: of concatenated 
cannam@89: compressed files is also supported.
cannam@89: 
cannam@89: You can also compress or decompress files to the standard output by
cannam@89: giving the \-c flag.  Multiple files may be compressed and
cannam@89: decompressed like this.  The resulting outputs are fed sequentially to
cannam@89: stdout.  Compression of multiple files 
cannam@89: in this manner generates a stream
cannam@89: containing multiple compressed file representations.  Such a stream
cannam@89: can be decompressed correctly only by
cannam@89: .I bzip2 
cannam@89: version 0.9.0 or
cannam@89: later.  Earlier versions of
cannam@89: .I bzip2
cannam@89: will stop after decompressing
cannam@89: the first file in the stream.
cannam@89: 
cannam@89: .I bzcat
cannam@89: (or
cannam@89: .I bzip2 -dc) 
cannam@89: decompresses all specified files to
cannam@89: the standard output.
cannam@89: 
cannam@89: .I bzip2
cannam@89: will read arguments from the environment variables
cannam@89: .I BZIP2
cannam@89: and
cannam@89: .I BZIP,
cannam@89: in that order, and will process them
cannam@89: before any arguments read from the command line.  This gives a 
cannam@89: convenient way to supply default arguments.
cannam@89: 
cannam@89: Compression is always performed, even if the compressed 
cannam@89: file is slightly
cannam@89: larger than the original.  Files of less than about one hundred bytes
cannam@89: tend to get larger, since the compression mechanism has a constant
cannam@89: overhead in the region of 50 bytes.  Random data (including the output
cannam@89: of most file compressors) is coded at about 8.05 bits per byte, giving
cannam@89: an expansion of around 0.5%.
cannam@89: 
cannam@89: As a self-check for your protection, 
cannam@89: .I 
cannam@89: bzip2
cannam@89: uses 32-bit CRCs to
cannam@89: make sure that the decompressed version of a file is identical to the
cannam@89: original.  This guards against corruption of the compressed data, and
cannam@89: against undetected bugs in
cannam@89: .I bzip2
cannam@89: (hopefully very unlikely).  The
cannam@89: chances of data corruption going undetected is microscopic, about one
cannam@89: chance in four billion for each file processed.  Be aware, though, that
cannam@89: the check occurs upon decompression, so it can only tell you that
cannam@89: something is wrong.  It can't help you 
cannam@89: recover the original uncompressed
cannam@89: data.  You can use 
cannam@89: .I bzip2recover
cannam@89: to try to recover data from
cannam@89: damaged files.
cannam@89: 
cannam@89: Return values: 0 for a normal exit, 1 for environmental problems (file
cannam@89: not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
cannam@89: compressed file, 3 for an internal consistency error (eg, bug) which
cannam@89: caused
cannam@89: .I bzip2
cannam@89: to panic.
cannam@89: 
cannam@89: .SH OPTIONS
cannam@89: .TP
cannam@89: .B \-c --stdout
cannam@89: Compress or decompress to standard output.
cannam@89: .TP
cannam@89: .B \-d --decompress
cannam@89: Force decompression.  
cannam@89: .I bzip2, 
cannam@89: .I bunzip2 
cannam@89: and
cannam@89: .I bzcat 
cannam@89: are
cannam@89: really the same program, and the decision about what actions to take is
cannam@89: done on the basis of which name is used.  This flag overrides that
cannam@89: mechanism, and forces 
cannam@89: .I bzip2
cannam@89: to decompress.
cannam@89: .TP
cannam@89: .B \-z --compress
cannam@89: The complement to \-d: forces compression, regardless of the
cannam@89: invocation name.
cannam@89: .TP
cannam@89: .B \-t --test
cannam@89: Check integrity of the specified file(s), but don't decompress them.
cannam@89: This really performs a trial decompression and throws away the result.
cannam@89: .TP
cannam@89: .B \-f --force
cannam@89: Force overwrite of output files.  Normally,
cannam@89: .I bzip2 
cannam@89: will not overwrite
cannam@89: existing output files.  Also forces 
cannam@89: .I bzip2 
cannam@89: to break hard links
cannam@89: to files, which it otherwise wouldn't do.
cannam@89: 
cannam@89: bzip2 normally declines to decompress files which don't have the
cannam@89: correct magic header bytes.  If forced (-f), however, it will pass
cannam@89: such files through unmodified.  This is how GNU gzip behaves.
cannam@89: .TP
cannam@89: .B \-k --keep
cannam@89: Keep (don't delete) input files during compression
cannam@89: or decompression.
cannam@89: .TP
cannam@89: .B \-s --small
cannam@89: Reduce memory usage, for compression, decompression and testing.  Files
cannam@89: are decompressed and tested using a modified algorithm which only
cannam@89: requires 2.5 bytes per block byte.  This means any file can be
cannam@89: decompressed in 2300k of memory, albeit at about half the normal speed.
cannam@89: 
cannam@89: During compression, \-s selects a block size of 200k, which limits
cannam@89: memory use to around the same figure, at the expense of your compression
cannam@89: ratio.  In short, if your machine is low on memory (8 megabytes or
cannam@89: less), use \-s for everything.  See MEMORY MANAGEMENT below.
cannam@89: .TP
cannam@89: .B \-q --quiet
cannam@89: Suppress non-essential warning messages.  Messages pertaining to
cannam@89: I/O errors and other critical events will not be suppressed.
cannam@89: .TP
cannam@89: .B \-v --verbose
cannam@89: Verbose mode -- show the compression ratio for each file processed.
cannam@89: Further \-v's increase the verbosity level, spewing out lots of
cannam@89: information which is primarily of interest for diagnostic purposes.
cannam@89: .TP
cannam@89: .B \-L --license -V --version
cannam@89: Display the software version, license terms and conditions.
cannam@89: .TP
cannam@89: .B \-1 (or \-\-fast) to \-9 (or \-\-best)
cannam@89: Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
cannam@89: effect when decompressing.  See MEMORY MANAGEMENT below.
cannam@89: The \-\-fast and \-\-best aliases are primarily for GNU gzip 
cannam@89: compatibility.  In particular, \-\-fast doesn't make things
cannam@89: significantly faster.  
cannam@89: And \-\-best merely selects the default behaviour.
cannam@89: .TP
cannam@89: .B \--
cannam@89: Treats all subsequent arguments as file names, even if they start
cannam@89: with a dash.  This is so you can handle files with names beginning
cannam@89: with a dash, for example: bzip2 \-- \-myfilename.
cannam@89: .TP
cannam@89: .B \--repetitive-fast --repetitive-best
cannam@89: These flags are redundant in versions 0.9.5 and above.  They provided
cannam@89: some coarse control over the behaviour of the sorting algorithm in
cannam@89: earlier versions, which was sometimes useful.  0.9.5 and above have an
cannam@89: improved algorithm which renders these flags irrelevant.
cannam@89: 
cannam@89: .SH MEMORY MANAGEMENT
cannam@89: .I bzip2 
cannam@89: compresses large files in blocks.  The block size affects
cannam@89: both the compression ratio achieved, and the amount of memory needed for
cannam@89: compression and decompression.  The flags \-1 through \-9
cannam@89: specify the block size to be 100,000 bytes through 900,000 bytes (the
cannam@89: default) respectively.  At decompression time, the block size used for
cannam@89: compression is read from the header of the compressed file, and
cannam@89: .I bunzip2
cannam@89: then allocates itself just enough memory to decompress
cannam@89: the file.  Since block sizes are stored in compressed files, it follows
cannam@89: that the flags \-1 to \-9 are irrelevant to and so ignored
cannam@89: during decompression.
cannam@89: 
cannam@89: Compression and decompression requirements, 
cannam@89: in bytes, can be estimated as:
cannam@89: 
cannam@89:        Compression:   400k + ( 8 x block size )
cannam@89: 
cannam@89:        Decompression: 100k + ( 4 x block size ), or
cannam@89:                       100k + ( 2.5 x block size )
cannam@89: 
cannam@89: Larger block sizes give rapidly diminishing marginal returns.  Most of
cannam@89: the compression comes from the first two or three hundred k of block
cannam@89: size, a fact worth bearing in mind when using
cannam@89: .I bzip2
cannam@89: on small machines.
cannam@89: It is also important to appreciate that the decompression memory
cannam@89: requirement is set at compression time by the choice of block size.
cannam@89: 
cannam@89: For files compressed with the default 900k block size,
cannam@89: .I bunzip2
cannam@89: will require about 3700 kbytes to decompress.  To support decompression
cannam@89: of any file on a 4 megabyte machine, 
cannam@89: .I bunzip2
cannam@89: has an option to
cannam@89: decompress using approximately half this amount of memory, about 2300
cannam@89: kbytes.  Decompression speed is also halved, so you should use this
cannam@89: option only where necessary.  The relevant flag is -s.
cannam@89: 
cannam@89: In general, try and use the largest block size memory constraints allow,
cannam@89: since that maximises the compression achieved.  Compression and
cannam@89: decompression speed are virtually unaffected by block size.
cannam@89: 
cannam@89: Another significant point applies to files which fit in a single block
cannam@89: -- that means most files you'd encounter using a large block size.  The
cannam@89: amount of real memory touched is proportional to the size of the file,
cannam@89: since the file is smaller than a block.  For example, compressing a file
cannam@89: 20,000 bytes long with the flag -9 will cause the compressor to
cannam@89: allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
cannam@89: kbytes of it.  Similarly, the decompressor will allocate 3700k but only
cannam@89: touch 100k + 20000 * 4 = 180 kbytes.
cannam@89: 
cannam@89: Here is a table which summarises the maximum memory usage for different
cannam@89: block sizes.  Also recorded is the total compressed size for 14 files of
cannam@89: the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
cannam@89: column gives some feel for how compression varies with block size.
cannam@89: These figures tend to understate the advantage of larger block sizes for
cannam@89: larger files, since the Corpus is dominated by smaller files.
cannam@89: 
cannam@89:            Compress   Decompress   Decompress   Corpus
cannam@89:     Flag     usage      usage       -s usage     Size
cannam@89: 
cannam@89:      -1      1200k       500k         350k      914704
cannam@89:      -2      2000k       900k         600k      877703
cannam@89:      -3      2800k      1300k         850k      860338
cannam@89:      -4      3600k      1700k        1100k      846899
cannam@89:      -5      4400k      2100k        1350k      845160
cannam@89:      -6      5200k      2500k        1600k      838626
cannam@89:      -7      6100k      2900k        1850k      834096
cannam@89:      -8      6800k      3300k        2100k      828642
cannam@89:      -9      7600k      3700k        2350k      828642
cannam@89: 
cannam@89: .SH RECOVERING DATA FROM DAMAGED FILES
cannam@89: .I bzip2
cannam@89: compresses files in blocks, usually 900kbytes long.  Each
cannam@89: block is handled independently.  If a media or transmission error causes
cannam@89: a multi-block .bz2
cannam@89: file to become damaged, it may be possible to
cannam@89: recover data from the undamaged blocks in the file.
cannam@89: 
cannam@89: The compressed representation of each block is delimited by a 48-bit
cannam@89: pattern, which makes it possible to find the block boundaries with
cannam@89: reasonable certainty.  Each block also carries its own 32-bit CRC, so
cannam@89: damaged blocks can be distinguished from undamaged ones.
cannam@89: 
cannam@89: .I bzip2recover
cannam@89: is a simple program whose purpose is to search for
cannam@89: blocks in .bz2 files, and write each block out into its own .bz2 
cannam@89: file.  You can then use
cannam@89: .I bzip2 
cannam@89: \-t
cannam@89: to test the
cannam@89: integrity of the resulting files, and decompress those which are
cannam@89: undamaged.
cannam@89: 
cannam@89: .I bzip2recover
cannam@89: takes a single argument, the name of the damaged file, 
cannam@89: and writes a number of files "rec00001file.bz2",
cannam@89: "rec00002file.bz2", etc, containing the  extracted  blocks.
cannam@89: The  output  filenames  are  designed  so  that the use of
cannam@89: wildcards in subsequent processing -- for example,  
cannam@89: "bzip2 -dc  rec*file.bz2 > recovered_data" -- processes the files in
cannam@89: the correct order.
cannam@89: 
cannam@89: .I bzip2recover
cannam@89: should be of most use dealing with large .bz2
cannam@89: files,  as  these will contain many blocks.  It is clearly
cannam@89: futile to use it on damaged single-block  files,  since  a
cannam@89: damaged  block  cannot  be recovered.  If you wish to minimise 
cannam@89: any potential data loss through media  or  transmission errors, 
cannam@89: you might consider compressing with a smaller
cannam@89: block size.
cannam@89: 
cannam@89: .SH PERFORMANCE NOTES
cannam@89: The sorting phase of compression gathers together similar strings in the
cannam@89: file.  Because of this, files containing very long runs of repeated
cannam@89: symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
cannam@89: compress more slowly than normal.  Versions 0.9.5 and above fare much
cannam@89: better than previous versions in this respect.  The ratio between
cannam@89: worst-case and average-case compression time is in the region of 10:1.
cannam@89: For previous versions, this figure was more like 100:1.  You can use the
cannam@89: \-vvvv option to monitor progress in great detail, if you want.
cannam@89: 
cannam@89: Decompression speed is unaffected by these phenomena.
cannam@89: 
cannam@89: .I bzip2
cannam@89: usually allocates several megabytes of memory to operate
cannam@89: in, and then charges all over it in a fairly random fashion.  This means
cannam@89: that performance, both for compressing and decompressing, is largely
cannam@89: determined by the speed at which your machine can service cache misses.
cannam@89: Because of this, small changes to the code to reduce the miss rate have
cannam@89: been observed to give disproportionately large performance improvements.
cannam@89: I imagine 
cannam@89: .I bzip2
cannam@89: will perform best on machines with very large caches.
cannam@89: 
cannam@89: .SH CAVEATS
cannam@89: I/O error messages are not as helpful as they could be.
cannam@89: .I bzip2
cannam@89: tries hard to detect I/O errors and exit cleanly, but the details of
cannam@89: what the problem is sometimes seem rather misleading.
cannam@89: 
cannam@89: This manual page pertains to version 1.0.6 of
cannam@89: .I bzip2.  
cannam@89: Compressed data created by this version is entirely forwards and
cannam@89: backwards compatible with the previous public releases, versions
cannam@89: 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following
cannam@89: exception: 0.9.0 and above can correctly decompress multiple
cannam@89: concatenated compressed files.  0.1pl2 cannot do this; it will stop
cannam@89: after decompressing just the first file in the stream.
cannam@89: 
cannam@89: .I bzip2recover
cannam@89: versions prior to 1.0.2 used 32-bit integers to represent
cannam@89: bit positions in compressed files, so they could not handle compressed
cannam@89: files more than 512 megabytes long.  Versions 1.0.2 and above use
cannam@89: 64-bit ints on some platforms which support them (GNU supported
cannam@89: targets, and Windows).  To establish whether or not bzip2recover was
cannam@89: built with such a limitation, run it without arguments.  In any event
cannam@89: you can build yourself an unlimited version if you can recompile it
cannam@89: with MaybeUInt64 set to be an unsigned 64-bit integer.
cannam@89: 
cannam@89: 
cannam@89: 
cannam@89: .SH AUTHOR
cannam@89: Julian Seward, jsewardbzip.org.
cannam@89: 
cannam@89: http://www.bzip.org
cannam@89: 
cannam@89: The ideas embodied in
cannam@89: .I bzip2
cannam@89: are due to (at least) the following
cannam@89: people: Michael Burrows and David Wheeler (for the block sorting
cannam@89: transformation), David Wheeler (again, for the Huffman coder), Peter
cannam@89: Fenwick (for the structured coding model in the original
cannam@89: .I bzip,
cannam@89: and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
cannam@89: (for the arithmetic coder in the original
cannam@89: .I bzip).  
cannam@89: I am much
cannam@89: indebted for their help, support and advice.  See the manual in the
cannam@89: source distribution for pointers to sources of documentation.  Christian
cannam@89: von Roques encouraged me to look for faster sorting algorithms, so as to
cannam@89: speed up compression.  Bela Lubkin encouraged me to improve the
cannam@89: worst-case compression performance.  
cannam@89: Donna Robinson XMLised the documentation.
cannam@89: The bz* scripts are derived from those of GNU gzip.
cannam@89: Many people sent patches, helped
cannam@89: with portability problems, lent machines, gave advice and were generally
cannam@89: helpful.