cannam@89:

cannam@89: bzip2 and libbzip2, version 1.0.6

cannam@89:

A program and library for data compression

cannam@89:

cannam@89: Julian Seward cannam@89:

cannam@89:

http://www.bzip.org

cannam@89:

Version 1.0.6 of 6 September 2010

cannam@89:

Redistribution and use in source and binary forms, with cannam@89: or without modification, are permitted provided that the cannam@89: following conditions are met:

cannam@89:

Redistributions of source code must retain the cannam@89: above copyright notice, this list of conditions and the cannam@89: following disclaimer.
The origin of this software must not be cannam@89: misrepresented; you must not claim that you wrote the original cannam@89: software. If you use this software in a product, an cannam@89: acknowledgment in the product documentation would be cannam@89: appreciated but is not required.
Altered source versions must be plainly marked cannam@89: as such, and must not be misrepresented as being the original cannam@89: software.
The name of the author may not be used to cannam@89: endorse or promote products derived from this software without cannam@89: specific prior written permission.

cannam@89:

THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY cannam@89: EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, cannam@89: THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A cannam@89: PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE cannam@89: AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, cannam@89: EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED cannam@89: TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, cannam@89: DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND cannam@89: ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT cannam@89: LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING cannam@89: IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF cannam@89: THE POSSIBILITY OF SUCH DAMAGE.

cannam@89:

PATENTS: To the best of my knowledge, cannam@89: bzip2 and cannam@89: libbzip2 do not use any patented cannam@89: algorithms. However, I do not have the resources to carry cannam@89: out a patent search. Therefore I cannot give any guarantee of cannam@89: the above statement. cannam@89:

cannam@89:

Table of Contents

cannam@89:

1. Introduction

2. How to use bzip2

2.1. NAME
2.2. SYNOPSIS
2.3. DESCRIPTION
2.4. OPTIONS
2.5. MEMORY MANAGEMENT
2.6. RECOVERING DATA FROM DAMAGED FILES
2.7. PERFORMANCE NOTES
2.8. CAVEATS
2.9. AUTHOR

3. cannam@89: Programming with libbzip2 cannam@89:

3.1. Top-level structure

3.1.1. Low-level summary
3.1.2. High-level summary
3.1.3. Utility functions summary

3.2. Error handling

3.3. Low-level interface

3.3.1. BZ2_bzCompressInit
3.3.2. BZ2_bzCompress
3.3.3. BZ2_bzCompressEnd
3.3.4. BZ2_bzDecompressInit
3.3.5. BZ2_bzDecompress
3.3.6. BZ2_bzDecompressEnd

3.4. High-level interface

3.4.1. BZ2_bzReadOpen
3.4.2. BZ2_bzRead
3.4.3. BZ2_bzReadGetUnused
3.4.4. BZ2_bzReadClose
3.4.5. BZ2_bzWriteOpen
3.4.6. BZ2_bzWrite
3.4.7. BZ2_bzWriteClose
3.4.8. Handling embedded compressed data streams
3.4.9. Standard file-reading/writing code

3.5. Utility functions

3.5.1. BZ2_bzBuffToBuffCompress
3.5.2. BZ2_bzBuffToBuffDecompress

3.6. zlib compatibility functions

3.7. Using the library in a stdio-free environment

3.7.1. Getting rid of stdio
3.7.2. Critical error handling

3.8. Making a Windows DLL

4. Miscellanea

4.1. Limitations of the compressed file format
4.2. Portability issues
4.3. Reporting bugs
4.4. Did you get the right package?
4.5. Further Reading

cannam@89:

cannam@89: 1. Introduction

cannam@89:

bzip2 compresses files cannam@89: using the Burrows-Wheeler block-sorting text compression cannam@89: algorithm, and Huffman coding. Compression is generally cannam@89: considerably better than that achieved by more conventional cannam@89: LZ77/LZ78-based compressors, and approaches the performance of cannam@89: the PPM family of statistical compressors.

cannam@89:

bzip2 is built on top of cannam@89: libbzip2, a flexible library for cannam@89: handling compressed data in the cannam@89: bzip2 format. This manual cannam@89: describes both how to use the program and how to work with the cannam@89: library interface. Most of the manual is devoted to this cannam@89: library, not the program, which is good news if your interest is cannam@89: only in the program.

cannam@89:

How to use bzip2 describes how to use cannam@89: bzip2; this is the only part cannam@89: you need to read if you just want to know how to operate the cannam@89: program.
Programming with libbzip2 describes the cannam@89: programming interfaces in detail, and
Miscellanea records some cannam@89: miscellaneous notes which I thought ought to be recorded cannam@89: somewhere.

cannam@89:

cannam@89: 2. How to use bzip2

cannam@89:

Table of Contents

cannam@89:

2.1. NAME
2.2. SYNOPSIS
2.3. DESCRIPTION
2.4. OPTIONS
2.5. MEMORY MANAGEMENT
2.6. RECOVERING DATA FROM DAMAGED FILES
2.7. PERFORMANCE NOTES
2.8. CAVEATS
2.9. AUTHOR

cannam@89:

This chapter contains a copy of the cannam@89: bzip2 man page, and nothing cannam@89: else.

cannam@89:

cannam@89: 2.1. NAME

cannam@89:

bzip2, cannam@89: bunzip2 - a block-sorting file cannam@89: compressor, v1.0.6
bzcat - cannam@89: decompresses files to stdout
bzip2recover - cannam@89: recovers data from damaged bzip2 files

cannam@89:

cannam@89: 2.2. SYNOPSIS

cannam@89:

bzip2 [ cannam@89: -cdfkqstvzVL123456789 ] [ filenames ... ]
bunzip2 [ cannam@89: -fkvsVL ] [ filenames ... ]
bzcat [ -s ] [ cannam@89: filenames ... ]
bzip2recover cannam@89: filename

cannam@89:

cannam@89: 2.3. DESCRIPTION

cannam@89:

bzip2 compresses files cannam@89: using the Burrows-Wheeler block sorting text compression cannam@89: algorithm, and Huffman coding. Compression is generally cannam@89: considerably better than that achieved by more conventional cannam@89: LZ77/LZ78-based compressors, and approaches the performance of cannam@89: the PPM family of statistical compressors.

cannam@89:

The command-line options are deliberately very similar to cannam@89: those of GNU gzip, but they are cannam@89: not identical.

cannam@89:

bzip2 expects a list of cannam@89: file names to accompany the command-line flags. Each file is cannam@89: replaced by a compressed version of itself, with the name cannam@89: original_name.bz2. Each cannam@89: compressed file has the same modification date, permissions, and, cannam@89: when possible, ownership as the corresponding original, so that cannam@89: these properties can be correctly restored at decompression time. cannam@89: File name handling is naive in the sense that there is no cannam@89: mechanism for preserving original file names, permissions, cannam@89: ownerships or dates in filesystems which lack these concepts, or cannam@89: have serious file name length restrictions, such as cannam@89: MS-DOS.

cannam@89:

bzip2 and cannam@89: bunzip2 will by default not cannam@89: overwrite existing files. If you want this to happen, specify cannam@89: the -f flag.

cannam@89:

If no file names are specified, cannam@89: bzip2 compresses from standard cannam@89: input to standard output. In this case, cannam@89: bzip2 will decline to write cannam@89: compressed output to a terminal, as this would be entirely cannam@89: incomprehensible and therefore pointless.

cannam@89:

bunzip2 (or cannam@89: bzip2 -d) decompresses all cannam@89: specified files. Files which were not created by cannam@89: bzip2 will be detected and cannam@89: ignored, and a warning issued. cannam@89: bzip2 attempts to guess the cannam@89: filename for the decompressed file from that of the compressed cannam@89: file as follows:

cannam@89:

filename.bz2 cannam@89: becomes cannam@89: filename
filename.bz cannam@89: becomes cannam@89: filename
filename.tbz2 cannam@89: becomes cannam@89: filename.tar
filename.tbz cannam@89: becomes cannam@89: filename.tar
anyothername cannam@89: becomes cannam@89: anyothername.out

cannam@89:

If the file does not end in one of the recognised endings, cannam@89: .bz2, cannam@89: .bz, cannam@89: .tbz2 or cannam@89: .tbz, cannam@89: bzip2 complains that it cannot cannam@89: guess the name of the original file, and uses the original name cannam@89: with .out appended.

cannam@89:

As with compression, supplying no filenames causes cannam@89: decompression from standard input to standard output.

cannam@89:

bunzip2 will correctly cannam@89: decompress a file which is the concatenation of two or more cannam@89: compressed files. The result is the concatenation of the cannam@89: corresponding uncompressed files. Integrity testing cannam@89: (-t) of concatenated compressed cannam@89: files is also supported.

cannam@89:

You can also compress or decompress files to the standard cannam@89: output by giving the -c flag. cannam@89: Multiple files may be compressed and decompressed like this. The cannam@89: resulting outputs are fed sequentially to stdout. Compression of cannam@89: multiple files in this manner generates a stream containing cannam@89: multiple compressed file representations. Such a stream can be cannam@89: decompressed correctly only by cannam@89: bzip2 version 0.9.0 or later. cannam@89: Earlier versions of bzip2 will cannam@89: stop after decompressing the first file in the stream.

cannam@89:

bzcat (or cannam@89: bzip2 -dc) decompresses all cannam@89: specified files to the standard output.

cannam@89:

bzip2 will read arguments cannam@89: from the environment variables cannam@89: BZIP2 and cannam@89: BZIP, in that order, and will cannam@89: process them before any arguments read from the command line. cannam@89: This gives a convenient way to supply default arguments.

cannam@89:

Compression is always performed, even if the compressed cannam@89: file is slightly larger than the original. Files of less than cannam@89: about one hundred bytes tend to get larger, since the compression cannam@89: mechanism has a constant overhead in the region of 50 bytes. cannam@89: Random data (including the output of most file compressors) is cannam@89: coded at about 8.05 bits per byte, giving an expansion of around cannam@89: 0.5%.

cannam@89:

As a self-check for your protection, cannam@89: bzip2 uses 32-bit CRCs to make cannam@89: sure that the decompressed version of a file is identical to the cannam@89: original. This guards against corruption of the compressed data, cannam@89: and against undetected bugs in cannam@89: bzip2 (hopefully very unlikely). cannam@89: The chances of data corruption going undetected is microscopic, cannam@89: about one chance in four billion for each file processed. Be cannam@89: aware, though, that the check occurs upon decompression, so it cannam@89: can only tell you that something is wrong. It can't help you cannam@89: recover the original uncompressed data. You can use cannam@89: bzip2recover to try to recover cannam@89: data from damaged files.

cannam@89:

Return values: 0 for a normal exit, 1 for environmental cannam@89: problems (file not found, invalid flags, I/O errors, etc.), 2 cannam@89: to indicate a corrupt compressed file, 3 for an internal cannam@89: consistency error (eg, bug) which caused cannam@89: bzip2 to panic.

cannam@89:

cannam@89: 2.4. OPTIONS

cannam@89:

-c --stdout

Compress or decompress to standard cannam@89: output.

-d --decompress

Force decompression. cannam@89: bzip2, cannam@89: bunzip2 and cannam@89: bzcat are really the same cannam@89: program, and the decision about what actions to take is done on cannam@89: the basis of which name is used. This flag overrides that cannam@89: mechanism, and forces bzip2 to decompress.

-z --compress

The complement to cannam@89: -d: forces compression, cannam@89: regardless of the invokation name.

-t --test

Check integrity of the specified file(s), but cannam@89: don't decompress them. This really performs a trial cannam@89: decompression and throws away the result.

-f --force

cannam@89:

Force overwrite of output files. Normally, cannam@89: bzip2 will not overwrite cannam@89: existing output files. Also forces cannam@89: bzip2 to break hard links to cannam@89: files, which it otherwise wouldn't do.

cannam@89:

bzip2 normally declines cannam@89: to decompress files which don't have the correct magic header cannam@89: bytes. If forced (-f), cannam@89: however, it will pass such files through unmodified. This is cannam@89: how GNU gzip behaves.

cannam@89:

-k --keep

Keep (don't delete) input files during cannam@89: compression or decompression.

-s --small

cannam@89:

Reduce memory usage, for compression, cannam@89: decompression and testing. Files are decompressed and tested cannam@89: using a modified algorithm which only requires 2.5 bytes per cannam@89: block byte. This means any file can be decompressed in 2300k cannam@89: of memory, albeit at about half the normal speed.

cannam@89:

During compression, -s cannam@89: selects a block size of 200k, which limits memory use to around cannam@89: the same figure, at the expense of your compression ratio. In cannam@89: short, if your machine is low on memory (8 megabytes or less), cannam@89: use -s for everything. See cannam@89: MEMORY MANAGEMENT below.

cannam@89:

-q --quiet

Suppress non-essential warning messages. cannam@89: Messages pertaining to I/O errors and other critical events cannam@89: will not be suppressed.

-v --verbose

Verbose mode -- show the compression ratio for cannam@89: each file processed. Further cannam@89: -v's increase the verbosity cannam@89: level, spewing out lots of information which is primarily of cannam@89: interest for diagnostic purposes.

-L --license -V --version

Display the software version, license terms and cannam@89: conditions.

-1 (or cannam@89: --fast) to cannam@89: -9 (or cannam@89: -best)

Set the block size to 100 k, 200 k ... 900 k cannam@89: when compressing. Has no effect when decompressing. See MEMORY MANAGEMENT below. The cannam@89: --fast and cannam@89: --best aliases are primarily cannam@89: for GNU gzip compatibility. cannam@89: In particular, --fast doesn't cannam@89: make things significantly faster. And cannam@89: --best merely selects the cannam@89: default behaviour.

--

Treats all subsequent arguments as file names, cannam@89: even if they start with a dash. This is so you can handle cannam@89: files with names beginning with a dash, for example: cannam@89: bzip2 -- cannam@89: -myfilename.

cannam@89: --repetitive-fast, --repetitive-best cannam@89:

These flags are redundant in versions 0.9.5 and cannam@89: above. They provided some coarse control over the behaviour of cannam@89: the sorting algorithm in earlier versions, which was sometimes cannam@89: useful. 0.9.5 and above have an improved algorithm which cannam@89: renders these flags irrelevant.

cannam@89:

cannam@89: 2.5. MEMORY MANAGEMENT

cannam@89:

bzip2 compresses large cannam@89: files in blocks. The block size affects both the compression cannam@89: ratio achieved, and the amount of memory needed for compression cannam@89: and decompression. The flags -1 cannam@89: through -9 specify the block cannam@89: size to be 100,000 bytes through 900,000 bytes (the default) cannam@89: respectively. At decompression time, the block size used for cannam@89: compression is read from the header of the compressed file, and cannam@89: bunzip2 then allocates itself cannam@89: just enough memory to decompress the file. Since block sizes are cannam@89: stored in compressed files, it follows that the flags cannam@89: -1 to cannam@89: -9 are irrelevant to and so cannam@89: ignored during decompression.

cannam@89:

Compression and decompression requirements, in bytes, can be cannam@89: estimated as:

cannam@89:

Compression:   400k + ( 8 x block size )
cannam@89: 
cannam@89: Decompression: 100k + ( 4 x block size ), or
cannam@89:                100k + ( 2.5 x block size )

cannam@89:

Larger block sizes give rapidly diminishing marginal cannam@89: returns. Most of the compression comes from the first two or cannam@89: three hundred k of block size, a fact worth bearing in mind when cannam@89: using bzip2 on small machines. cannam@89: It is also important to appreciate that the decompression memory cannam@89: requirement is set at compression time by the choice of block cannam@89: size.

cannam@89:

For files compressed with the default 900k block size, cannam@89: bunzip2 will require about 3700 cannam@89: kbytes to decompress. To support decompression of any file on a cannam@89: 4 megabyte machine, bunzip2 has cannam@89: an option to decompress using approximately half this amount of cannam@89: memory, about 2300 kbytes. Decompression speed is also halved, cannam@89: so you should use this option only where necessary. The relevant cannam@89: flag is -s.

cannam@89:

In general, try and use the largest block size memory cannam@89: constraints allow, since that maximises the compression achieved. cannam@89: Compression and decompression speed are virtually unaffected by cannam@89: block size.

cannam@89:

Another significant point applies to files which fit in a cannam@89: single block -- that means most files you'd encounter using a cannam@89: large block size. The amount of real memory touched is cannam@89: proportional to the size of the file, since the file is smaller cannam@89: than a block. For example, compressing a file 20,000 bytes long cannam@89: with the flag -9 will cause the cannam@89: compressor to allocate around 7600k of memory, but only touch cannam@89: 400k + 20000 * 8 = 560 kbytes of it. Similarly, the decompressor cannam@89: will allocate 3700k but only touch 100k + 20000 * 4 = 180 cannam@89: kbytes.

cannam@89:

Here is a table which summarises the maximum memory usage cannam@89: for different block sizes. Also recorded is the total compressed cannam@89: size for 14 files of the Calgary Text Compression Corpus cannam@89: totalling 3,141,622 bytes. This column gives some feel for how cannam@89: compression varies with block size. These figures tend to cannam@89: understate the advantage of larger block sizes for larger files, cannam@89: since the Corpus is dominated by smaller files.

cannam@89:

        Compress   Decompress   Decompress   Corpus
cannam@89: Flag     usage      usage       -s usage     Size
cannam@89: 
cannam@89:  -1      1200k       500k         350k      914704
cannam@89:  -2      2000k       900k         600k      877703
cannam@89:  -3      2800k      1300k         850k      860338
cannam@89:  -4      3600k      1700k        1100k      846899
cannam@89:  -5      4400k      2100k        1350k      845160
cannam@89:  -6      5200k      2500k        1600k      838626
cannam@89:  -7      6100k      2900k        1850k      834096
cannam@89:  -8      6800k      3300k        2100k      828642
cannam@89:  -9      7600k      3700k        2350k      828642

cannam@89:

cannam@89: 2.6. RECOVERING DATA FROM DAMAGED FILES

cannam@89:

bzip2 compresses files in cannam@89: blocks, usually 900kbytes long. Each block is handled cannam@89: independently. If a media or transmission error causes a cannam@89: multi-block .bz2 file to become cannam@89: damaged, it may be possible to recover data from the undamaged cannam@89: blocks in the file.

cannam@89:

The compressed representation of each block is delimited by cannam@89: a 48-bit pattern, which makes it possible to find the block cannam@89: boundaries with reasonable certainty. Each block also carries cannam@89: its own 32-bit CRC, so damaged blocks can be distinguished from cannam@89: undamaged ones.

cannam@89:

bzip2recover is a simple cannam@89: program whose purpose is to search for blocks in cannam@89: .bz2 files, and write each block cannam@89: out into its own .bz2 file. You cannam@89: can then use bzip2 -t to test cannam@89: the integrity of the resulting files, and decompress those which cannam@89: are undamaged.

cannam@89:

bzip2recover takes a cannam@89: single argument, the name of the damaged file, and writes a cannam@89: number of files rec0001file.bz2, cannam@89: rec0002file.bz2, etc, containing cannam@89: the extracted blocks. The output filenames are designed so that cannam@89: the use of wildcards in subsequent processing -- for example, cannam@89: bzip2 -dc rec*file.bz2 > cannam@89: recovered_data -- lists the files in the correct cannam@89: order.

cannam@89:

bzip2recover should be of cannam@89: most use dealing with large .bz2 cannam@89: files, as these will contain many blocks. It is clearly futile cannam@89: to use it on damaged single-block files, since a damaged block cannam@89: cannot be recovered. If you wish to minimise any potential data cannam@89: loss through media or transmission errors, you might consider cannam@89: compressing with a smaller block size.

cannam@89:

cannam@89: 2.7. PERFORMANCE NOTES

cannam@89:

The sorting phase of compression gathers together similar cannam@89: strings in the file. Because of this, files containing very long cannam@89: runs of repeated symbols, like "aabaabaabaab ..." (repeated cannam@89: several hundred times) may compress more slowly than normal. cannam@89: Versions 0.9.5 and above fare much better than previous versions cannam@89: in this respect. The ratio between worst-case and average-case cannam@89: compression time is in the region of 10:1. For previous cannam@89: versions, this figure was more like 100:1. You can use the cannam@89: -vvvv option to monitor progress cannam@89: in great detail, if you want.

cannam@89:

Decompression speed is unaffected by these cannam@89: phenomena.

cannam@89:

bzip2 usually allocates cannam@89: several megabytes of memory to operate in, and then charges all cannam@89: over it in a fairly random fashion. This means that performance, cannam@89: both for compressing and decompressing, is largely determined by cannam@89: the speed at which your machine can service cache misses. cannam@89: Because of this, small changes to the code to reduce the miss cannam@89: rate have been observed to give disproportionately large cannam@89: performance improvements. I imagine cannam@89: bzip2 will perform best on cannam@89: machines with very large caches.

cannam@89:

cannam@89: 2.8. CAVEATS

cannam@89:

I/O error messages are not as helpful as they could be. cannam@89: bzip2 tries hard to detect I/O cannam@89: errors and exit cleanly, but the details of what the problem is cannam@89: sometimes seem rather misleading.

cannam@89:

This manual page pertains to version 1.0.6 of cannam@89: bzip2. Compressed data created by cannam@89: this version is entirely forwards and backwards compatible with the cannam@89: previous public releases, versions 0.1pl2, 0.9.0 and 0.9.5, 1.0.0, cannam@89: 1.0.1, 1.0.2 and 1.0.3, but with the following exception: 0.9.0 and cannam@89: above can correctly decompress multiple concatenated compressed files. cannam@89: 0.1pl2 cannot do this; it will stop after decompressing just the first cannam@89: file in the stream.

cannam@89:

bzip2recover versions cannam@89: prior to 1.0.2 used 32-bit integers to represent bit positions in cannam@89: compressed files, so it could not handle compressed files more cannam@89: than 512 megabytes long. Versions 1.0.2 and above use 64-bit ints cannam@89: on some platforms which support them (GNU supported targets, and cannam@89: Windows). To establish whether or not cannam@89: bzip2recover was built with such cannam@89: a limitation, run it without arguments. In any event you can cannam@89: build yourself an unlimited version if you can recompile it with cannam@89: MaybeUInt64 set to be an cannam@89: unsigned 64-bit integer.

cannam@89:

cannam@89: 2.9. AUTHOR

cannam@89:

Julian Seward, cannam@89: jseward@bzip.org

cannam@89:

The ideas embodied in cannam@89: bzip2 are due to (at least) the cannam@89: following people: Michael Burrows and David Wheeler (for the cannam@89: block sorting transformation), David Wheeler (again, for the cannam@89: Huffman coder), Peter Fenwick (for the structured coding model in cannam@89: the original bzip, and many cannam@89: refinements), and Alistair Moffat, Radford Neal and Ian Witten cannam@89: (for the arithmetic coder in the original cannam@89: bzip). I am much indebted for cannam@89: their help, support and advice. See the manual in the source cannam@89: distribution for pointers to sources of documentation. Christian cannam@89: von Roques encouraged me to look for faster sorting algorithms, cannam@89: so as to speed up compression. Bela Lubkin encouraged me to cannam@89: improve the worst-case compression performance. cannam@89: Donna Robinson XMLised the documentation. cannam@89: Many people sent cannam@89: patches, helped with portability problems, lent machines, gave cannam@89: advice and were generally helpful.

cannam@89:

cannam@89: 3. cannam@89: Programming with `libbzip2` cannam@89:

cannam@89:

Table of Contents

cannam@89:

3.1. Top-level structure

3.1.1. Low-level summary
3.1.2. High-level summary
3.1.3. Utility functions summary

3.2. Error handling

3.3. Low-level interface

3.3.1. BZ2_bzCompressInit
3.3.2. BZ2_bzCompress
3.3.3. BZ2_bzCompressEnd
3.3.4. BZ2_bzDecompressInit
3.3.5. BZ2_bzDecompress
3.3.6. BZ2_bzDecompressEnd

3.4. High-level interface

3.4.1. BZ2_bzReadOpen
3.4.2. BZ2_bzRead
3.4.3. BZ2_bzReadGetUnused
3.4.4. BZ2_bzReadClose
3.4.5. BZ2_bzWriteOpen
3.4.6. BZ2_bzWrite
3.4.7. BZ2_bzWriteClose
3.4.8. Handling embedded compressed data streams
3.4.9. Standard file-reading/writing code

3.5. Utility functions

3.5.1. BZ2_bzBuffToBuffCompress
3.5.2. BZ2_bzBuffToBuffDecompress

3.6. zlib compatibility functions

3.7. Using the library in a stdio-free environment

3.7.1. Getting rid of stdio
3.7.2. Critical error handling

3.8. Making a Windows DLL

cannam@89:

This chapter describes the programming interface to cannam@89: libbzip2.

cannam@89:

For general background information, particularly about cannam@89: memory use and performance aspects, you'd be well advised to read cannam@89: How to use bzip2 as well.

cannam@89:

cannam@89: 3.1. Top-level structure

cannam@89:

libbzip2 is a flexible cannam@89: library for compressing and decompressing data in the cannam@89: bzip2 data format. Although cannam@89: packaged as a single entity, it helps to regard the library as cannam@89: three separate parts: the low level interface, and the high level cannam@89: interface, and some utility functions.

cannam@89:

The structure of cannam@89: libbzip2's interfaces is similar cannam@89: to that of Jean-loup Gailly's and Mark Adler's excellent cannam@89: zlib library.

cannam@89:

All externally visible symbols have names beginning cannam@89: BZ2_. This is new in version cannam@89: 1.0. The intention is to minimise pollution of the namespaces of cannam@89: library clients.

cannam@89:

To use any part of the library, you need to cannam@89: #include <bzlib.h> cannam@89: into your sources.

cannam@89:

cannam@89: 3.1.1. Low-level summary

cannam@89:

This interface provides services for compressing and cannam@89: decompressing data in memory. There's no provision for dealing cannam@89: with files, streams or any other I/O mechanisms, just straight cannam@89: memory-to-memory work. In fact, this part of the library can be cannam@89: compiled without inclusion of cannam@89: stdio.h, which may be helpful cannam@89: for embedded applications.

cannam@89:

The low-level part of the library has no global variables cannam@89: and is therefore thread-safe.

cannam@89:

Six routines make up the low level interface: cannam@89: BZ2_bzCompressInit, cannam@89: BZ2_bzCompress, and cannam@89: BZ2_bzCompressEnd for cannam@89: compression, and a corresponding trio cannam@89: BZ2_bzDecompressInit, cannam@89: BZ2_bzDecompress and cannam@89: BZ2_bzDecompressEnd for cannam@89: decompression. The *Init cannam@89: functions allocate memory for compression/decompression and do cannam@89: other initialisations, whilst the cannam@89: *End functions close down cannam@89: operations and release memory.

cannam@89:

The real work is done by cannam@89: BZ2_bzCompress and cannam@89: BZ2_bzDecompress. These cannam@89: compress and decompress data from a user-supplied input buffer to cannam@89: a user-supplied output buffer. These buffers can be any size; cannam@89: arbitrary quantities of data are handled by making repeated calls cannam@89: to these functions. This is a flexible mechanism allowing a cannam@89: consumer-pull style of activity, or producer-push, or a mixture cannam@89: of both.

cannam@89:

cannam@89: 3.1.2. High-level summary

cannam@89:

This interface provides some handy wrappers around the cannam@89: low-level interface to facilitate reading and writing cannam@89: bzip2 format files cannam@89: (.bz2 files). The routines cannam@89: provide hooks to facilitate reading files in which the cannam@89: bzip2 data stream is embedded cannam@89: within some larger-scale file structure, or where there are cannam@89: multiple bzip2 data streams cannam@89: concatenated end-to-end.

cannam@89:

For reading files, cannam@89: BZ2_bzReadOpen, cannam@89: BZ2_bzRead, cannam@89: BZ2_bzReadClose and cannam@89: BZ2_bzReadGetUnused are cannam@89: supplied. For writing files, cannam@89: BZ2_bzWriteOpen, cannam@89: BZ2_bzWrite and cannam@89: BZ2_bzWriteFinish are cannam@89: available.

cannam@89:

As with the low-level library, no global variables are used cannam@89: so the library is per se thread-safe. However, if I/O errors cannam@89: occur whilst reading or writing the underlying compressed files, cannam@89: you may have to consult errno to cannam@89: determine the cause of the error. In that case, you'd need a C cannam@89: library which correctly supports cannam@89: errno in a multithreaded cannam@89: environment.

cannam@89:

To make the library a little simpler and more portable, cannam@89: BZ2_bzReadOpen and cannam@89: BZ2_bzWriteOpen require you to cannam@89: pass them file handles (FILE*s) cannam@89: which have previously been opened for reading or writing cannam@89: respectively. That avoids portability problems associated with cannam@89: file operations and file attributes, whilst not being much of an cannam@89: imposition on the programmer.

cannam@89:

cannam@89: 3.1.3. Utility functions summary

cannam@89:

For very simple needs, cannam@89: BZ2_bzBuffToBuffCompress and cannam@89: BZ2_bzBuffToBuffDecompress are cannam@89: provided. These compress data in memory from one buffer to cannam@89: another buffer in a single function call. You should assess cannam@89: whether these functions fulfill your memory-to-memory cannam@89: compression/decompression requirements before investing effort in cannam@89: understanding the more general but more complex low-level cannam@89: interface.

cannam@89:

Yoshioka Tsuneo cannam@89: (tsuneo@rr.iij4u.or.jp) has cannam@89: contributed some functions to give better cannam@89: zlib compatibility. These cannam@89: functions are BZ2_bzopen, cannam@89: BZ2_bzread, cannam@89: BZ2_bzwrite, cannam@89: BZ2_bzflush, cannam@89: BZ2_bzclose, cannam@89: BZ2_bzerror and cannam@89: BZ2_bzlibVersion. You may find cannam@89: these functions more convenient for simple file reading and cannam@89: writing, than those in the high-level interface. These functions cannam@89: are not (yet) officially part of the library, and are minimally cannam@89: documented here. If they break, you get to keep all the pieces. cannam@89: I hope to document them properly when time permits.

cannam@89:

Yoshioka also contributed modifications to allow the cannam@89: library to be built as a Windows DLL.

cannam@89:

cannam@89: 3.2. Error handling

cannam@89:

The library is designed to recover cleanly in all cannam@89: situations, including the worst-case situation of decompressing cannam@89: random data. I'm not 100% sure that it can always do this, so cannam@89: you might want to add a signal handler to catch segmentation cannam@89: violations during decompression if you are feeling especially cannam@89: paranoid. I would be interested in hearing more about the cannam@89: robustness of the library to corrupted compressed data.

cannam@89:

Version 1.0.3 more robust in this respect than any cannam@89: previous version. Investigations with Valgrind (a tool for detecting cannam@89: problems with memory management) indicate cannam@89: that, at least for the few files I tested, all single-bit errors cannam@89: in the decompressed data are caught properly, with no cannam@89: segmentation faults, no uses of uninitialised data, no out of cannam@89: range reads or writes, and no infinite looping in the decompressor. cannam@89: So it's certainly pretty robust, although cannam@89: I wouldn't claim it to be totally bombproof.

cannam@89:

The file bzlib.h contains cannam@89: all definitions needed to use the library. In particular, you cannam@89: should definitely not include cannam@89: bzlib_private.h.

cannam@89:

In bzlib.h, the various cannam@89: return values are defined. The following list is not intended as cannam@89: an exhaustive description of the circumstances in which a given cannam@89: value may be returned -- those descriptions are given later. cannam@89: Rather, it is intended to convey the rough meaning of each return cannam@89: value. The first five actions are normal and not intended to cannam@89: denote an error situation.

cannam@89:

BZ_OK: The requested action was completed cannam@89: successfully.
BZ_RUN_OK, BZ_FLUSH_OK, cannam@89: BZ_FINISH_OK: In cannam@89: BZ2_bzCompress, the requested cannam@89: flush/finish/nothing-special action was completed cannam@89: successfully.
BZ_STREAM_END: Compression of data was completed, or the cannam@89: logical stream end was detected during cannam@89: decompression.

cannam@89:

The following return values indicate an error of some cannam@89: kind.

cannam@89:

BZ_CONFIG_ERROR: Indicates that the library has been improperly cannam@89: compiled on your platform -- a major configuration error. cannam@89: Specifically, it means that cannam@89: sizeof(char), cannam@89: sizeof(short) and cannam@89: sizeof(int) are not 1, 2 and cannam@89: 4 respectively, as they should be. Note that the library cannam@89: should still work properly on 64-bit platforms which follow cannam@89: the LP64 programming model -- that is, where cannam@89: sizeof(long) and cannam@89: sizeof(void*) are 8. Under cannam@89: LP64, sizeof(int) is still 4, cannam@89: so libbzip2, which doesn't cannam@89: use the long type, is cannam@89: OK.
BZ_SEQUENCE_ERROR: When using the library, it is important to call cannam@89: the functions in the correct sequence and with data structures cannam@89: (buffers etc) in the correct states. cannam@89: libbzip2 checks as much as it cannam@89: can to ensure this is happening, and returns cannam@89: BZ_SEQUENCE_ERROR if not. cannam@89: Code which complies precisely with the function semantics, as cannam@89: detailed below, should never receive this value; such an event cannam@89: denotes buggy code which you should cannam@89: investigate.
BZ_PARAM_ERROR: Returned when a parameter to a function call is cannam@89: out of range or otherwise manifestly incorrect. As with cannam@89: BZ_SEQUENCE_ERROR, this cannam@89: denotes a bug in the client code. The distinction between cannam@89: BZ_PARAM_ERROR and cannam@89: BZ_SEQUENCE_ERROR is a bit cannam@89: hazy, but still worth making.
BZ_MEM_ERROR: Returned when a request to allocate memory cannam@89: failed. Note that the quantity of memory needed to decompress cannam@89: a stream cannot be determined until the stream's header has cannam@89: been read. So cannam@89: BZ2_bzDecompress and cannam@89: BZ2_bzRead may return cannam@89: BZ_MEM_ERROR even though some cannam@89: of the compressed data has been read. The same is not true cannam@89: for compression; once cannam@89: BZ2_bzCompressInit or cannam@89: BZ2_bzWriteOpen have cannam@89: successfully completed, cannam@89: BZ_MEM_ERROR cannot cannam@89: occur.
BZ_DATA_ERROR: Returned when a data integrity error is cannam@89: detected during decompression. Most importantly, this means cannam@89: when stored and computed CRCs for the data do not match. This cannam@89: value is also returned upon detection of any other anomaly in cannam@89: the compressed data.
BZ_DATA_ERROR_MAGIC: As a special case of cannam@89: BZ_DATA_ERROR, it is cannam@89: sometimes useful to know when the compressed stream does not cannam@89: start with the correct magic bytes ('B' 'Z' cannam@89: 'h').
BZ_IO_ERROR: Returned by cannam@89: BZ2_bzRead and cannam@89: BZ2_bzWrite when there is an cannam@89: error reading or writing in the compressed file, and by cannam@89: BZ2_bzReadOpen and cannam@89: BZ2_bzWriteOpen for attempts cannam@89: to use a file for which the error indicator (viz, cannam@89: ferror(f)) is set. On cannam@89: receipt of BZ_IO_ERROR, the cannam@89: caller should consult errno cannam@89: and/or perror to acquire cannam@89: operating-system specific information about the cannam@89: problem.
BZ_UNEXPECTED_EOF: Returned by cannam@89: BZ2_bzRead when the cannam@89: compressed file finishes before the logical end of stream is cannam@89: detected.
BZ_OUTBUFF_FULL: Returned by cannam@89: BZ2_bzBuffToBuffCompress and cannam@89: BZ2_bzBuffToBuffDecompress to cannam@89: indicate that the output data will not fit into the output cannam@89: buffer provided.

cannam@89:

cannam@89: 3.3. Low-level interface

cannam@89:

cannam@89: 3.3.1. BZ2_bzCompressInit

cannam@89:

typedef struct {
cannam@89:   char *next_in;
cannam@89:   unsigned int avail_in;
cannam@89:   unsigned int total_in_lo32;
cannam@89:   unsigned int total_in_hi32;
cannam@89: 
cannam@89:   char *next_out;
cannam@89:   unsigned int avail_out;
cannam@89:   unsigned int total_out_lo32;
cannam@89:   unsigned int total_out_hi32;
cannam@89: 
cannam@89:   void *state;
cannam@89: 
cannam@89:   void *(*bzalloc)(void *,int,int);
cannam@89:   void (*bzfree)(void *,void *);
cannam@89:   void *opaque;
cannam@89: } bz_stream;
cannam@89: 
cannam@89: int BZ2_bzCompressInit ( bz_stream *strm, 
cannam@89:                          int blockSize100k, 
cannam@89:                          int verbosity,
cannam@89:                          int workFactor );

cannam@89:

Prepares for compression. The cannam@89: bz_stream structure holds all cannam@89: data pertaining to the compression activity. A cannam@89: bz_stream structure should be cannam@89: allocated and initialised prior to the call. The fields of cannam@89: bz_stream comprise the entirety cannam@89: of the user-visible data. state cannam@89: is a pointer to the private data structures required for cannam@89: compression.

cannam@89:

Custom memory allocators are supported, via fields cannam@89: bzalloc, cannam@89: bzfree, and cannam@89: opaque. The value cannam@89: opaque is passed to as the first cannam@89: argument to all calls to bzalloc cannam@89: and bzfree, but is otherwise cannam@89: ignored by the library. The call bzalloc ( cannam@89: opaque, n, m ) is expected to return a pointer cannam@89: p to n * cannam@89: m bytes of memory, and bzfree ( cannam@89: opaque, p ) should free that memory.

cannam@89:

If you don't want to use a custom memory allocator, set cannam@89: bzalloc, cannam@89: bzfree and cannam@89: opaque to cannam@89: NULL, and the library will then cannam@89: use the standard malloc / cannam@89: free routines.

cannam@89:

Before calling cannam@89: BZ2_bzCompressInit, fields cannam@89: bzalloc, cannam@89: bzfree and cannam@89: opaque should be filled cannam@89: appropriately, as just described. Upon return, the internal cannam@89: state will have been allocated and initialised, and cannam@89: total_in_lo32, cannam@89: total_in_hi32, cannam@89: total_out_lo32 and cannam@89: total_out_hi32 will have been cannam@89: set to zero. These four fields are used by the library to inform cannam@89: the caller of the total amount of data passed into and out of the cannam@89: library, respectively. You should not try to change them. As of cannam@89: version 1.0, 64-bit counts are maintained, even on 32-bit cannam@89: platforms, using the _hi32 cannam@89: fields to store the upper 32 bits of the count. So, for example, cannam@89: the total amount of data in is (total_in_hi32 cannam@89: << 32) + total_in_lo32.

cannam@89:

Parameter blockSize100k cannam@89: specifies the block size to be used for compression. It should cannam@89: be a value between 1 and 9 inclusive, and the actual block size cannam@89: used is 100000 x this figure. 9 gives the best compression but cannam@89: takes most memory.

cannam@89:

Parameter verbosity should cannam@89: be set to a number between 0 and 4 inclusive. 0 is silent, and cannam@89: greater numbers give increasingly verbose monitoring/debugging cannam@89: output. If the library has been compiled with cannam@89: -DBZ_NO_STDIO, no such output cannam@89: will appear for any verbosity setting.

cannam@89:

Parameter workFactor cannam@89: controls how the compression phase behaves when presented with cannam@89: worst case, highly repetitive, input data. If compression runs cannam@89: into difficulties caused by repetitive data, the library switches cannam@89: from the standard sorting algorithm to a fallback algorithm. The cannam@89: fallback is slower than the standard algorithm by perhaps a cannam@89: factor of three, but always behaves reasonably, no matter how bad cannam@89: the input.

cannam@89:

Lower values of workFactor cannam@89: reduce the amount of effort the standard algorithm will expend cannam@89: before resorting to the fallback. You should set this parameter cannam@89: carefully; too low, and many inputs will be handled by the cannam@89: fallback algorithm and so compress rather slowly, too high, and cannam@89: your average-to-worst case compression times can become very cannam@89: large. The default value of 30 gives reasonable behaviour over a cannam@89: wide range of circumstances.

cannam@89:

Allowable values range from 0 to 250 inclusive. 0 is a cannam@89: special case, equivalent to using the default value of 30.

cannam@89:

Note that the compressed output generated is the same cannam@89: regardless of whether or not the fallback algorithm is cannam@89: used.

cannam@89:

Be aware also that this parameter may disappear entirely in cannam@89: future versions of the library. In principle it should be cannam@89: possible to devise a good way to automatically choose which cannam@89: algorithm to use. Such a mechanism would render the parameter cannam@89: obsolete.

cannam@89:

Possible return values:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if strm is NULL 
cannam@89:   or blockSize < 1 or blockSize > 9
cannam@89:   or verbosity < 0 or verbosity > 4
cannam@89:   or workFactor < 0 or workFactor > 250
cannam@89: BZ_MEM_ERROR 
cannam@89:   if not enough memory is available
cannam@89: BZ_OK 
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzCompress
cannam@89:   if BZ_OK is returned
cannam@89:   no specific action needed in case of error

cannam@89:

cannam@89: 3.3.2. BZ2_bzCompress

cannam@89:

int BZ2_bzCompress ( bz_stream *strm, int action );

cannam@89:

Provides more input and/or output buffer space for the cannam@89: library. The caller maintains input and output buffers, and cannam@89: calls BZ2_bzCompress to transfer cannam@89: data between them.

cannam@89:

Before each call to cannam@89: BZ2_bzCompress, cannam@89: next_in should point at the data cannam@89: to be compressed, and avail_in cannam@89: should indicate how many bytes the library may read. cannam@89: BZ2_bzCompress updates cannam@89: next_in, cannam@89: avail_in and cannam@89: total_in to reflect the number cannam@89: of bytes it has read.

cannam@89:

Similarly, next_out should cannam@89: point to a buffer in which the compressed data is to be placed, cannam@89: with avail_out indicating how cannam@89: much output space is available. cannam@89: BZ2_bzCompress updates cannam@89: next_out, cannam@89: avail_out and cannam@89: total_out to reflect the number cannam@89: of bytes output.

cannam@89:

You may provide and remove as little or as much data as you cannam@89: like on each call of cannam@89: BZ2_bzCompress. In the limit, cannam@89: it is acceptable to supply and remove data one byte at a time, cannam@89: although this would be terribly inefficient. You should always cannam@89: ensure that at least one byte of output space is available at cannam@89: each call.

cannam@89:

A second purpose of cannam@89: BZ2_bzCompress is to request a cannam@89: change of mode of the compressed stream.

cannam@89:

Conceptually, a compressed stream can be in one of four cannam@89: states: IDLE, RUNNING, FLUSHING and FINISHING. Before cannam@89: initialisation cannam@89: (BZ2_bzCompressInit) and after cannam@89: termination (BZ2_bzCompressEnd), cannam@89: a stream is regarded as IDLE.

cannam@89:

Upon initialisation cannam@89: (BZ2_bzCompressInit), the stream cannam@89: is placed in the RUNNING state. Subsequent calls to cannam@89: BZ2_bzCompress should pass cannam@89: BZ_RUN as the requested action; cannam@89: other actions are illegal and will result in cannam@89: BZ_SEQUENCE_ERROR.

cannam@89:

At some point, the calling program will have provided all cannam@89: the input data it wants to. It will then want to finish up -- in cannam@89: effect, asking the library to process any data it might have cannam@89: buffered internally. In this state, cannam@89: BZ2_bzCompress will no longer cannam@89: attempt to read data from cannam@89: next_in, but it will want to cannam@89: write data to next_out. Because cannam@89: the output buffer supplied by the user can be arbitrarily small, cannam@89: the finishing-up operation cannot necessarily be done with a cannam@89: single call of cannam@89: BZ2_bzCompress.

cannam@89:

Instead, the calling program passes cannam@89: BZ_FINISH as an action to cannam@89: BZ2_bzCompress. This changes cannam@89: the stream's state to FINISHING. Any remaining input (ie, cannam@89: next_in[0 .. avail_in-1]) is cannam@89: compressed and transferred to the output buffer. To do this, cannam@89: BZ2_bzCompress must be called cannam@89: repeatedly until all the output has been consumed. At that cannam@89: point, BZ2_bzCompress returns cannam@89: BZ_STREAM_END, and the stream's cannam@89: state is set back to IDLE. cannam@89: BZ2_bzCompressEnd should then be cannam@89: called.

cannam@89:

Just to make sure the calling program does not cheat, the cannam@89: library makes a note of avail_in cannam@89: at the time of the first call to cannam@89: BZ2_bzCompress which has cannam@89: BZ_FINISH as an action (ie, at cannam@89: the time the program has announced its intention to not supply cannam@89: any more input). By comparing this value with that of cannam@89: avail_in over subsequent calls cannam@89: to BZ2_bzCompress, the library cannam@89: can detect any attempts to slip in more data to compress. Any cannam@89: calls for which this is detected will return cannam@89: BZ_SEQUENCE_ERROR. This cannam@89: indicates a programming mistake which should be corrected.

cannam@89:

Instead of asking to finish, the calling program may ask cannam@89: BZ2_bzCompress to take all the cannam@89: remaining input, compress it and terminate the current cannam@89: (Burrows-Wheeler) compression block. This could be useful for cannam@89: error control purposes. The mechanism is analogous to that for cannam@89: finishing: call BZ2_bzCompress cannam@89: with an action of BZ_FLUSH, cannam@89: remove output data, and persist with the cannam@89: BZ_FLUSH action until the value cannam@89: BZ_RUN is returned. As with cannam@89: finishing, BZ2_bzCompress cannam@89: detects any attempt to provide more input data once the flush has cannam@89: begun.

cannam@89:

Once the flush is complete, the stream returns to the cannam@89: normal RUNNING state.

cannam@89:

This all sounds pretty complex, but isn't really. Here's a cannam@89: table which shows which actions are allowable in each state, what cannam@89: action will be taken, what the next state is, and what the cannam@89: non-error return values are. Note that you can't explicitly ask cannam@89: what state the stream is in, but nor do you need to -- it can be cannam@89: inferred from the values returned by cannam@89: BZ2_bzCompress.

cannam@89:

IDLE/any
cannam@89:   Illegal.  IDLE state only exists after BZ2_bzCompressEnd or
cannam@89:   before BZ2_bzCompressInit.
cannam@89:   Return value = BZ_SEQUENCE_ERROR
cannam@89: 
cannam@89: RUNNING/BZ_RUN
cannam@89:   Compress from next_in to next_out as much as possible.
cannam@89:   Next state = RUNNING
cannam@89:   Return value = BZ_RUN_OK
cannam@89: 
cannam@89: RUNNING/BZ_FLUSH
cannam@89:   Remember current value of next_in. Compress from next_in
cannam@89:   to next_out as much as possible, but do not accept any more input.
cannam@89:   Next state = FLUSHING
cannam@89:   Return value = BZ_FLUSH_OK
cannam@89: 
cannam@89: RUNNING/BZ_FINISH
cannam@89:   Remember current value of next_in. Compress from next_in
cannam@89:   to next_out as much as possible, but do not accept any more input.
cannam@89:   Next state = FINISHING
cannam@89:   Return value = BZ_FINISH_OK
cannam@89: 
cannam@89: FLUSHING/BZ_FLUSH
cannam@89:   Compress from next_in to next_out as much as possible, 
cannam@89:   but do not accept any more input.
cannam@89:   If all the existing input has been used up and all compressed
cannam@89:   output has been removed
cannam@89:     Next state = RUNNING; Return value = BZ_RUN_OK
cannam@89:   else
cannam@89:     Next state = FLUSHING; Return value = BZ_FLUSH_OK
cannam@89: 
cannam@89: FLUSHING/other     
cannam@89:   Illegal.
cannam@89:   Return value = BZ_SEQUENCE_ERROR
cannam@89: 
cannam@89: FINISHING/BZ_FINISH
cannam@89:   Compress from next_in to next_out as much as possible,
cannam@89:   but to not accept any more input.  
cannam@89:   If all the existing input has been used up and all compressed
cannam@89:   output has been removed
cannam@89:     Next state = IDLE; Return value = BZ_STREAM_END
cannam@89:   else
cannam@89:     Next state = FINISHING; Return value = BZ_FINISH_OK
cannam@89: 
cannam@89: FINISHING/other
cannam@89:   Illegal.
cannam@89:   Return value = BZ_SEQUENCE_ERROR

cannam@89:

That still looks complicated? Well, fair enough. The cannam@89: usual sequence of calls for compressing a load of data is:

cannam@89:

Get started with cannam@89: BZ2_bzCompressInit.
Shovel data in and shlurp out its compressed form cannam@89: using zero or more calls of cannam@89: BZ2_bzCompress with action = cannam@89: BZ_RUN.
Finish up. Repeatedly call cannam@89: BZ2_bzCompress with action = cannam@89: BZ_FINISH, copying out the cannam@89: compressed output, until cannam@89: BZ_STREAM_END is cannam@89: returned.
Close up and go home. Call cannam@89: BZ2_bzCompressEnd.

cannam@89:

If the data you want to compress fits into your input cannam@89: buffer all at once, you can skip the calls of cannam@89: BZ2_bzCompress ( ..., BZ_RUN ) cannam@89: and just do the BZ2_bzCompress ( ..., BZ_FINISH cannam@89: ) calls.

cannam@89:

All required memory is allocated by cannam@89: BZ2_bzCompressInit. The cannam@89: compression library can accept any data at all (obviously). So cannam@89: you shouldn't get any error return values from the cannam@89: BZ2_bzCompress calls. If you cannam@89: do, they will be cannam@89: BZ_SEQUENCE_ERROR, and indicate cannam@89: a bug in your programming.

cannam@89:

Trivial other possible return values:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if strm is NULL, or strm->s is NULL

cannam@89:

cannam@89: 3.3.3. BZ2_bzCompressEnd

cannam@89:

int BZ2_bzCompressEnd ( bz_stream *strm );

cannam@89:

Releases all memory associated with a compression cannam@89: stream.

cannam@89:

Possible return values:

cannam@89:

BZ_PARAM_ERROR  if strm is NULL or strm->s is NULL
cannam@89: BZ_OK           otherwise

cannam@89:

cannam@89: 3.3.4. BZ2_bzDecompressInit

cannam@89:

int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );

cannam@89:

Prepares for decompression. As with cannam@89: BZ2_bzCompressInit, a cannam@89: bz_stream record should be cannam@89: allocated and initialised before the call. Fields cannam@89: bzalloc, cannam@89: bzfree and cannam@89: opaque should be set if a custom cannam@89: memory allocator is required, or made cannam@89: NULL for the normal cannam@89: malloc / cannam@89: free routines. Upon return, the cannam@89: internal state will have been initialised, and cannam@89: total_in and cannam@89: total_out will be zero.

cannam@89:

For the meaning of parameter cannam@89: verbosity, see cannam@89: BZ2_bzCompressInit.

cannam@89:

If small is nonzero, the cannam@89: library will use an alternative decompression algorithm which cannam@89: uses less memory but at the cost of decompressing more slowly cannam@89: (roughly speaking, half the speed, but the maximum memory cannam@89: requirement drops to around 2300k). See How to use bzip2 cannam@89: for more information on memory management.

cannam@89:

Note that the amount of memory needed to decompress a cannam@89: stream cannot be determined until the stream's header has been cannam@89: read, so even if cannam@89: BZ2_bzDecompressInit succeeds, a cannam@89: subsequent BZ2_bzDecompress cannam@89: could fail with cannam@89: BZ_MEM_ERROR.

cannam@89:

Possible return values:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if ( small != 0 && small != 1 )
cannam@89:   or (verbosity <; 0 || verbosity > 4)
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory is available

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzDecompress
cannam@89:   if BZ_OK was returned
cannam@89:   no specific action required in case of error

cannam@89:

cannam@89: 3.3.5. BZ2_bzDecompress

cannam@89:

int BZ2_bzDecompress ( bz_stream *strm );

cannam@89:

Provides more input and/out output buffer space for the cannam@89: library. The caller maintains input and output buffers, and uses cannam@89: BZ2_bzDecompress to transfer cannam@89: data between them.

cannam@89:

Before each call to cannam@89: BZ2_bzDecompress, cannam@89: next_in should point at the cannam@89: compressed data, and avail_in cannam@89: should indicate how many bytes the library may read. cannam@89: BZ2_bzDecompress updates cannam@89: next_in, cannam@89: avail_in and cannam@89: total_in to reflect the number cannam@89: of bytes it has read.

cannam@89:

Similarly, next_out should cannam@89: point to a buffer in which the uncompressed output is to be cannam@89: placed, with avail_out cannam@89: indicating how much output space is available. cannam@89: BZ2_bzCompress updates cannam@89: next_out, cannam@89: avail_out and cannam@89: total_out to reflect the number cannam@89: of bytes output.

cannam@89:

You may provide and remove as little or as much data as you cannam@89: like on each call of cannam@89: BZ2_bzDecompress. In the limit, cannam@89: it is acceptable to supply and remove data one byte at a time, cannam@89: although this would be terribly inefficient. You should always cannam@89: ensure that at least one byte of output space is available at cannam@89: each call.

cannam@89:

Use of BZ2_bzDecompress is cannam@89: simpler than cannam@89: BZ2_bzCompress.

cannam@89:

You should provide input and remove output as described cannam@89: above, and repeatedly call cannam@89: BZ2_bzDecompress until cannam@89: BZ_STREAM_END is returned. cannam@89: Appearance of BZ_STREAM_END cannam@89: denotes that BZ2_bzDecompress cannam@89: has detected the logical end of the compressed stream. cannam@89: BZ2_bzDecompress will not cannam@89: produce BZ_STREAM_END until all cannam@89: output data has been placed into the output buffer, so once cannam@89: BZ_STREAM_END appears, you are cannam@89: guaranteed to have available all the decompressed output, and cannam@89: BZ2_bzDecompressEnd can safely cannam@89: be called.

cannam@89:

If case of an error return value, you should call cannam@89: BZ2_bzDecompressEnd to clean up cannam@89: and release memory.

cannam@89:

Possible return values:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if strm is NULL or strm->s is NULL
cannam@89:   or strm->avail_out < 1
cannam@89: BZ_DATA_ERROR
cannam@89:   if a data integrity error is detected in the compressed stream
cannam@89: BZ_DATA_ERROR_MAGIC
cannam@89:   if the compressed stream doesn't begin with the right magic bytes
cannam@89: BZ_MEM_ERROR
cannam@89:   if there wasn't enough memory available
cannam@89: BZ_STREAM_END
cannam@89:   if the logical end of the data stream was detected and all
cannam@89:   output in has been consumed, eg s-->avail_out > 0
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzDecompress
cannam@89:   if BZ_OK was returned
cannam@89: BZ2_bzDecompressEnd
cannam@89:   otherwise

cannam@89:

cannam@89: 3.3.6. BZ2_bzDecompressEnd

cannam@89:

int BZ2_bzDecompressEnd ( bz_stream *strm );

cannam@89:

Releases all memory associated with a decompression cannam@89: stream.

cannam@89:

Possible return values:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if strm is NULL or strm->s is NULL
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

  None.

cannam@89:

cannam@89: 3.4. High-level interface

cannam@89:

This interface provides functions for reading and writing cannam@89: bzip2 format files. First, some cannam@89: general points.

cannam@89:

All of the functions take an cannam@89: int* first argument, cannam@89: bzerror. After each call, cannam@89: bzerror should be consulted cannam@89: first to determine the outcome of the call. If cannam@89: bzerror is cannam@89: BZ_OK, the call completed cannam@89: successfully, and only then should the return value of the cannam@89: function (if any) be consulted. If cannam@89: bzerror is cannam@89: BZ_IO_ERROR, there was an cannam@89: error reading/writing the underlying compressed file, and you cannam@89: should then consult errno / cannam@89: perror to determine the cause cannam@89: of the difficulty. bzerror cannam@89: may also be set to various other values; precise details are cannam@89: given on a per-function basis below.
If bzerror indicates cannam@89: an error (ie, anything except cannam@89: BZ_OK and cannam@89: BZ_STREAM_END), you should cannam@89: immediately call cannam@89: BZ2_bzReadClose (or cannam@89: BZ2_bzWriteClose, depending on cannam@89: whether you are attempting to read or to write) to free up all cannam@89: resources associated with the stream. Once an error has been cannam@89: indicated, behaviour of all calls except cannam@89: BZ2_bzReadClose cannam@89: (BZ2_bzWriteClose) is cannam@89: undefined. The implication is that (1) cannam@89: bzerror should be checked cannam@89: after each call, and (2) if cannam@89: bzerror indicates an error, cannam@89: BZ2_bzReadClose cannam@89: (BZ2_bzWriteClose) should then cannam@89: be called to clean up.
The FILE* arguments cannam@89: passed to BZ2_bzReadOpen / cannam@89: BZ2_bzWriteOpen should be set cannam@89: to binary mode. Most Unix systems will do this by default, but cannam@89: other platforms, including Windows and Mac, will not. If you cannam@89: omit this, you may encounter problems when moving code to new cannam@89: platforms.
Memory allocation requests are handled by cannam@89: malloc / cannam@89: free. At present there is no cannam@89: facility for user-defined memory allocators in the file I/O cannam@89: functions (could easily be added, though).

cannam@89:

cannam@89: 3.4.1. BZ2_bzReadOpen

cannam@89:

typedef void BZFILE;
cannam@89: 
cannam@89: BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f, 
cannam@89:                         int verbosity, int small,
cannam@89:                         void *unused, int nUnused );

cannam@89:

Prepare to read compressed data from file handle cannam@89: f. cannam@89: f should refer to a file which cannam@89: has been opened for reading, and for which the error indicator cannam@89: (ferror(f))is not set. If cannam@89: small is 1, the library will try cannam@89: to decompress using less memory, at the expense of speed.

cannam@89:

For reasons explained below, cannam@89: BZ2_bzRead will decompress the cannam@89: nUnused bytes starting at cannam@89: unused, before starting to read cannam@89: from the file f. At most cannam@89: BZ_MAX_UNUSED bytes may be cannam@89: supplied like this. If this facility is not required, you should cannam@89: pass NULL and cannam@89: 0 for cannam@89: unused and cannam@89: nUnused respectively.

cannam@89:

For the meaning of parameters cannam@89: small and cannam@89: verbosity, see cannam@89: BZ2_bzDecompressInit.

cannam@89:

The amount of memory needed to decompress a file cannot be cannam@89: determined until the file's header has been read. So it is cannam@89: possible that BZ2_bzReadOpen cannam@89: returns BZ_OK but a subsequent cannam@89: call of BZ2_bzRead will return cannam@89: BZ_MEM_ERROR.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if f is NULL
cannam@89:   or small is neither 0 nor 1
cannam@89:   or ( unused == NULL && nUnused != 0 )
cannam@89:   or ( unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED) )
cannam@89: BZ_IO_ERROR
cannam@89:   if ferror(f) is nonzero
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory is available
cannam@89: BZ_OK
cannam@89:   otherwise.

cannam@89:

Possible return values:

cannam@89:

Pointer to an abstract BZFILE
cannam@89:   if bzerror is BZ_OK
cannam@89: NULL
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzRead
cannam@89:   if bzerror is BZ_OK
cannam@89: BZ2_bzClose
cannam@89:   otherwise

cannam@89:

cannam@89: 3.4.2. BZ2_bzRead

cannam@89:

int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );

cannam@89:

Reads up to len cannam@89: (uncompressed) bytes from the compressed file cannam@89: b into the buffer cannam@89: buf. If the read was cannam@89: successful, bzerror is set to cannam@89: BZ_OK and the number of bytes cannam@89: read is returned. If the logical end-of-stream was detected, cannam@89: bzerror will be set to cannam@89: BZ_STREAM_END, and the number of cannam@89: bytes read is returned. All other cannam@89: bzerror values denote an cannam@89: error.

cannam@89:

BZ2_bzRead will supply cannam@89: len bytes, unless the logical cannam@89: stream end is detected or an error occurs. Because of this, it cannam@89: is possible to detect the stream end by observing when the number cannam@89: of bytes returned is less than the number requested. cannam@89: Nevertheless, this is regarded as inadvisable; you should instead cannam@89: check bzerror after every call cannam@89: and watch out for cannam@89: BZ_STREAM_END.

cannam@89:

Internally, BZ2_bzRead cannam@89: copies data from the compressed file in chunks of size cannam@89: BZ_MAX_UNUSED bytes before cannam@89: decompressing it. If the file contains more bytes than strictly cannam@89: needed to reach the logical end-of-stream, cannam@89: BZ2_bzRead will almost certainly cannam@89: read some of the trailing data before signalling cannam@89: BZ_SEQUENCE_END. To collect the cannam@89: read but unused data once cannam@89: BZ_SEQUENCE_END has appeared, cannam@89: call BZ2_bzReadGetUnused cannam@89: immediately before cannam@89: BZ2_bzReadClose.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if b is NULL or buf is NULL or len < 0
cannam@89: BZ_SEQUENCE_ERROR
cannam@89:   if b was opened with BZ2_bzWriteOpen
cannam@89: BZ_IO_ERROR
cannam@89:   if there is an error reading from the compressed file
cannam@89: BZ_UNEXPECTED_EOF
cannam@89:   if the compressed file ended before 
cannam@89:   the logical end-of-stream was detected
cannam@89: BZ_DATA_ERROR
cannam@89:   if a data integrity error was detected in the compressed stream
cannam@89: BZ_DATA_ERROR_MAGIC
cannam@89:   if the stream does not begin with the requisite header bytes 
cannam@89:   (ie, is not a bzip2 data file).  This is really 
cannam@89:   a special case of BZ_DATA_ERROR.
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory was available
cannam@89: BZ_STREAM_END
cannam@89:   if the logical end of stream was detected.
cannam@89: BZ_OK
cannam@89:   otherwise.

cannam@89:

Possible return values:

cannam@89:

number of bytes read
cannam@89:   if bzerror is BZ_OK or BZ_STREAM_END
cannam@89: undefined
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

collect data from buf, then BZ2_bzRead or BZ2_bzReadClose
cannam@89:   if bzerror is BZ_OK
cannam@89: collect data from buf, then BZ2_bzReadClose or BZ2_bzReadGetUnused
cannam@89:   if bzerror is BZ_SEQUENCE_END
cannam@89: BZ2_bzReadClose
cannam@89:   otherwise

cannam@89:

cannam@89: 3.4.3. BZ2_bzReadGetUnused

cannam@89:

void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b, 
cannam@89:                           void** unused, int* nUnused );

cannam@89:

Returns data which was read from the compressed file but cannam@89: was not needed to get to the logical end-of-stream. cannam@89: *unused is set to the address of cannam@89: the data, and *nUnused to the cannam@89: number of bytes. *nUnused will cannam@89: be set to a value between 0 and cannam@89: BZ_MAX_UNUSED inclusive.

cannam@89:

This function may only be called once cannam@89: BZ2_bzRead has signalled cannam@89: BZ_STREAM_END but before cannam@89: BZ2_bzReadClose.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if b is NULL
cannam@89:   or unused is NULL or nUnused is NULL
cannam@89: BZ_SEQUENCE_ERROR
cannam@89:   if BZ_STREAM_END has not been signalled
cannam@89:   or if b was opened with BZ2_bzWriteOpen
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzReadClose

cannam@89:

cannam@89: 3.4.4. BZ2_bzReadClose

cannam@89:

void BZ2_bzReadClose ( int *bzerror, BZFILE *b );

cannam@89:

Releases all memory pertaining to the compressed file cannam@89: b. cannam@89: BZ2_bzReadClose does not call cannam@89: fclose on the underlying file cannam@89: handle, so you should do that yourself if appropriate. cannam@89: BZ2_bzReadClose should be called cannam@89: to clean up after all error situations.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_SEQUENCE_ERROR
cannam@89:   if b was opened with BZ2_bzOpenWrite
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

none

cannam@89:

cannam@89: 3.4.5. BZ2_bzWriteOpen

cannam@89:

BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f, 
cannam@89:                          int blockSize100k, int verbosity,
cannam@89:                          int workFactor );

cannam@89:

Prepare to write compressed data to file handle cannam@89: f. cannam@89: f should refer to a file which cannam@89: has been opened for writing, and for which the error indicator cannam@89: (ferror(f))is not set.

cannam@89:

For the meaning of parameters cannam@89: blockSize100k, cannam@89: verbosity and cannam@89: workFactor, see cannam@89: BZ2_bzCompressInit.

cannam@89:

All required memory is allocated at this stage, so if the cannam@89: call completes successfully, cannam@89: BZ_MEM_ERROR cannot be signalled cannam@89: by a subsequent call to cannam@89: BZ2_bzWrite.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if f is NULL
cannam@89:   or blockSize100k < 1 or blockSize100k > 9
cannam@89: BZ_IO_ERROR
cannam@89:   if ferror(f) is nonzero
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory is available
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

Possible return values:

cannam@89:

Pointer to an abstract BZFILE
cannam@89:   if bzerror is BZ_OK
cannam@89: NULL
cannam@89:   otherwise

cannam@89:

Allowable next actions:

cannam@89:

BZ2_bzWrite
cannam@89:   if bzerror is BZ_OK
cannam@89:   (you could go directly to BZ2_bzWriteClose, but this would be pretty pointless)
cannam@89: BZ2_bzWriteClose
cannam@89:   otherwise

cannam@89:

cannam@89: 3.4.6. BZ2_bzWrite

cannam@89:

void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );

cannam@89:

Absorbs len bytes from the cannam@89: buffer buf, eventually to be cannam@89: compressed and written to the file.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_PARAM_ERROR
cannam@89:   if b is NULL or buf is NULL or len < 0
cannam@89: BZ_SEQUENCE_ERROR
cannam@89:   if b was opened with BZ2_bzReadOpen
cannam@89: BZ_IO_ERROR
cannam@89:   if there is an error writing the compressed file.
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

cannam@89: 3.4.7. BZ2_bzWriteClose

cannam@89:

void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
cannam@89:                        int abandon,
cannam@89:                        unsigned int* nbytes_in,
cannam@89:                        unsigned int* nbytes_out );
cannam@89: 
cannam@89: void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
cannam@89:                          int abandon,
cannam@89:                          unsigned int* nbytes_in_lo32,
cannam@89:                          unsigned int* nbytes_in_hi32,
cannam@89:                          unsigned int* nbytes_out_lo32,
cannam@89:                          unsigned int* nbytes_out_hi32 );

cannam@89:

Compresses and flushes to the compressed file all data so cannam@89: far supplied by BZ2_bzWrite. cannam@89: The logical end-of-stream markers are also written, so subsequent cannam@89: calls to BZ2_bzWrite are cannam@89: illegal. All memory associated with the compressed file cannam@89: b is released. cannam@89: fflush is called on the cannam@89: compressed file, but it is not cannam@89: fclose'd.

cannam@89:

If BZ2_bzWriteClose is cannam@89: called to clean up after an error, the only action is to release cannam@89: the memory. The library records the error codes issued by cannam@89: previous calls, so this situation will be detected automatically. cannam@89: There is no attempt to complete the compression operation, nor to cannam@89: fflush the compressed file. You cannam@89: can force this behaviour to happen even in the case of no error, cannam@89: by passing a nonzero value to cannam@89: abandon.

cannam@89:

If nbytes_in is non-null, cannam@89: *nbytes_in will be set to be the cannam@89: total volume of uncompressed data handled. Similarly, cannam@89: nbytes_out will be set to the cannam@89: total volume of compressed data written. For compatibility with cannam@89: older versions of the library, cannam@89: BZ2_bzWriteClose only yields the cannam@89: lower 32 bits of these counts. Use cannam@89: BZ2_bzWriteClose64 if you want cannam@89: the full 64 bit counts. These two functions are otherwise cannam@89: absolutely identical.

cannam@89:

Possible assignments to cannam@89: bzerror:

cannam@89:

BZ_SEQUENCE_ERROR
cannam@89:   if b was opened with BZ2_bzReadOpen
cannam@89: BZ_IO_ERROR
cannam@89:   if there is an error writing the compressed file
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

cannam@89: 3.4.8. Handling embedded compressed data streams

cannam@89:

The high-level library facilitates use of cannam@89: bzip2 data streams which form cannam@89: some part of a surrounding, larger data stream.

cannam@89:

For writing, the library takes an open file handle, cannam@89: writes compressed data to it, cannam@89: fflushes it but does not cannam@89: fclose it. The calling cannam@89: application can write its own data before and after the cannam@89: compressed data stream, using that same file handle.
Reading is more complex, and the facilities are not as cannam@89: general as they could be since generality is hard to reconcile cannam@89: with efficiency. BZ2_bzRead cannam@89: reads from the compressed file in blocks of size cannam@89: BZ_MAX_UNUSED bytes, and in cannam@89: doing so probably will overshoot the logical end of compressed cannam@89: stream. To recover this data once decompression has ended, cannam@89: call BZ2_bzReadGetUnused after cannam@89: the last call of BZ2_bzRead cannam@89: (the one returning cannam@89: BZ_STREAM_END) but before cannam@89: calling cannam@89: BZ2_bzReadClose.

cannam@89:

This mechanism makes it easy to decompress multiple cannam@89: bzip2 streams placed end-to-end. cannam@89: As the end of one stream, when cannam@89: BZ2_bzRead returns cannam@89: BZ_STREAM_END, call cannam@89: BZ2_bzReadGetUnused to collect cannam@89: the unused data (copy it into your own buffer somewhere). That cannam@89: data forms the start of the next compressed stream. To start cannam@89: uncompressing that next stream, call cannam@89: BZ2_bzReadOpen again, feeding in cannam@89: the unused data via the unused / cannam@89: nUnused parameters. Keep doing cannam@89: this until BZ_STREAM_END return cannam@89: coincides with the physical end of file cannam@89: (feof(f)). In this situation cannam@89: BZ2_bzReadGetUnused will of cannam@89: course return no data.

cannam@89:

This should give some feel for how the high-level interface cannam@89: can be used. If you require extra flexibility, you'll have to cannam@89: bite the bullet and get to grips with the low-level cannam@89: interface.

cannam@89:

cannam@89: 3.4.9. Standard file-reading/writing code

cannam@89:

Here's how you'd write data to a compressed file:

cannam@89:

FILE*   f;
cannam@89: BZFILE* b;
cannam@89: int     nBuf;
cannam@89: char    buf[ /* whatever size you like */ ];
cannam@89: int     bzerror;
cannam@89: int     nWritten;
cannam@89: 
cannam@89: f = fopen ( "myfile.bz2", "w" );
cannam@89: if ( !f ) {
cannam@89:  /* handle error */
cannam@89: }
cannam@89: b = BZ2_bzWriteOpen( &bzerror, f, 9 );
cannam@89: if (bzerror != BZ_OK) {
cannam@89:  BZ2_bzWriteClose ( b );
cannam@89:  /* handle error */
cannam@89: }
cannam@89: 
cannam@89: while ( /* condition */ ) {
cannam@89:  /* get data to write into buf, and set nBuf appropriately */
cannam@89:  nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
cannam@89:  if (bzerror == BZ_IO_ERROR) { 
cannam@89:    BZ2_bzWriteClose ( &bzerror, b );
cannam@89:    /* handle error */
cannam@89:  }
cannam@89: }
cannam@89: 
cannam@89: BZ2_bzWriteClose( &bzerror, b );
cannam@89: if (bzerror == BZ_IO_ERROR) {
cannam@89:  /* handle error */
cannam@89: }

cannam@89:

And to read from a compressed file:

cannam@89:

FILE*   f;
cannam@89: BZFILE* b;
cannam@89: int     nBuf;
cannam@89: char    buf[ /* whatever size you like */ ];
cannam@89: int     bzerror;
cannam@89: int     nWritten;
cannam@89: 
cannam@89: f = fopen ( "myfile.bz2", "r" );
cannam@89: if ( !f ) {
cannam@89:   /* handle error */
cannam@89: }
cannam@89: b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
cannam@89: if ( bzerror != BZ_OK ) {
cannam@89:   BZ2_bzReadClose ( &bzerror, b );
cannam@89:   /* handle error */
cannam@89: }
cannam@89: 
cannam@89: bzerror = BZ_OK;
cannam@89: while ( bzerror == BZ_OK && /* arbitrary other conditions */) {
cannam@89:   nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
cannam@89:   if ( bzerror == BZ_OK ) {
cannam@89:     /* do something with buf[0 .. nBuf-1] */
cannam@89:   }
cannam@89: }
cannam@89: if ( bzerror != BZ_STREAM_END ) {
cannam@89:    BZ2_bzReadClose ( &bzerror, b );
cannam@89:    /* handle error */
cannam@89: } else {
cannam@89:    BZ2_bzReadClose ( &bzerror, b );
cannam@89: }

cannam@89:

cannam@89: 3.5. Utility functions

cannam@89:

cannam@89: 3.5.1. BZ2_bzBuffToBuffCompress

cannam@89:

int BZ2_bzBuffToBuffCompress( char*         dest,
cannam@89:                               unsigned int* destLen,
cannam@89:                               char*         source,
cannam@89:                               unsigned int  sourceLen,
cannam@89:                               int           blockSize100k,
cannam@89:                               int           verbosity,
cannam@89:                               int           workFactor );

cannam@89:

Attempts to compress the data in source[0 cannam@89: .. sourceLen-1] into the destination buffer, cannam@89: dest[0 .. *destLen-1]. If the cannam@89: destination buffer is big enough, cannam@89: *destLen is set to the size of cannam@89: the compressed data, and BZ_OK cannam@89: is returned. If the compressed data won't fit, cannam@89: *destLen is unchanged, and cannam@89: BZ_OUTBUFF_FULL is cannam@89: returned.

cannam@89:

Compression in this manner is a one-shot event, done with a cannam@89: single call to this function. The resulting compressed data is a cannam@89: complete bzip2 format data cannam@89: stream. There is no mechanism for making additional calls to cannam@89: provide extra input data. If you want that kind of mechanism, cannam@89: use the low-level interface.

cannam@89:

For the meaning of parameters cannam@89: blockSize100k, cannam@89: verbosity and cannam@89: workFactor, see cannam@89: BZ2_bzCompressInit.

cannam@89:

To guarantee that the compressed data will fit in its cannam@89: buffer, allocate an output buffer of size 1% larger than the cannam@89: uncompressed data, plus six hundred extra bytes.

cannam@89:

BZ2_bzBuffToBuffDecompress cannam@89: will not write data at or beyond cannam@89: dest[*destLen], even in case of cannam@89: buffer overflow.

cannam@89:

Possible return values:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if dest is NULL or destLen is NULL
cannam@89:   or blockSize100k < 1 or blockSize100k > 9
cannam@89:   or verbosity < 0 or verbosity > 4
cannam@89:   or workFactor < 0 or workFactor > 250
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory is available 
cannam@89: BZ_OUTBUFF_FULL
cannam@89:   if the size of the compressed data exceeds *destLen
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

cannam@89: 3.5.2. BZ2_bzBuffToBuffDecompress

cannam@89:

int BZ2_bzBuffToBuffDecompress( char*         dest,
cannam@89:                                 unsigned int* destLen,
cannam@89:                                 char*         source,
cannam@89:                                 unsigned int  sourceLen,
cannam@89:                                 int           small,
cannam@89:                                 int           verbosity );

cannam@89:

Attempts to decompress the data in source[0 cannam@89: .. sourceLen-1] into the destination buffer, cannam@89: dest[0 .. *destLen-1]. If the cannam@89: destination buffer is big enough, cannam@89: *destLen is set to the size of cannam@89: the uncompressed data, and BZ_OK cannam@89: is returned. If the compressed data won't fit, cannam@89: *destLen is unchanged, and cannam@89: BZ_OUTBUFF_FULL is cannam@89: returned.

cannam@89:

source is assumed to hold cannam@89: a complete bzip2 format data cannam@89: stream. cannam@89: BZ2_bzBuffToBuffDecompress tries cannam@89: to decompress the entirety of the stream into the output cannam@89: buffer.

cannam@89:

For the meaning of parameters cannam@89: small and cannam@89: verbosity, see cannam@89: BZ2_bzDecompressInit.

cannam@89:

Because the compression ratio of the compressed data cannot cannam@89: be known in advance, there is no easy way to guarantee that the cannam@89: output buffer will be big enough. You may of course make cannam@89: arrangements in your code to record the size of the uncompressed cannam@89: data, but such a mechanism is beyond the scope of this cannam@89: library.

cannam@89:

BZ2_bzBuffToBuffDecompress cannam@89: will not write data at or beyond cannam@89: dest[*destLen], even in case of cannam@89: buffer overflow.

cannam@89:

Possible return values:

cannam@89:

BZ_CONFIG_ERROR
cannam@89:   if the library has been mis-compiled
cannam@89: BZ_PARAM_ERROR
cannam@89:   if dest is NULL or destLen is NULL
cannam@89:   or small != 0 && small != 1
cannam@89:   or verbosity < 0 or verbosity > 4
cannam@89: BZ_MEM_ERROR
cannam@89:   if insufficient memory is available 
cannam@89: BZ_OUTBUFF_FULL
cannam@89:   if the size of the compressed data exceeds *destLen
cannam@89: BZ_DATA_ERROR
cannam@89:   if a data integrity error was detected in the compressed data
cannam@89: BZ_DATA_ERROR_MAGIC
cannam@89:   if the compressed data doesn't begin with the right magic bytes
cannam@89: BZ_UNEXPECTED_EOF
cannam@89:   if the compressed data ends unexpectedly
cannam@89: BZ_OK
cannam@89:   otherwise

cannam@89:

cannam@89: 3.6. zlib compatibility functions

cannam@89:

Yoshioka Tsuneo has contributed some functions to give cannam@89: better zlib compatibility. cannam@89: These functions are BZ2_bzopen, cannam@89: BZ2_bzread, cannam@89: BZ2_bzwrite, cannam@89: BZ2_bzflush, cannam@89: BZ2_bzclose, cannam@89: BZ2_bzerror and cannam@89: BZ2_bzlibVersion. These cannam@89: functions are not (yet) officially part of the library. If they cannam@89: break, you get to keep all the pieces. Nevertheless, I think cannam@89: they work ok.

cannam@89:

typedef void BZFILE;
cannam@89: 
cannam@89: const char * BZ2_bzlibVersion ( void );

cannam@89:

Returns a string indicating the library version.

cannam@89:

BZFILE * BZ2_bzopen  ( const char *path, const char *mode );
cannam@89: BZFILE * BZ2_bzdopen ( int        fd,    const char *mode );

cannam@89:

Opens a .bz2 file for cannam@89: reading or writing, using either its name or a pre-existing file cannam@89: descriptor. Analogous to fopen cannam@89: and fdopen.

cannam@89:

int BZ2_bzread  ( BZFILE* b, void* buf, int len );
cannam@89: int BZ2_bzwrite ( BZFILE* b, void* buf, int len );

cannam@89:

Reads/writes data from/to a previously opened cannam@89: BZFILE. Analogous to cannam@89: fread and cannam@89: fwrite.

cannam@89:

int  BZ2_bzflush ( BZFILE* b );
cannam@89: void BZ2_bzclose ( BZFILE* b );

cannam@89:

Flushes/closes a BZFILE. cannam@89: BZ2_bzflush doesn't actually do cannam@89: anything. Analogous to fflush cannam@89: and fclose.

cannam@89:

const char * BZ2_bzerror ( BZFILE *b, int *errnum )

cannam@89:

Returns a string describing the more recent error status of cannam@89: b, and also sets cannam@89: *errnum to its numerical cannam@89: value.

cannam@89:

cannam@89: 3.7. Using the library in a stdio-free environment

cannam@89:

cannam@89: 3.7.1. Getting rid of stdio

cannam@89:

In a deeply embedded application, you might want to use cannam@89: just the memory-to-memory functions. You can do this cannam@89: conveniently by compiling the library with preprocessor symbol cannam@89: BZ_NO_STDIO defined. Doing this cannam@89: gives you a library containing only the following eight cannam@89: functions:

cannam@89:

BZ2_bzCompressInit, cannam@89: BZ2_bzCompress, cannam@89: BZ2_bzCompressEnd cannam@89: BZ2_bzDecompressInit, cannam@89: BZ2_bzDecompress, cannam@89: BZ2_bzDecompressEnd cannam@89: BZ2_bzBuffToBuffCompress, cannam@89: BZ2_bzBuffToBuffDecompress

cannam@89:

When compiled like this, all functions will ignore cannam@89: verbosity settings.

cannam@89:

cannam@89: 3.7.2. Critical error handling

cannam@89:

libbzip2 contains a number cannam@89: of internal assertion checks which should, needless to say, never cannam@89: be activated. Nevertheless, if an assertion should fail, cannam@89: behaviour depends on whether or not the library was compiled with cannam@89: BZ_NO_STDIO set.

cannam@89:

For a normal compile, an assertion failure yields the cannam@89: message:

cannam@89:

cannam@89:
bzip2/libbzip2: internal error number N.
cannam@89:
This is a bug in bzip2/libbzip2, 1.0.6 of 6 September 2010. cannam@89: Please report it to me at: jseward@bzip.org. If this happened cannam@89: when you were using some program which uses libbzip2 as a cannam@89: component, you should also report this bug to the author(s) cannam@89: of that program. Please make an effort to report this bug; cannam@89: timely and accurate bug reports eventually lead to higher cannam@89: quality software. Thanks. Julian Seward, 6 September 2010. cannam@89:
cannam@89:

cannam@89:

where N is some error code cannam@89: number. If N == 1007, it also cannam@89: prints some extra text advising the reader that unreliable memory cannam@89: is often associated with internal error 1007. (This is a cannam@89: frequently-observed-phenomenon with versions 1.0.0/1.0.1).

cannam@89:

exit(3) is then cannam@89: called.

cannam@89:

For a stdio-free library, cannam@89: assertion failures result in a call to a function declared cannam@89: as:

cannam@89:

extern void bz_internal_error ( int errcode );

cannam@89:

The relevant code is passed as a parameter. You should cannam@89: supply such a function.

cannam@89:

In either case, once an assertion failure has occurred, any cannam@89: bz_stream records involved can cannam@89: be regarded as invalid. You should not attempt to resume normal cannam@89: operation with them.

cannam@89:

You may, of course, change critical error handling to suit cannam@89: your needs. As I said above, critical errors indicate bugs in cannam@89: the library and should not occur. All "normal" error situations cannam@89: are indicated via error return codes from functions, and can be cannam@89: recovered from.

cannam@89:

cannam@89: 3.8. Making a Windows DLL

cannam@89:

Everything related to Windows has been contributed by cannam@89: Yoshioka Tsuneo cannam@89: (tsuneo@rr.iij4u.or.jp), so cannam@89: you should send your queries to him (but perhaps Cc: me, cannam@89: jseward@bzip.org).

cannam@89:

My vague understanding of what to do is: using Visual C++ cannam@89: 5.0, open the project file cannam@89: libbz2.dsp, and build. That's cannam@89: all.

cannam@89:

If you can't open the project file for some reason, make a cannam@89: new one, naming these files: cannam@89: blocksort.c, cannam@89: bzlib.c, cannam@89: compress.c, cannam@89: crctable.c, cannam@89: decompress.c, cannam@89: huffman.c, cannam@89: randtable.c and cannam@89: libbz2.def. You will also need cannam@89: to name the header files bzlib.h cannam@89: and bzlib_private.h.

cannam@89:

If you don't use VC++, you may need to define the cannam@89: proprocessor symbol cannam@89: _WIN32.

cannam@89:

Finally, dlltest.c is a cannam@89: sample program using the DLL. It has a project file, cannam@89: dlltest.dsp.

cannam@89:

If you just want a makefile for Visual C, have a look at cannam@89: makefile.msc.

cannam@89:

Be aware that if you compile cannam@89: bzip2 itself on Win32, you must cannam@89: set BZ_UNIX to 0 and cannam@89: BZ_LCCWIN32 to 1, in the file cannam@89: bzip2.c, before compiling. cannam@89: Otherwise the resulting binary won't work correctly.

cannam@89:

I haven't tried any of this stuff myself, but it all looks cannam@89: plausible.

cannam@89:

cannam@89: 4. Miscellanea

cannam@89:

Table of Contents

cannam@89:

4.1. Limitations of the compressed file format
4.2. Portability issues
4.3. Reporting bugs
4.4. Did you get the right package?
4.5. Further Reading

cannam@89:

These are just some random thoughts of mine. Your mileage cannam@89: may vary.

cannam@89:

cannam@89: 4.1. Limitations of the compressed file format

cannam@89:

bzip2-1.0.X, cannam@89: 0.9.5 and cannam@89: 0.9.0 use exactly the same file cannam@89: format as the original version, cannam@89: bzip2-0.1. This decision was cannam@89: made in the interests of stability. Creating yet another cannam@89: incompatible compressed file format would create further cannam@89: confusion and disruption for users.

cannam@89:

Nevertheless, this is not a painless decision. Development cannam@89: work since the release of cannam@89: bzip2-0.1 in August 1997 has cannam@89: shown complexities in the file format which slow down cannam@89: decompression and, in retrospect, are unnecessary. These cannam@89: are:

cannam@89:

The run-length encoder, which is the first of the cannam@89: compression transformations, is entirely irrelevant. The cannam@89: original purpose was to protect the sorting algorithm from the cannam@89: very worst case input: a string of repeated symbols. But cannam@89: algorithm steps Q6a and Q6b in the original Burrows-Wheeler cannam@89: technical report (SRC-124) show how repeats can be handled cannam@89: without difficulty in block sorting.
cannam@89:
The randomisation mechanism doesn't really need to be cannam@89: there. Udi Manber and Gene Myers published a suffix array cannam@89: construction algorithm a few years back, which can be employed cannam@89: to sort any block, no matter how repetitive, in O(N log N) cannam@89: time. Subsequent work by Kunihiko Sadakane has produced a cannam@89: derivative O(N (log N)^2) algorithm which usually outperforms cannam@89: the Manber-Myers algorithm.
cannam@89:
I could have changed to Sadakane's algorithm, but I find cannam@89: it to be slower than bzip2's cannam@89: existing algorithm for most inputs, and the randomisation cannam@89: mechanism protects adequately against bad cases. I didn't cannam@89: think it was a good tradeoff to make. Partly this is due to cannam@89: the fact that I was not flooded with email complaints about cannam@89: bzip2-0.1's performance on cannam@89: repetitive data, so perhaps it isn't a problem for real cannam@89: inputs.
cannam@89:
Probably the best long-term solution, and the one I have cannam@89: incorporated into 0.9.5 and above, is to use the existing cannam@89: sorting algorithm initially, and fall back to a O(N (log N)^2) cannam@89: algorithm if the standard algorithm gets into cannam@89: difficulties.
cannam@89:
The compressed file format was never designed to be cannam@89: handled by a library, and I have had to jump though some hoops cannam@89: to produce an efficient implementation of decompression. It's cannam@89: a bit hairy. Try passing cannam@89: decompress.c through the C cannam@89: preprocessor and you'll see what I mean. Much of this cannam@89: complexity could have been avoided if the compressed size of cannam@89: each block of data was recorded in the data stream.
An Adler-32 checksum, rather than a CRC32 checksum, cannam@89: would be faster to compute.

cannam@89:

It would be fair to say that the cannam@89: bzip2 format was frozen before I cannam@89: properly and fully understood the performance consequences of cannam@89: doing so.

cannam@89:

Improvements which I was able to incorporate into 0.9.0, cannam@89: despite using the same file format, are:

cannam@89:

Single array implementation of the inverse BWT. This cannam@89: significantly speeds up decompression, presumably because it cannam@89: reduces the number of cache misses.
Faster inverse MTF transform for large MTF values. cannam@89: The new implementation is based on the notion of sliding blocks cannam@89: of values.
bzip2-0.9.0 now reads cannam@89: and writes files with fread cannam@89: and fwrite; version 0.1 used cannam@89: putc and cannam@89: getc. Duh! Well, you live cannam@89: and learn.

cannam@89:

Further ahead, it would be nice to be able to do random cannam@89: access into files. This will require some careful design of cannam@89: compressed file formats.

cannam@89:

cannam@89: 4.2. Portability issues

cannam@89:

After some consideration, I have decided not to use GNU cannam@89: autoconf to configure 0.9.5 or cannam@89: 1.0.

cannam@89:

autoconf, admirable and cannam@89: wonderful though it is, mainly assists with portability problems cannam@89: between Unix-like platforms. But cannam@89: bzip2 doesn't have much in the cannam@89: way of portability problems on Unix; most of the difficulties cannam@89: appear when porting to the Mac, or to Microsoft's operating cannam@89: systems. autoconf doesn't help cannam@89: in those cases, and brings in a whole load of new cannam@89: complexity.

cannam@89:

Most people should be able to compile the library and cannam@89: program under Unix straight out-of-the-box, so to speak, cannam@89: especially if you have a version of GNU C available.

cannam@89:

There are a couple of cannam@89: __inline__ directives in the cannam@89: code. GNU C (gcc) should be cannam@89: able to handle them. If you're not using GNU C, your C compiler cannam@89: shouldn't see them at all. If your compiler does, for some cannam@89: reason, see them and doesn't like them, just cannam@89: #define cannam@89: __inline__ to be cannam@89: /* */. One easy way to do this cannam@89: is to compile with the flag cannam@89: -D__inline__=, which should be cannam@89: understood by most Unix compilers.

cannam@89:

If you still have difficulties, try compiling with the cannam@89: macro BZ_STRICT_ANSI defined. cannam@89: This should enable you to build the library in a strictly ANSI cannam@89: compliant environment. Building the program itself like this is cannam@89: dangerous and not supported, since you remove cannam@89: bzip2's checks against cannam@89: compressing directories, symbolic links, devices, and other cannam@89: not-really-a-file entities. This could cause filesystem cannam@89: corruption!

cannam@89:

One other thing: if you create a cannam@89: bzip2 binary for public distribution, cannam@89: please consider linking it statically (gcc cannam@89: -static). This avoids all sorts of library-version cannam@89: issues that others may encounter later on.

cannam@89:

If you build bzip2 on cannam@89: Win32, you must set BZ_UNIX to 0 cannam@89: and BZ_LCCWIN32 to 1, in the cannam@89: file bzip2.c, before compiling. cannam@89: Otherwise the resulting binary won't work correctly.

cannam@89:

cannam@89: 4.3. Reporting bugs

cannam@89:

I tried pretty hard to make sure cannam@89: bzip2 is bug free, both by cannam@89: design and by testing. Hopefully you'll never need to read this cannam@89: section for real.

cannam@89:

Nevertheless, if bzip2 dies cannam@89: with a segmentation fault, a bus error or an internal assertion cannam@89: failure, it will ask you to email me a bug report. Experience from cannam@89: years of feedback of bzip2 users indicates that almost all these cannam@89: problems can be traced to either compiler bugs or hardware cannam@89: problems.

cannam@89:

cannam@89:
Recompile the program with no optimisation, and cannam@89: see if it works. And/or try a different compiler. I heard all cannam@89: sorts of stories about various flavours of GNU C (and other cannam@89: compilers) generating bad code for cannam@89: bzip2, and I've run across two cannam@89: such examples myself.
cannam@89:
2.7.X versions of GNU C are known to generate bad code cannam@89: from time to time, at high optimisation levels. If you get cannam@89: problems, try using the flags cannam@89: -O2 cannam@89: -fomit-frame-pointer cannam@89: -fno-strength-reduce. You cannam@89: should specifically not use cannam@89: -funroll-loops.
cannam@89:
You may notice that the Makefile runs six tests as part cannam@89: of the build process. If the program passes all of these, it's cannam@89: a pretty good (but not 100%) indication that the compiler has cannam@89: done its job correctly.
cannam@89:
cannam@89:
If bzip2 cannam@89: crashes randomly, and the crashes are not repeatable, you may cannam@89: have a flaky memory subsystem. cannam@89: bzip2 really hammers your cannam@89: memory hierarchy, and if it's a bit marginal, you may get these cannam@89: problems. Ditto if your disk or I/O subsystem is slowly cannam@89: failing. Yup, this really does happen.
cannam@89:
Try using a different machine of the same type, and see cannam@89: if you can repeat the problem.
cannam@89:
This isn't really a bug, but ... If cannam@89: bzip2 tells you your file is cannam@89: corrupted on decompression, and you obtained the file via FTP, cannam@89: there is a possibility that you forgot to tell FTP to do a cannam@89: binary mode transfer. That absolutely will cause the file to cannam@89: be non-decompressible. You'll have to transfer it cannam@89: again.

cannam@89:

If you've incorporated cannam@89: libbzip2 into your own program cannam@89: and are getting problems, please, please, please, check that the cannam@89: parameters you are passing in calls to the library, are correct, cannam@89: and in accordance with what the documentation says is allowable. cannam@89: I have tried to make the library robust against such problems, cannam@89: but I'm sure I haven't succeeded.

cannam@89:

Finally, if the above comments don't help, you'll have to cannam@89: send me a bug report. Now, it's just amazing how many people cannam@89: will send me a bug report saying something like:

cannam@89:

bzip2 crashed with segmentation fault on my machine

cannam@89:

and absolutely nothing else. Needless to say, a such a cannam@89: report is totally, utterly, completely and cannam@89: comprehensively 100% useless; a waste of your time, my time, and cannam@89: net bandwidth. With no details at all, there's no way cannam@89: I can possibly begin to figure out what the problem is.

cannam@89:

The rules of the game are: facts, facts, facts. Don't omit cannam@89: them because "oh, they won't be relevant". At the bare cannam@89: minimum:

cannam@89:

Machine type.  Operating system version.  
cannam@89: Exact version of bzip2 (do bzip2 -V).  
cannam@89: Exact version of the compiler used.  
cannam@89: Flags passed to the compiler.

cannam@89:

However, the most important single thing that will help me cannam@89: is the file that you were trying to compress or decompress at the cannam@89: time the problem happened. Without that, my ability to do cannam@89: anything more than speculate about the cause, is limited.

cannam@89:

cannam@89: 4.4. Did you get the right package?

cannam@89:

bzip2 is a resource hog. cannam@89: It soaks up large amounts of CPU cycles and memory. Also, it cannam@89: gives very large latencies. In the worst case, you can feed many cannam@89: megabytes of uncompressed data into the library before getting cannam@89: any compressed output, so this probably rules out applications cannam@89: requiring interactive behaviour.

cannam@89:

These aren't faults of my implementation, I hope, but more cannam@89: an intrinsic property of the Burrows-Wheeler transform cannam@89: (unfortunately). Maybe this isn't what you want.

cannam@89:

If you want a compressor and/or library which is faster, cannam@89: uses less memory but gets pretty good compression, and has cannam@89: minimal latency, consider Jean-loup Gailly's and Mark Adler's cannam@89: work, zlib-1.2.1 and cannam@89: gzip-1.2.4. Look for them at cannam@89: http://www.zlib.org and cannam@89: http://www.gzip.org cannam@89: respectively.

cannam@89:

For something faster and lighter still, you might try Markus F cannam@89: X J Oberhumer's LZO real-time cannam@89: compression/decompression library, at cannam@89: http://www.oberhumer.com/opensource.

cannam@89:

cannam@89: 4.5. Further Reading

cannam@89:

bzip2 is not research cannam@89: work, in the sense that it doesn't present any new ideas. cannam@89: Rather, it's an engineering exercise based on existing cannam@89: ideas.

cannam@89:

Four documents describe essentially all the ideas behind cannam@89: bzip2:

cannam@89:

Michael Burrows and D. J. Wheeler:
cannam@89:   "A block-sorting lossless data compression algorithm"
cannam@89:    10th May 1994.
cannam@89:    Digital SRC Research Report 124.
cannam@89:    ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
cannam@89:    If you have trouble finding it, try searching at the
cannam@89:    New Zealand Digital Library, http://www.nzdl.org.
cannam@89:
cannam@89: Daniel S. Hirschberg and Debra A. LeLewer
cannam@89:   "Efficient Decoding of Prefix Codes"
cannam@89:    Communications of the ACM, April 1990, Vol 33, Number 4.
cannam@89:    You might be able to get an electronic copy of this
cannam@89:    from the ACM Digital Library.
cannam@89:
cannam@89: David J. Wheeler
cannam@89:    Program bred3.c and accompanying document bred3.ps.
cannam@89:    This contains the idea behind the multi-table Huffman coding scheme.
cannam@89:    ftp://ftp.cl.cam.ac.uk/users/djw3/
cannam@89:
cannam@89: Jon L. Bentley and Robert Sedgewick
cannam@89:   "Fast Algorithms for Sorting and Searching Strings"
cannam@89:    Available from Sedgewick's web page,
cannam@89:    www.cs.princeton.edu/~rs
cannam@89:

cannam@89:

The following paper gives valuable additional insights into cannam@89: the algorithm, but is not immediately the basis of any code used cannam@89: in bzip2.

cannam@89:

Peter Fenwick:
cannam@89:    Block Sorting Text Compression
cannam@89:    Proceedings of the 19th Australasian Computer Science Conference,
cannam@89:      Melbourne, Australia.  Jan 31 - Feb 2, 1996.
cannam@89:    ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps

cannam@89:

Kunihiko Sadakane's sorting algorithm, mentioned above, is cannam@89: available from:

cannam@89:

http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
cannam@89:

cannam@89:

The Manber-Myers suffix array construction algorithm is cannam@89: described in a paper available from:

cannam@89:

http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
cannam@89:

cannam@89:

Finally, the following papers document some cannam@89: investigations I made into the performance of sorting cannam@89: and decompression algorithms:

cannam@89:

Julian Seward
cannam@89:    On the Performance of BWT Sorting Algorithms
cannam@89:    Proceedings of the IEEE Data Compression Conference 2000
cannam@89:      Snowbird, Utah.  28-30 March 2000.
cannam@89:
cannam@89: Julian Seward
cannam@89:    Space-time Tradeoffs in the Inverse B-W Transform
cannam@89:    Proceedings of the IEEE Data Compression Conference 2001
cannam@89:      Snowbird, Utah.  27-29 March 2001.
cannam@89:

cannam@89:

cannam@89: bzip2 and libbzip2, version 1.0.6

A program and library for data compression

cannam@89: Julian Seward cannam@89:

cannam@89: 1. Introduction

cannam@89: 2. How to use bzip2

cannam@89: 2.1. NAME

cannam@89: 2.2. SYNOPSIS

cannam@89: 2.3. DESCRIPTION

cannam@89: 2.4. OPTIONS

cannam@89: 2.5. MEMORY MANAGEMENT

cannam@89: 2.6. RECOVERING DATA FROM DAMAGED FILES

cannam@89: 2.7. PERFORMANCE NOTES

cannam@89: 2.8. CAVEATS

cannam@89: 2.9. AUTHOR

cannam@89: 3. cannam@89: Programming with libbzip2 cannam@89:

cannam@89: 3.1. Top-level structure

cannam@89: 3.1.1. Low-level summary

cannam@89: 3.1.2. High-level summary

cannam@89: 3.1.3. Utility functions summary

cannam@89: 3.2. Error handling

cannam@89: 3.3. Low-level interface

cannam@89: 3.3.1. BZ2_bzCompressInit

cannam@89: 3.3.2. BZ2_bzCompress

cannam@89: 3.3.3. BZ2_bzCompressEnd

cannam@89: 3.3.4. BZ2_bzDecompressInit

cannam@89: 3.3.5. BZ2_bzDecompress

cannam@89: 3.3.6. BZ2_bzDecompressEnd

cannam@89: 3.4. High-level interface

cannam@89: 3.4.1. BZ2_bzReadOpen

cannam@89: 3.4.2. BZ2_bzRead

cannam@89: 3.4.3. BZ2_bzReadGetUnused

cannam@89: 3.4.4. BZ2_bzReadClose

cannam@89: 3.4.5. BZ2_bzWriteOpen

cannam@89: 3.4.6. BZ2_bzWrite

cannam@89: 3.4.7. BZ2_bzWriteClose

cannam@89: 3.4.8. Handling embedded compressed data streams

cannam@89: 3.4.9. Standard file-reading/writing code

cannam@89: 3.5. Utility functions

cannam@89: 3.5.1. BZ2_bzBuffToBuffCompress

cannam@89: 3.5.2. BZ2_bzBuffToBuffDecompress

cannam@89: 3.6. zlib compatibility functions

cannam@89: 3.7. Using the library in a stdio-free environment

cannam@89: 3.7.1. Getting rid of stdio

cannam@89: 3.7.2. Critical error handling

cannam@89: 3.8. Making a Windows DLL

cannam@89: 4. Miscellanea

cannam@89: 4.1. Limitations of the compressed file format

cannam@89: 4.2. Portability issues

cannam@89: 4.3. Reporting bugs

cannam@89: 4.4. Did you get the right package?

cannam@89: 4.5. Further Reading

cannam@89: 3. cannam@89: Programming with `libbzip2` cannam@89: