annotate src/bzip2-1.0.6/manual.xml @ 169:223a55898ab9 tip default

Add null config files
author Chris Cannam <cannam@all-day-breakfast.com>
date Mon, 02 Mar 2020 14:03:47 +0000
parents 8a15ff55d9af
children
rev   line source
cannam@89 1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
cannam@89 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
cannam@89 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"[
cannam@89 4
cannam@89 5 <!-- various strings, dates etc. common to all docs -->
cannam@89 6 <!ENTITY % common-ents SYSTEM "entities.xml"> %common-ents;
cannam@89 7 ]>
cannam@89 8
cannam@89 9 <book lang="en" id="userman" xreflabel="bzip2 Manual">
cannam@89 10
cannam@89 11 <bookinfo>
cannam@89 12 <title>bzip2 and libbzip2, version 1.0.6</title>
cannam@89 13 <subtitle>A program and library for data compression</subtitle>
cannam@89 14 <copyright>
cannam@89 15 <year>&bz-lifespan;</year>
cannam@89 16 <holder>Julian Seward</holder>
cannam@89 17 </copyright>
cannam@89 18 <releaseinfo>Version &bz-version; of &bz-date;</releaseinfo>
cannam@89 19
cannam@89 20 <authorgroup>
cannam@89 21 <author>
cannam@89 22 <firstname>Julian</firstname>
cannam@89 23 <surname>Seward</surname>
cannam@89 24 <affiliation>
cannam@89 25 <orgname>&bz-url;</orgname>
cannam@89 26 </affiliation>
cannam@89 27 </author>
cannam@89 28 </authorgroup>
cannam@89 29
cannam@89 30 <legalnotice>
cannam@89 31
cannam@89 32 <para>This program, <computeroutput>bzip2</computeroutput>, the
cannam@89 33 associated library <computeroutput>libbzip2</computeroutput>, and
cannam@89 34 all documentation, are copyright &copy; &bz-lifespan; Julian Seward.
cannam@89 35 All rights reserved.</para>
cannam@89 36
cannam@89 37 <para>Redistribution and use in source and binary forms, with
cannam@89 38 or without modification, are permitted provided that the
cannam@89 39 following conditions are met:</para>
cannam@89 40
cannam@89 41 <itemizedlist mark='bullet'>
cannam@89 42
cannam@89 43 <listitem><para>Redistributions of source code must retain the
cannam@89 44 above copyright notice, this list of conditions and the
cannam@89 45 following disclaimer.</para></listitem>
cannam@89 46
cannam@89 47 <listitem><para>The origin of this software must not be
cannam@89 48 misrepresented; you must not claim that you wrote the original
cannam@89 49 software. If you use this software in a product, an
cannam@89 50 acknowledgment in the product documentation would be
cannam@89 51 appreciated but is not required.</para></listitem>
cannam@89 52
cannam@89 53 <listitem><para>Altered source versions must be plainly marked
cannam@89 54 as such, and must not be misrepresented as being the original
cannam@89 55 software.</para></listitem>
cannam@89 56
cannam@89 57 <listitem><para>The name of the author may not be used to
cannam@89 58 endorse or promote products derived from this software without
cannam@89 59 specific prior written permission.</para></listitem>
cannam@89 60
cannam@89 61 </itemizedlist>
cannam@89 62
cannam@89 63 <para>THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY
cannam@89 64 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
cannam@89 65 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
cannam@89 66 PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
cannam@89 67 AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
cannam@89 68 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
cannam@89 69 TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
cannam@89 70 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
cannam@89 71 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
cannam@89 72 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
cannam@89 73 IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
cannam@89 74 THE POSSIBILITY OF SUCH DAMAGE.</para>
cannam@89 75
cannam@89 76 <para>PATENTS: To the best of my knowledge,
cannam@89 77 <computeroutput>bzip2</computeroutput> and
cannam@89 78 <computeroutput>libbzip2</computeroutput> do not use any patented
cannam@89 79 algorithms. However, I do not have the resources to carry
cannam@89 80 out a patent search. Therefore I cannot give any guarantee of
cannam@89 81 the above statement.
cannam@89 82 </para>
cannam@89 83
cannam@89 84 </legalnotice>
cannam@89 85
cannam@89 86 </bookinfo>
cannam@89 87
cannam@89 88
cannam@89 89
cannam@89 90 <chapter id="intro" xreflabel="Introduction">
cannam@89 91 <title>Introduction</title>
cannam@89 92
cannam@89 93 <para><computeroutput>bzip2</computeroutput> compresses files
cannam@89 94 using the Burrows-Wheeler block-sorting text compression
cannam@89 95 algorithm, and Huffman coding. Compression is generally
cannam@89 96 considerably better than that achieved by more conventional
cannam@89 97 LZ77/LZ78-based compressors, and approaches the performance of
cannam@89 98 the PPM family of statistical compressors.</para>
cannam@89 99
cannam@89 100 <para><computeroutput>bzip2</computeroutput> is built on top of
cannam@89 101 <computeroutput>libbzip2</computeroutput>, a flexible library for
cannam@89 102 handling compressed data in the
cannam@89 103 <computeroutput>bzip2</computeroutput> format. This manual
cannam@89 104 describes both how to use the program and how to work with the
cannam@89 105 library interface. Most of the manual is devoted to this
cannam@89 106 library, not the program, which is good news if your interest is
cannam@89 107 only in the program.</para>
cannam@89 108
cannam@89 109 <itemizedlist mark='bullet'>
cannam@89 110
cannam@89 111 <listitem><para><xref linkend="using"/> describes how to use
cannam@89 112 <computeroutput>bzip2</computeroutput>; this is the only part
cannam@89 113 you need to read if you just want to know how to operate the
cannam@89 114 program.</para></listitem>
cannam@89 115
cannam@89 116 <listitem><para><xref linkend="libprog"/> describes the
cannam@89 117 programming interfaces in detail, and</para></listitem>
cannam@89 118
cannam@89 119 <listitem><para><xref linkend="misc"/> records some
cannam@89 120 miscellaneous notes which I thought ought to be recorded
cannam@89 121 somewhere.</para></listitem>
cannam@89 122
cannam@89 123 </itemizedlist>
cannam@89 124
cannam@89 125 </chapter>
cannam@89 126
cannam@89 127
cannam@89 128 <chapter id="using" xreflabel="How to use bzip2">
cannam@89 129 <title>How to use bzip2</title>
cannam@89 130
cannam@89 131 <para>This chapter contains a copy of the
cannam@89 132 <computeroutput>bzip2</computeroutput> man page, and nothing
cannam@89 133 else.</para>
cannam@89 134
cannam@89 135 <sect1 id="name" xreflabel="NAME">
cannam@89 136 <title>NAME</title>
cannam@89 137
cannam@89 138 <itemizedlist mark='bullet'>
cannam@89 139
cannam@89 140 <listitem><para><computeroutput>bzip2</computeroutput>,
cannam@89 141 <computeroutput>bunzip2</computeroutput> - a block-sorting file
cannam@89 142 compressor, v1.0.6</para></listitem>
cannam@89 143
cannam@89 144 <listitem><para><computeroutput>bzcat</computeroutput> -
cannam@89 145 decompresses files to stdout</para></listitem>
cannam@89 146
cannam@89 147 <listitem><para><computeroutput>bzip2recover</computeroutput> -
cannam@89 148 recovers data from damaged bzip2 files</para></listitem>
cannam@89 149
cannam@89 150 </itemizedlist>
cannam@89 151
cannam@89 152 </sect1>
cannam@89 153
cannam@89 154
cannam@89 155 <sect1 id="synopsis" xreflabel="SYNOPSIS">
cannam@89 156 <title>SYNOPSIS</title>
cannam@89 157
cannam@89 158 <itemizedlist mark='bullet'>
cannam@89 159
cannam@89 160 <listitem><para><computeroutput>bzip2</computeroutput> [
cannam@89 161 -cdfkqstvzVL123456789 ] [ filenames ... ]</para></listitem>
cannam@89 162
cannam@89 163 <listitem><para><computeroutput>bunzip2</computeroutput> [
cannam@89 164 -fkvsVL ] [ filenames ... ]</para></listitem>
cannam@89 165
cannam@89 166 <listitem><para><computeroutput>bzcat</computeroutput> [ -s ] [
cannam@89 167 filenames ... ]</para></listitem>
cannam@89 168
cannam@89 169 <listitem><para><computeroutput>bzip2recover</computeroutput>
cannam@89 170 filename</para></listitem>
cannam@89 171
cannam@89 172 </itemizedlist>
cannam@89 173
cannam@89 174 </sect1>
cannam@89 175
cannam@89 176
cannam@89 177 <sect1 id="description" xreflabel="DESCRIPTION">
cannam@89 178 <title>DESCRIPTION</title>
cannam@89 179
cannam@89 180 <para><computeroutput>bzip2</computeroutput> compresses files
cannam@89 181 using the Burrows-Wheeler block sorting text compression
cannam@89 182 algorithm, and Huffman coding. Compression is generally
cannam@89 183 considerably better than that achieved by more conventional
cannam@89 184 LZ77/LZ78-based compressors, and approaches the performance of
cannam@89 185 the PPM family of statistical compressors.</para>
cannam@89 186
cannam@89 187 <para>The command-line options are deliberately very similar to
cannam@89 188 those of GNU <computeroutput>gzip</computeroutput>, but they are
cannam@89 189 not identical.</para>
cannam@89 190
cannam@89 191 <para><computeroutput>bzip2</computeroutput> expects a list of
cannam@89 192 file names to accompany the command-line flags. Each file is
cannam@89 193 replaced by a compressed version of itself, with the name
cannam@89 194 <computeroutput>original_name.bz2</computeroutput>. Each
cannam@89 195 compressed file has the same modification date, permissions, and,
cannam@89 196 when possible, ownership as the corresponding original, so that
cannam@89 197 these properties can be correctly restored at decompression time.
cannam@89 198 File name handling is naive in the sense that there is no
cannam@89 199 mechanism for preserving original file names, permissions,
cannam@89 200 ownerships or dates in filesystems which lack these concepts, or
cannam@89 201 have serious file name length restrictions, such as
cannam@89 202 MS-DOS.</para>
cannam@89 203
cannam@89 204 <para><computeroutput>bzip2</computeroutput> and
cannam@89 205 <computeroutput>bunzip2</computeroutput> will by default not
cannam@89 206 overwrite existing files. If you want this to happen, specify
cannam@89 207 the <computeroutput>-f</computeroutput> flag.</para>
cannam@89 208
cannam@89 209 <para>If no file names are specified,
cannam@89 210 <computeroutput>bzip2</computeroutput> compresses from standard
cannam@89 211 input to standard output. In this case,
cannam@89 212 <computeroutput>bzip2</computeroutput> will decline to write
cannam@89 213 compressed output to a terminal, as this would be entirely
cannam@89 214 incomprehensible and therefore pointless.</para>
cannam@89 215
cannam@89 216 <para><computeroutput>bunzip2</computeroutput> (or
cannam@89 217 <computeroutput>bzip2 -d</computeroutput>) decompresses all
cannam@89 218 specified files. Files which were not created by
cannam@89 219 <computeroutput>bzip2</computeroutput> will be detected and
cannam@89 220 ignored, and a warning issued.
cannam@89 221 <computeroutput>bzip2</computeroutput> attempts to guess the
cannam@89 222 filename for the decompressed file from that of the compressed
cannam@89 223 file as follows:</para>
cannam@89 224
cannam@89 225 <itemizedlist mark='bullet'>
cannam@89 226
cannam@89 227 <listitem><para><computeroutput>filename.bz2 </computeroutput>
cannam@89 228 becomes
cannam@89 229 <computeroutput>filename</computeroutput></para></listitem>
cannam@89 230
cannam@89 231 <listitem><para><computeroutput>filename.bz </computeroutput>
cannam@89 232 becomes
cannam@89 233 <computeroutput>filename</computeroutput></para></listitem>
cannam@89 234
cannam@89 235 <listitem><para><computeroutput>filename.tbz2</computeroutput>
cannam@89 236 becomes
cannam@89 237 <computeroutput>filename.tar</computeroutput></para></listitem>
cannam@89 238
cannam@89 239 <listitem><para><computeroutput>filename.tbz </computeroutput>
cannam@89 240 becomes
cannam@89 241 <computeroutput>filename.tar</computeroutput></para></listitem>
cannam@89 242
cannam@89 243 <listitem><para><computeroutput>anyothername </computeroutput>
cannam@89 244 becomes
cannam@89 245 <computeroutput>anyothername.out</computeroutput></para></listitem>
cannam@89 246
cannam@89 247 </itemizedlist>
cannam@89 248
cannam@89 249 <para>If the file does not end in one of the recognised endings,
cannam@89 250 <computeroutput>.bz2</computeroutput>,
cannam@89 251 <computeroutput>.bz</computeroutput>,
cannam@89 252 <computeroutput>.tbz2</computeroutput> or
cannam@89 253 <computeroutput>.tbz</computeroutput>,
cannam@89 254 <computeroutput>bzip2</computeroutput> complains that it cannot
cannam@89 255 guess the name of the original file, and uses the original name
cannam@89 256 with <computeroutput>.out</computeroutput> appended.</para>
cannam@89 257
cannam@89 258 <para>As with compression, supplying no filenames causes
cannam@89 259 decompression from standard input to standard output.</para>
cannam@89 260
cannam@89 261 <para><computeroutput>bunzip2</computeroutput> will correctly
cannam@89 262 decompress a file which is the concatenation of two or more
cannam@89 263 compressed files. The result is the concatenation of the
cannam@89 264 corresponding uncompressed files. Integrity testing
cannam@89 265 (<computeroutput>-t</computeroutput>) of concatenated compressed
cannam@89 266 files is also supported.</para>
cannam@89 267
cannam@89 268 <para>You can also compress or decompress files to the standard
cannam@89 269 output by giving the <computeroutput>-c</computeroutput> flag.
cannam@89 270 Multiple files may be compressed and decompressed like this. The
cannam@89 271 resulting outputs are fed sequentially to stdout. Compression of
cannam@89 272 multiple files in this manner generates a stream containing
cannam@89 273 multiple compressed file representations. Such a stream can be
cannam@89 274 decompressed correctly only by
cannam@89 275 <computeroutput>bzip2</computeroutput> version 0.9.0 or later.
cannam@89 276 Earlier versions of <computeroutput>bzip2</computeroutput> will
cannam@89 277 stop after decompressing the first file in the stream.</para>
cannam@89 278
cannam@89 279 <para><computeroutput>bzcat</computeroutput> (or
cannam@89 280 <computeroutput>bzip2 -dc</computeroutput>) decompresses all
cannam@89 281 specified files to the standard output.</para>
cannam@89 282
cannam@89 283 <para><computeroutput>bzip2</computeroutput> will read arguments
cannam@89 284 from the environment variables
cannam@89 285 <computeroutput>BZIP2</computeroutput> and
cannam@89 286 <computeroutput>BZIP</computeroutput>, in that order, and will
cannam@89 287 process them before any arguments read from the command line.
cannam@89 288 This gives a convenient way to supply default arguments.</para>
cannam@89 289
cannam@89 290 <para>Compression is always performed, even if the compressed
cannam@89 291 file is slightly larger than the original. Files of less than
cannam@89 292 about one hundred bytes tend to get larger, since the compression
cannam@89 293 mechanism has a constant overhead in the region of 50 bytes.
cannam@89 294 Random data (including the output of most file compressors) is
cannam@89 295 coded at about 8.05 bits per byte, giving an expansion of around
cannam@89 296 0.5%.</para>
cannam@89 297
cannam@89 298 <para>As a self-check for your protection,
cannam@89 299 <computeroutput>bzip2</computeroutput> uses 32-bit CRCs to make
cannam@89 300 sure that the decompressed version of a file is identical to the
cannam@89 301 original. This guards against corruption of the compressed data,
cannam@89 302 and against undetected bugs in
cannam@89 303 <computeroutput>bzip2</computeroutput> (hopefully very unlikely).
cannam@89 304 The chances of data corruption going undetected is microscopic,
cannam@89 305 about one chance in four billion for each file processed. Be
cannam@89 306 aware, though, that the check occurs upon decompression, so it
cannam@89 307 can only tell you that something is wrong. It can't help you
cannam@89 308 recover the original uncompressed data. You can use
cannam@89 309 <computeroutput>bzip2recover</computeroutput> to try to recover
cannam@89 310 data from damaged files.</para>
cannam@89 311
cannam@89 312 <para>Return values: 0 for a normal exit, 1 for environmental
cannam@89 313 problems (file not found, invalid flags, I/O errors, etc.), 2
cannam@89 314 to indicate a corrupt compressed file, 3 for an internal
cannam@89 315 consistency error (eg, bug) which caused
cannam@89 316 <computeroutput>bzip2</computeroutput> to panic.</para>
cannam@89 317
cannam@89 318 </sect1>
cannam@89 319
cannam@89 320
cannam@89 321 <sect1 id="options" xreflabel="OPTIONS">
cannam@89 322 <title>OPTIONS</title>
cannam@89 323
cannam@89 324 <variablelist>
cannam@89 325
cannam@89 326 <varlistentry>
cannam@89 327 <term><computeroutput>-c --stdout</computeroutput></term>
cannam@89 328 <listitem><para>Compress or decompress to standard
cannam@89 329 output.</para></listitem>
cannam@89 330 </varlistentry>
cannam@89 331
cannam@89 332 <varlistentry>
cannam@89 333 <term><computeroutput>-d --decompress</computeroutput></term>
cannam@89 334 <listitem><para>Force decompression.
cannam@89 335 <computeroutput>bzip2</computeroutput>,
cannam@89 336 <computeroutput>bunzip2</computeroutput> and
cannam@89 337 <computeroutput>bzcat</computeroutput> are really the same
cannam@89 338 program, and the decision about what actions to take is done on
cannam@89 339 the basis of which name is used. This flag overrides that
cannam@89 340 mechanism, and forces bzip2 to decompress.</para></listitem>
cannam@89 341 </varlistentry>
cannam@89 342
cannam@89 343 <varlistentry>
cannam@89 344 <term><computeroutput>-z --compress</computeroutput></term>
cannam@89 345 <listitem><para>The complement to
cannam@89 346 <computeroutput>-d</computeroutput>: forces compression,
cannam@89 347 regardless of the invokation name.</para></listitem>
cannam@89 348 </varlistentry>
cannam@89 349
cannam@89 350 <varlistentry>
cannam@89 351 <term><computeroutput>-t --test</computeroutput></term>
cannam@89 352 <listitem><para>Check integrity of the specified file(s), but
cannam@89 353 don't decompress them. This really performs a trial
cannam@89 354 decompression and throws away the result.</para></listitem>
cannam@89 355 </varlistentry>
cannam@89 356
cannam@89 357 <varlistentry>
cannam@89 358 <term><computeroutput>-f --force</computeroutput></term>
cannam@89 359 <listitem><para>Force overwrite of output files. Normally,
cannam@89 360 <computeroutput>bzip2</computeroutput> will not overwrite
cannam@89 361 existing output files. Also forces
cannam@89 362 <computeroutput>bzip2</computeroutput> to break hard links to
cannam@89 363 files, which it otherwise wouldn't do.</para>
cannam@89 364 <para><computeroutput>bzip2</computeroutput> normally declines
cannam@89 365 to decompress files which don't have the correct magic header
cannam@89 366 bytes. If forced (<computeroutput>-f</computeroutput>),
cannam@89 367 however, it will pass such files through unmodified. This is
cannam@89 368 how GNU <computeroutput>gzip</computeroutput> behaves.</para>
cannam@89 369 </listitem>
cannam@89 370 </varlistentry>
cannam@89 371
cannam@89 372 <varlistentry>
cannam@89 373 <term><computeroutput>-k --keep</computeroutput></term>
cannam@89 374 <listitem><para>Keep (don't delete) input files during
cannam@89 375 compression or decompression.</para></listitem>
cannam@89 376 </varlistentry>
cannam@89 377
cannam@89 378 <varlistentry>
cannam@89 379 <term><computeroutput>-s --small</computeroutput></term>
cannam@89 380 <listitem><para>Reduce memory usage, for compression,
cannam@89 381 decompression and testing. Files are decompressed and tested
cannam@89 382 using a modified algorithm which only requires 2.5 bytes per
cannam@89 383 block byte. This means any file can be decompressed in 2300k
cannam@89 384 of memory, albeit at about half the normal speed.</para>
cannam@89 385 <para>During compression, <computeroutput>-s</computeroutput>
cannam@89 386 selects a block size of 200k, which limits memory use to around
cannam@89 387 the same figure, at the expense of your compression ratio. In
cannam@89 388 short, if your machine is low on memory (8 megabytes or less),
cannam@89 389 use <computeroutput>-s</computeroutput> for everything. See
cannam@89 390 <xref linkend="memory-management"/> below.</para></listitem>
cannam@89 391 </varlistentry>
cannam@89 392
cannam@89 393 <varlistentry>
cannam@89 394 <term><computeroutput>-q --quiet</computeroutput></term>
cannam@89 395 <listitem><para>Suppress non-essential warning messages.
cannam@89 396 Messages pertaining to I/O errors and other critical events
cannam@89 397 will not be suppressed.</para></listitem>
cannam@89 398 </varlistentry>
cannam@89 399
cannam@89 400 <varlistentry>
cannam@89 401 <term><computeroutput>-v --verbose</computeroutput></term>
cannam@89 402 <listitem><para>Verbose mode -- show the compression ratio for
cannam@89 403 each file processed. Further
cannam@89 404 <computeroutput>-v</computeroutput>'s increase the verbosity
cannam@89 405 level, spewing out lots of information which is primarily of
cannam@89 406 interest for diagnostic purposes.</para></listitem>
cannam@89 407 </varlistentry>
cannam@89 408
cannam@89 409 <varlistentry>
cannam@89 410 <term><computeroutput>-L --license -V --version</computeroutput></term>
cannam@89 411 <listitem><para>Display the software version, license terms and
cannam@89 412 conditions.</para></listitem>
cannam@89 413 </varlistentry>
cannam@89 414
cannam@89 415 <varlistentry>
cannam@89 416 <term><computeroutput>-1</computeroutput> (or
cannam@89 417 <computeroutput>--fast</computeroutput>) to
cannam@89 418 <computeroutput>-9</computeroutput> (or
cannam@89 419 <computeroutput>-best</computeroutput>)</term>
cannam@89 420 <listitem><para>Set the block size to 100 k, 200 k ... 900 k
cannam@89 421 when compressing. Has no effect when decompressing. See <xref
cannam@89 422 linkend="memory-management" /> below. The
cannam@89 423 <computeroutput>--fast</computeroutput> and
cannam@89 424 <computeroutput>--best</computeroutput> aliases are primarily
cannam@89 425 for GNU <computeroutput>gzip</computeroutput> compatibility.
cannam@89 426 In particular, <computeroutput>--fast</computeroutput> doesn't
cannam@89 427 make things significantly faster. And
cannam@89 428 <computeroutput>--best</computeroutput> merely selects the
cannam@89 429 default behaviour.</para></listitem>
cannam@89 430 </varlistentry>
cannam@89 431
cannam@89 432 <varlistentry>
cannam@89 433 <term><computeroutput>--</computeroutput></term>
cannam@89 434 <listitem><para>Treats all subsequent arguments as file names,
cannam@89 435 even if they start with a dash. This is so you can handle
cannam@89 436 files with names beginning with a dash, for example:
cannam@89 437 <computeroutput>bzip2 --
cannam@89 438 -myfilename</computeroutput>.</para></listitem>
cannam@89 439 </varlistentry>
cannam@89 440
cannam@89 441 <varlistentry>
cannam@89 442 <term><computeroutput>--repetitive-fast</computeroutput></term>
cannam@89 443 <term><computeroutput>--repetitive-best</computeroutput></term>
cannam@89 444 <listitem><para>These flags are redundant in versions 0.9.5 and
cannam@89 445 above. They provided some coarse control over the behaviour of
cannam@89 446 the sorting algorithm in earlier versions, which was sometimes
cannam@89 447 useful. 0.9.5 and above have an improved algorithm which
cannam@89 448 renders these flags irrelevant.</para></listitem>
cannam@89 449 </varlistentry>
cannam@89 450
cannam@89 451 </variablelist>
cannam@89 452
cannam@89 453 </sect1>
cannam@89 454
cannam@89 455
cannam@89 456 <sect1 id="memory-management" xreflabel="MEMORY MANAGEMENT">
cannam@89 457 <title>MEMORY MANAGEMENT</title>
cannam@89 458
cannam@89 459 <para><computeroutput>bzip2</computeroutput> compresses large
cannam@89 460 files in blocks. The block size affects both the compression
cannam@89 461 ratio achieved, and the amount of memory needed for compression
cannam@89 462 and decompression. The flags <computeroutput>-1</computeroutput>
cannam@89 463 through <computeroutput>-9</computeroutput> specify the block
cannam@89 464 size to be 100,000 bytes through 900,000 bytes (the default)
cannam@89 465 respectively. At decompression time, the block size used for
cannam@89 466 compression is read from the header of the compressed file, and
cannam@89 467 <computeroutput>bunzip2</computeroutput> then allocates itself
cannam@89 468 just enough memory to decompress the file. Since block sizes are
cannam@89 469 stored in compressed files, it follows that the flags
cannam@89 470 <computeroutput>-1</computeroutput> to
cannam@89 471 <computeroutput>-9</computeroutput> are irrelevant to and so
cannam@89 472 ignored during decompression.</para>
cannam@89 473
cannam@89 474 <para>Compression and decompression requirements, in bytes, can be
cannam@89 475 estimated as:</para>
cannam@89 476 <programlisting>
cannam@89 477 Compression: 400k + ( 8 x block size )
cannam@89 478
cannam@89 479 Decompression: 100k + ( 4 x block size ), or
cannam@89 480 100k + ( 2.5 x block size )
cannam@89 481 </programlisting>
cannam@89 482
cannam@89 483 <para>Larger block sizes give rapidly diminishing marginal
cannam@89 484 returns. Most of the compression comes from the first two or
cannam@89 485 three hundred k of block size, a fact worth bearing in mind when
cannam@89 486 using <computeroutput>bzip2</computeroutput> on small machines.
cannam@89 487 It is also important to appreciate that the decompression memory
cannam@89 488 requirement is set at compression time by the choice of block
cannam@89 489 size.</para>
cannam@89 490
cannam@89 491 <para>For files compressed with the default 900k block size,
cannam@89 492 <computeroutput>bunzip2</computeroutput> will require about 3700
cannam@89 493 kbytes to decompress. To support decompression of any file on a
cannam@89 494 4 megabyte machine, <computeroutput>bunzip2</computeroutput> has
cannam@89 495 an option to decompress using approximately half this amount of
cannam@89 496 memory, about 2300 kbytes. Decompression speed is also halved,
cannam@89 497 so you should use this option only where necessary. The relevant
cannam@89 498 flag is <computeroutput>-s</computeroutput>.</para>
cannam@89 499
cannam@89 500 <para>In general, try and use the largest block size memory
cannam@89 501 constraints allow, since that maximises the compression achieved.
cannam@89 502 Compression and decompression speed are virtually unaffected by
cannam@89 503 block size.</para>
cannam@89 504
cannam@89 505 <para>Another significant point applies to files which fit in a
cannam@89 506 single block -- that means most files you'd encounter using a
cannam@89 507 large block size. The amount of real memory touched is
cannam@89 508 proportional to the size of the file, since the file is smaller
cannam@89 509 than a block. For example, compressing a file 20,000 bytes long
cannam@89 510 with the flag <computeroutput>-9</computeroutput> will cause the
cannam@89 511 compressor to allocate around 7600k of memory, but only touch
cannam@89 512 400k + 20000 * 8 = 560 kbytes of it. Similarly, the decompressor
cannam@89 513 will allocate 3700k but only touch 100k + 20000 * 4 = 180
cannam@89 514 kbytes.</para>
cannam@89 515
cannam@89 516 <para>Here is a table which summarises the maximum memory usage
cannam@89 517 for different block sizes. Also recorded is the total compressed
cannam@89 518 size for 14 files of the Calgary Text Compression Corpus
cannam@89 519 totalling 3,141,622 bytes. This column gives some feel for how
cannam@89 520 compression varies with block size. These figures tend to
cannam@89 521 understate the advantage of larger block sizes for larger files,
cannam@89 522 since the Corpus is dominated by smaller files.</para>
cannam@89 523
cannam@89 524 <programlisting>
cannam@89 525 Compress Decompress Decompress Corpus
cannam@89 526 Flag usage usage -s usage Size
cannam@89 527
cannam@89 528 -1 1200k 500k 350k 914704
cannam@89 529 -2 2000k 900k 600k 877703
cannam@89 530 -3 2800k 1300k 850k 860338
cannam@89 531 -4 3600k 1700k 1100k 846899
cannam@89 532 -5 4400k 2100k 1350k 845160
cannam@89 533 -6 5200k 2500k 1600k 838626
cannam@89 534 -7 6100k 2900k 1850k 834096
cannam@89 535 -8 6800k 3300k 2100k 828642
cannam@89 536 -9 7600k 3700k 2350k 828642
cannam@89 537 </programlisting>
cannam@89 538
cannam@89 539 </sect1>
cannam@89 540
cannam@89 541
cannam@89 542 <sect1 id="recovering" xreflabel="RECOVERING DATA FROM DAMAGED FILES">
cannam@89 543 <title>RECOVERING DATA FROM DAMAGED FILES</title>
cannam@89 544
cannam@89 545 <para><computeroutput>bzip2</computeroutput> compresses files in
cannam@89 546 blocks, usually 900kbytes long. Each block is handled
cannam@89 547 independently. If a media or transmission error causes a
cannam@89 548 multi-block <computeroutput>.bz2</computeroutput> file to become
cannam@89 549 damaged, it may be possible to recover data from the undamaged
cannam@89 550 blocks in the file.</para>
cannam@89 551
cannam@89 552 <para>The compressed representation of each block is delimited by
cannam@89 553 a 48-bit pattern, which makes it possible to find the block
cannam@89 554 boundaries with reasonable certainty. Each block also carries
cannam@89 555 its own 32-bit CRC, so damaged blocks can be distinguished from
cannam@89 556 undamaged ones.</para>
cannam@89 557
cannam@89 558 <para><computeroutput>bzip2recover</computeroutput> is a simple
cannam@89 559 program whose purpose is to search for blocks in
cannam@89 560 <computeroutput>.bz2</computeroutput> files, and write each block
cannam@89 561 out into its own <computeroutput>.bz2</computeroutput> file. You
cannam@89 562 can then use <computeroutput>bzip2 -t</computeroutput> to test
cannam@89 563 the integrity of the resulting files, and decompress those which
cannam@89 564 are undamaged.</para>
cannam@89 565
cannam@89 566 <para><computeroutput>bzip2recover</computeroutput> takes a
cannam@89 567 single argument, the name of the damaged file, and writes a
cannam@89 568 number of files <computeroutput>rec0001file.bz2</computeroutput>,
cannam@89 569 <computeroutput>rec0002file.bz2</computeroutput>, etc, containing
cannam@89 570 the extracted blocks. The output filenames are designed so that
cannam@89 571 the use of wildcards in subsequent processing -- for example,
cannam@89 572 <computeroutput>bzip2 -dc rec*file.bz2 &#62;
cannam@89 573 recovered_data</computeroutput> -- lists the files in the correct
cannam@89 574 order.</para>
cannam@89 575
cannam@89 576 <para><computeroutput>bzip2recover</computeroutput> should be of
cannam@89 577 most use dealing with large <computeroutput>.bz2</computeroutput>
cannam@89 578 files, as these will contain many blocks. It is clearly futile
cannam@89 579 to use it on damaged single-block files, since a damaged block
cannam@89 580 cannot be recovered. If you wish to minimise any potential data
cannam@89 581 loss through media or transmission errors, you might consider
cannam@89 582 compressing with a smaller block size.</para>
cannam@89 583
cannam@89 584 </sect1>
cannam@89 585
cannam@89 586
cannam@89 587 <sect1 id="performance" xreflabel="PERFORMANCE NOTES">
cannam@89 588 <title>PERFORMANCE NOTES</title>
cannam@89 589
cannam@89 590 <para>The sorting phase of compression gathers together similar
cannam@89 591 strings in the file. Because of this, files containing very long
cannam@89 592 runs of repeated symbols, like "aabaabaabaab ..." (repeated
cannam@89 593 several hundred times) may compress more slowly than normal.
cannam@89 594 Versions 0.9.5 and above fare much better than previous versions
cannam@89 595 in this respect. The ratio between worst-case and average-case
cannam@89 596 compression time is in the region of 10:1. For previous
cannam@89 597 versions, this figure was more like 100:1. You can use the
cannam@89 598 <computeroutput>-vvvv</computeroutput> option to monitor progress
cannam@89 599 in great detail, if you want.</para>
cannam@89 600
cannam@89 601 <para>Decompression speed is unaffected by these
cannam@89 602 phenomena.</para>
cannam@89 603
cannam@89 604 <para><computeroutput>bzip2</computeroutput> usually allocates
cannam@89 605 several megabytes of memory to operate in, and then charges all
cannam@89 606 over it in a fairly random fashion. This means that performance,
cannam@89 607 both for compressing and decompressing, is largely determined by
cannam@89 608 the speed at which your machine can service cache misses.
cannam@89 609 Because of this, small changes to the code to reduce the miss
cannam@89 610 rate have been observed to give disproportionately large
cannam@89 611 performance improvements. I imagine
cannam@89 612 <computeroutput>bzip2</computeroutput> will perform best on
cannam@89 613 machines with very large caches.</para>
cannam@89 614
cannam@89 615 </sect1>
cannam@89 616
cannam@89 617
cannam@89 618
cannam@89 619 <sect1 id="caveats" xreflabel="CAVEATS">
cannam@89 620 <title>CAVEATS</title>
cannam@89 621
cannam@89 622 <para>I/O error messages are not as helpful as they could be.
cannam@89 623 <computeroutput>bzip2</computeroutput> tries hard to detect I/O
cannam@89 624 errors and exit cleanly, but the details of what the problem is
cannam@89 625 sometimes seem rather misleading.</para>
cannam@89 626
cannam@89 627 <para>This manual page pertains to version &bz-version; of
cannam@89 628 <computeroutput>bzip2</computeroutput>. Compressed data created by
cannam@89 629 this version is entirely forwards and backwards compatible with the
cannam@89 630 previous public releases, versions 0.1pl2, 0.9.0 and 0.9.5, 1.0.0,
cannam@89 631 1.0.1, 1.0.2 and 1.0.3, but with the following exception: 0.9.0 and
cannam@89 632 above can correctly decompress multiple concatenated compressed files.
cannam@89 633 0.1pl2 cannot do this; it will stop after decompressing just the first
cannam@89 634 file in the stream.</para>
cannam@89 635
cannam@89 636 <para><computeroutput>bzip2recover</computeroutput> versions
cannam@89 637 prior to 1.0.2 used 32-bit integers to represent bit positions in
cannam@89 638 compressed files, so it could not handle compressed files more
cannam@89 639 than 512 megabytes long. Versions 1.0.2 and above use 64-bit ints
cannam@89 640 on some platforms which support them (GNU supported targets, and
cannam@89 641 Windows). To establish whether or not
cannam@89 642 <computeroutput>bzip2recover</computeroutput> was built with such
cannam@89 643 a limitation, run it without arguments. In any event you can
cannam@89 644 build yourself an unlimited version if you can recompile it with
cannam@89 645 <computeroutput>MaybeUInt64</computeroutput> set to be an
cannam@89 646 unsigned 64-bit integer.</para>
cannam@89 647
cannam@89 648 </sect1>
cannam@89 649
cannam@89 650
cannam@89 651
cannam@89 652 <sect1 id="author" xreflabel="AUTHOR">
cannam@89 653 <title>AUTHOR</title>
cannam@89 654
cannam@89 655 <para>Julian Seward,
cannam@89 656 <computeroutput>&bz-email;</computeroutput></para>
cannam@89 657
cannam@89 658 <para>The ideas embodied in
cannam@89 659 <computeroutput>bzip2</computeroutput> are due to (at least) the
cannam@89 660 following people: Michael Burrows and David Wheeler (for the
cannam@89 661 block sorting transformation), David Wheeler (again, for the
cannam@89 662 Huffman coder), Peter Fenwick (for the structured coding model in
cannam@89 663 the original <computeroutput>bzip</computeroutput>, and many
cannam@89 664 refinements), and Alistair Moffat, Radford Neal and Ian Witten
cannam@89 665 (for the arithmetic coder in the original
cannam@89 666 <computeroutput>bzip</computeroutput>). I am much indebted for
cannam@89 667 their help, support and advice. See the manual in the source
cannam@89 668 distribution for pointers to sources of documentation. Christian
cannam@89 669 von Roques encouraged me to look for faster sorting algorithms,
cannam@89 670 so as to speed up compression. Bela Lubkin encouraged me to
cannam@89 671 improve the worst-case compression performance.
cannam@89 672 Donna Robinson XMLised the documentation.
cannam@89 673 Many people sent
cannam@89 674 patches, helped with portability problems, lent machines, gave
cannam@89 675 advice and were generally helpful.</para>
cannam@89 676
cannam@89 677 </sect1>
cannam@89 678
cannam@89 679 </chapter>
cannam@89 680
cannam@89 681
cannam@89 682
cannam@89 683 <chapter id="libprog" xreflabel="Programming with libbzip2">
cannam@89 684 <title>
cannam@89 685 Programming with <computeroutput>libbzip2</computeroutput>
cannam@89 686 </title>
cannam@89 687
cannam@89 688 <para>This chapter describes the programming interface to
cannam@89 689 <computeroutput>libbzip2</computeroutput>.</para>
cannam@89 690
cannam@89 691 <para>For general background information, particularly about
cannam@89 692 memory use and performance aspects, you'd be well advised to read
cannam@89 693 <xref linkend="using"/> as well.</para>
cannam@89 694
cannam@89 695
cannam@89 696 <sect1 id="top-level" xreflabel="Top-level structure">
cannam@89 697 <title>Top-level structure</title>
cannam@89 698
cannam@89 699 <para><computeroutput>libbzip2</computeroutput> is a flexible
cannam@89 700 library for compressing and decompressing data in the
cannam@89 701 <computeroutput>bzip2</computeroutput> data format. Although
cannam@89 702 packaged as a single entity, it helps to regard the library as
cannam@89 703 three separate parts: the low level interface, and the high level
cannam@89 704 interface, and some utility functions.</para>
cannam@89 705
cannam@89 706 <para>The structure of
cannam@89 707 <computeroutput>libbzip2</computeroutput>'s interfaces is similar
cannam@89 708 to that of Jean-loup Gailly's and Mark Adler's excellent
cannam@89 709 <computeroutput>zlib</computeroutput> library.</para>
cannam@89 710
cannam@89 711 <para>All externally visible symbols have names beginning
cannam@89 712 <computeroutput>BZ2_</computeroutput>. This is new in version
cannam@89 713 1.0. The intention is to minimise pollution of the namespaces of
cannam@89 714 library clients.</para>
cannam@89 715
cannam@89 716 <para>To use any part of the library, you need to
cannam@89 717 <computeroutput>#include &lt;bzlib.h&gt;</computeroutput>
cannam@89 718 into your sources.</para>
cannam@89 719
cannam@89 720
cannam@89 721
cannam@89 722 <sect2 id="ll-summary" xreflabel="Low-level summary">
cannam@89 723 <title>Low-level summary</title>
cannam@89 724
cannam@89 725 <para>This interface provides services for compressing and
cannam@89 726 decompressing data in memory. There's no provision for dealing
cannam@89 727 with files, streams or any other I/O mechanisms, just straight
cannam@89 728 memory-to-memory work. In fact, this part of the library can be
cannam@89 729 compiled without inclusion of
cannam@89 730 <computeroutput>stdio.h</computeroutput>, which may be helpful
cannam@89 731 for embedded applications.</para>
cannam@89 732
cannam@89 733 <para>The low-level part of the library has no global variables
cannam@89 734 and is therefore thread-safe.</para>
cannam@89 735
cannam@89 736 <para>Six routines make up the low level interface:
cannam@89 737 <computeroutput>BZ2_bzCompressInit</computeroutput>,
cannam@89 738 <computeroutput>BZ2_bzCompress</computeroutput>, and
cannam@89 739 <computeroutput>BZ2_bzCompressEnd</computeroutput> for
cannam@89 740 compression, and a corresponding trio
cannam@89 741 <computeroutput>BZ2_bzDecompressInit</computeroutput>,
cannam@89 742 <computeroutput>BZ2_bzDecompress</computeroutput> and
cannam@89 743 <computeroutput>BZ2_bzDecompressEnd</computeroutput> for
cannam@89 744 decompression. The <computeroutput>*Init</computeroutput>
cannam@89 745 functions allocate memory for compression/decompression and do
cannam@89 746 other initialisations, whilst the
cannam@89 747 <computeroutput>*End</computeroutput> functions close down
cannam@89 748 operations and release memory.</para>
cannam@89 749
cannam@89 750 <para>The real work is done by
cannam@89 751 <computeroutput>BZ2_bzCompress</computeroutput> and
cannam@89 752 <computeroutput>BZ2_bzDecompress</computeroutput>. These
cannam@89 753 compress and decompress data from a user-supplied input buffer to
cannam@89 754 a user-supplied output buffer. These buffers can be any size;
cannam@89 755 arbitrary quantities of data are handled by making repeated calls
cannam@89 756 to these functions. This is a flexible mechanism allowing a
cannam@89 757 consumer-pull style of activity, or producer-push, or a mixture
cannam@89 758 of both.</para>
cannam@89 759
cannam@89 760 </sect2>
cannam@89 761
cannam@89 762
cannam@89 763 <sect2 id="hl-summary" xreflabel="High-level summary">
cannam@89 764 <title>High-level summary</title>
cannam@89 765
cannam@89 766 <para>This interface provides some handy wrappers around the
cannam@89 767 low-level interface to facilitate reading and writing
cannam@89 768 <computeroutput>bzip2</computeroutput> format files
cannam@89 769 (<computeroutput>.bz2</computeroutput> files). The routines
cannam@89 770 provide hooks to facilitate reading files in which the
cannam@89 771 <computeroutput>bzip2</computeroutput> data stream is embedded
cannam@89 772 within some larger-scale file structure, or where there are
cannam@89 773 multiple <computeroutput>bzip2</computeroutput> data streams
cannam@89 774 concatenated end-to-end.</para>
cannam@89 775
cannam@89 776 <para>For reading files,
cannam@89 777 <computeroutput>BZ2_bzReadOpen</computeroutput>,
cannam@89 778 <computeroutput>BZ2_bzRead</computeroutput>,
cannam@89 779 <computeroutput>BZ2_bzReadClose</computeroutput> and
cannam@89 780 <computeroutput>BZ2_bzReadGetUnused</computeroutput> are
cannam@89 781 supplied. For writing files,
cannam@89 782 <computeroutput>BZ2_bzWriteOpen</computeroutput>,
cannam@89 783 <computeroutput>BZ2_bzWrite</computeroutput> and
cannam@89 784 <computeroutput>BZ2_bzWriteFinish</computeroutput> are
cannam@89 785 available.</para>
cannam@89 786
cannam@89 787 <para>As with the low-level library, no global variables are used
cannam@89 788 so the library is per se thread-safe. However, if I/O errors
cannam@89 789 occur whilst reading or writing the underlying compressed files,
cannam@89 790 you may have to consult <computeroutput>errno</computeroutput> to
cannam@89 791 determine the cause of the error. In that case, you'd need a C
cannam@89 792 library which correctly supports
cannam@89 793 <computeroutput>errno</computeroutput> in a multithreaded
cannam@89 794 environment.</para>
cannam@89 795
cannam@89 796 <para>To make the library a little simpler and more portable,
cannam@89 797 <computeroutput>BZ2_bzReadOpen</computeroutput> and
cannam@89 798 <computeroutput>BZ2_bzWriteOpen</computeroutput> require you to
cannam@89 799 pass them file handles (<computeroutput>FILE*</computeroutput>s)
cannam@89 800 which have previously been opened for reading or writing
cannam@89 801 respectively. That avoids portability problems associated with
cannam@89 802 file operations and file attributes, whilst not being much of an
cannam@89 803 imposition on the programmer.</para>
cannam@89 804
cannam@89 805 </sect2>
cannam@89 806
cannam@89 807
cannam@89 808 <sect2 id="util-fns-summary" xreflabel="Utility functions summary">
cannam@89 809 <title>Utility functions summary</title>
cannam@89 810
cannam@89 811 <para>For very simple needs,
cannam@89 812 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and
cannam@89 813 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> are
cannam@89 814 provided. These compress data in memory from one buffer to
cannam@89 815 another buffer in a single function call. You should assess
cannam@89 816 whether these functions fulfill your memory-to-memory
cannam@89 817 compression/decompression requirements before investing effort in
cannam@89 818 understanding the more general but more complex low-level
cannam@89 819 interface.</para>
cannam@89 820
cannam@89 821 <para>Yoshioka Tsuneo
cannam@89 822 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>) has
cannam@89 823 contributed some functions to give better
cannam@89 824 <computeroutput>zlib</computeroutput> compatibility. These
cannam@89 825 functions are <computeroutput>BZ2_bzopen</computeroutput>,
cannam@89 826 <computeroutput>BZ2_bzread</computeroutput>,
cannam@89 827 <computeroutput>BZ2_bzwrite</computeroutput>,
cannam@89 828 <computeroutput>BZ2_bzflush</computeroutput>,
cannam@89 829 <computeroutput>BZ2_bzclose</computeroutput>,
cannam@89 830 <computeroutput>BZ2_bzerror</computeroutput> and
cannam@89 831 <computeroutput>BZ2_bzlibVersion</computeroutput>. You may find
cannam@89 832 these functions more convenient for simple file reading and
cannam@89 833 writing, than those in the high-level interface. These functions
cannam@89 834 are not (yet) officially part of the library, and are minimally
cannam@89 835 documented here. If they break, you get to keep all the pieces.
cannam@89 836 I hope to document them properly when time permits.</para>
cannam@89 837
cannam@89 838 <para>Yoshioka also contributed modifications to allow the
cannam@89 839 library to be built as a Windows DLL.</para>
cannam@89 840
cannam@89 841 </sect2>
cannam@89 842
cannam@89 843 </sect1>
cannam@89 844
cannam@89 845
cannam@89 846 <sect1 id="err-handling" xreflabel="Error handling">
cannam@89 847 <title>Error handling</title>
cannam@89 848
cannam@89 849 <para>The library is designed to recover cleanly in all
cannam@89 850 situations, including the worst-case situation of decompressing
cannam@89 851 random data. I'm not 100% sure that it can always do this, so
cannam@89 852 you might want to add a signal handler to catch segmentation
cannam@89 853 violations during decompression if you are feeling especially
cannam@89 854 paranoid. I would be interested in hearing more about the
cannam@89 855 robustness of the library to corrupted compressed data.</para>
cannam@89 856
cannam@89 857 <para>Version 1.0.3 more robust in this respect than any
cannam@89 858 previous version. Investigations with Valgrind (a tool for detecting
cannam@89 859 problems with memory management) indicate
cannam@89 860 that, at least for the few files I tested, all single-bit errors
cannam@89 861 in the decompressed data are caught properly, with no
cannam@89 862 segmentation faults, no uses of uninitialised data, no out of
cannam@89 863 range reads or writes, and no infinite looping in the decompressor.
cannam@89 864 So it's certainly pretty robust, although
cannam@89 865 I wouldn't claim it to be totally bombproof.</para>
cannam@89 866
cannam@89 867 <para>The file <computeroutput>bzlib.h</computeroutput> contains
cannam@89 868 all definitions needed to use the library. In particular, you
cannam@89 869 should definitely not include
cannam@89 870 <computeroutput>bzlib_private.h</computeroutput>.</para>
cannam@89 871
cannam@89 872 <para>In <computeroutput>bzlib.h</computeroutput>, the various
cannam@89 873 return values are defined. The following list is not intended as
cannam@89 874 an exhaustive description of the circumstances in which a given
cannam@89 875 value may be returned -- those descriptions are given later.
cannam@89 876 Rather, it is intended to convey the rough meaning of each return
cannam@89 877 value. The first five actions are normal and not intended to
cannam@89 878 denote an error situation.</para>
cannam@89 879
cannam@89 880 <variablelist>
cannam@89 881
cannam@89 882 <varlistentry>
cannam@89 883 <term><computeroutput>BZ_OK</computeroutput></term>
cannam@89 884 <listitem><para>The requested action was completed
cannam@89 885 successfully.</para></listitem>
cannam@89 886 </varlistentry>
cannam@89 887
cannam@89 888 <varlistentry>
cannam@89 889 <term><computeroutput>BZ_RUN_OK, BZ_FLUSH_OK,
cannam@89 890 BZ_FINISH_OK</computeroutput></term>
cannam@89 891 <listitem><para>In
cannam@89 892 <computeroutput>BZ2_bzCompress</computeroutput>, the requested
cannam@89 893 flush/finish/nothing-special action was completed
cannam@89 894 successfully.</para></listitem>
cannam@89 895 </varlistentry>
cannam@89 896
cannam@89 897 <varlistentry>
cannam@89 898 <term><computeroutput>BZ_STREAM_END</computeroutput></term>
cannam@89 899 <listitem><para>Compression of data was completed, or the
cannam@89 900 logical stream end was detected during
cannam@89 901 decompression.</para></listitem>
cannam@89 902 </varlistentry>
cannam@89 903
cannam@89 904 </variablelist>
cannam@89 905
cannam@89 906 <para>The following return values indicate an error of some
cannam@89 907 kind.</para>
cannam@89 908
cannam@89 909 <variablelist>
cannam@89 910
cannam@89 911 <varlistentry>
cannam@89 912 <term><computeroutput>BZ_CONFIG_ERROR</computeroutput></term>
cannam@89 913 <listitem><para>Indicates that the library has been improperly
cannam@89 914 compiled on your platform -- a major configuration error.
cannam@89 915 Specifically, it means that
cannam@89 916 <computeroutput>sizeof(char)</computeroutput>,
cannam@89 917 <computeroutput>sizeof(short)</computeroutput> and
cannam@89 918 <computeroutput>sizeof(int)</computeroutput> are not 1, 2 and
cannam@89 919 4 respectively, as they should be. Note that the library
cannam@89 920 should still work properly on 64-bit platforms which follow
cannam@89 921 the LP64 programming model -- that is, where
cannam@89 922 <computeroutput>sizeof(long)</computeroutput> and
cannam@89 923 <computeroutput>sizeof(void*)</computeroutput> are 8. Under
cannam@89 924 LP64, <computeroutput>sizeof(int)</computeroutput> is still 4,
cannam@89 925 so <computeroutput>libbzip2</computeroutput>, which doesn't
cannam@89 926 use the <computeroutput>long</computeroutput> type, is
cannam@89 927 OK.</para></listitem>
cannam@89 928 </varlistentry>
cannam@89 929
cannam@89 930 <varlistentry>
cannam@89 931 <term><computeroutput>BZ_SEQUENCE_ERROR</computeroutput></term>
cannam@89 932 <listitem><para>When using the library, it is important to call
cannam@89 933 the functions in the correct sequence and with data structures
cannam@89 934 (buffers etc) in the correct states.
cannam@89 935 <computeroutput>libbzip2</computeroutput> checks as much as it
cannam@89 936 can to ensure this is happening, and returns
cannam@89 937 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> if not.
cannam@89 938 Code which complies precisely with the function semantics, as
cannam@89 939 detailed below, should never receive this value; such an event
cannam@89 940 denotes buggy code which you should
cannam@89 941 investigate.</para></listitem>
cannam@89 942 </varlistentry>
cannam@89 943
cannam@89 944 <varlistentry>
cannam@89 945 <term><computeroutput>BZ_PARAM_ERROR</computeroutput></term>
cannam@89 946 <listitem><para>Returned when a parameter to a function call is
cannam@89 947 out of range or otherwise manifestly incorrect. As with
cannam@89 948 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, this
cannam@89 949 denotes a bug in the client code. The distinction between
cannam@89 950 <computeroutput>BZ_PARAM_ERROR</computeroutput> and
cannam@89 951 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> is a bit
cannam@89 952 hazy, but still worth making.</para></listitem>
cannam@89 953 </varlistentry>
cannam@89 954
cannam@89 955 <varlistentry>
cannam@89 956 <term><computeroutput>BZ_MEM_ERROR</computeroutput></term>
cannam@89 957 <listitem><para>Returned when a request to allocate memory
cannam@89 958 failed. Note that the quantity of memory needed to decompress
cannam@89 959 a stream cannot be determined until the stream's header has
cannam@89 960 been read. So
cannam@89 961 <computeroutput>BZ2_bzDecompress</computeroutput> and
cannam@89 962 <computeroutput>BZ2_bzRead</computeroutput> may return
cannam@89 963 <computeroutput>BZ_MEM_ERROR</computeroutput> even though some
cannam@89 964 of the compressed data has been read. The same is not true
cannam@89 965 for compression; once
cannam@89 966 <computeroutput>BZ2_bzCompressInit</computeroutput> or
cannam@89 967 <computeroutput>BZ2_bzWriteOpen</computeroutput> have
cannam@89 968 successfully completed,
cannam@89 969 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot
cannam@89 970 occur.</para></listitem>
cannam@89 971 </varlistentry>
cannam@89 972
cannam@89 973 <varlistentry>
cannam@89 974 <term><computeroutput>BZ_DATA_ERROR</computeroutput></term>
cannam@89 975 <listitem><para>Returned when a data integrity error is
cannam@89 976 detected during decompression. Most importantly, this means
cannam@89 977 when stored and computed CRCs for the data do not match. This
cannam@89 978 value is also returned upon detection of any other anomaly in
cannam@89 979 the compressed data.</para></listitem>
cannam@89 980 </varlistentry>
cannam@89 981
cannam@89 982 <varlistentry>
cannam@89 983 <term><computeroutput>BZ_DATA_ERROR_MAGIC</computeroutput></term>
cannam@89 984 <listitem><para>As a special case of
cannam@89 985 <computeroutput>BZ_DATA_ERROR</computeroutput>, it is
cannam@89 986 sometimes useful to know when the compressed stream does not
cannam@89 987 start with the correct magic bytes (<computeroutput>'B' 'Z'
cannam@89 988 'h'</computeroutput>).</para></listitem>
cannam@89 989 </varlistentry>
cannam@89 990
cannam@89 991 <varlistentry>
cannam@89 992 <term><computeroutput>BZ_IO_ERROR</computeroutput></term>
cannam@89 993 <listitem><para>Returned by
cannam@89 994 <computeroutput>BZ2_bzRead</computeroutput> and
cannam@89 995 <computeroutput>BZ2_bzWrite</computeroutput> when there is an
cannam@89 996 error reading or writing in the compressed file, and by
cannam@89 997 <computeroutput>BZ2_bzReadOpen</computeroutput> and
cannam@89 998 <computeroutput>BZ2_bzWriteOpen</computeroutput> for attempts
cannam@89 999 to use a file for which the error indicator (viz,
cannam@89 1000 <computeroutput>ferror(f)</computeroutput>) is set. On
cannam@89 1001 receipt of <computeroutput>BZ_IO_ERROR</computeroutput>, the
cannam@89 1002 caller should consult <computeroutput>errno</computeroutput>
cannam@89 1003 and/or <computeroutput>perror</computeroutput> to acquire
cannam@89 1004 operating-system specific information about the
cannam@89 1005 problem.</para></listitem>
cannam@89 1006 </varlistentry>
cannam@89 1007
cannam@89 1008 <varlistentry>
cannam@89 1009 <term><computeroutput>BZ_UNEXPECTED_EOF</computeroutput></term>
cannam@89 1010 <listitem><para>Returned by
cannam@89 1011 <computeroutput>BZ2_bzRead</computeroutput> when the
cannam@89 1012 compressed file finishes before the logical end of stream is
cannam@89 1013 detected.</para></listitem>
cannam@89 1014 </varlistentry>
cannam@89 1015
cannam@89 1016 <varlistentry>
cannam@89 1017 <term><computeroutput>BZ_OUTBUFF_FULL</computeroutput></term>
cannam@89 1018 <listitem><para>Returned by
cannam@89 1019 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and
cannam@89 1020 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> to
cannam@89 1021 indicate that the output data will not fit into the output
cannam@89 1022 buffer provided.</para></listitem>
cannam@89 1023 </varlistentry>
cannam@89 1024
cannam@89 1025 </variablelist>
cannam@89 1026
cannam@89 1027 </sect1>
cannam@89 1028
cannam@89 1029
cannam@89 1030
cannam@89 1031 <sect1 id="low-level" xreflabel=">Low-level interface">
cannam@89 1032 <title>Low-level interface</title>
cannam@89 1033
cannam@89 1034
cannam@89 1035 <sect2 id="bzcompress-init" xreflabel="BZ2_bzCompressInit">
cannam@89 1036 <title>BZ2_bzCompressInit</title>
cannam@89 1037
cannam@89 1038 <programlisting>
cannam@89 1039 typedef struct {
cannam@89 1040 char *next_in;
cannam@89 1041 unsigned int avail_in;
cannam@89 1042 unsigned int total_in_lo32;
cannam@89 1043 unsigned int total_in_hi32;
cannam@89 1044
cannam@89 1045 char *next_out;
cannam@89 1046 unsigned int avail_out;
cannam@89 1047 unsigned int total_out_lo32;
cannam@89 1048 unsigned int total_out_hi32;
cannam@89 1049
cannam@89 1050 void *state;
cannam@89 1051
cannam@89 1052 void *(*bzalloc)(void *,int,int);
cannam@89 1053 void (*bzfree)(void *,void *);
cannam@89 1054 void *opaque;
cannam@89 1055 } bz_stream;
cannam@89 1056
cannam@89 1057 int BZ2_bzCompressInit ( bz_stream *strm,
cannam@89 1058 int blockSize100k,
cannam@89 1059 int verbosity,
cannam@89 1060 int workFactor );
cannam@89 1061 </programlisting>
cannam@89 1062
cannam@89 1063 <para>Prepares for compression. The
cannam@89 1064 <computeroutput>bz_stream</computeroutput> structure holds all
cannam@89 1065 data pertaining to the compression activity. A
cannam@89 1066 <computeroutput>bz_stream</computeroutput> structure should be
cannam@89 1067 allocated and initialised prior to the call. The fields of
cannam@89 1068 <computeroutput>bz_stream</computeroutput> comprise the entirety
cannam@89 1069 of the user-visible data. <computeroutput>state</computeroutput>
cannam@89 1070 is a pointer to the private data structures required for
cannam@89 1071 compression.</para>
cannam@89 1072
cannam@89 1073 <para>Custom memory allocators are supported, via fields
cannam@89 1074 <computeroutput>bzalloc</computeroutput>,
cannam@89 1075 <computeroutput>bzfree</computeroutput>, and
cannam@89 1076 <computeroutput>opaque</computeroutput>. The value
cannam@89 1077 <computeroutput>opaque</computeroutput> is passed to as the first
cannam@89 1078 argument to all calls to <computeroutput>bzalloc</computeroutput>
cannam@89 1079 and <computeroutput>bzfree</computeroutput>, but is otherwise
cannam@89 1080 ignored by the library. The call <computeroutput>bzalloc (
cannam@89 1081 opaque, n, m )</computeroutput> is expected to return a pointer
cannam@89 1082 <computeroutput>p</computeroutput> to <computeroutput>n *
cannam@89 1083 m</computeroutput> bytes of memory, and <computeroutput>bzfree (
cannam@89 1084 opaque, p )</computeroutput> should free that memory.</para>
cannam@89 1085
cannam@89 1086 <para>If you don't want to use a custom memory allocator, set
cannam@89 1087 <computeroutput>bzalloc</computeroutput>,
cannam@89 1088 <computeroutput>bzfree</computeroutput> and
cannam@89 1089 <computeroutput>opaque</computeroutput> to
cannam@89 1090 <computeroutput>NULL</computeroutput>, and the library will then
cannam@89 1091 use the standard <computeroutput>malloc</computeroutput> /
cannam@89 1092 <computeroutput>free</computeroutput> routines.</para>
cannam@89 1093
cannam@89 1094 <para>Before calling
cannam@89 1095 <computeroutput>BZ2_bzCompressInit</computeroutput>, fields
cannam@89 1096 <computeroutput>bzalloc</computeroutput>,
cannam@89 1097 <computeroutput>bzfree</computeroutput> and
cannam@89 1098 <computeroutput>opaque</computeroutput> should be filled
cannam@89 1099 appropriately, as just described. Upon return, the internal
cannam@89 1100 state will have been allocated and initialised, and
cannam@89 1101 <computeroutput>total_in_lo32</computeroutput>,
cannam@89 1102 <computeroutput>total_in_hi32</computeroutput>,
cannam@89 1103 <computeroutput>total_out_lo32</computeroutput> and
cannam@89 1104 <computeroutput>total_out_hi32</computeroutput> will have been
cannam@89 1105 set to zero. These four fields are used by the library to inform
cannam@89 1106 the caller of the total amount of data passed into and out of the
cannam@89 1107 library, respectively. You should not try to change them. As of
cannam@89 1108 version 1.0, 64-bit counts are maintained, even on 32-bit
cannam@89 1109 platforms, using the <computeroutput>_hi32</computeroutput>
cannam@89 1110 fields to store the upper 32 bits of the count. So, for example,
cannam@89 1111 the total amount of data in is <computeroutput>(total_in_hi32
cannam@89 1112 &#60;&#60; 32) + total_in_lo32</computeroutput>.</para>
cannam@89 1113
cannam@89 1114 <para>Parameter <computeroutput>blockSize100k</computeroutput>
cannam@89 1115 specifies the block size to be used for compression. It should
cannam@89 1116 be a value between 1 and 9 inclusive, and the actual block size
cannam@89 1117 used is 100000 x this figure. 9 gives the best compression but
cannam@89 1118 takes most memory.</para>
cannam@89 1119
cannam@89 1120 <para>Parameter <computeroutput>verbosity</computeroutput> should
cannam@89 1121 be set to a number between 0 and 4 inclusive. 0 is silent, and
cannam@89 1122 greater numbers give increasingly verbose monitoring/debugging
cannam@89 1123 output. If the library has been compiled with
cannam@89 1124 <computeroutput>-DBZ_NO_STDIO</computeroutput>, no such output
cannam@89 1125 will appear for any verbosity setting.</para>
cannam@89 1126
cannam@89 1127 <para>Parameter <computeroutput>workFactor</computeroutput>
cannam@89 1128 controls how the compression phase behaves when presented with
cannam@89 1129 worst case, highly repetitive, input data. If compression runs
cannam@89 1130 into difficulties caused by repetitive data, the library switches
cannam@89 1131 from the standard sorting algorithm to a fallback algorithm. The
cannam@89 1132 fallback is slower than the standard algorithm by perhaps a
cannam@89 1133 factor of three, but always behaves reasonably, no matter how bad
cannam@89 1134 the input.</para>
cannam@89 1135
cannam@89 1136 <para>Lower values of <computeroutput>workFactor</computeroutput>
cannam@89 1137 reduce the amount of effort the standard algorithm will expend
cannam@89 1138 before resorting to the fallback. You should set this parameter
cannam@89 1139 carefully; too low, and many inputs will be handled by the
cannam@89 1140 fallback algorithm and so compress rather slowly, too high, and
cannam@89 1141 your average-to-worst case compression times can become very
cannam@89 1142 large. The default value of 30 gives reasonable behaviour over a
cannam@89 1143 wide range of circumstances.</para>
cannam@89 1144
cannam@89 1145 <para>Allowable values range from 0 to 250 inclusive. 0 is a
cannam@89 1146 special case, equivalent to using the default value of 30.</para>
cannam@89 1147
cannam@89 1148 <para>Note that the compressed output generated is the same
cannam@89 1149 regardless of whether or not the fallback algorithm is
cannam@89 1150 used.</para>
cannam@89 1151
cannam@89 1152 <para>Be aware also that this parameter may disappear entirely in
cannam@89 1153 future versions of the library. In principle it should be
cannam@89 1154 possible to devise a good way to automatically choose which
cannam@89 1155 algorithm to use. Such a mechanism would render the parameter
cannam@89 1156 obsolete.</para>
cannam@89 1157
cannam@89 1158 <para>Possible return values:</para>
cannam@89 1159
cannam@89 1160 <programlisting>
cannam@89 1161 BZ_CONFIG_ERROR
cannam@89 1162 if the library has been mis-compiled
cannam@89 1163 BZ_PARAM_ERROR
cannam@89 1164 if strm is NULL
cannam@89 1165 or blockSize < 1 or blockSize > 9
cannam@89 1166 or verbosity < 0 or verbosity > 4
cannam@89 1167 or workFactor < 0 or workFactor > 250
cannam@89 1168 BZ_MEM_ERROR
cannam@89 1169 if not enough memory is available
cannam@89 1170 BZ_OK
cannam@89 1171 otherwise
cannam@89 1172 </programlisting>
cannam@89 1173
cannam@89 1174 <para>Allowable next actions:</para>
cannam@89 1175
cannam@89 1176 <programlisting>
cannam@89 1177 BZ2_bzCompress
cannam@89 1178 if BZ_OK is returned
cannam@89 1179 no specific action needed in case of error
cannam@89 1180 </programlisting>
cannam@89 1181
cannam@89 1182 </sect2>
cannam@89 1183
cannam@89 1184
cannam@89 1185 <sect2 id="bzCompress" xreflabel="BZ2_bzCompress">
cannam@89 1186 <title>BZ2_bzCompress</title>
cannam@89 1187
cannam@89 1188 <programlisting>
cannam@89 1189 int BZ2_bzCompress ( bz_stream *strm, int action );
cannam@89 1190 </programlisting>
cannam@89 1191
cannam@89 1192 <para>Provides more input and/or output buffer space for the
cannam@89 1193 library. The caller maintains input and output buffers, and
cannam@89 1194 calls <computeroutput>BZ2_bzCompress</computeroutput> to transfer
cannam@89 1195 data between them.</para>
cannam@89 1196
cannam@89 1197 <para>Before each call to
cannam@89 1198 <computeroutput>BZ2_bzCompress</computeroutput>,
cannam@89 1199 <computeroutput>next_in</computeroutput> should point at the data
cannam@89 1200 to be compressed, and <computeroutput>avail_in</computeroutput>
cannam@89 1201 should indicate how many bytes the library may read.
cannam@89 1202 <computeroutput>BZ2_bzCompress</computeroutput> updates
cannam@89 1203 <computeroutput>next_in</computeroutput>,
cannam@89 1204 <computeroutput>avail_in</computeroutput> and
cannam@89 1205 <computeroutput>total_in</computeroutput> to reflect the number
cannam@89 1206 of bytes it has read.</para>
cannam@89 1207
cannam@89 1208 <para>Similarly, <computeroutput>next_out</computeroutput> should
cannam@89 1209 point to a buffer in which the compressed data is to be placed,
cannam@89 1210 with <computeroutput>avail_out</computeroutput> indicating how
cannam@89 1211 much output space is available.
cannam@89 1212 <computeroutput>BZ2_bzCompress</computeroutput> updates
cannam@89 1213 <computeroutput>next_out</computeroutput>,
cannam@89 1214 <computeroutput>avail_out</computeroutput> and
cannam@89 1215 <computeroutput>total_out</computeroutput> to reflect the number
cannam@89 1216 of bytes output.</para>
cannam@89 1217
cannam@89 1218 <para>You may provide and remove as little or as much data as you
cannam@89 1219 like on each call of
cannam@89 1220 <computeroutput>BZ2_bzCompress</computeroutput>. In the limit,
cannam@89 1221 it is acceptable to supply and remove data one byte at a time,
cannam@89 1222 although this would be terribly inefficient. You should always
cannam@89 1223 ensure that at least one byte of output space is available at
cannam@89 1224 each call.</para>
cannam@89 1225
cannam@89 1226 <para>A second purpose of
cannam@89 1227 <computeroutput>BZ2_bzCompress</computeroutput> is to request a
cannam@89 1228 change of mode of the compressed stream.</para>
cannam@89 1229
cannam@89 1230 <para>Conceptually, a compressed stream can be in one of four
cannam@89 1231 states: IDLE, RUNNING, FLUSHING and FINISHING. Before
cannam@89 1232 initialisation
cannam@89 1233 (<computeroutput>BZ2_bzCompressInit</computeroutput>) and after
cannam@89 1234 termination (<computeroutput>BZ2_bzCompressEnd</computeroutput>),
cannam@89 1235 a stream is regarded as IDLE.</para>
cannam@89 1236
cannam@89 1237 <para>Upon initialisation
cannam@89 1238 (<computeroutput>BZ2_bzCompressInit</computeroutput>), the stream
cannam@89 1239 is placed in the RUNNING state. Subsequent calls to
cannam@89 1240 <computeroutput>BZ2_bzCompress</computeroutput> should pass
cannam@89 1241 <computeroutput>BZ_RUN</computeroutput> as the requested action;
cannam@89 1242 other actions are illegal and will result in
cannam@89 1243 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>.</para>
cannam@89 1244
cannam@89 1245 <para>At some point, the calling program will have provided all
cannam@89 1246 the input data it wants to. It will then want to finish up -- in
cannam@89 1247 effect, asking the library to process any data it might have
cannam@89 1248 buffered internally. In this state,
cannam@89 1249 <computeroutput>BZ2_bzCompress</computeroutput> will no longer
cannam@89 1250 attempt to read data from
cannam@89 1251 <computeroutput>next_in</computeroutput>, but it will want to
cannam@89 1252 write data to <computeroutput>next_out</computeroutput>. Because
cannam@89 1253 the output buffer supplied by the user can be arbitrarily small,
cannam@89 1254 the finishing-up operation cannot necessarily be done with a
cannam@89 1255 single call of
cannam@89 1256 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
cannam@89 1257
cannam@89 1258 <para>Instead, the calling program passes
cannam@89 1259 <computeroutput>BZ_FINISH</computeroutput> as an action to
cannam@89 1260 <computeroutput>BZ2_bzCompress</computeroutput>. This changes
cannam@89 1261 the stream's state to FINISHING. Any remaining input (ie,
cannam@89 1262 <computeroutput>next_in[0 .. avail_in-1]</computeroutput>) is
cannam@89 1263 compressed and transferred to the output buffer. To do this,
cannam@89 1264 <computeroutput>BZ2_bzCompress</computeroutput> must be called
cannam@89 1265 repeatedly until all the output has been consumed. At that
cannam@89 1266 point, <computeroutput>BZ2_bzCompress</computeroutput> returns
cannam@89 1267 <computeroutput>BZ_STREAM_END</computeroutput>, and the stream's
cannam@89 1268 state is set back to IDLE.
cannam@89 1269 <computeroutput>BZ2_bzCompressEnd</computeroutput> should then be
cannam@89 1270 called.</para>
cannam@89 1271
cannam@89 1272 <para>Just to make sure the calling program does not cheat, the
cannam@89 1273 library makes a note of <computeroutput>avail_in</computeroutput>
cannam@89 1274 at the time of the first call to
cannam@89 1275 <computeroutput>BZ2_bzCompress</computeroutput> which has
cannam@89 1276 <computeroutput>BZ_FINISH</computeroutput> as an action (ie, at
cannam@89 1277 the time the program has announced its intention to not supply
cannam@89 1278 any more input). By comparing this value with that of
cannam@89 1279 <computeroutput>avail_in</computeroutput> over subsequent calls
cannam@89 1280 to <computeroutput>BZ2_bzCompress</computeroutput>, the library
cannam@89 1281 can detect any attempts to slip in more data to compress. Any
cannam@89 1282 calls for which this is detected will return
cannam@89 1283 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>. This
cannam@89 1284 indicates a programming mistake which should be corrected.</para>
cannam@89 1285
cannam@89 1286 <para>Instead of asking to finish, the calling program may ask
cannam@89 1287 <computeroutput>BZ2_bzCompress</computeroutput> to take all the
cannam@89 1288 remaining input, compress it and terminate the current
cannam@89 1289 (Burrows-Wheeler) compression block. This could be useful for
cannam@89 1290 error control purposes. The mechanism is analogous to that for
cannam@89 1291 finishing: call <computeroutput>BZ2_bzCompress</computeroutput>
cannam@89 1292 with an action of <computeroutput>BZ_FLUSH</computeroutput>,
cannam@89 1293 remove output data, and persist with the
cannam@89 1294 <computeroutput>BZ_FLUSH</computeroutput> action until the value
cannam@89 1295 <computeroutput>BZ_RUN</computeroutput> is returned. As with
cannam@89 1296 finishing, <computeroutput>BZ2_bzCompress</computeroutput>
cannam@89 1297 detects any attempt to provide more input data once the flush has
cannam@89 1298 begun.</para>
cannam@89 1299
cannam@89 1300 <para>Once the flush is complete, the stream returns to the
cannam@89 1301 normal RUNNING state.</para>
cannam@89 1302
cannam@89 1303 <para>This all sounds pretty complex, but isn't really. Here's a
cannam@89 1304 table which shows which actions are allowable in each state, what
cannam@89 1305 action will be taken, what the next state is, and what the
cannam@89 1306 non-error return values are. Note that you can't explicitly ask
cannam@89 1307 what state the stream is in, but nor do you need to -- it can be
cannam@89 1308 inferred from the values returned by
cannam@89 1309 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
cannam@89 1310
cannam@89 1311 <programlisting>
cannam@89 1312 IDLE/any
cannam@89 1313 Illegal. IDLE state only exists after BZ2_bzCompressEnd or
cannam@89 1314 before BZ2_bzCompressInit.
cannam@89 1315 Return value = BZ_SEQUENCE_ERROR
cannam@89 1316
cannam@89 1317 RUNNING/BZ_RUN
cannam@89 1318 Compress from next_in to next_out as much as possible.
cannam@89 1319 Next state = RUNNING
cannam@89 1320 Return value = BZ_RUN_OK
cannam@89 1321
cannam@89 1322 RUNNING/BZ_FLUSH
cannam@89 1323 Remember current value of next_in. Compress from next_in
cannam@89 1324 to next_out as much as possible, but do not accept any more input.
cannam@89 1325 Next state = FLUSHING
cannam@89 1326 Return value = BZ_FLUSH_OK
cannam@89 1327
cannam@89 1328 RUNNING/BZ_FINISH
cannam@89 1329 Remember current value of next_in. Compress from next_in
cannam@89 1330 to next_out as much as possible, but do not accept any more input.
cannam@89 1331 Next state = FINISHING
cannam@89 1332 Return value = BZ_FINISH_OK
cannam@89 1333
cannam@89 1334 FLUSHING/BZ_FLUSH
cannam@89 1335 Compress from next_in to next_out as much as possible,
cannam@89 1336 but do not accept any more input.
cannam@89 1337 If all the existing input has been used up and all compressed
cannam@89 1338 output has been removed
cannam@89 1339 Next state = RUNNING; Return value = BZ_RUN_OK
cannam@89 1340 else
cannam@89 1341 Next state = FLUSHING; Return value = BZ_FLUSH_OK
cannam@89 1342
cannam@89 1343 FLUSHING/other
cannam@89 1344 Illegal.
cannam@89 1345 Return value = BZ_SEQUENCE_ERROR
cannam@89 1346
cannam@89 1347 FINISHING/BZ_FINISH
cannam@89 1348 Compress from next_in to next_out as much as possible,
cannam@89 1349 but to not accept any more input.
cannam@89 1350 If all the existing input has been used up and all compressed
cannam@89 1351 output has been removed
cannam@89 1352 Next state = IDLE; Return value = BZ_STREAM_END
cannam@89 1353 else
cannam@89 1354 Next state = FINISHING; Return value = BZ_FINISH_OK
cannam@89 1355
cannam@89 1356 FINISHING/other
cannam@89 1357 Illegal.
cannam@89 1358 Return value = BZ_SEQUENCE_ERROR
cannam@89 1359 </programlisting>
cannam@89 1360
cannam@89 1361
cannam@89 1362 <para>That still looks complicated? Well, fair enough. The
cannam@89 1363 usual sequence of calls for compressing a load of data is:</para>
cannam@89 1364
cannam@89 1365 <orderedlist>
cannam@89 1366
cannam@89 1367 <listitem><para>Get started with
cannam@89 1368 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para></listitem>
cannam@89 1369
cannam@89 1370 <listitem><para>Shovel data in and shlurp out its compressed form
cannam@89 1371 using zero or more calls of
cannam@89 1372 <computeroutput>BZ2_bzCompress</computeroutput> with action =
cannam@89 1373 <computeroutput>BZ_RUN</computeroutput>.</para></listitem>
cannam@89 1374
cannam@89 1375 <listitem><para>Finish up. Repeatedly call
cannam@89 1376 <computeroutput>BZ2_bzCompress</computeroutput> with action =
cannam@89 1377 <computeroutput>BZ_FINISH</computeroutput>, copying out the
cannam@89 1378 compressed output, until
cannam@89 1379 <computeroutput>BZ_STREAM_END</computeroutput> is
cannam@89 1380 returned.</para></listitem> <listitem><para>Close up and go home. Call
cannam@89 1381 <computeroutput>BZ2_bzCompressEnd</computeroutput>.</para></listitem>
cannam@89 1382
cannam@89 1383 </orderedlist>
cannam@89 1384
cannam@89 1385 <para>If the data you want to compress fits into your input
cannam@89 1386 buffer all at once, you can skip the calls of
cannam@89 1387 <computeroutput>BZ2_bzCompress ( ..., BZ_RUN )</computeroutput>
cannam@89 1388 and just do the <computeroutput>BZ2_bzCompress ( ..., BZ_FINISH
cannam@89 1389 )</computeroutput> calls.</para>
cannam@89 1390
cannam@89 1391 <para>All required memory is allocated by
cannam@89 1392 <computeroutput>BZ2_bzCompressInit</computeroutput>. The
cannam@89 1393 compression library can accept any data at all (obviously). So
cannam@89 1394 you shouldn't get any error return values from the
cannam@89 1395 <computeroutput>BZ2_bzCompress</computeroutput> calls. If you
cannam@89 1396 do, they will be
cannam@89 1397 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, and indicate
cannam@89 1398 a bug in your programming.</para>
cannam@89 1399
cannam@89 1400 <para>Trivial other possible return values:</para>
cannam@89 1401
cannam@89 1402 <programlisting>
cannam@89 1403 BZ_PARAM_ERROR
cannam@89 1404 if strm is NULL, or strm->s is NULL
cannam@89 1405 </programlisting>
cannam@89 1406
cannam@89 1407 </sect2>
cannam@89 1408
cannam@89 1409
cannam@89 1410 <sect2 id="bzCompress-end" xreflabel="BZ2_bzCompressEnd">
cannam@89 1411 <title>BZ2_bzCompressEnd</title>
cannam@89 1412
cannam@89 1413 <programlisting>
cannam@89 1414 int BZ2_bzCompressEnd ( bz_stream *strm );
cannam@89 1415 </programlisting>
cannam@89 1416
cannam@89 1417 <para>Releases all memory associated with a compression
cannam@89 1418 stream.</para>
cannam@89 1419
cannam@89 1420 <para>Possible return values:</para>
cannam@89 1421
cannam@89 1422 <programlisting>
cannam@89 1423 BZ_PARAM_ERROR if strm is NULL or strm->s is NULL
cannam@89 1424 BZ_OK otherwise
cannam@89 1425 </programlisting>
cannam@89 1426
cannam@89 1427 </sect2>
cannam@89 1428
cannam@89 1429
cannam@89 1430 <sect2 id="bzDecompress-init" xreflabel="BZ2_bzDecompressInit">
cannam@89 1431 <title>BZ2_bzDecompressInit</title>
cannam@89 1432
cannam@89 1433 <programlisting>
cannam@89 1434 int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
cannam@89 1435 </programlisting>
cannam@89 1436
cannam@89 1437 <para>Prepares for decompression. As with
cannam@89 1438 <computeroutput>BZ2_bzCompressInit</computeroutput>, a
cannam@89 1439 <computeroutput>bz_stream</computeroutput> record should be
cannam@89 1440 allocated and initialised before the call. Fields
cannam@89 1441 <computeroutput>bzalloc</computeroutput>,
cannam@89 1442 <computeroutput>bzfree</computeroutput> and
cannam@89 1443 <computeroutput>opaque</computeroutput> should be set if a custom
cannam@89 1444 memory allocator is required, or made
cannam@89 1445 <computeroutput>NULL</computeroutput> for the normal
cannam@89 1446 <computeroutput>malloc</computeroutput> /
cannam@89 1447 <computeroutput>free</computeroutput> routines. Upon return, the
cannam@89 1448 internal state will have been initialised, and
cannam@89 1449 <computeroutput>total_in</computeroutput> and
cannam@89 1450 <computeroutput>total_out</computeroutput> will be zero.</para>
cannam@89 1451
cannam@89 1452 <para>For the meaning of parameter
cannam@89 1453 <computeroutput>verbosity</computeroutput>, see
cannam@89 1454 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
cannam@89 1455
cannam@89 1456 <para>If <computeroutput>small</computeroutput> is nonzero, the
cannam@89 1457 library will use an alternative decompression algorithm which
cannam@89 1458 uses less memory but at the cost of decompressing more slowly
cannam@89 1459 (roughly speaking, half the speed, but the maximum memory
cannam@89 1460 requirement drops to around 2300k). See <xref linkend="using"/>
cannam@89 1461 for more information on memory management.</para>
cannam@89 1462
cannam@89 1463 <para>Note that the amount of memory needed to decompress a
cannam@89 1464 stream cannot be determined until the stream's header has been
cannam@89 1465 read, so even if
cannam@89 1466 <computeroutput>BZ2_bzDecompressInit</computeroutput> succeeds, a
cannam@89 1467 subsequent <computeroutput>BZ2_bzDecompress</computeroutput>
cannam@89 1468 could fail with
cannam@89 1469 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para>
cannam@89 1470
cannam@89 1471 <para>Possible return values:</para>
cannam@89 1472
cannam@89 1473 <programlisting>
cannam@89 1474 BZ_CONFIG_ERROR
cannam@89 1475 if the library has been mis-compiled
cannam@89 1476 BZ_PARAM_ERROR
cannam@89 1477 if ( small != 0 && small != 1 )
cannam@89 1478 or (verbosity <; 0 || verbosity > 4)
cannam@89 1479 BZ_MEM_ERROR
cannam@89 1480 if insufficient memory is available
cannam@89 1481 </programlisting>
cannam@89 1482
cannam@89 1483 <para>Allowable next actions:</para>
cannam@89 1484
cannam@89 1485 <programlisting>
cannam@89 1486 BZ2_bzDecompress
cannam@89 1487 if BZ_OK was returned
cannam@89 1488 no specific action required in case of error
cannam@89 1489 </programlisting>
cannam@89 1490
cannam@89 1491 </sect2>
cannam@89 1492
cannam@89 1493
cannam@89 1494 <sect2 id="bzDecompress" xreflabel="BZ2_bzDecompress">
cannam@89 1495 <title>BZ2_bzDecompress</title>
cannam@89 1496
cannam@89 1497 <programlisting>
cannam@89 1498 int BZ2_bzDecompress ( bz_stream *strm );
cannam@89 1499 </programlisting>
cannam@89 1500
cannam@89 1501 <para>Provides more input and/out output buffer space for the
cannam@89 1502 library. The caller maintains input and output buffers, and uses
cannam@89 1503 <computeroutput>BZ2_bzDecompress</computeroutput> to transfer
cannam@89 1504 data between them.</para>
cannam@89 1505
cannam@89 1506 <para>Before each call to
cannam@89 1507 <computeroutput>BZ2_bzDecompress</computeroutput>,
cannam@89 1508 <computeroutput>next_in</computeroutput> should point at the
cannam@89 1509 compressed data, and <computeroutput>avail_in</computeroutput>
cannam@89 1510 should indicate how many bytes the library may read.
cannam@89 1511 <computeroutput>BZ2_bzDecompress</computeroutput> updates
cannam@89 1512 <computeroutput>next_in</computeroutput>,
cannam@89 1513 <computeroutput>avail_in</computeroutput> and
cannam@89 1514 <computeroutput>total_in</computeroutput> to reflect the number
cannam@89 1515 of bytes it has read.</para>
cannam@89 1516
cannam@89 1517 <para>Similarly, <computeroutput>next_out</computeroutput> should
cannam@89 1518 point to a buffer in which the uncompressed output is to be
cannam@89 1519 placed, with <computeroutput>avail_out</computeroutput>
cannam@89 1520 indicating how much output space is available.
cannam@89 1521 <computeroutput>BZ2_bzCompress</computeroutput> updates
cannam@89 1522 <computeroutput>next_out</computeroutput>,
cannam@89 1523 <computeroutput>avail_out</computeroutput> and
cannam@89 1524 <computeroutput>total_out</computeroutput> to reflect the number
cannam@89 1525 of bytes output.</para>
cannam@89 1526
cannam@89 1527 <para>You may provide and remove as little or as much data as you
cannam@89 1528 like on each call of
cannam@89 1529 <computeroutput>BZ2_bzDecompress</computeroutput>. In the limit,
cannam@89 1530 it is acceptable to supply and remove data one byte at a time,
cannam@89 1531 although this would be terribly inefficient. You should always
cannam@89 1532 ensure that at least one byte of output space is available at
cannam@89 1533 each call.</para>
cannam@89 1534
cannam@89 1535 <para>Use of <computeroutput>BZ2_bzDecompress</computeroutput> is
cannam@89 1536 simpler than
cannam@89 1537 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
cannam@89 1538
cannam@89 1539 <para>You should provide input and remove output as described
cannam@89 1540 above, and repeatedly call
cannam@89 1541 <computeroutput>BZ2_bzDecompress</computeroutput> until
cannam@89 1542 <computeroutput>BZ_STREAM_END</computeroutput> is returned.
cannam@89 1543 Appearance of <computeroutput>BZ_STREAM_END</computeroutput>
cannam@89 1544 denotes that <computeroutput>BZ2_bzDecompress</computeroutput>
cannam@89 1545 has detected the logical end of the compressed stream.
cannam@89 1546 <computeroutput>BZ2_bzDecompress</computeroutput> will not
cannam@89 1547 produce <computeroutput>BZ_STREAM_END</computeroutput> until all
cannam@89 1548 output data has been placed into the output buffer, so once
cannam@89 1549 <computeroutput>BZ_STREAM_END</computeroutput> appears, you are
cannam@89 1550 guaranteed to have available all the decompressed output, and
cannam@89 1551 <computeroutput>BZ2_bzDecompressEnd</computeroutput> can safely
cannam@89 1552 be called.</para>
cannam@89 1553
cannam@89 1554 <para>If case of an error return value, you should call
cannam@89 1555 <computeroutput>BZ2_bzDecompressEnd</computeroutput> to clean up
cannam@89 1556 and release memory.</para>
cannam@89 1557
cannam@89 1558 <para>Possible return values:</para>
cannam@89 1559
cannam@89 1560 <programlisting>
cannam@89 1561 BZ_PARAM_ERROR
cannam@89 1562 if strm is NULL or strm->s is NULL
cannam@89 1563 or strm->avail_out < 1
cannam@89 1564 BZ_DATA_ERROR
cannam@89 1565 if a data integrity error is detected in the compressed stream
cannam@89 1566 BZ_DATA_ERROR_MAGIC
cannam@89 1567 if the compressed stream doesn't begin with the right magic bytes
cannam@89 1568 BZ_MEM_ERROR
cannam@89 1569 if there wasn't enough memory available
cannam@89 1570 BZ_STREAM_END
cannam@89 1571 if the logical end of the data stream was detected and all
cannam@89 1572 output in has been consumed, eg s-->avail_out > 0
cannam@89 1573 BZ_OK
cannam@89 1574 otherwise
cannam@89 1575 </programlisting>
cannam@89 1576
cannam@89 1577 <para>Allowable next actions:</para>
cannam@89 1578
cannam@89 1579 <programlisting>
cannam@89 1580 BZ2_bzDecompress
cannam@89 1581 if BZ_OK was returned
cannam@89 1582 BZ2_bzDecompressEnd
cannam@89 1583 otherwise
cannam@89 1584 </programlisting>
cannam@89 1585
cannam@89 1586 </sect2>
cannam@89 1587
cannam@89 1588
cannam@89 1589 <sect2 id="bzDecompress-end" xreflabel="BZ2_bzDecompressEnd">
cannam@89 1590 <title>BZ2_bzDecompressEnd</title>
cannam@89 1591
cannam@89 1592 <programlisting>
cannam@89 1593 int BZ2_bzDecompressEnd ( bz_stream *strm );
cannam@89 1594 </programlisting>
cannam@89 1595
cannam@89 1596 <para>Releases all memory associated with a decompression
cannam@89 1597 stream.</para>
cannam@89 1598
cannam@89 1599 <para>Possible return values:</para>
cannam@89 1600
cannam@89 1601 <programlisting>
cannam@89 1602 BZ_PARAM_ERROR
cannam@89 1603 if strm is NULL or strm->s is NULL
cannam@89 1604 BZ_OK
cannam@89 1605 otherwise
cannam@89 1606 </programlisting>
cannam@89 1607
cannam@89 1608 <para>Allowable next actions:</para>
cannam@89 1609
cannam@89 1610 <programlisting>
cannam@89 1611 None.
cannam@89 1612 </programlisting>
cannam@89 1613
cannam@89 1614 </sect2>
cannam@89 1615
cannam@89 1616 </sect1>
cannam@89 1617
cannam@89 1618
cannam@89 1619 <sect1 id="hl-interface" xreflabel="High-level interface">
cannam@89 1620 <title>High-level interface</title>
cannam@89 1621
cannam@89 1622 <para>This interface provides functions for reading and writing
cannam@89 1623 <computeroutput>bzip2</computeroutput> format files. First, some
cannam@89 1624 general points.</para>
cannam@89 1625
cannam@89 1626 <itemizedlist mark='bullet'>
cannam@89 1627
cannam@89 1628 <listitem><para>All of the functions take an
cannam@89 1629 <computeroutput>int*</computeroutput> first argument,
cannam@89 1630 <computeroutput>bzerror</computeroutput>. After each call,
cannam@89 1631 <computeroutput>bzerror</computeroutput> should be consulted
cannam@89 1632 first to determine the outcome of the call. If
cannam@89 1633 <computeroutput>bzerror</computeroutput> is
cannam@89 1634 <computeroutput>BZ_OK</computeroutput>, the call completed
cannam@89 1635 successfully, and only then should the return value of the
cannam@89 1636 function (if any) be consulted. If
cannam@89 1637 <computeroutput>bzerror</computeroutput> is
cannam@89 1638 <computeroutput>BZ_IO_ERROR</computeroutput>, there was an
cannam@89 1639 error reading/writing the underlying compressed file, and you
cannam@89 1640 should then consult <computeroutput>errno</computeroutput> /
cannam@89 1641 <computeroutput>perror</computeroutput> to determine the cause
cannam@89 1642 of the difficulty. <computeroutput>bzerror</computeroutput>
cannam@89 1643 may also be set to various other values; precise details are
cannam@89 1644 given on a per-function basis below.</para></listitem>
cannam@89 1645
cannam@89 1646 <listitem><para>If <computeroutput>bzerror</computeroutput> indicates
cannam@89 1647 an error (ie, anything except
cannam@89 1648 <computeroutput>BZ_OK</computeroutput> and
cannam@89 1649 <computeroutput>BZ_STREAM_END</computeroutput>), you should
cannam@89 1650 immediately call
cannam@89 1651 <computeroutput>BZ2_bzReadClose</computeroutput> (or
cannam@89 1652 <computeroutput>BZ2_bzWriteClose</computeroutput>, depending on
cannam@89 1653 whether you are attempting to read or to write) to free up all
cannam@89 1654 resources associated with the stream. Once an error has been
cannam@89 1655 indicated, behaviour of all calls except
cannam@89 1656 <computeroutput>BZ2_bzReadClose</computeroutput>
cannam@89 1657 (<computeroutput>BZ2_bzWriteClose</computeroutput>) is
cannam@89 1658 undefined. The implication is that (1)
cannam@89 1659 <computeroutput>bzerror</computeroutput> should be checked
cannam@89 1660 after each call, and (2) if
cannam@89 1661 <computeroutput>bzerror</computeroutput> indicates an error,
cannam@89 1662 <computeroutput>BZ2_bzReadClose</computeroutput>
cannam@89 1663 (<computeroutput>BZ2_bzWriteClose</computeroutput>) should then
cannam@89 1664 be called to clean up.</para></listitem>
cannam@89 1665
cannam@89 1666 <listitem><para>The <computeroutput>FILE*</computeroutput> arguments
cannam@89 1667 passed to <computeroutput>BZ2_bzReadOpen</computeroutput> /
cannam@89 1668 <computeroutput>BZ2_bzWriteOpen</computeroutput> should be set
cannam@89 1669 to binary mode. Most Unix systems will do this by default, but
cannam@89 1670 other platforms, including Windows and Mac, will not. If you
cannam@89 1671 omit this, you may encounter problems when moving code to new
cannam@89 1672 platforms.</para></listitem>
cannam@89 1673
cannam@89 1674 <listitem><para>Memory allocation requests are handled by
cannam@89 1675 <computeroutput>malloc</computeroutput> /
cannam@89 1676 <computeroutput>free</computeroutput>. At present there is no
cannam@89 1677 facility for user-defined memory allocators in the file I/O
cannam@89 1678 functions (could easily be added, though).</para></listitem>
cannam@89 1679
cannam@89 1680 </itemizedlist>
cannam@89 1681
cannam@89 1682
cannam@89 1683
cannam@89 1684 <sect2 id="bzreadopen" xreflabel="BZ2_bzReadOpen">
cannam@89 1685 <title>BZ2_bzReadOpen</title>
cannam@89 1686
cannam@89 1687 <programlisting>
cannam@89 1688 typedef void BZFILE;
cannam@89 1689
cannam@89 1690 BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f,
cannam@89 1691 int verbosity, int small,
cannam@89 1692 void *unused, int nUnused );
cannam@89 1693 </programlisting>
cannam@89 1694
cannam@89 1695 <para>Prepare to read compressed data from file handle
cannam@89 1696 <computeroutput>f</computeroutput>.
cannam@89 1697 <computeroutput>f</computeroutput> should refer to a file which
cannam@89 1698 has been opened for reading, and for which the error indicator
cannam@89 1699 (<computeroutput>ferror(f)</computeroutput>)is not set. If
cannam@89 1700 <computeroutput>small</computeroutput> is 1, the library will try
cannam@89 1701 to decompress using less memory, at the expense of speed.</para>
cannam@89 1702
cannam@89 1703 <para>For reasons explained below,
cannam@89 1704 <computeroutput>BZ2_bzRead</computeroutput> will decompress the
cannam@89 1705 <computeroutput>nUnused</computeroutput> bytes starting at
cannam@89 1706 <computeroutput>unused</computeroutput>, before starting to read
cannam@89 1707 from the file <computeroutput>f</computeroutput>. At most
cannam@89 1708 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes may be
cannam@89 1709 supplied like this. If this facility is not required, you should
cannam@89 1710 pass <computeroutput>NULL</computeroutput> and
cannam@89 1711 <computeroutput>0</computeroutput> for
cannam@89 1712 <computeroutput>unused</computeroutput> and
cannam@89 1713 n<computeroutput>Unused</computeroutput> respectively.</para>
cannam@89 1714
cannam@89 1715 <para>For the meaning of parameters
cannam@89 1716 <computeroutput>small</computeroutput> and
cannam@89 1717 <computeroutput>verbosity</computeroutput>, see
cannam@89 1718 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para>
cannam@89 1719
cannam@89 1720 <para>The amount of memory needed to decompress a file cannot be
cannam@89 1721 determined until the file's header has been read. So it is
cannam@89 1722 possible that <computeroutput>BZ2_bzReadOpen</computeroutput>
cannam@89 1723 returns <computeroutput>BZ_OK</computeroutput> but a subsequent
cannam@89 1724 call of <computeroutput>BZ2_bzRead</computeroutput> will return
cannam@89 1725 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para>
cannam@89 1726
cannam@89 1727 <para>Possible assignments to
cannam@89 1728 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 1729
cannam@89 1730 <programlisting>
cannam@89 1731 BZ_CONFIG_ERROR
cannam@89 1732 if the library has been mis-compiled
cannam@89 1733 BZ_PARAM_ERROR
cannam@89 1734 if f is NULL
cannam@89 1735 or small is neither 0 nor 1
cannam@89 1736 or ( unused == NULL && nUnused != 0 )
cannam@89 1737 or ( unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED) )
cannam@89 1738 BZ_IO_ERROR
cannam@89 1739 if ferror(f) is nonzero
cannam@89 1740 BZ_MEM_ERROR
cannam@89 1741 if insufficient memory is available
cannam@89 1742 BZ_OK
cannam@89 1743 otherwise.
cannam@89 1744 </programlisting>
cannam@89 1745
cannam@89 1746 <para>Possible return values:</para>
cannam@89 1747
cannam@89 1748 <programlisting>
cannam@89 1749 Pointer to an abstract BZFILE
cannam@89 1750 if bzerror is BZ_OK
cannam@89 1751 NULL
cannam@89 1752 otherwise
cannam@89 1753 </programlisting>
cannam@89 1754
cannam@89 1755 <para>Allowable next actions:</para>
cannam@89 1756
cannam@89 1757 <programlisting>
cannam@89 1758 BZ2_bzRead
cannam@89 1759 if bzerror is BZ_OK
cannam@89 1760 BZ2_bzClose
cannam@89 1761 otherwise
cannam@89 1762 </programlisting>
cannam@89 1763
cannam@89 1764 </sect2>
cannam@89 1765
cannam@89 1766
cannam@89 1767 <sect2 id="bzread" xreflabel="BZ2_bzRead">
cannam@89 1768 <title>BZ2_bzRead</title>
cannam@89 1769
cannam@89 1770 <programlisting>
cannam@89 1771 int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
cannam@89 1772 </programlisting>
cannam@89 1773
cannam@89 1774 <para>Reads up to <computeroutput>len</computeroutput>
cannam@89 1775 (uncompressed) bytes from the compressed file
cannam@89 1776 <computeroutput>b</computeroutput> into the buffer
cannam@89 1777 <computeroutput>buf</computeroutput>. If the read was
cannam@89 1778 successful, <computeroutput>bzerror</computeroutput> is set to
cannam@89 1779 <computeroutput>BZ_OK</computeroutput> and the number of bytes
cannam@89 1780 read is returned. If the logical end-of-stream was detected,
cannam@89 1781 <computeroutput>bzerror</computeroutput> will be set to
cannam@89 1782 <computeroutput>BZ_STREAM_END</computeroutput>, and the number of
cannam@89 1783 bytes read is returned. All other
cannam@89 1784 <computeroutput>bzerror</computeroutput> values denote an
cannam@89 1785 error.</para>
cannam@89 1786
cannam@89 1787 <para><computeroutput>BZ2_bzRead</computeroutput> will supply
cannam@89 1788 <computeroutput>len</computeroutput> bytes, unless the logical
cannam@89 1789 stream end is detected or an error occurs. Because of this, it
cannam@89 1790 is possible to detect the stream end by observing when the number
cannam@89 1791 of bytes returned is less than the number requested.
cannam@89 1792 Nevertheless, this is regarded as inadvisable; you should instead
cannam@89 1793 check <computeroutput>bzerror</computeroutput> after every call
cannam@89 1794 and watch out for
cannam@89 1795 <computeroutput>BZ_STREAM_END</computeroutput>.</para>
cannam@89 1796
cannam@89 1797 <para>Internally, <computeroutput>BZ2_bzRead</computeroutput>
cannam@89 1798 copies data from the compressed file in chunks of size
cannam@89 1799 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes before
cannam@89 1800 decompressing it. If the file contains more bytes than strictly
cannam@89 1801 needed to reach the logical end-of-stream,
cannam@89 1802 <computeroutput>BZ2_bzRead</computeroutput> will almost certainly
cannam@89 1803 read some of the trailing data before signalling
cannam@89 1804 <computeroutput>BZ_SEQUENCE_END</computeroutput>. To collect the
cannam@89 1805 read but unused data once
cannam@89 1806 <computeroutput>BZ_SEQUENCE_END</computeroutput> has appeared,
cannam@89 1807 call <computeroutput>BZ2_bzReadGetUnused</computeroutput>
cannam@89 1808 immediately before
cannam@89 1809 <computeroutput>BZ2_bzReadClose</computeroutput>.</para>
cannam@89 1810
cannam@89 1811 <para>Possible assignments to
cannam@89 1812 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 1813
cannam@89 1814 <programlisting>
cannam@89 1815 BZ_PARAM_ERROR
cannam@89 1816 if b is NULL or buf is NULL or len < 0
cannam@89 1817 BZ_SEQUENCE_ERROR
cannam@89 1818 if b was opened with BZ2_bzWriteOpen
cannam@89 1819 BZ_IO_ERROR
cannam@89 1820 if there is an error reading from the compressed file
cannam@89 1821 BZ_UNEXPECTED_EOF
cannam@89 1822 if the compressed file ended before
cannam@89 1823 the logical end-of-stream was detected
cannam@89 1824 BZ_DATA_ERROR
cannam@89 1825 if a data integrity error was detected in the compressed stream
cannam@89 1826 BZ_DATA_ERROR_MAGIC
cannam@89 1827 if the stream does not begin with the requisite header bytes
cannam@89 1828 (ie, is not a bzip2 data file). This is really
cannam@89 1829 a special case of BZ_DATA_ERROR.
cannam@89 1830 BZ_MEM_ERROR
cannam@89 1831 if insufficient memory was available
cannam@89 1832 BZ_STREAM_END
cannam@89 1833 if the logical end of stream was detected.
cannam@89 1834 BZ_OK
cannam@89 1835 otherwise.
cannam@89 1836 </programlisting>
cannam@89 1837
cannam@89 1838 <para>Possible return values:</para>
cannam@89 1839
cannam@89 1840 <programlisting>
cannam@89 1841 number of bytes read
cannam@89 1842 if bzerror is BZ_OK or BZ_STREAM_END
cannam@89 1843 undefined
cannam@89 1844 otherwise
cannam@89 1845 </programlisting>
cannam@89 1846
cannam@89 1847 <para>Allowable next actions:</para>
cannam@89 1848
cannam@89 1849 <programlisting>
cannam@89 1850 collect data from buf, then BZ2_bzRead or BZ2_bzReadClose
cannam@89 1851 if bzerror is BZ_OK
cannam@89 1852 collect data from buf, then BZ2_bzReadClose or BZ2_bzReadGetUnused
cannam@89 1853 if bzerror is BZ_SEQUENCE_END
cannam@89 1854 BZ2_bzReadClose
cannam@89 1855 otherwise
cannam@89 1856 </programlisting>
cannam@89 1857
cannam@89 1858 </sect2>
cannam@89 1859
cannam@89 1860
cannam@89 1861 <sect2 id="bzreadgetunused" xreflabel="BZ2_bzReadGetUnused">
cannam@89 1862 <title>BZ2_bzReadGetUnused</title>
cannam@89 1863
cannam@89 1864 <programlisting>
cannam@89 1865 void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b,
cannam@89 1866 void** unused, int* nUnused );
cannam@89 1867 </programlisting>
cannam@89 1868
cannam@89 1869 <para>Returns data which was read from the compressed file but
cannam@89 1870 was not needed to get to the logical end-of-stream.
cannam@89 1871 <computeroutput>*unused</computeroutput> is set to the address of
cannam@89 1872 the data, and <computeroutput>*nUnused</computeroutput> to the
cannam@89 1873 number of bytes. <computeroutput>*nUnused</computeroutput> will
cannam@89 1874 be set to a value between <computeroutput>0</computeroutput> and
cannam@89 1875 <computeroutput>BZ_MAX_UNUSED</computeroutput> inclusive.</para>
cannam@89 1876
cannam@89 1877 <para>This function may only be called once
cannam@89 1878 <computeroutput>BZ2_bzRead</computeroutput> has signalled
cannam@89 1879 <computeroutput>BZ_STREAM_END</computeroutput> but before
cannam@89 1880 <computeroutput>BZ2_bzReadClose</computeroutput>.</para>
cannam@89 1881
cannam@89 1882 <para>Possible assignments to
cannam@89 1883 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 1884
cannam@89 1885 <programlisting>
cannam@89 1886 BZ_PARAM_ERROR
cannam@89 1887 if b is NULL
cannam@89 1888 or unused is NULL or nUnused is NULL
cannam@89 1889 BZ_SEQUENCE_ERROR
cannam@89 1890 if BZ_STREAM_END has not been signalled
cannam@89 1891 or if b was opened with BZ2_bzWriteOpen
cannam@89 1892 BZ_OK
cannam@89 1893 otherwise
cannam@89 1894 </programlisting>
cannam@89 1895
cannam@89 1896 <para>Allowable next actions:</para>
cannam@89 1897
cannam@89 1898 <programlisting>
cannam@89 1899 BZ2_bzReadClose
cannam@89 1900 </programlisting>
cannam@89 1901
cannam@89 1902 </sect2>
cannam@89 1903
cannam@89 1904
cannam@89 1905 <sect2 id="bzreadclose" xreflabel="BZ2_bzReadClose">
cannam@89 1906 <title>BZ2_bzReadClose</title>
cannam@89 1907
cannam@89 1908 <programlisting>
cannam@89 1909 void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
cannam@89 1910 </programlisting>
cannam@89 1911
cannam@89 1912 <para>Releases all memory pertaining to the compressed file
cannam@89 1913 <computeroutput>b</computeroutput>.
cannam@89 1914 <computeroutput>BZ2_bzReadClose</computeroutput> does not call
cannam@89 1915 <computeroutput>fclose</computeroutput> on the underlying file
cannam@89 1916 handle, so you should do that yourself if appropriate.
cannam@89 1917 <computeroutput>BZ2_bzReadClose</computeroutput> should be called
cannam@89 1918 to clean up after all error situations.</para>
cannam@89 1919
cannam@89 1920 <para>Possible assignments to
cannam@89 1921 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 1922
cannam@89 1923 <programlisting>
cannam@89 1924 BZ_SEQUENCE_ERROR
cannam@89 1925 if b was opened with BZ2_bzOpenWrite
cannam@89 1926 BZ_OK
cannam@89 1927 otherwise
cannam@89 1928 </programlisting>
cannam@89 1929
cannam@89 1930 <para>Allowable next actions:</para>
cannam@89 1931
cannam@89 1932 <programlisting>
cannam@89 1933 none
cannam@89 1934 </programlisting>
cannam@89 1935
cannam@89 1936 </sect2>
cannam@89 1937
cannam@89 1938
cannam@89 1939 <sect2 id="bzwriteopen" xreflabel="BZ2_bzWriteOpen">
cannam@89 1940 <title>BZ2_bzWriteOpen</title>
cannam@89 1941
cannam@89 1942 <programlisting>
cannam@89 1943 BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f,
cannam@89 1944 int blockSize100k, int verbosity,
cannam@89 1945 int workFactor );
cannam@89 1946 </programlisting>
cannam@89 1947
cannam@89 1948 <para>Prepare to write compressed data to file handle
cannam@89 1949 <computeroutput>f</computeroutput>.
cannam@89 1950 <computeroutput>f</computeroutput> should refer to a file which
cannam@89 1951 has been opened for writing, and for which the error indicator
cannam@89 1952 (<computeroutput>ferror(f)</computeroutput>)is not set.</para>
cannam@89 1953
cannam@89 1954 <para>For the meaning of parameters
cannam@89 1955 <computeroutput>blockSize100k</computeroutput>,
cannam@89 1956 <computeroutput>verbosity</computeroutput> and
cannam@89 1957 <computeroutput>workFactor</computeroutput>, see
cannam@89 1958 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
cannam@89 1959
cannam@89 1960 <para>All required memory is allocated at this stage, so if the
cannam@89 1961 call completes successfully,
cannam@89 1962 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot be signalled
cannam@89 1963 by a subsequent call to
cannam@89 1964 <computeroutput>BZ2_bzWrite</computeroutput>.</para>
cannam@89 1965
cannam@89 1966 <para>Possible assignments to
cannam@89 1967 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 1968
cannam@89 1969 <programlisting>
cannam@89 1970 BZ_CONFIG_ERROR
cannam@89 1971 if the library has been mis-compiled
cannam@89 1972 BZ_PARAM_ERROR
cannam@89 1973 if f is NULL
cannam@89 1974 or blockSize100k < 1 or blockSize100k > 9
cannam@89 1975 BZ_IO_ERROR
cannam@89 1976 if ferror(f) is nonzero
cannam@89 1977 BZ_MEM_ERROR
cannam@89 1978 if insufficient memory is available
cannam@89 1979 BZ_OK
cannam@89 1980 otherwise
cannam@89 1981 </programlisting>
cannam@89 1982
cannam@89 1983 <para>Possible return values:</para>
cannam@89 1984
cannam@89 1985 <programlisting>
cannam@89 1986 Pointer to an abstract BZFILE
cannam@89 1987 if bzerror is BZ_OK
cannam@89 1988 NULL
cannam@89 1989 otherwise
cannam@89 1990 </programlisting>
cannam@89 1991
cannam@89 1992 <para>Allowable next actions:</para>
cannam@89 1993
cannam@89 1994 <programlisting>
cannam@89 1995 BZ2_bzWrite
cannam@89 1996 if bzerror is BZ_OK
cannam@89 1997 (you could go directly to BZ2_bzWriteClose, but this would be pretty pointless)
cannam@89 1998 BZ2_bzWriteClose
cannam@89 1999 otherwise
cannam@89 2000 </programlisting>
cannam@89 2001
cannam@89 2002 </sect2>
cannam@89 2003
cannam@89 2004
cannam@89 2005 <sect2 id="bzwrite" xreflabel="BZ2_bzWrite">
cannam@89 2006 <title>BZ2_bzWrite</title>
cannam@89 2007
cannam@89 2008 <programlisting>
cannam@89 2009 void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
cannam@89 2010 </programlisting>
cannam@89 2011
cannam@89 2012 <para>Absorbs <computeroutput>len</computeroutput> bytes from the
cannam@89 2013 buffer <computeroutput>buf</computeroutput>, eventually to be
cannam@89 2014 compressed and written to the file.</para>
cannam@89 2015
cannam@89 2016 <para>Possible assignments to
cannam@89 2017 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 2018
cannam@89 2019 <programlisting>
cannam@89 2020 BZ_PARAM_ERROR
cannam@89 2021 if b is NULL or buf is NULL or len < 0
cannam@89 2022 BZ_SEQUENCE_ERROR
cannam@89 2023 if b was opened with BZ2_bzReadOpen
cannam@89 2024 BZ_IO_ERROR
cannam@89 2025 if there is an error writing the compressed file.
cannam@89 2026 BZ_OK
cannam@89 2027 otherwise
cannam@89 2028 </programlisting>
cannam@89 2029
cannam@89 2030 </sect2>
cannam@89 2031
cannam@89 2032
cannam@89 2033 <sect2 id="bzwriteclose" xreflabel="BZ2_bzWriteClose">
cannam@89 2034 <title>BZ2_bzWriteClose</title>
cannam@89 2035
cannam@89 2036 <programlisting>
cannam@89 2037 void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
cannam@89 2038 int abandon,
cannam@89 2039 unsigned int* nbytes_in,
cannam@89 2040 unsigned int* nbytes_out );
cannam@89 2041
cannam@89 2042 void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
cannam@89 2043 int abandon,
cannam@89 2044 unsigned int* nbytes_in_lo32,
cannam@89 2045 unsigned int* nbytes_in_hi32,
cannam@89 2046 unsigned int* nbytes_out_lo32,
cannam@89 2047 unsigned int* nbytes_out_hi32 );
cannam@89 2048 </programlisting>
cannam@89 2049
cannam@89 2050 <para>Compresses and flushes to the compressed file all data so
cannam@89 2051 far supplied by <computeroutput>BZ2_bzWrite</computeroutput>.
cannam@89 2052 The logical end-of-stream markers are also written, so subsequent
cannam@89 2053 calls to <computeroutput>BZ2_bzWrite</computeroutput> are
cannam@89 2054 illegal. All memory associated with the compressed file
cannam@89 2055 <computeroutput>b</computeroutput> is released.
cannam@89 2056 <computeroutput>fflush</computeroutput> is called on the
cannam@89 2057 compressed file, but it is not
cannam@89 2058 <computeroutput>fclose</computeroutput>'d.</para>
cannam@89 2059
cannam@89 2060 <para>If <computeroutput>BZ2_bzWriteClose</computeroutput> is
cannam@89 2061 called to clean up after an error, the only action is to release
cannam@89 2062 the memory. The library records the error codes issued by
cannam@89 2063 previous calls, so this situation will be detected automatically.
cannam@89 2064 There is no attempt to complete the compression operation, nor to
cannam@89 2065 <computeroutput>fflush</computeroutput> the compressed file. You
cannam@89 2066 can force this behaviour to happen even in the case of no error,
cannam@89 2067 by passing a nonzero value to
cannam@89 2068 <computeroutput>abandon</computeroutput>.</para>
cannam@89 2069
cannam@89 2070 <para>If <computeroutput>nbytes_in</computeroutput> is non-null,
cannam@89 2071 <computeroutput>*nbytes_in</computeroutput> will be set to be the
cannam@89 2072 total volume of uncompressed data handled. Similarly,
cannam@89 2073 <computeroutput>nbytes_out</computeroutput> will be set to the
cannam@89 2074 total volume of compressed data written. For compatibility with
cannam@89 2075 older versions of the library,
cannam@89 2076 <computeroutput>BZ2_bzWriteClose</computeroutput> only yields the
cannam@89 2077 lower 32 bits of these counts. Use
cannam@89 2078 <computeroutput>BZ2_bzWriteClose64</computeroutput> if you want
cannam@89 2079 the full 64 bit counts. These two functions are otherwise
cannam@89 2080 absolutely identical.</para>
cannam@89 2081
cannam@89 2082 <para>Possible assignments to
cannam@89 2083 <computeroutput>bzerror</computeroutput>:</para>
cannam@89 2084
cannam@89 2085 <programlisting>
cannam@89 2086 BZ_SEQUENCE_ERROR
cannam@89 2087 if b was opened with BZ2_bzReadOpen
cannam@89 2088 BZ_IO_ERROR
cannam@89 2089 if there is an error writing the compressed file
cannam@89 2090 BZ_OK
cannam@89 2091 otherwise
cannam@89 2092 </programlisting>
cannam@89 2093
cannam@89 2094 </sect2>
cannam@89 2095
cannam@89 2096
cannam@89 2097 <sect2 id="embed" xreflabel="Handling embedded compressed data streams">
cannam@89 2098 <title>Handling embedded compressed data streams</title>
cannam@89 2099
cannam@89 2100 <para>The high-level library facilitates use of
cannam@89 2101 <computeroutput>bzip2</computeroutput> data streams which form
cannam@89 2102 some part of a surrounding, larger data stream.</para>
cannam@89 2103
cannam@89 2104 <itemizedlist mark='bullet'>
cannam@89 2105
cannam@89 2106 <listitem><para>For writing, the library takes an open file handle,
cannam@89 2107 writes compressed data to it,
cannam@89 2108 <computeroutput>fflush</computeroutput>es it but does not
cannam@89 2109 <computeroutput>fclose</computeroutput> it. The calling
cannam@89 2110 application can write its own data before and after the
cannam@89 2111 compressed data stream, using that same file handle.</para></listitem>
cannam@89 2112
cannam@89 2113 <listitem><para>Reading is more complex, and the facilities are not as
cannam@89 2114 general as they could be since generality is hard to reconcile
cannam@89 2115 with efficiency. <computeroutput>BZ2_bzRead</computeroutput>
cannam@89 2116 reads from the compressed file in blocks of size
cannam@89 2117 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes, and in
cannam@89 2118 doing so probably will overshoot the logical end of compressed
cannam@89 2119 stream. To recover this data once decompression has ended,
cannam@89 2120 call <computeroutput>BZ2_bzReadGetUnused</computeroutput> after
cannam@89 2121 the last call of <computeroutput>BZ2_bzRead</computeroutput>
cannam@89 2122 (the one returning
cannam@89 2123 <computeroutput>BZ_STREAM_END</computeroutput>) but before
cannam@89 2124 calling
cannam@89 2125 <computeroutput>BZ2_bzReadClose</computeroutput>.</para></listitem>
cannam@89 2126
cannam@89 2127 </itemizedlist>
cannam@89 2128
cannam@89 2129 <para>This mechanism makes it easy to decompress multiple
cannam@89 2130 <computeroutput>bzip2</computeroutput> streams placed end-to-end.
cannam@89 2131 As the end of one stream, when
cannam@89 2132 <computeroutput>BZ2_bzRead</computeroutput> returns
cannam@89 2133 <computeroutput>BZ_STREAM_END</computeroutput>, call
cannam@89 2134 <computeroutput>BZ2_bzReadGetUnused</computeroutput> to collect
cannam@89 2135 the unused data (copy it into your own buffer somewhere). That
cannam@89 2136 data forms the start of the next compressed stream. To start
cannam@89 2137 uncompressing that next stream, call
cannam@89 2138 <computeroutput>BZ2_bzReadOpen</computeroutput> again, feeding in
cannam@89 2139 the unused data via the <computeroutput>unused</computeroutput> /
cannam@89 2140 <computeroutput>nUnused</computeroutput> parameters. Keep doing
cannam@89 2141 this until <computeroutput>BZ_STREAM_END</computeroutput> return
cannam@89 2142 coincides with the physical end of file
cannam@89 2143 (<computeroutput>feof(f)</computeroutput>). In this situation
cannam@89 2144 <computeroutput>BZ2_bzReadGetUnused</computeroutput> will of
cannam@89 2145 course return no data.</para>
cannam@89 2146
cannam@89 2147 <para>This should give some feel for how the high-level interface
cannam@89 2148 can be used. If you require extra flexibility, you'll have to
cannam@89 2149 bite the bullet and get to grips with the low-level
cannam@89 2150 interface.</para>
cannam@89 2151
cannam@89 2152 </sect2>
cannam@89 2153
cannam@89 2154
cannam@89 2155 <sect2 id="std-rdwr" xreflabel="Standard file-reading/writing code">
cannam@89 2156 <title>Standard file-reading/writing code</title>
cannam@89 2157
cannam@89 2158 <para>Here's how you'd write data to a compressed file:</para>
cannam@89 2159
cannam@89 2160 <programlisting>
cannam@89 2161 FILE* f;
cannam@89 2162 BZFILE* b;
cannam@89 2163 int nBuf;
cannam@89 2164 char buf[ /* whatever size you like */ ];
cannam@89 2165 int bzerror;
cannam@89 2166 int nWritten;
cannam@89 2167
cannam@89 2168 f = fopen ( "myfile.bz2", "w" );
cannam@89 2169 if ( !f ) {
cannam@89 2170 /* handle error */
cannam@89 2171 }
cannam@89 2172 b = BZ2_bzWriteOpen( &bzerror, f, 9 );
cannam@89 2173 if (bzerror != BZ_OK) {
cannam@89 2174 BZ2_bzWriteClose ( b );
cannam@89 2175 /* handle error */
cannam@89 2176 }
cannam@89 2177
cannam@89 2178 while ( /* condition */ ) {
cannam@89 2179 /* get data to write into buf, and set nBuf appropriately */
cannam@89 2180 nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
cannam@89 2181 if (bzerror == BZ_IO_ERROR) {
cannam@89 2182 BZ2_bzWriteClose ( &bzerror, b );
cannam@89 2183 /* handle error */
cannam@89 2184 }
cannam@89 2185 }
cannam@89 2186
cannam@89 2187 BZ2_bzWriteClose( &bzerror, b );
cannam@89 2188 if (bzerror == BZ_IO_ERROR) {
cannam@89 2189 /* handle error */
cannam@89 2190 }
cannam@89 2191 </programlisting>
cannam@89 2192
cannam@89 2193 <para>And to read from a compressed file:</para>
cannam@89 2194
cannam@89 2195 <programlisting>
cannam@89 2196 FILE* f;
cannam@89 2197 BZFILE* b;
cannam@89 2198 int nBuf;
cannam@89 2199 char buf[ /* whatever size you like */ ];
cannam@89 2200 int bzerror;
cannam@89 2201 int nWritten;
cannam@89 2202
cannam@89 2203 f = fopen ( "myfile.bz2", "r" );
cannam@89 2204 if ( !f ) {
cannam@89 2205 /* handle error */
cannam@89 2206 }
cannam@89 2207 b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
cannam@89 2208 if ( bzerror != BZ_OK ) {
cannam@89 2209 BZ2_bzReadClose ( &bzerror, b );
cannam@89 2210 /* handle error */
cannam@89 2211 }
cannam@89 2212
cannam@89 2213 bzerror = BZ_OK;
cannam@89 2214 while ( bzerror == BZ_OK && /* arbitrary other conditions */) {
cannam@89 2215 nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
cannam@89 2216 if ( bzerror == BZ_OK ) {
cannam@89 2217 /* do something with buf[0 .. nBuf-1] */
cannam@89 2218 }
cannam@89 2219 }
cannam@89 2220 if ( bzerror != BZ_STREAM_END ) {
cannam@89 2221 BZ2_bzReadClose ( &bzerror, b );
cannam@89 2222 /* handle error */
cannam@89 2223 } else {
cannam@89 2224 BZ2_bzReadClose ( &bzerror, b );
cannam@89 2225 }
cannam@89 2226 </programlisting>
cannam@89 2227
cannam@89 2228 </sect2>
cannam@89 2229
cannam@89 2230 </sect1>
cannam@89 2231
cannam@89 2232
cannam@89 2233 <sect1 id="util-fns" xreflabel="Utility functions">
cannam@89 2234 <title>Utility functions</title>
cannam@89 2235
cannam@89 2236
cannam@89 2237 <sect2 id="bzbufftobuffcompress" xreflabel="BZ2_bzBuffToBuffCompress">
cannam@89 2238 <title>BZ2_bzBuffToBuffCompress</title>
cannam@89 2239
cannam@89 2240 <programlisting>
cannam@89 2241 int BZ2_bzBuffToBuffCompress( char* dest,
cannam@89 2242 unsigned int* destLen,
cannam@89 2243 char* source,
cannam@89 2244 unsigned int sourceLen,
cannam@89 2245 int blockSize100k,
cannam@89 2246 int verbosity,
cannam@89 2247 int workFactor );
cannam@89 2248 </programlisting>
cannam@89 2249
cannam@89 2250 <para>Attempts to compress the data in <computeroutput>source[0
cannam@89 2251 .. sourceLen-1]</computeroutput> into the destination buffer,
cannam@89 2252 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the
cannam@89 2253 destination buffer is big enough,
cannam@89 2254 <computeroutput>*destLen</computeroutput> is set to the size of
cannam@89 2255 the compressed data, and <computeroutput>BZ_OK</computeroutput>
cannam@89 2256 is returned. If the compressed data won't fit,
cannam@89 2257 <computeroutput>*destLen</computeroutput> is unchanged, and
cannam@89 2258 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is
cannam@89 2259 returned.</para>
cannam@89 2260
cannam@89 2261 <para>Compression in this manner is a one-shot event, done with a
cannam@89 2262 single call to this function. The resulting compressed data is a
cannam@89 2263 complete <computeroutput>bzip2</computeroutput> format data
cannam@89 2264 stream. There is no mechanism for making additional calls to
cannam@89 2265 provide extra input data. If you want that kind of mechanism,
cannam@89 2266 use the low-level interface.</para>
cannam@89 2267
cannam@89 2268 <para>For the meaning of parameters
cannam@89 2269 <computeroutput>blockSize100k</computeroutput>,
cannam@89 2270 <computeroutput>verbosity</computeroutput> and
cannam@89 2271 <computeroutput>workFactor</computeroutput>, see
cannam@89 2272 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
cannam@89 2273
cannam@89 2274 <para>To guarantee that the compressed data will fit in its
cannam@89 2275 buffer, allocate an output buffer of size 1% larger than the
cannam@89 2276 uncompressed data, plus six hundred extra bytes.</para>
cannam@89 2277
cannam@89 2278 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput>
cannam@89 2279 will not write data at or beyond
cannam@89 2280 <computeroutput>dest[*destLen]</computeroutput>, even in case of
cannam@89 2281 buffer overflow.</para>
cannam@89 2282
cannam@89 2283 <para>Possible return values:</para>
cannam@89 2284
cannam@89 2285 <programlisting>
cannam@89 2286 BZ_CONFIG_ERROR
cannam@89 2287 if the library has been mis-compiled
cannam@89 2288 BZ_PARAM_ERROR
cannam@89 2289 if dest is NULL or destLen is NULL
cannam@89 2290 or blockSize100k < 1 or blockSize100k > 9
cannam@89 2291 or verbosity < 0 or verbosity > 4
cannam@89 2292 or workFactor < 0 or workFactor > 250
cannam@89 2293 BZ_MEM_ERROR
cannam@89 2294 if insufficient memory is available
cannam@89 2295 BZ_OUTBUFF_FULL
cannam@89 2296 if the size of the compressed data exceeds *destLen
cannam@89 2297 BZ_OK
cannam@89 2298 otherwise
cannam@89 2299 </programlisting>
cannam@89 2300
cannam@89 2301 </sect2>
cannam@89 2302
cannam@89 2303
cannam@89 2304 <sect2 id="bzbufftobuffdecompress" xreflabel="BZ2_bzBuffToBuffDecompress">
cannam@89 2305 <title>BZ2_bzBuffToBuffDecompress</title>
cannam@89 2306
cannam@89 2307 <programlisting>
cannam@89 2308 int BZ2_bzBuffToBuffDecompress( char* dest,
cannam@89 2309 unsigned int* destLen,
cannam@89 2310 char* source,
cannam@89 2311 unsigned int sourceLen,
cannam@89 2312 int small,
cannam@89 2313 int verbosity );
cannam@89 2314 </programlisting>
cannam@89 2315
cannam@89 2316 <para>Attempts to decompress the data in <computeroutput>source[0
cannam@89 2317 .. sourceLen-1]</computeroutput> into the destination buffer,
cannam@89 2318 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the
cannam@89 2319 destination buffer is big enough,
cannam@89 2320 <computeroutput>*destLen</computeroutput> is set to the size of
cannam@89 2321 the uncompressed data, and <computeroutput>BZ_OK</computeroutput>
cannam@89 2322 is returned. If the compressed data won't fit,
cannam@89 2323 <computeroutput>*destLen</computeroutput> is unchanged, and
cannam@89 2324 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is
cannam@89 2325 returned.</para>
cannam@89 2326
cannam@89 2327 <para><computeroutput>source</computeroutput> is assumed to hold
cannam@89 2328 a complete <computeroutput>bzip2</computeroutput> format data
cannam@89 2329 stream.
cannam@89 2330 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> tries
cannam@89 2331 to decompress the entirety of the stream into the output
cannam@89 2332 buffer.</para>
cannam@89 2333
cannam@89 2334 <para>For the meaning of parameters
cannam@89 2335 <computeroutput>small</computeroutput> and
cannam@89 2336 <computeroutput>verbosity</computeroutput>, see
cannam@89 2337 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para>
cannam@89 2338
cannam@89 2339 <para>Because the compression ratio of the compressed data cannot
cannam@89 2340 be known in advance, there is no easy way to guarantee that the
cannam@89 2341 output buffer will be big enough. You may of course make
cannam@89 2342 arrangements in your code to record the size of the uncompressed
cannam@89 2343 data, but such a mechanism is beyond the scope of this
cannam@89 2344 library.</para>
cannam@89 2345
cannam@89 2346 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput>
cannam@89 2347 will not write data at or beyond
cannam@89 2348 <computeroutput>dest[*destLen]</computeroutput>, even in case of
cannam@89 2349 buffer overflow.</para>
cannam@89 2350
cannam@89 2351 <para>Possible return values:</para>
cannam@89 2352
cannam@89 2353 <programlisting>
cannam@89 2354 BZ_CONFIG_ERROR
cannam@89 2355 if the library has been mis-compiled
cannam@89 2356 BZ_PARAM_ERROR
cannam@89 2357 if dest is NULL or destLen is NULL
cannam@89 2358 or small != 0 && small != 1
cannam@89 2359 or verbosity < 0 or verbosity > 4
cannam@89 2360 BZ_MEM_ERROR
cannam@89 2361 if insufficient memory is available
cannam@89 2362 BZ_OUTBUFF_FULL
cannam@89 2363 if the size of the compressed data exceeds *destLen
cannam@89 2364 BZ_DATA_ERROR
cannam@89 2365 if a data integrity error was detected in the compressed data
cannam@89 2366 BZ_DATA_ERROR_MAGIC
cannam@89 2367 if the compressed data doesn't begin with the right magic bytes
cannam@89 2368 BZ_UNEXPECTED_EOF
cannam@89 2369 if the compressed data ends unexpectedly
cannam@89 2370 BZ_OK
cannam@89 2371 otherwise
cannam@89 2372 </programlisting>
cannam@89 2373
cannam@89 2374 </sect2>
cannam@89 2375
cannam@89 2376 </sect1>
cannam@89 2377
cannam@89 2378
cannam@89 2379 <sect1 id="zlib-compat" xreflabel="zlib compatibility functions">
cannam@89 2380 <title>zlib compatibility functions</title>
cannam@89 2381
cannam@89 2382 <para>Yoshioka Tsuneo has contributed some functions to give
cannam@89 2383 better <computeroutput>zlib</computeroutput> compatibility.
cannam@89 2384 These functions are <computeroutput>BZ2_bzopen</computeroutput>,
cannam@89 2385 <computeroutput>BZ2_bzread</computeroutput>,
cannam@89 2386 <computeroutput>BZ2_bzwrite</computeroutput>,
cannam@89 2387 <computeroutput>BZ2_bzflush</computeroutput>,
cannam@89 2388 <computeroutput>BZ2_bzclose</computeroutput>,
cannam@89 2389 <computeroutput>BZ2_bzerror</computeroutput> and
cannam@89 2390 <computeroutput>BZ2_bzlibVersion</computeroutput>. These
cannam@89 2391 functions are not (yet) officially part of the library. If they
cannam@89 2392 break, you get to keep all the pieces. Nevertheless, I think
cannam@89 2393 they work ok.</para>
cannam@89 2394
cannam@89 2395 <programlisting>
cannam@89 2396 typedef void BZFILE;
cannam@89 2397
cannam@89 2398 const char * BZ2_bzlibVersion ( void );
cannam@89 2399 </programlisting>
cannam@89 2400
cannam@89 2401 <para>Returns a string indicating the library version.</para>
cannam@89 2402
cannam@89 2403 <programlisting>
cannam@89 2404 BZFILE * BZ2_bzopen ( const char *path, const char *mode );
cannam@89 2405 BZFILE * BZ2_bzdopen ( int fd, const char *mode );
cannam@89 2406 </programlisting>
cannam@89 2407
cannam@89 2408 <para>Opens a <computeroutput>.bz2</computeroutput> file for
cannam@89 2409 reading or writing, using either its name or a pre-existing file
cannam@89 2410 descriptor. Analogous to <computeroutput>fopen</computeroutput>
cannam@89 2411 and <computeroutput>fdopen</computeroutput>.</para>
cannam@89 2412
cannam@89 2413 <programlisting>
cannam@89 2414 int BZ2_bzread ( BZFILE* b, void* buf, int len );
cannam@89 2415 int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
cannam@89 2416 </programlisting>
cannam@89 2417
cannam@89 2418 <para>Reads/writes data from/to a previously opened
cannam@89 2419 <computeroutput>BZFILE</computeroutput>. Analogous to
cannam@89 2420 <computeroutput>fread</computeroutput> and
cannam@89 2421 <computeroutput>fwrite</computeroutput>.</para>
cannam@89 2422
cannam@89 2423 <programlisting>
cannam@89 2424 int BZ2_bzflush ( BZFILE* b );
cannam@89 2425 void BZ2_bzclose ( BZFILE* b );
cannam@89 2426 </programlisting>
cannam@89 2427
cannam@89 2428 <para>Flushes/closes a <computeroutput>BZFILE</computeroutput>.
cannam@89 2429 <computeroutput>BZ2_bzflush</computeroutput> doesn't actually do
cannam@89 2430 anything. Analogous to <computeroutput>fflush</computeroutput>
cannam@89 2431 and <computeroutput>fclose</computeroutput>.</para>
cannam@89 2432
cannam@89 2433 <programlisting>
cannam@89 2434 const char * BZ2_bzerror ( BZFILE *b, int *errnum )
cannam@89 2435 </programlisting>
cannam@89 2436
cannam@89 2437 <para>Returns a string describing the more recent error status of
cannam@89 2438 <computeroutput>b</computeroutput>, and also sets
cannam@89 2439 <computeroutput>*errnum</computeroutput> to its numerical
cannam@89 2440 value.</para>
cannam@89 2441
cannam@89 2442 </sect1>
cannam@89 2443
cannam@89 2444
cannam@89 2445 <sect1 id="stdio-free"
cannam@89 2446 xreflabel="Using the library in a stdio-free environment">
cannam@89 2447 <title>Using the library in a stdio-free environment</title>
cannam@89 2448
cannam@89 2449
cannam@89 2450 <sect2 id="stdio-bye" xreflabel="Getting rid of stdio">
cannam@89 2451 <title>Getting rid of stdio</title>
cannam@89 2452
cannam@89 2453 <para>In a deeply embedded application, you might want to use
cannam@89 2454 just the memory-to-memory functions. You can do this
cannam@89 2455 conveniently by compiling the library with preprocessor symbol
cannam@89 2456 <computeroutput>BZ_NO_STDIO</computeroutput> defined. Doing this
cannam@89 2457 gives you a library containing only the following eight
cannam@89 2458 functions:</para>
cannam@89 2459
cannam@89 2460 <para><computeroutput>BZ2_bzCompressInit</computeroutput>,
cannam@89 2461 <computeroutput>BZ2_bzCompress</computeroutput>,
cannam@89 2462 <computeroutput>BZ2_bzCompressEnd</computeroutput>
cannam@89 2463 <computeroutput>BZ2_bzDecompressInit</computeroutput>,
cannam@89 2464 <computeroutput>BZ2_bzDecompress</computeroutput>,
cannam@89 2465 <computeroutput>BZ2_bzDecompressEnd</computeroutput>
cannam@89 2466 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput>,
cannam@89 2467 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput></para>
cannam@89 2468
cannam@89 2469 <para>When compiled like this, all functions will ignore
cannam@89 2470 <computeroutput>verbosity</computeroutput> settings.</para>
cannam@89 2471
cannam@89 2472 </sect2>
cannam@89 2473
cannam@89 2474
cannam@89 2475 <sect2 id="critical-error" xreflabel="Critical error handling">
cannam@89 2476 <title>Critical error handling</title>
cannam@89 2477
cannam@89 2478 <para><computeroutput>libbzip2</computeroutput> contains a number
cannam@89 2479 of internal assertion checks which should, needless to say, never
cannam@89 2480 be activated. Nevertheless, if an assertion should fail,
cannam@89 2481 behaviour depends on whether or not the library was compiled with
cannam@89 2482 <computeroutput>BZ_NO_STDIO</computeroutput> set.</para>
cannam@89 2483
cannam@89 2484 <para>For a normal compile, an assertion failure yields the
cannam@89 2485 message:</para>
cannam@89 2486
cannam@89 2487 <blockquote>
cannam@89 2488 <para>bzip2/libbzip2: internal error number N.</para>
cannam@89 2489 <para>This is a bug in bzip2/libbzip2, &bz-version; of &bz-date;.
cannam@89 2490 Please report it to me at: &bz-email;. If this happened
cannam@89 2491 when you were using some program which uses libbzip2 as a
cannam@89 2492 component, you should also report this bug to the author(s)
cannam@89 2493 of that program. Please make an effort to report this bug;
cannam@89 2494 timely and accurate bug reports eventually lead to higher
cannam@89 2495 quality software. Thanks. Julian Seward, &bz-date;.
cannam@89 2496 </para></blockquote>
cannam@89 2497
cannam@89 2498 <para>where <computeroutput>N</computeroutput> is some error code
cannam@89 2499 number. If <computeroutput>N == 1007</computeroutput>, it also
cannam@89 2500 prints some extra text advising the reader that unreliable memory
cannam@89 2501 is often associated with internal error 1007. (This is a
cannam@89 2502 frequently-observed-phenomenon with versions 1.0.0/1.0.1).</para>
cannam@89 2503
cannam@89 2504 <para><computeroutput>exit(3)</computeroutput> is then
cannam@89 2505 called.</para>
cannam@89 2506
cannam@89 2507 <para>For a <computeroutput>stdio</computeroutput>-free library,
cannam@89 2508 assertion failures result in a call to a function declared
cannam@89 2509 as:</para>
cannam@89 2510
cannam@89 2511 <programlisting>
cannam@89 2512 extern void bz_internal_error ( int errcode );
cannam@89 2513 </programlisting>
cannam@89 2514
cannam@89 2515 <para>The relevant code is passed as a parameter. You should
cannam@89 2516 supply such a function.</para>
cannam@89 2517
cannam@89 2518 <para>In either case, once an assertion failure has occurred, any
cannam@89 2519 <computeroutput>bz_stream</computeroutput> records involved can
cannam@89 2520 be regarded as invalid. You should not attempt to resume normal
cannam@89 2521 operation with them.</para>
cannam@89 2522
cannam@89 2523 <para>You may, of course, change critical error handling to suit
cannam@89 2524 your needs. As I said above, critical errors indicate bugs in
cannam@89 2525 the library and should not occur. All "normal" error situations
cannam@89 2526 are indicated via error return codes from functions, and can be
cannam@89 2527 recovered from.</para>
cannam@89 2528
cannam@89 2529 </sect2>
cannam@89 2530
cannam@89 2531 </sect1>
cannam@89 2532
cannam@89 2533
cannam@89 2534 <sect1 id="win-dll" xreflabel="Making a Windows DLL">
cannam@89 2535 <title>Making a Windows DLL</title>
cannam@89 2536
cannam@89 2537 <para>Everything related to Windows has been contributed by
cannam@89 2538 Yoshioka Tsuneo
cannam@89 2539 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>), so
cannam@89 2540 you should send your queries to him (but perhaps Cc: me,
cannam@89 2541 <computeroutput>&bz-email;</computeroutput>).</para>
cannam@89 2542
cannam@89 2543 <para>My vague understanding of what to do is: using Visual C++
cannam@89 2544 5.0, open the project file
cannam@89 2545 <computeroutput>libbz2.dsp</computeroutput>, and build. That's
cannam@89 2546 all.</para>
cannam@89 2547
cannam@89 2548 <para>If you can't open the project file for some reason, make a
cannam@89 2549 new one, naming these files:
cannam@89 2550 <computeroutput>blocksort.c</computeroutput>,
cannam@89 2551 <computeroutput>bzlib.c</computeroutput>,
cannam@89 2552 <computeroutput>compress.c</computeroutput>,
cannam@89 2553 <computeroutput>crctable.c</computeroutput>,
cannam@89 2554 <computeroutput>decompress.c</computeroutput>,
cannam@89 2555 <computeroutput>huffman.c</computeroutput>,
cannam@89 2556 <computeroutput>randtable.c</computeroutput> and
cannam@89 2557 <computeroutput>libbz2.def</computeroutput>. You will also need
cannam@89 2558 to name the header files <computeroutput>bzlib.h</computeroutput>
cannam@89 2559 and <computeroutput>bzlib_private.h</computeroutput>.</para>
cannam@89 2560
cannam@89 2561 <para>If you don't use VC++, you may need to define the
cannam@89 2562 proprocessor symbol
cannam@89 2563 <computeroutput>_WIN32</computeroutput>.</para>
cannam@89 2564
cannam@89 2565 <para>Finally, <computeroutput>dlltest.c</computeroutput> is a
cannam@89 2566 sample program using the DLL. It has a project file,
cannam@89 2567 <computeroutput>dlltest.dsp</computeroutput>.</para>
cannam@89 2568
cannam@89 2569 <para>If you just want a makefile for Visual C, have a look at
cannam@89 2570 <computeroutput>makefile.msc</computeroutput>.</para>
cannam@89 2571
cannam@89 2572 <para>Be aware that if you compile
cannam@89 2573 <computeroutput>bzip2</computeroutput> itself on Win32, you must
cannam@89 2574 set <computeroutput>BZ_UNIX</computeroutput> to 0 and
cannam@89 2575 <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the file
cannam@89 2576 <computeroutput>bzip2.c</computeroutput>, before compiling.
cannam@89 2577 Otherwise the resulting binary won't work correctly.</para>
cannam@89 2578
cannam@89 2579 <para>I haven't tried any of this stuff myself, but it all looks
cannam@89 2580 plausible.</para>
cannam@89 2581
cannam@89 2582 </sect1>
cannam@89 2583
cannam@89 2584 </chapter>
cannam@89 2585
cannam@89 2586
cannam@89 2587
cannam@89 2588 <chapter id="misc" xreflabel="Miscellanea">
cannam@89 2589 <title>Miscellanea</title>
cannam@89 2590
cannam@89 2591 <para>These are just some random thoughts of mine. Your mileage
cannam@89 2592 may vary.</para>
cannam@89 2593
cannam@89 2594
cannam@89 2595 <sect1 id="limits" xreflabel="Limitations of the compressed file format">
cannam@89 2596 <title>Limitations of the compressed file format</title>
cannam@89 2597
cannam@89 2598 <para><computeroutput>bzip2-1.0.X</computeroutput>,
cannam@89 2599 <computeroutput>0.9.5</computeroutput> and
cannam@89 2600 <computeroutput>0.9.0</computeroutput> use exactly the same file
cannam@89 2601 format as the original version,
cannam@89 2602 <computeroutput>bzip2-0.1</computeroutput>. This decision was
cannam@89 2603 made in the interests of stability. Creating yet another
cannam@89 2604 incompatible compressed file format would create further
cannam@89 2605 confusion and disruption for users.</para>
cannam@89 2606
cannam@89 2607 <para>Nevertheless, this is not a painless decision. Development
cannam@89 2608 work since the release of
cannam@89 2609 <computeroutput>bzip2-0.1</computeroutput> in August 1997 has
cannam@89 2610 shown complexities in the file format which slow down
cannam@89 2611 decompression and, in retrospect, are unnecessary. These
cannam@89 2612 are:</para>
cannam@89 2613
cannam@89 2614 <itemizedlist mark='bullet'>
cannam@89 2615
cannam@89 2616 <listitem><para>The run-length encoder, which is the first of the
cannam@89 2617 compression transformations, is entirely irrelevant. The
cannam@89 2618 original purpose was to protect the sorting algorithm from the
cannam@89 2619 very worst case input: a string of repeated symbols. But
cannam@89 2620 algorithm steps Q6a and Q6b in the original Burrows-Wheeler
cannam@89 2621 technical report (SRC-124) show how repeats can be handled
cannam@89 2622 without difficulty in block sorting.</para></listitem>
cannam@89 2623
cannam@89 2624 <listitem><para>The randomisation mechanism doesn't really need to be
cannam@89 2625 there. Udi Manber and Gene Myers published a suffix array
cannam@89 2626 construction algorithm a few years back, which can be employed
cannam@89 2627 to sort any block, no matter how repetitive, in O(N log N)
cannam@89 2628 time. Subsequent work by Kunihiko Sadakane has produced a
cannam@89 2629 derivative O(N (log N)^2) algorithm which usually outperforms
cannam@89 2630 the Manber-Myers algorithm.</para>
cannam@89 2631
cannam@89 2632 <para>I could have changed to Sadakane's algorithm, but I find
cannam@89 2633 it to be slower than <computeroutput>bzip2</computeroutput>'s
cannam@89 2634 existing algorithm for most inputs, and the randomisation
cannam@89 2635 mechanism protects adequately against bad cases. I didn't
cannam@89 2636 think it was a good tradeoff to make. Partly this is due to
cannam@89 2637 the fact that I was not flooded with email complaints about
cannam@89 2638 <computeroutput>bzip2-0.1</computeroutput>'s performance on
cannam@89 2639 repetitive data, so perhaps it isn't a problem for real
cannam@89 2640 inputs.</para>
cannam@89 2641
cannam@89 2642 <para>Probably the best long-term solution, and the one I have
cannam@89 2643 incorporated into 0.9.5 and above, is to use the existing
cannam@89 2644 sorting algorithm initially, and fall back to a O(N (log N)^2)
cannam@89 2645 algorithm if the standard algorithm gets into
cannam@89 2646 difficulties.</para></listitem>
cannam@89 2647
cannam@89 2648 <listitem><para>The compressed file format was never designed to be
cannam@89 2649 handled by a library, and I have had to jump though some hoops
cannam@89 2650 to produce an efficient implementation of decompression. It's
cannam@89 2651 a bit hairy. Try passing
cannam@89 2652 <computeroutput>decompress.c</computeroutput> through the C
cannam@89 2653 preprocessor and you'll see what I mean. Much of this
cannam@89 2654 complexity could have been avoided if the compressed size of
cannam@89 2655 each block of data was recorded in the data stream.</para></listitem>
cannam@89 2656
cannam@89 2657 <listitem><para>An Adler-32 checksum, rather than a CRC32 checksum,
cannam@89 2658 would be faster to compute.</para></listitem>
cannam@89 2659
cannam@89 2660 </itemizedlist>
cannam@89 2661
cannam@89 2662 <para>It would be fair to say that the
cannam@89 2663 <computeroutput>bzip2</computeroutput> format was frozen before I
cannam@89 2664 properly and fully understood the performance consequences of
cannam@89 2665 doing so.</para>
cannam@89 2666
cannam@89 2667 <para>Improvements which I was able to incorporate into 0.9.0,
cannam@89 2668 despite using the same file format, are:</para>
cannam@89 2669
cannam@89 2670 <itemizedlist mark='bullet'>
cannam@89 2671
cannam@89 2672 <listitem><para>Single array implementation of the inverse BWT. This
cannam@89 2673 significantly speeds up decompression, presumably because it
cannam@89 2674 reduces the number of cache misses.</para></listitem>
cannam@89 2675
cannam@89 2676 <listitem><para>Faster inverse MTF transform for large MTF values.
cannam@89 2677 The new implementation is based on the notion of sliding blocks
cannam@89 2678 of values.</para></listitem>
cannam@89 2679
cannam@89 2680 <listitem><para><computeroutput>bzip2-0.9.0</computeroutput> now reads
cannam@89 2681 and writes files with <computeroutput>fread</computeroutput>
cannam@89 2682 and <computeroutput>fwrite</computeroutput>; version 0.1 used
cannam@89 2683 <computeroutput>putc</computeroutput> and
cannam@89 2684 <computeroutput>getc</computeroutput>. Duh! Well, you live
cannam@89 2685 and learn.</para></listitem>
cannam@89 2686
cannam@89 2687 </itemizedlist>
cannam@89 2688
cannam@89 2689 <para>Further ahead, it would be nice to be able to do random
cannam@89 2690 access into files. This will require some careful design of
cannam@89 2691 compressed file formats.</para>
cannam@89 2692
cannam@89 2693 </sect1>
cannam@89 2694
cannam@89 2695
cannam@89 2696 <sect1 id="port-issues" xreflabel="Portability issues">
cannam@89 2697 <title>Portability issues</title>
cannam@89 2698
cannam@89 2699 <para>After some consideration, I have decided not to use GNU
cannam@89 2700 <computeroutput>autoconf</computeroutput> to configure 0.9.5 or
cannam@89 2701 1.0.</para>
cannam@89 2702
cannam@89 2703 <para><computeroutput>autoconf</computeroutput>, admirable and
cannam@89 2704 wonderful though it is, mainly assists with portability problems
cannam@89 2705 between Unix-like platforms. But
cannam@89 2706 <computeroutput>bzip2</computeroutput> doesn't have much in the
cannam@89 2707 way of portability problems on Unix; most of the difficulties
cannam@89 2708 appear when porting to the Mac, or to Microsoft's operating
cannam@89 2709 systems. <computeroutput>autoconf</computeroutput> doesn't help
cannam@89 2710 in those cases, and brings in a whole load of new
cannam@89 2711 complexity.</para>
cannam@89 2712
cannam@89 2713 <para>Most people should be able to compile the library and
cannam@89 2714 program under Unix straight out-of-the-box, so to speak,
cannam@89 2715 especially if you have a version of GNU C available.</para>
cannam@89 2716
cannam@89 2717 <para>There are a couple of
cannam@89 2718 <computeroutput>__inline__</computeroutput> directives in the
cannam@89 2719 code. GNU C (<computeroutput>gcc</computeroutput>) should be
cannam@89 2720 able to handle them. If you're not using GNU C, your C compiler
cannam@89 2721 shouldn't see them at all. If your compiler does, for some
cannam@89 2722 reason, see them and doesn't like them, just
cannam@89 2723 <computeroutput>#define</computeroutput>
cannam@89 2724 <computeroutput>__inline__</computeroutput> to be
cannam@89 2725 <computeroutput>/* */</computeroutput>. One easy way to do this
cannam@89 2726 is to compile with the flag
cannam@89 2727 <computeroutput>-D__inline__=</computeroutput>, which should be
cannam@89 2728 understood by most Unix compilers.</para>
cannam@89 2729
cannam@89 2730 <para>If you still have difficulties, try compiling with the
cannam@89 2731 macro <computeroutput>BZ_STRICT_ANSI</computeroutput> defined.
cannam@89 2732 This should enable you to build the library in a strictly ANSI
cannam@89 2733 compliant environment. Building the program itself like this is
cannam@89 2734 dangerous and not supported, since you remove
cannam@89 2735 <computeroutput>bzip2</computeroutput>'s checks against
cannam@89 2736 compressing directories, symbolic links, devices, and other
cannam@89 2737 not-really-a-file entities. This could cause filesystem
cannam@89 2738 corruption!</para>
cannam@89 2739
cannam@89 2740 <para>One other thing: if you create a
cannam@89 2741 <computeroutput>bzip2</computeroutput> binary for public distribution,
cannam@89 2742 please consider linking it statically (<computeroutput>gcc
cannam@89 2743 -static</computeroutput>). This avoids all sorts of library-version
cannam@89 2744 issues that others may encounter later on.</para>
cannam@89 2745
cannam@89 2746 <para>If you build <computeroutput>bzip2</computeroutput> on
cannam@89 2747 Win32, you must set <computeroutput>BZ_UNIX</computeroutput> to 0
cannam@89 2748 and <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the
cannam@89 2749 file <computeroutput>bzip2.c</computeroutput>, before compiling.
cannam@89 2750 Otherwise the resulting binary won't work correctly.</para>
cannam@89 2751
cannam@89 2752 </sect1>
cannam@89 2753
cannam@89 2754
cannam@89 2755 <sect1 id="bugs" xreflabel="Reporting bugs">
cannam@89 2756 <title>Reporting bugs</title>
cannam@89 2757
cannam@89 2758 <para>I tried pretty hard to make sure
cannam@89 2759 <computeroutput>bzip2</computeroutput> is bug free, both by
cannam@89 2760 design and by testing. Hopefully you'll never need to read this
cannam@89 2761 section for real.</para>
cannam@89 2762
cannam@89 2763 <para>Nevertheless, if <computeroutput>bzip2</computeroutput> dies
cannam@89 2764 with a segmentation fault, a bus error or an internal assertion
cannam@89 2765 failure, it will ask you to email me a bug report. Experience from
cannam@89 2766 years of feedback of bzip2 users indicates that almost all these
cannam@89 2767 problems can be traced to either compiler bugs or hardware
cannam@89 2768 problems.</para>
cannam@89 2769
cannam@89 2770 <itemizedlist mark='bullet'>
cannam@89 2771
cannam@89 2772 <listitem><para>Recompile the program with no optimisation, and
cannam@89 2773 see if it works. And/or try a different compiler. I heard all
cannam@89 2774 sorts of stories about various flavours of GNU C (and other
cannam@89 2775 compilers) generating bad code for
cannam@89 2776 <computeroutput>bzip2</computeroutput>, and I've run across two
cannam@89 2777 such examples myself.</para>
cannam@89 2778
cannam@89 2779 <para>2.7.X versions of GNU C are known to generate bad code
cannam@89 2780 from time to time, at high optimisation levels. If you get
cannam@89 2781 problems, try using the flags
cannam@89 2782 <computeroutput>-O2</computeroutput>
cannam@89 2783 <computeroutput>-fomit-frame-pointer</computeroutput>
cannam@89 2784 <computeroutput>-fno-strength-reduce</computeroutput>. You
cannam@89 2785 should specifically <emphasis>not</emphasis> use
cannam@89 2786 <computeroutput>-funroll-loops</computeroutput>.</para>
cannam@89 2787
cannam@89 2788 <para>You may notice that the Makefile runs six tests as part
cannam@89 2789 of the build process. If the program passes all of these, it's
cannam@89 2790 a pretty good (but not 100%) indication that the compiler has
cannam@89 2791 done its job correctly.</para></listitem>
cannam@89 2792
cannam@89 2793 <listitem><para>If <computeroutput>bzip2</computeroutput>
cannam@89 2794 crashes randomly, and the crashes are not repeatable, you may
cannam@89 2795 have a flaky memory subsystem.
cannam@89 2796 <computeroutput>bzip2</computeroutput> really hammers your
cannam@89 2797 memory hierarchy, and if it's a bit marginal, you may get these
cannam@89 2798 problems. Ditto if your disk or I/O subsystem is slowly
cannam@89 2799 failing. Yup, this really does happen.</para>
cannam@89 2800
cannam@89 2801 <para>Try using a different machine of the same type, and see
cannam@89 2802 if you can repeat the problem.</para></listitem>
cannam@89 2803
cannam@89 2804 <listitem><para>This isn't really a bug, but ... If
cannam@89 2805 <computeroutput>bzip2</computeroutput> tells you your file is
cannam@89 2806 corrupted on decompression, and you obtained the file via FTP,
cannam@89 2807 there is a possibility that you forgot to tell FTP to do a
cannam@89 2808 binary mode transfer. That absolutely will cause the file to
cannam@89 2809 be non-decompressible. You'll have to transfer it
cannam@89 2810 again.</para></listitem>
cannam@89 2811
cannam@89 2812 </itemizedlist>
cannam@89 2813
cannam@89 2814 <para>If you've incorporated
cannam@89 2815 <computeroutput>libbzip2</computeroutput> into your own program
cannam@89 2816 and are getting problems, please, please, please, check that the
cannam@89 2817 parameters you are passing in calls to the library, are correct,
cannam@89 2818 and in accordance with what the documentation says is allowable.
cannam@89 2819 I have tried to make the library robust against such problems,
cannam@89 2820 but I'm sure I haven't succeeded.</para>
cannam@89 2821
cannam@89 2822 <para>Finally, if the above comments don't help, you'll have to
cannam@89 2823 send me a bug report. Now, it's just amazing how many people
cannam@89 2824 will send me a bug report saying something like:</para>
cannam@89 2825
cannam@89 2826 <programlisting>
cannam@89 2827 bzip2 crashed with segmentation fault on my machine
cannam@89 2828 </programlisting>
cannam@89 2829
cannam@89 2830 <para>and absolutely nothing else. Needless to say, a such a
cannam@89 2831 report is <emphasis>totally, utterly, completely and
cannam@89 2832 comprehensively 100% useless; a waste of your time, my time, and
cannam@89 2833 net bandwidth</emphasis>. With no details at all, there's no way
cannam@89 2834 I can possibly begin to figure out what the problem is.</para>
cannam@89 2835
cannam@89 2836 <para>The rules of the game are: facts, facts, facts. Don't omit
cannam@89 2837 them because "oh, they won't be relevant". At the bare
cannam@89 2838 minimum:</para>
cannam@89 2839
cannam@89 2840 <programlisting>
cannam@89 2841 Machine type. Operating system version.
cannam@89 2842 Exact version of bzip2 (do bzip2 -V).
cannam@89 2843 Exact version of the compiler used.
cannam@89 2844 Flags passed to the compiler.
cannam@89 2845 </programlisting>
cannam@89 2846
cannam@89 2847 <para>However, the most important single thing that will help me
cannam@89 2848 is the file that you were trying to compress or decompress at the
cannam@89 2849 time the problem happened. Without that, my ability to do
cannam@89 2850 anything more than speculate about the cause, is limited.</para>
cannam@89 2851
cannam@89 2852 </sect1>
cannam@89 2853
cannam@89 2854
cannam@89 2855 <sect1 id="package" xreflabel="Did you get the right package?">
cannam@89 2856 <title>Did you get the right package?</title>
cannam@89 2857
cannam@89 2858 <para><computeroutput>bzip2</computeroutput> is a resource hog.
cannam@89 2859 It soaks up large amounts of CPU cycles and memory. Also, it
cannam@89 2860 gives very large latencies. In the worst case, you can feed many
cannam@89 2861 megabytes of uncompressed data into the library before getting
cannam@89 2862 any compressed output, so this probably rules out applications
cannam@89 2863 requiring interactive behaviour.</para>
cannam@89 2864
cannam@89 2865 <para>These aren't faults of my implementation, I hope, but more
cannam@89 2866 an intrinsic property of the Burrows-Wheeler transform
cannam@89 2867 (unfortunately). Maybe this isn't what you want.</para>
cannam@89 2868
cannam@89 2869 <para>If you want a compressor and/or library which is faster,
cannam@89 2870 uses less memory but gets pretty good compression, and has
cannam@89 2871 minimal latency, consider Jean-loup Gailly's and Mark Adler's
cannam@89 2872 work, <computeroutput>zlib-1.2.1</computeroutput> and
cannam@89 2873 <computeroutput>gzip-1.2.4</computeroutput>. Look for them at
cannam@89 2874 <ulink url="http://www.zlib.org">http://www.zlib.org</ulink> and
cannam@89 2875 <ulink url="http://www.gzip.org">http://www.gzip.org</ulink>
cannam@89 2876 respectively.</para>
cannam@89 2877
cannam@89 2878 <para>For something faster and lighter still, you might try Markus F
cannam@89 2879 X J Oberhumer's <computeroutput>LZO</computeroutput> real-time
cannam@89 2880 compression/decompression library, at
cannam@89 2881 <ulink url="http://www.oberhumer.com/opensource">http://www.oberhumer.com/opensource</ulink>.</para>
cannam@89 2882
cannam@89 2883 </sect1>
cannam@89 2884
cannam@89 2885
cannam@89 2886
cannam@89 2887 <sect1 id="reading" xreflabel="Further Reading">
cannam@89 2888 <title>Further Reading</title>
cannam@89 2889
cannam@89 2890 <para><computeroutput>bzip2</computeroutput> is not research
cannam@89 2891 work, in the sense that it doesn't present any new ideas.
cannam@89 2892 Rather, it's an engineering exercise based on existing
cannam@89 2893 ideas.</para>
cannam@89 2894
cannam@89 2895 <para>Four documents describe essentially all the ideas behind
cannam@89 2896 <computeroutput>bzip2</computeroutput>:</para>
cannam@89 2897
cannam@89 2898 <literallayout>Michael Burrows and D. J. Wheeler:
cannam@89 2899 "A block-sorting lossless data compression algorithm"
cannam@89 2900 10th May 1994.
cannam@89 2901 Digital SRC Research Report 124.
cannam@89 2902 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
cannam@89 2903 If you have trouble finding it, try searching at the
cannam@89 2904 New Zealand Digital Library, http://www.nzdl.org.
cannam@89 2905
cannam@89 2906 Daniel S. Hirschberg and Debra A. LeLewer
cannam@89 2907 "Efficient Decoding of Prefix Codes"
cannam@89 2908 Communications of the ACM, April 1990, Vol 33, Number 4.
cannam@89 2909 You might be able to get an electronic copy of this
cannam@89 2910 from the ACM Digital Library.
cannam@89 2911
cannam@89 2912 David J. Wheeler
cannam@89 2913 Program bred3.c and accompanying document bred3.ps.
cannam@89 2914 This contains the idea behind the multi-table Huffman coding scheme.
cannam@89 2915 ftp://ftp.cl.cam.ac.uk/users/djw3/
cannam@89 2916
cannam@89 2917 Jon L. Bentley and Robert Sedgewick
cannam@89 2918 "Fast Algorithms for Sorting and Searching Strings"
cannam@89 2919 Available from Sedgewick's web page,
cannam@89 2920 www.cs.princeton.edu/~rs
cannam@89 2921 </literallayout>
cannam@89 2922
cannam@89 2923 <para>The following paper gives valuable additional insights into
cannam@89 2924 the algorithm, but is not immediately the basis of any code used
cannam@89 2925 in bzip2.</para>
cannam@89 2926
cannam@89 2927 <literallayout>Peter Fenwick:
cannam@89 2928 Block Sorting Text Compression
cannam@89 2929 Proceedings of the 19th Australasian Computer Science Conference,
cannam@89 2930 Melbourne, Australia. Jan 31 - Feb 2, 1996.
cannam@89 2931 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps</literallayout>
cannam@89 2932
cannam@89 2933 <para>Kunihiko Sadakane's sorting algorithm, mentioned above, is
cannam@89 2934 available from:</para>
cannam@89 2935
cannam@89 2936 <literallayout>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
cannam@89 2937 </literallayout>
cannam@89 2938
cannam@89 2939 <para>The Manber-Myers suffix array construction algorithm is
cannam@89 2940 described in a paper available from:</para>
cannam@89 2941
cannam@89 2942 <literallayout>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
cannam@89 2943 </literallayout>
cannam@89 2944
cannam@89 2945 <para>Finally, the following papers document some
cannam@89 2946 investigations I made into the performance of sorting
cannam@89 2947 and decompression algorithms:</para>
cannam@89 2948
cannam@89 2949 <literallayout>Julian Seward
cannam@89 2950 On the Performance of BWT Sorting Algorithms
cannam@89 2951 Proceedings of the IEEE Data Compression Conference 2000
cannam@89 2952 Snowbird, Utah. 28-30 March 2000.
cannam@89 2953
cannam@89 2954 Julian Seward
cannam@89 2955 Space-time Tradeoffs in the Inverse B-W Transform
cannam@89 2956 Proceedings of the IEEE Data Compression Conference 2001
cannam@89 2957 Snowbird, Utah. 27-29 March 2001.
cannam@89 2958 </literallayout>
cannam@89 2959
cannam@89 2960 </sect1>
cannam@89 2961
cannam@89 2962 </chapter>
cannam@89 2963
cannam@89 2964 </book>