annotate src/bzip2-1.0.6/manual.xml @ 83:ae30d91d2ffe

Replace these with versions built using an older toolset (so as to avoid ABI compatibilities when linking on Ubuntu 14.04 for packaging purposes)
author Chris Cannam
date Fri, 07 Feb 2020 11:51:13 +0000
parents e13257ea84a4
children
rev   line source
Chris@4 1 <?xml version="1.0"?> <!-- -*- sgml -*- -->
Chris@4 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
Chris@4 3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"[
Chris@4 4
Chris@4 5 <!-- various strings, dates etc. common to all docs -->
Chris@4 6 <!ENTITY % common-ents SYSTEM "entities.xml"> %common-ents;
Chris@4 7 ]>
Chris@4 8
Chris@4 9 <book lang="en" id="userman" xreflabel="bzip2 Manual">
Chris@4 10
Chris@4 11 <bookinfo>
Chris@4 12 <title>bzip2 and libbzip2, version 1.0.6</title>
Chris@4 13 <subtitle>A program and library for data compression</subtitle>
Chris@4 14 <copyright>
Chris@4 15 <year>&bz-lifespan;</year>
Chris@4 16 <holder>Julian Seward</holder>
Chris@4 17 </copyright>
Chris@4 18 <releaseinfo>Version &bz-version; of &bz-date;</releaseinfo>
Chris@4 19
Chris@4 20 <authorgroup>
Chris@4 21 <author>
Chris@4 22 <firstname>Julian</firstname>
Chris@4 23 <surname>Seward</surname>
Chris@4 24 <affiliation>
Chris@4 25 <orgname>&bz-url;</orgname>
Chris@4 26 </affiliation>
Chris@4 27 </author>
Chris@4 28 </authorgroup>
Chris@4 29
Chris@4 30 <legalnotice>
Chris@4 31
Chris@4 32 <para>This program, <computeroutput>bzip2</computeroutput>, the
Chris@4 33 associated library <computeroutput>libbzip2</computeroutput>, and
Chris@4 34 all documentation, are copyright &copy; &bz-lifespan; Julian Seward.
Chris@4 35 All rights reserved.</para>
Chris@4 36
Chris@4 37 <para>Redistribution and use in source and binary forms, with
Chris@4 38 or without modification, are permitted provided that the
Chris@4 39 following conditions are met:</para>
Chris@4 40
Chris@4 41 <itemizedlist mark='bullet'>
Chris@4 42
Chris@4 43 <listitem><para>Redistributions of source code must retain the
Chris@4 44 above copyright notice, this list of conditions and the
Chris@4 45 following disclaimer.</para></listitem>
Chris@4 46
Chris@4 47 <listitem><para>The origin of this software must not be
Chris@4 48 misrepresented; you must not claim that you wrote the original
Chris@4 49 software. If you use this software in a product, an
Chris@4 50 acknowledgment in the product documentation would be
Chris@4 51 appreciated but is not required.</para></listitem>
Chris@4 52
Chris@4 53 <listitem><para>Altered source versions must be plainly marked
Chris@4 54 as such, and must not be misrepresented as being the original
Chris@4 55 software.</para></listitem>
Chris@4 56
Chris@4 57 <listitem><para>The name of the author may not be used to
Chris@4 58 endorse or promote products derived from this software without
Chris@4 59 specific prior written permission.</para></listitem>
Chris@4 60
Chris@4 61 </itemizedlist>
Chris@4 62
Chris@4 63 <para>THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY
Chris@4 64 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
Chris@4 65 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
Chris@4 66 PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
Chris@4 67 AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
Chris@4 68 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
Chris@4 69 TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
Chris@4 70 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
Chris@4 71 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
Chris@4 72 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
Chris@4 73 IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
Chris@4 74 THE POSSIBILITY OF SUCH DAMAGE.</para>
Chris@4 75
Chris@4 76 <para>PATENTS: To the best of my knowledge,
Chris@4 77 <computeroutput>bzip2</computeroutput> and
Chris@4 78 <computeroutput>libbzip2</computeroutput> do not use any patented
Chris@4 79 algorithms. However, I do not have the resources to carry
Chris@4 80 out a patent search. Therefore I cannot give any guarantee of
Chris@4 81 the above statement.
Chris@4 82 </para>
Chris@4 83
Chris@4 84 </legalnotice>
Chris@4 85
Chris@4 86 </bookinfo>
Chris@4 87
Chris@4 88
Chris@4 89
Chris@4 90 <chapter id="intro" xreflabel="Introduction">
Chris@4 91 <title>Introduction</title>
Chris@4 92
Chris@4 93 <para><computeroutput>bzip2</computeroutput> compresses files
Chris@4 94 using the Burrows-Wheeler block-sorting text compression
Chris@4 95 algorithm, and Huffman coding. Compression is generally
Chris@4 96 considerably better than that achieved by more conventional
Chris@4 97 LZ77/LZ78-based compressors, and approaches the performance of
Chris@4 98 the PPM family of statistical compressors.</para>
Chris@4 99
Chris@4 100 <para><computeroutput>bzip2</computeroutput> is built on top of
Chris@4 101 <computeroutput>libbzip2</computeroutput>, a flexible library for
Chris@4 102 handling compressed data in the
Chris@4 103 <computeroutput>bzip2</computeroutput> format. This manual
Chris@4 104 describes both how to use the program and how to work with the
Chris@4 105 library interface. Most of the manual is devoted to this
Chris@4 106 library, not the program, which is good news if your interest is
Chris@4 107 only in the program.</para>
Chris@4 108
Chris@4 109 <itemizedlist mark='bullet'>
Chris@4 110
Chris@4 111 <listitem><para><xref linkend="using"/> describes how to use
Chris@4 112 <computeroutput>bzip2</computeroutput>; this is the only part
Chris@4 113 you need to read if you just want to know how to operate the
Chris@4 114 program.</para></listitem>
Chris@4 115
Chris@4 116 <listitem><para><xref linkend="libprog"/> describes the
Chris@4 117 programming interfaces in detail, and</para></listitem>
Chris@4 118
Chris@4 119 <listitem><para><xref linkend="misc"/> records some
Chris@4 120 miscellaneous notes which I thought ought to be recorded
Chris@4 121 somewhere.</para></listitem>
Chris@4 122
Chris@4 123 </itemizedlist>
Chris@4 124
Chris@4 125 </chapter>
Chris@4 126
Chris@4 127
Chris@4 128 <chapter id="using" xreflabel="How to use bzip2">
Chris@4 129 <title>How to use bzip2</title>
Chris@4 130
Chris@4 131 <para>This chapter contains a copy of the
Chris@4 132 <computeroutput>bzip2</computeroutput> man page, and nothing
Chris@4 133 else.</para>
Chris@4 134
Chris@4 135 <sect1 id="name" xreflabel="NAME">
Chris@4 136 <title>NAME</title>
Chris@4 137
Chris@4 138 <itemizedlist mark='bullet'>
Chris@4 139
Chris@4 140 <listitem><para><computeroutput>bzip2</computeroutput>,
Chris@4 141 <computeroutput>bunzip2</computeroutput> - a block-sorting file
Chris@4 142 compressor, v1.0.6</para></listitem>
Chris@4 143
Chris@4 144 <listitem><para><computeroutput>bzcat</computeroutput> -
Chris@4 145 decompresses files to stdout</para></listitem>
Chris@4 146
Chris@4 147 <listitem><para><computeroutput>bzip2recover</computeroutput> -
Chris@4 148 recovers data from damaged bzip2 files</para></listitem>
Chris@4 149
Chris@4 150 </itemizedlist>
Chris@4 151
Chris@4 152 </sect1>
Chris@4 153
Chris@4 154
Chris@4 155 <sect1 id="synopsis" xreflabel="SYNOPSIS">
Chris@4 156 <title>SYNOPSIS</title>
Chris@4 157
Chris@4 158 <itemizedlist mark='bullet'>
Chris@4 159
Chris@4 160 <listitem><para><computeroutput>bzip2</computeroutput> [
Chris@4 161 -cdfkqstvzVL123456789 ] [ filenames ... ]</para></listitem>
Chris@4 162
Chris@4 163 <listitem><para><computeroutput>bunzip2</computeroutput> [
Chris@4 164 -fkvsVL ] [ filenames ... ]</para></listitem>
Chris@4 165
Chris@4 166 <listitem><para><computeroutput>bzcat</computeroutput> [ -s ] [
Chris@4 167 filenames ... ]</para></listitem>
Chris@4 168
Chris@4 169 <listitem><para><computeroutput>bzip2recover</computeroutput>
Chris@4 170 filename</para></listitem>
Chris@4 171
Chris@4 172 </itemizedlist>
Chris@4 173
Chris@4 174 </sect1>
Chris@4 175
Chris@4 176
Chris@4 177 <sect1 id="description" xreflabel="DESCRIPTION">
Chris@4 178 <title>DESCRIPTION</title>
Chris@4 179
Chris@4 180 <para><computeroutput>bzip2</computeroutput> compresses files
Chris@4 181 using the Burrows-Wheeler block sorting text compression
Chris@4 182 algorithm, and Huffman coding. Compression is generally
Chris@4 183 considerably better than that achieved by more conventional
Chris@4 184 LZ77/LZ78-based compressors, and approaches the performance of
Chris@4 185 the PPM family of statistical compressors.</para>
Chris@4 186
Chris@4 187 <para>The command-line options are deliberately very similar to
Chris@4 188 those of GNU <computeroutput>gzip</computeroutput>, but they are
Chris@4 189 not identical.</para>
Chris@4 190
Chris@4 191 <para><computeroutput>bzip2</computeroutput> expects a list of
Chris@4 192 file names to accompany the command-line flags. Each file is
Chris@4 193 replaced by a compressed version of itself, with the name
Chris@4 194 <computeroutput>original_name.bz2</computeroutput>. Each
Chris@4 195 compressed file has the same modification date, permissions, and,
Chris@4 196 when possible, ownership as the corresponding original, so that
Chris@4 197 these properties can be correctly restored at decompression time.
Chris@4 198 File name handling is naive in the sense that there is no
Chris@4 199 mechanism for preserving original file names, permissions,
Chris@4 200 ownerships or dates in filesystems which lack these concepts, or
Chris@4 201 have serious file name length restrictions, such as
Chris@4 202 MS-DOS.</para>
Chris@4 203
Chris@4 204 <para><computeroutput>bzip2</computeroutput> and
Chris@4 205 <computeroutput>bunzip2</computeroutput> will by default not
Chris@4 206 overwrite existing files. If you want this to happen, specify
Chris@4 207 the <computeroutput>-f</computeroutput> flag.</para>
Chris@4 208
Chris@4 209 <para>If no file names are specified,
Chris@4 210 <computeroutput>bzip2</computeroutput> compresses from standard
Chris@4 211 input to standard output. In this case,
Chris@4 212 <computeroutput>bzip2</computeroutput> will decline to write
Chris@4 213 compressed output to a terminal, as this would be entirely
Chris@4 214 incomprehensible and therefore pointless.</para>
Chris@4 215
Chris@4 216 <para><computeroutput>bunzip2</computeroutput> (or
Chris@4 217 <computeroutput>bzip2 -d</computeroutput>) decompresses all
Chris@4 218 specified files. Files which were not created by
Chris@4 219 <computeroutput>bzip2</computeroutput> will be detected and
Chris@4 220 ignored, and a warning issued.
Chris@4 221 <computeroutput>bzip2</computeroutput> attempts to guess the
Chris@4 222 filename for the decompressed file from that of the compressed
Chris@4 223 file as follows:</para>
Chris@4 224
Chris@4 225 <itemizedlist mark='bullet'>
Chris@4 226
Chris@4 227 <listitem><para><computeroutput>filename.bz2 </computeroutput>
Chris@4 228 becomes
Chris@4 229 <computeroutput>filename</computeroutput></para></listitem>
Chris@4 230
Chris@4 231 <listitem><para><computeroutput>filename.bz </computeroutput>
Chris@4 232 becomes
Chris@4 233 <computeroutput>filename</computeroutput></para></listitem>
Chris@4 234
Chris@4 235 <listitem><para><computeroutput>filename.tbz2</computeroutput>
Chris@4 236 becomes
Chris@4 237 <computeroutput>filename.tar</computeroutput></para></listitem>
Chris@4 238
Chris@4 239 <listitem><para><computeroutput>filename.tbz </computeroutput>
Chris@4 240 becomes
Chris@4 241 <computeroutput>filename.tar</computeroutput></para></listitem>
Chris@4 242
Chris@4 243 <listitem><para><computeroutput>anyothername </computeroutput>
Chris@4 244 becomes
Chris@4 245 <computeroutput>anyothername.out</computeroutput></para></listitem>
Chris@4 246
Chris@4 247 </itemizedlist>
Chris@4 248
Chris@4 249 <para>If the file does not end in one of the recognised endings,
Chris@4 250 <computeroutput>.bz2</computeroutput>,
Chris@4 251 <computeroutput>.bz</computeroutput>,
Chris@4 252 <computeroutput>.tbz2</computeroutput> or
Chris@4 253 <computeroutput>.tbz</computeroutput>,
Chris@4 254 <computeroutput>bzip2</computeroutput> complains that it cannot
Chris@4 255 guess the name of the original file, and uses the original name
Chris@4 256 with <computeroutput>.out</computeroutput> appended.</para>
Chris@4 257
Chris@4 258 <para>As with compression, supplying no filenames causes
Chris@4 259 decompression from standard input to standard output.</para>
Chris@4 260
Chris@4 261 <para><computeroutput>bunzip2</computeroutput> will correctly
Chris@4 262 decompress a file which is the concatenation of two or more
Chris@4 263 compressed files. The result is the concatenation of the
Chris@4 264 corresponding uncompressed files. Integrity testing
Chris@4 265 (<computeroutput>-t</computeroutput>) of concatenated compressed
Chris@4 266 files is also supported.</para>
Chris@4 267
Chris@4 268 <para>You can also compress or decompress files to the standard
Chris@4 269 output by giving the <computeroutput>-c</computeroutput> flag.
Chris@4 270 Multiple files may be compressed and decompressed like this. The
Chris@4 271 resulting outputs are fed sequentially to stdout. Compression of
Chris@4 272 multiple files in this manner generates a stream containing
Chris@4 273 multiple compressed file representations. Such a stream can be
Chris@4 274 decompressed correctly only by
Chris@4 275 <computeroutput>bzip2</computeroutput> version 0.9.0 or later.
Chris@4 276 Earlier versions of <computeroutput>bzip2</computeroutput> will
Chris@4 277 stop after decompressing the first file in the stream.</para>
Chris@4 278
Chris@4 279 <para><computeroutput>bzcat</computeroutput> (or
Chris@4 280 <computeroutput>bzip2 -dc</computeroutput>) decompresses all
Chris@4 281 specified files to the standard output.</para>
Chris@4 282
Chris@4 283 <para><computeroutput>bzip2</computeroutput> will read arguments
Chris@4 284 from the environment variables
Chris@4 285 <computeroutput>BZIP2</computeroutput> and
Chris@4 286 <computeroutput>BZIP</computeroutput>, in that order, and will
Chris@4 287 process them before any arguments read from the command line.
Chris@4 288 This gives a convenient way to supply default arguments.</para>
Chris@4 289
Chris@4 290 <para>Compression is always performed, even if the compressed
Chris@4 291 file is slightly larger than the original. Files of less than
Chris@4 292 about one hundred bytes tend to get larger, since the compression
Chris@4 293 mechanism has a constant overhead in the region of 50 bytes.
Chris@4 294 Random data (including the output of most file compressors) is
Chris@4 295 coded at about 8.05 bits per byte, giving an expansion of around
Chris@4 296 0.5%.</para>
Chris@4 297
Chris@4 298 <para>As a self-check for your protection,
Chris@4 299 <computeroutput>bzip2</computeroutput> uses 32-bit CRCs to make
Chris@4 300 sure that the decompressed version of a file is identical to the
Chris@4 301 original. This guards against corruption of the compressed data,
Chris@4 302 and against undetected bugs in
Chris@4 303 <computeroutput>bzip2</computeroutput> (hopefully very unlikely).
Chris@4 304 The chances of data corruption going undetected is microscopic,
Chris@4 305 about one chance in four billion for each file processed. Be
Chris@4 306 aware, though, that the check occurs upon decompression, so it
Chris@4 307 can only tell you that something is wrong. It can't help you
Chris@4 308 recover the original uncompressed data. You can use
Chris@4 309 <computeroutput>bzip2recover</computeroutput> to try to recover
Chris@4 310 data from damaged files.</para>
Chris@4 311
Chris@4 312 <para>Return values: 0 for a normal exit, 1 for environmental
Chris@4 313 problems (file not found, invalid flags, I/O errors, etc.), 2
Chris@4 314 to indicate a corrupt compressed file, 3 for an internal
Chris@4 315 consistency error (eg, bug) which caused
Chris@4 316 <computeroutput>bzip2</computeroutput> to panic.</para>
Chris@4 317
Chris@4 318 </sect1>
Chris@4 319
Chris@4 320
Chris@4 321 <sect1 id="options" xreflabel="OPTIONS">
Chris@4 322 <title>OPTIONS</title>
Chris@4 323
Chris@4 324 <variablelist>
Chris@4 325
Chris@4 326 <varlistentry>
Chris@4 327 <term><computeroutput>-c --stdout</computeroutput></term>
Chris@4 328 <listitem><para>Compress or decompress to standard
Chris@4 329 output.</para></listitem>
Chris@4 330 </varlistentry>
Chris@4 331
Chris@4 332 <varlistentry>
Chris@4 333 <term><computeroutput>-d --decompress</computeroutput></term>
Chris@4 334 <listitem><para>Force decompression.
Chris@4 335 <computeroutput>bzip2</computeroutput>,
Chris@4 336 <computeroutput>bunzip2</computeroutput> and
Chris@4 337 <computeroutput>bzcat</computeroutput> are really the same
Chris@4 338 program, and the decision about what actions to take is done on
Chris@4 339 the basis of which name is used. This flag overrides that
Chris@4 340 mechanism, and forces bzip2 to decompress.</para></listitem>
Chris@4 341 </varlistentry>
Chris@4 342
Chris@4 343 <varlistentry>
Chris@4 344 <term><computeroutput>-z --compress</computeroutput></term>
Chris@4 345 <listitem><para>The complement to
Chris@4 346 <computeroutput>-d</computeroutput>: forces compression,
Chris@4 347 regardless of the invokation name.</para></listitem>
Chris@4 348 </varlistentry>
Chris@4 349
Chris@4 350 <varlistentry>
Chris@4 351 <term><computeroutput>-t --test</computeroutput></term>
Chris@4 352 <listitem><para>Check integrity of the specified file(s), but
Chris@4 353 don't decompress them. This really performs a trial
Chris@4 354 decompression and throws away the result.</para></listitem>
Chris@4 355 </varlistentry>
Chris@4 356
Chris@4 357 <varlistentry>
Chris@4 358 <term><computeroutput>-f --force</computeroutput></term>
Chris@4 359 <listitem><para>Force overwrite of output files. Normally,
Chris@4 360 <computeroutput>bzip2</computeroutput> will not overwrite
Chris@4 361 existing output files. Also forces
Chris@4 362 <computeroutput>bzip2</computeroutput> to break hard links to
Chris@4 363 files, which it otherwise wouldn't do.</para>
Chris@4 364 <para><computeroutput>bzip2</computeroutput> normally declines
Chris@4 365 to decompress files which don't have the correct magic header
Chris@4 366 bytes. If forced (<computeroutput>-f</computeroutput>),
Chris@4 367 however, it will pass such files through unmodified. This is
Chris@4 368 how GNU <computeroutput>gzip</computeroutput> behaves.</para>
Chris@4 369 </listitem>
Chris@4 370 </varlistentry>
Chris@4 371
Chris@4 372 <varlistentry>
Chris@4 373 <term><computeroutput>-k --keep</computeroutput></term>
Chris@4 374 <listitem><para>Keep (don't delete) input files during
Chris@4 375 compression or decompression.</para></listitem>
Chris@4 376 </varlistentry>
Chris@4 377
Chris@4 378 <varlistentry>
Chris@4 379 <term><computeroutput>-s --small</computeroutput></term>
Chris@4 380 <listitem><para>Reduce memory usage, for compression,
Chris@4 381 decompression and testing. Files are decompressed and tested
Chris@4 382 using a modified algorithm which only requires 2.5 bytes per
Chris@4 383 block byte. This means any file can be decompressed in 2300k
Chris@4 384 of memory, albeit at about half the normal speed.</para>
Chris@4 385 <para>During compression, <computeroutput>-s</computeroutput>
Chris@4 386 selects a block size of 200k, which limits memory use to around
Chris@4 387 the same figure, at the expense of your compression ratio. In
Chris@4 388 short, if your machine is low on memory (8 megabytes or less),
Chris@4 389 use <computeroutput>-s</computeroutput> for everything. See
Chris@4 390 <xref linkend="memory-management"/> below.</para></listitem>
Chris@4 391 </varlistentry>
Chris@4 392
Chris@4 393 <varlistentry>
Chris@4 394 <term><computeroutput>-q --quiet</computeroutput></term>
Chris@4 395 <listitem><para>Suppress non-essential warning messages.
Chris@4 396 Messages pertaining to I/O errors and other critical events
Chris@4 397 will not be suppressed.</para></listitem>
Chris@4 398 </varlistentry>
Chris@4 399
Chris@4 400 <varlistentry>
Chris@4 401 <term><computeroutput>-v --verbose</computeroutput></term>
Chris@4 402 <listitem><para>Verbose mode -- show the compression ratio for
Chris@4 403 each file processed. Further
Chris@4 404 <computeroutput>-v</computeroutput>'s increase the verbosity
Chris@4 405 level, spewing out lots of information which is primarily of
Chris@4 406 interest for diagnostic purposes.</para></listitem>
Chris@4 407 </varlistentry>
Chris@4 408
Chris@4 409 <varlistentry>
Chris@4 410 <term><computeroutput>-L --license -V --version</computeroutput></term>
Chris@4 411 <listitem><para>Display the software version, license terms and
Chris@4 412 conditions.</para></listitem>
Chris@4 413 </varlistentry>
Chris@4 414
Chris@4 415 <varlistentry>
Chris@4 416 <term><computeroutput>-1</computeroutput> (or
Chris@4 417 <computeroutput>--fast</computeroutput>) to
Chris@4 418 <computeroutput>-9</computeroutput> (or
Chris@4 419 <computeroutput>-best</computeroutput>)</term>
Chris@4 420 <listitem><para>Set the block size to 100 k, 200 k ... 900 k
Chris@4 421 when compressing. Has no effect when decompressing. See <xref
Chris@4 422 linkend="memory-management" /> below. The
Chris@4 423 <computeroutput>--fast</computeroutput> and
Chris@4 424 <computeroutput>--best</computeroutput> aliases are primarily
Chris@4 425 for GNU <computeroutput>gzip</computeroutput> compatibility.
Chris@4 426 In particular, <computeroutput>--fast</computeroutput> doesn't
Chris@4 427 make things significantly faster. And
Chris@4 428 <computeroutput>--best</computeroutput> merely selects the
Chris@4 429 default behaviour.</para></listitem>
Chris@4 430 </varlistentry>
Chris@4 431
Chris@4 432 <varlistentry>
Chris@4 433 <term><computeroutput>--</computeroutput></term>
Chris@4 434 <listitem><para>Treats all subsequent arguments as file names,
Chris@4 435 even if they start with a dash. This is so you can handle
Chris@4 436 files with names beginning with a dash, for example:
Chris@4 437 <computeroutput>bzip2 --
Chris@4 438 -myfilename</computeroutput>.</para></listitem>
Chris@4 439 </varlistentry>
Chris@4 440
Chris@4 441 <varlistentry>
Chris@4 442 <term><computeroutput>--repetitive-fast</computeroutput></term>
Chris@4 443 <term><computeroutput>--repetitive-best</computeroutput></term>
Chris@4 444 <listitem><para>These flags are redundant in versions 0.9.5 and
Chris@4 445 above. They provided some coarse control over the behaviour of
Chris@4 446 the sorting algorithm in earlier versions, which was sometimes
Chris@4 447 useful. 0.9.5 and above have an improved algorithm which
Chris@4 448 renders these flags irrelevant.</para></listitem>
Chris@4 449 </varlistentry>
Chris@4 450
Chris@4 451 </variablelist>
Chris@4 452
Chris@4 453 </sect1>
Chris@4 454
Chris@4 455
Chris@4 456 <sect1 id="memory-management" xreflabel="MEMORY MANAGEMENT">
Chris@4 457 <title>MEMORY MANAGEMENT</title>
Chris@4 458
Chris@4 459 <para><computeroutput>bzip2</computeroutput> compresses large
Chris@4 460 files in blocks. The block size affects both the compression
Chris@4 461 ratio achieved, and the amount of memory needed for compression
Chris@4 462 and decompression. The flags <computeroutput>-1</computeroutput>
Chris@4 463 through <computeroutput>-9</computeroutput> specify the block
Chris@4 464 size to be 100,000 bytes through 900,000 bytes (the default)
Chris@4 465 respectively. At decompression time, the block size used for
Chris@4 466 compression is read from the header of the compressed file, and
Chris@4 467 <computeroutput>bunzip2</computeroutput> then allocates itself
Chris@4 468 just enough memory to decompress the file. Since block sizes are
Chris@4 469 stored in compressed files, it follows that the flags
Chris@4 470 <computeroutput>-1</computeroutput> to
Chris@4 471 <computeroutput>-9</computeroutput> are irrelevant to and so
Chris@4 472 ignored during decompression.</para>
Chris@4 473
Chris@4 474 <para>Compression and decompression requirements, in bytes, can be
Chris@4 475 estimated as:</para>
Chris@4 476 <programlisting>
Chris@4 477 Compression: 400k + ( 8 x block size )
Chris@4 478
Chris@4 479 Decompression: 100k + ( 4 x block size ), or
Chris@4 480 100k + ( 2.5 x block size )
Chris@4 481 </programlisting>
Chris@4 482
Chris@4 483 <para>Larger block sizes give rapidly diminishing marginal
Chris@4 484 returns. Most of the compression comes from the first two or
Chris@4 485 three hundred k of block size, a fact worth bearing in mind when
Chris@4 486 using <computeroutput>bzip2</computeroutput> on small machines.
Chris@4 487 It is also important to appreciate that the decompression memory
Chris@4 488 requirement is set at compression time by the choice of block
Chris@4 489 size.</para>
Chris@4 490
Chris@4 491 <para>For files compressed with the default 900k block size,
Chris@4 492 <computeroutput>bunzip2</computeroutput> will require about 3700
Chris@4 493 kbytes to decompress. To support decompression of any file on a
Chris@4 494 4 megabyte machine, <computeroutput>bunzip2</computeroutput> has
Chris@4 495 an option to decompress using approximately half this amount of
Chris@4 496 memory, about 2300 kbytes. Decompression speed is also halved,
Chris@4 497 so you should use this option only where necessary. The relevant
Chris@4 498 flag is <computeroutput>-s</computeroutput>.</para>
Chris@4 499
Chris@4 500 <para>In general, try and use the largest block size memory
Chris@4 501 constraints allow, since that maximises the compression achieved.
Chris@4 502 Compression and decompression speed are virtually unaffected by
Chris@4 503 block size.</para>
Chris@4 504
Chris@4 505 <para>Another significant point applies to files which fit in a
Chris@4 506 single block -- that means most files you'd encounter using a
Chris@4 507 large block size. The amount of real memory touched is
Chris@4 508 proportional to the size of the file, since the file is smaller
Chris@4 509 than a block. For example, compressing a file 20,000 bytes long
Chris@4 510 with the flag <computeroutput>-9</computeroutput> will cause the
Chris@4 511 compressor to allocate around 7600k of memory, but only touch
Chris@4 512 400k + 20000 * 8 = 560 kbytes of it. Similarly, the decompressor
Chris@4 513 will allocate 3700k but only touch 100k + 20000 * 4 = 180
Chris@4 514 kbytes.</para>
Chris@4 515
Chris@4 516 <para>Here is a table which summarises the maximum memory usage
Chris@4 517 for different block sizes. Also recorded is the total compressed
Chris@4 518 size for 14 files of the Calgary Text Compression Corpus
Chris@4 519 totalling 3,141,622 bytes. This column gives some feel for how
Chris@4 520 compression varies with block size. These figures tend to
Chris@4 521 understate the advantage of larger block sizes for larger files,
Chris@4 522 since the Corpus is dominated by smaller files.</para>
Chris@4 523
Chris@4 524 <programlisting>
Chris@4 525 Compress Decompress Decompress Corpus
Chris@4 526 Flag usage usage -s usage Size
Chris@4 527
Chris@4 528 -1 1200k 500k 350k 914704
Chris@4 529 -2 2000k 900k 600k 877703
Chris@4 530 -3 2800k 1300k 850k 860338
Chris@4 531 -4 3600k 1700k 1100k 846899
Chris@4 532 -5 4400k 2100k 1350k 845160
Chris@4 533 -6 5200k 2500k 1600k 838626
Chris@4 534 -7 6100k 2900k 1850k 834096
Chris@4 535 -8 6800k 3300k 2100k 828642
Chris@4 536 -9 7600k 3700k 2350k 828642
Chris@4 537 </programlisting>
Chris@4 538
Chris@4 539 </sect1>
Chris@4 540
Chris@4 541
Chris@4 542 <sect1 id="recovering" xreflabel="RECOVERING DATA FROM DAMAGED FILES">
Chris@4 543 <title>RECOVERING DATA FROM DAMAGED FILES</title>
Chris@4 544
Chris@4 545 <para><computeroutput>bzip2</computeroutput> compresses files in
Chris@4 546 blocks, usually 900kbytes long. Each block is handled
Chris@4 547 independently. If a media or transmission error causes a
Chris@4 548 multi-block <computeroutput>.bz2</computeroutput> file to become
Chris@4 549 damaged, it may be possible to recover data from the undamaged
Chris@4 550 blocks in the file.</para>
Chris@4 551
Chris@4 552 <para>The compressed representation of each block is delimited by
Chris@4 553 a 48-bit pattern, which makes it possible to find the block
Chris@4 554 boundaries with reasonable certainty. Each block also carries
Chris@4 555 its own 32-bit CRC, so damaged blocks can be distinguished from
Chris@4 556 undamaged ones.</para>
Chris@4 557
Chris@4 558 <para><computeroutput>bzip2recover</computeroutput> is a simple
Chris@4 559 program whose purpose is to search for blocks in
Chris@4 560 <computeroutput>.bz2</computeroutput> files, and write each block
Chris@4 561 out into its own <computeroutput>.bz2</computeroutput> file. You
Chris@4 562 can then use <computeroutput>bzip2 -t</computeroutput> to test
Chris@4 563 the integrity of the resulting files, and decompress those which
Chris@4 564 are undamaged.</para>
Chris@4 565
Chris@4 566 <para><computeroutput>bzip2recover</computeroutput> takes a
Chris@4 567 single argument, the name of the damaged file, and writes a
Chris@4 568 number of files <computeroutput>rec0001file.bz2</computeroutput>,
Chris@4 569 <computeroutput>rec0002file.bz2</computeroutput>, etc, containing
Chris@4 570 the extracted blocks. The output filenames are designed so that
Chris@4 571 the use of wildcards in subsequent processing -- for example,
Chris@4 572 <computeroutput>bzip2 -dc rec*file.bz2 &#62;
Chris@4 573 recovered_data</computeroutput> -- lists the files in the correct
Chris@4 574 order.</para>
Chris@4 575
Chris@4 576 <para><computeroutput>bzip2recover</computeroutput> should be of
Chris@4 577 most use dealing with large <computeroutput>.bz2</computeroutput>
Chris@4 578 files, as these will contain many blocks. It is clearly futile
Chris@4 579 to use it on damaged single-block files, since a damaged block
Chris@4 580 cannot be recovered. If you wish to minimise any potential data
Chris@4 581 loss through media or transmission errors, you might consider
Chris@4 582 compressing with a smaller block size.</para>
Chris@4 583
Chris@4 584 </sect1>
Chris@4 585
Chris@4 586
Chris@4 587 <sect1 id="performance" xreflabel="PERFORMANCE NOTES">
Chris@4 588 <title>PERFORMANCE NOTES</title>
Chris@4 589
Chris@4 590 <para>The sorting phase of compression gathers together similar
Chris@4 591 strings in the file. Because of this, files containing very long
Chris@4 592 runs of repeated symbols, like "aabaabaabaab ..." (repeated
Chris@4 593 several hundred times) may compress more slowly than normal.
Chris@4 594 Versions 0.9.5 and above fare much better than previous versions
Chris@4 595 in this respect. The ratio between worst-case and average-case
Chris@4 596 compression time is in the region of 10:1. For previous
Chris@4 597 versions, this figure was more like 100:1. You can use the
Chris@4 598 <computeroutput>-vvvv</computeroutput> option to monitor progress
Chris@4 599 in great detail, if you want.</para>
Chris@4 600
Chris@4 601 <para>Decompression speed is unaffected by these
Chris@4 602 phenomena.</para>
Chris@4 603
Chris@4 604 <para><computeroutput>bzip2</computeroutput> usually allocates
Chris@4 605 several megabytes of memory to operate in, and then charges all
Chris@4 606 over it in a fairly random fashion. This means that performance,
Chris@4 607 both for compressing and decompressing, is largely determined by
Chris@4 608 the speed at which your machine can service cache misses.
Chris@4 609 Because of this, small changes to the code to reduce the miss
Chris@4 610 rate have been observed to give disproportionately large
Chris@4 611 performance improvements. I imagine
Chris@4 612 <computeroutput>bzip2</computeroutput> will perform best on
Chris@4 613 machines with very large caches.</para>
Chris@4 614
Chris@4 615 </sect1>
Chris@4 616
Chris@4 617
Chris@4 618
Chris@4 619 <sect1 id="caveats" xreflabel="CAVEATS">
Chris@4 620 <title>CAVEATS</title>
Chris@4 621
Chris@4 622 <para>I/O error messages are not as helpful as they could be.
Chris@4 623 <computeroutput>bzip2</computeroutput> tries hard to detect I/O
Chris@4 624 errors and exit cleanly, but the details of what the problem is
Chris@4 625 sometimes seem rather misleading.</para>
Chris@4 626
Chris@4 627 <para>This manual page pertains to version &bz-version; of
Chris@4 628 <computeroutput>bzip2</computeroutput>. Compressed data created by
Chris@4 629 this version is entirely forwards and backwards compatible with the
Chris@4 630 previous public releases, versions 0.1pl2, 0.9.0 and 0.9.5, 1.0.0,
Chris@4 631 1.0.1, 1.0.2 and 1.0.3, but with the following exception: 0.9.0 and
Chris@4 632 above can correctly decompress multiple concatenated compressed files.
Chris@4 633 0.1pl2 cannot do this; it will stop after decompressing just the first
Chris@4 634 file in the stream.</para>
Chris@4 635
Chris@4 636 <para><computeroutput>bzip2recover</computeroutput> versions
Chris@4 637 prior to 1.0.2 used 32-bit integers to represent bit positions in
Chris@4 638 compressed files, so it could not handle compressed files more
Chris@4 639 than 512 megabytes long. Versions 1.0.2 and above use 64-bit ints
Chris@4 640 on some platforms which support them (GNU supported targets, and
Chris@4 641 Windows). To establish whether or not
Chris@4 642 <computeroutput>bzip2recover</computeroutput> was built with such
Chris@4 643 a limitation, run it without arguments. In any event you can
Chris@4 644 build yourself an unlimited version if you can recompile it with
Chris@4 645 <computeroutput>MaybeUInt64</computeroutput> set to be an
Chris@4 646 unsigned 64-bit integer.</para>
Chris@4 647
Chris@4 648 </sect1>
Chris@4 649
Chris@4 650
Chris@4 651
Chris@4 652 <sect1 id="author" xreflabel="AUTHOR">
Chris@4 653 <title>AUTHOR</title>
Chris@4 654
Chris@4 655 <para>Julian Seward,
Chris@4 656 <computeroutput>&bz-email;</computeroutput></para>
Chris@4 657
Chris@4 658 <para>The ideas embodied in
Chris@4 659 <computeroutput>bzip2</computeroutput> are due to (at least) the
Chris@4 660 following people: Michael Burrows and David Wheeler (for the
Chris@4 661 block sorting transformation), David Wheeler (again, for the
Chris@4 662 Huffman coder), Peter Fenwick (for the structured coding model in
Chris@4 663 the original <computeroutput>bzip</computeroutput>, and many
Chris@4 664 refinements), and Alistair Moffat, Radford Neal and Ian Witten
Chris@4 665 (for the arithmetic coder in the original
Chris@4 666 <computeroutput>bzip</computeroutput>). I am much indebted for
Chris@4 667 their help, support and advice. See the manual in the source
Chris@4 668 distribution for pointers to sources of documentation. Christian
Chris@4 669 von Roques encouraged me to look for faster sorting algorithms,
Chris@4 670 so as to speed up compression. Bela Lubkin encouraged me to
Chris@4 671 improve the worst-case compression performance.
Chris@4 672 Donna Robinson XMLised the documentation.
Chris@4 673 Many people sent
Chris@4 674 patches, helped with portability problems, lent machines, gave
Chris@4 675 advice and were generally helpful.</para>
Chris@4 676
Chris@4 677 </sect1>
Chris@4 678
Chris@4 679 </chapter>
Chris@4 680
Chris@4 681
Chris@4 682
Chris@4 683 <chapter id="libprog" xreflabel="Programming with libbzip2">
Chris@4 684 <title>
Chris@4 685 Programming with <computeroutput>libbzip2</computeroutput>
Chris@4 686 </title>
Chris@4 687
Chris@4 688 <para>This chapter describes the programming interface to
Chris@4 689 <computeroutput>libbzip2</computeroutput>.</para>
Chris@4 690
Chris@4 691 <para>For general background information, particularly about
Chris@4 692 memory use and performance aspects, you'd be well advised to read
Chris@4 693 <xref linkend="using"/> as well.</para>
Chris@4 694
Chris@4 695
Chris@4 696 <sect1 id="top-level" xreflabel="Top-level structure">
Chris@4 697 <title>Top-level structure</title>
Chris@4 698
Chris@4 699 <para><computeroutput>libbzip2</computeroutput> is a flexible
Chris@4 700 library for compressing and decompressing data in the
Chris@4 701 <computeroutput>bzip2</computeroutput> data format. Although
Chris@4 702 packaged as a single entity, it helps to regard the library as
Chris@4 703 three separate parts: the low level interface, and the high level
Chris@4 704 interface, and some utility functions.</para>
Chris@4 705
Chris@4 706 <para>The structure of
Chris@4 707 <computeroutput>libbzip2</computeroutput>'s interfaces is similar
Chris@4 708 to that of Jean-loup Gailly's and Mark Adler's excellent
Chris@4 709 <computeroutput>zlib</computeroutput> library.</para>
Chris@4 710
Chris@4 711 <para>All externally visible symbols have names beginning
Chris@4 712 <computeroutput>BZ2_</computeroutput>. This is new in version
Chris@4 713 1.0. The intention is to minimise pollution of the namespaces of
Chris@4 714 library clients.</para>
Chris@4 715
Chris@4 716 <para>To use any part of the library, you need to
Chris@4 717 <computeroutput>#include &lt;bzlib.h&gt;</computeroutput>
Chris@4 718 into your sources.</para>
Chris@4 719
Chris@4 720
Chris@4 721
Chris@4 722 <sect2 id="ll-summary" xreflabel="Low-level summary">
Chris@4 723 <title>Low-level summary</title>
Chris@4 724
Chris@4 725 <para>This interface provides services for compressing and
Chris@4 726 decompressing data in memory. There's no provision for dealing
Chris@4 727 with files, streams or any other I/O mechanisms, just straight
Chris@4 728 memory-to-memory work. In fact, this part of the library can be
Chris@4 729 compiled without inclusion of
Chris@4 730 <computeroutput>stdio.h</computeroutput>, which may be helpful
Chris@4 731 for embedded applications.</para>
Chris@4 732
Chris@4 733 <para>The low-level part of the library has no global variables
Chris@4 734 and is therefore thread-safe.</para>
Chris@4 735
Chris@4 736 <para>Six routines make up the low level interface:
Chris@4 737 <computeroutput>BZ2_bzCompressInit</computeroutput>,
Chris@4 738 <computeroutput>BZ2_bzCompress</computeroutput>, and
Chris@4 739 <computeroutput>BZ2_bzCompressEnd</computeroutput> for
Chris@4 740 compression, and a corresponding trio
Chris@4 741 <computeroutput>BZ2_bzDecompressInit</computeroutput>,
Chris@4 742 <computeroutput>BZ2_bzDecompress</computeroutput> and
Chris@4 743 <computeroutput>BZ2_bzDecompressEnd</computeroutput> for
Chris@4 744 decompression. The <computeroutput>*Init</computeroutput>
Chris@4 745 functions allocate memory for compression/decompression and do
Chris@4 746 other initialisations, whilst the
Chris@4 747 <computeroutput>*End</computeroutput> functions close down
Chris@4 748 operations and release memory.</para>
Chris@4 749
Chris@4 750 <para>The real work is done by
Chris@4 751 <computeroutput>BZ2_bzCompress</computeroutput> and
Chris@4 752 <computeroutput>BZ2_bzDecompress</computeroutput>. These
Chris@4 753 compress and decompress data from a user-supplied input buffer to
Chris@4 754 a user-supplied output buffer. These buffers can be any size;
Chris@4 755 arbitrary quantities of data are handled by making repeated calls
Chris@4 756 to these functions. This is a flexible mechanism allowing a
Chris@4 757 consumer-pull style of activity, or producer-push, or a mixture
Chris@4 758 of both.</para>
Chris@4 759
Chris@4 760 </sect2>
Chris@4 761
Chris@4 762
Chris@4 763 <sect2 id="hl-summary" xreflabel="High-level summary">
Chris@4 764 <title>High-level summary</title>
Chris@4 765
Chris@4 766 <para>This interface provides some handy wrappers around the
Chris@4 767 low-level interface to facilitate reading and writing
Chris@4 768 <computeroutput>bzip2</computeroutput> format files
Chris@4 769 (<computeroutput>.bz2</computeroutput> files). The routines
Chris@4 770 provide hooks to facilitate reading files in which the
Chris@4 771 <computeroutput>bzip2</computeroutput> data stream is embedded
Chris@4 772 within some larger-scale file structure, or where there are
Chris@4 773 multiple <computeroutput>bzip2</computeroutput> data streams
Chris@4 774 concatenated end-to-end.</para>
Chris@4 775
Chris@4 776 <para>For reading files,
Chris@4 777 <computeroutput>BZ2_bzReadOpen</computeroutput>,
Chris@4 778 <computeroutput>BZ2_bzRead</computeroutput>,
Chris@4 779 <computeroutput>BZ2_bzReadClose</computeroutput> and
Chris@4 780 <computeroutput>BZ2_bzReadGetUnused</computeroutput> are
Chris@4 781 supplied. For writing files,
Chris@4 782 <computeroutput>BZ2_bzWriteOpen</computeroutput>,
Chris@4 783 <computeroutput>BZ2_bzWrite</computeroutput> and
Chris@4 784 <computeroutput>BZ2_bzWriteFinish</computeroutput> are
Chris@4 785 available.</para>
Chris@4 786
Chris@4 787 <para>As with the low-level library, no global variables are used
Chris@4 788 so the library is per se thread-safe. However, if I/O errors
Chris@4 789 occur whilst reading or writing the underlying compressed files,
Chris@4 790 you may have to consult <computeroutput>errno</computeroutput> to
Chris@4 791 determine the cause of the error. In that case, you'd need a C
Chris@4 792 library which correctly supports
Chris@4 793 <computeroutput>errno</computeroutput> in a multithreaded
Chris@4 794 environment.</para>
Chris@4 795
Chris@4 796 <para>To make the library a little simpler and more portable,
Chris@4 797 <computeroutput>BZ2_bzReadOpen</computeroutput> and
Chris@4 798 <computeroutput>BZ2_bzWriteOpen</computeroutput> require you to
Chris@4 799 pass them file handles (<computeroutput>FILE*</computeroutput>s)
Chris@4 800 which have previously been opened for reading or writing
Chris@4 801 respectively. That avoids portability problems associated with
Chris@4 802 file operations and file attributes, whilst not being much of an
Chris@4 803 imposition on the programmer.</para>
Chris@4 804
Chris@4 805 </sect2>
Chris@4 806
Chris@4 807
Chris@4 808 <sect2 id="util-fns-summary" xreflabel="Utility functions summary">
Chris@4 809 <title>Utility functions summary</title>
Chris@4 810
Chris@4 811 <para>For very simple needs,
Chris@4 812 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and
Chris@4 813 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> are
Chris@4 814 provided. These compress data in memory from one buffer to
Chris@4 815 another buffer in a single function call. You should assess
Chris@4 816 whether these functions fulfill your memory-to-memory
Chris@4 817 compression/decompression requirements before investing effort in
Chris@4 818 understanding the more general but more complex low-level
Chris@4 819 interface.</para>
Chris@4 820
Chris@4 821 <para>Yoshioka Tsuneo
Chris@4 822 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>) has
Chris@4 823 contributed some functions to give better
Chris@4 824 <computeroutput>zlib</computeroutput> compatibility. These
Chris@4 825 functions are <computeroutput>BZ2_bzopen</computeroutput>,
Chris@4 826 <computeroutput>BZ2_bzread</computeroutput>,
Chris@4 827 <computeroutput>BZ2_bzwrite</computeroutput>,
Chris@4 828 <computeroutput>BZ2_bzflush</computeroutput>,
Chris@4 829 <computeroutput>BZ2_bzclose</computeroutput>,
Chris@4 830 <computeroutput>BZ2_bzerror</computeroutput> and
Chris@4 831 <computeroutput>BZ2_bzlibVersion</computeroutput>. You may find
Chris@4 832 these functions more convenient for simple file reading and
Chris@4 833 writing, than those in the high-level interface. These functions
Chris@4 834 are not (yet) officially part of the library, and are minimally
Chris@4 835 documented here. If they break, you get to keep all the pieces.
Chris@4 836 I hope to document them properly when time permits.</para>
Chris@4 837
Chris@4 838 <para>Yoshioka also contributed modifications to allow the
Chris@4 839 library to be built as a Windows DLL.</para>
Chris@4 840
Chris@4 841 </sect2>
Chris@4 842
Chris@4 843 </sect1>
Chris@4 844
Chris@4 845
Chris@4 846 <sect1 id="err-handling" xreflabel="Error handling">
Chris@4 847 <title>Error handling</title>
Chris@4 848
Chris@4 849 <para>The library is designed to recover cleanly in all
Chris@4 850 situations, including the worst-case situation of decompressing
Chris@4 851 random data. I'm not 100% sure that it can always do this, so
Chris@4 852 you might want to add a signal handler to catch segmentation
Chris@4 853 violations during decompression if you are feeling especially
Chris@4 854 paranoid. I would be interested in hearing more about the
Chris@4 855 robustness of the library to corrupted compressed data.</para>
Chris@4 856
Chris@4 857 <para>Version 1.0.3 more robust in this respect than any
Chris@4 858 previous version. Investigations with Valgrind (a tool for detecting
Chris@4 859 problems with memory management) indicate
Chris@4 860 that, at least for the few files I tested, all single-bit errors
Chris@4 861 in the decompressed data are caught properly, with no
Chris@4 862 segmentation faults, no uses of uninitialised data, no out of
Chris@4 863 range reads or writes, and no infinite looping in the decompressor.
Chris@4 864 So it's certainly pretty robust, although
Chris@4 865 I wouldn't claim it to be totally bombproof.</para>
Chris@4 866
Chris@4 867 <para>The file <computeroutput>bzlib.h</computeroutput> contains
Chris@4 868 all definitions needed to use the library. In particular, you
Chris@4 869 should definitely not include
Chris@4 870 <computeroutput>bzlib_private.h</computeroutput>.</para>
Chris@4 871
Chris@4 872 <para>In <computeroutput>bzlib.h</computeroutput>, the various
Chris@4 873 return values are defined. The following list is not intended as
Chris@4 874 an exhaustive description of the circumstances in which a given
Chris@4 875 value may be returned -- those descriptions are given later.
Chris@4 876 Rather, it is intended to convey the rough meaning of each return
Chris@4 877 value. The first five actions are normal and not intended to
Chris@4 878 denote an error situation.</para>
Chris@4 879
Chris@4 880 <variablelist>
Chris@4 881
Chris@4 882 <varlistentry>
Chris@4 883 <term><computeroutput>BZ_OK</computeroutput></term>
Chris@4 884 <listitem><para>The requested action was completed
Chris@4 885 successfully.</para></listitem>
Chris@4 886 </varlistentry>
Chris@4 887
Chris@4 888 <varlistentry>
Chris@4 889 <term><computeroutput>BZ_RUN_OK, BZ_FLUSH_OK,
Chris@4 890 BZ_FINISH_OK</computeroutput></term>
Chris@4 891 <listitem><para>In
Chris@4 892 <computeroutput>BZ2_bzCompress</computeroutput>, the requested
Chris@4 893 flush/finish/nothing-special action was completed
Chris@4 894 successfully.</para></listitem>
Chris@4 895 </varlistentry>
Chris@4 896
Chris@4 897 <varlistentry>
Chris@4 898 <term><computeroutput>BZ_STREAM_END</computeroutput></term>
Chris@4 899 <listitem><para>Compression of data was completed, or the
Chris@4 900 logical stream end was detected during
Chris@4 901 decompression.</para></listitem>
Chris@4 902 </varlistentry>
Chris@4 903
Chris@4 904 </variablelist>
Chris@4 905
Chris@4 906 <para>The following return values indicate an error of some
Chris@4 907 kind.</para>
Chris@4 908
Chris@4 909 <variablelist>
Chris@4 910
Chris@4 911 <varlistentry>
Chris@4 912 <term><computeroutput>BZ_CONFIG_ERROR</computeroutput></term>
Chris@4 913 <listitem><para>Indicates that the library has been improperly
Chris@4 914 compiled on your platform -- a major configuration error.
Chris@4 915 Specifically, it means that
Chris@4 916 <computeroutput>sizeof(char)</computeroutput>,
Chris@4 917 <computeroutput>sizeof(short)</computeroutput> and
Chris@4 918 <computeroutput>sizeof(int)</computeroutput> are not 1, 2 and
Chris@4 919 4 respectively, as they should be. Note that the library
Chris@4 920 should still work properly on 64-bit platforms which follow
Chris@4 921 the LP64 programming model -- that is, where
Chris@4 922 <computeroutput>sizeof(long)</computeroutput> and
Chris@4 923 <computeroutput>sizeof(void*)</computeroutput> are 8. Under
Chris@4 924 LP64, <computeroutput>sizeof(int)</computeroutput> is still 4,
Chris@4 925 so <computeroutput>libbzip2</computeroutput>, which doesn't
Chris@4 926 use the <computeroutput>long</computeroutput> type, is
Chris@4 927 OK.</para></listitem>
Chris@4 928 </varlistentry>
Chris@4 929
Chris@4 930 <varlistentry>
Chris@4 931 <term><computeroutput>BZ_SEQUENCE_ERROR</computeroutput></term>
Chris@4 932 <listitem><para>When using the library, it is important to call
Chris@4 933 the functions in the correct sequence and with data structures
Chris@4 934 (buffers etc) in the correct states.
Chris@4 935 <computeroutput>libbzip2</computeroutput> checks as much as it
Chris@4 936 can to ensure this is happening, and returns
Chris@4 937 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> if not.
Chris@4 938 Code which complies precisely with the function semantics, as
Chris@4 939 detailed below, should never receive this value; such an event
Chris@4 940 denotes buggy code which you should
Chris@4 941 investigate.</para></listitem>
Chris@4 942 </varlistentry>
Chris@4 943
Chris@4 944 <varlistentry>
Chris@4 945 <term><computeroutput>BZ_PARAM_ERROR</computeroutput></term>
Chris@4 946 <listitem><para>Returned when a parameter to a function call is
Chris@4 947 out of range or otherwise manifestly incorrect. As with
Chris@4 948 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, this
Chris@4 949 denotes a bug in the client code. The distinction between
Chris@4 950 <computeroutput>BZ_PARAM_ERROR</computeroutput> and
Chris@4 951 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> is a bit
Chris@4 952 hazy, but still worth making.</para></listitem>
Chris@4 953 </varlistentry>
Chris@4 954
Chris@4 955 <varlistentry>
Chris@4 956 <term><computeroutput>BZ_MEM_ERROR</computeroutput></term>
Chris@4 957 <listitem><para>Returned when a request to allocate memory
Chris@4 958 failed. Note that the quantity of memory needed to decompress
Chris@4 959 a stream cannot be determined until the stream's header has
Chris@4 960 been read. So
Chris@4 961 <computeroutput>BZ2_bzDecompress</computeroutput> and
Chris@4 962 <computeroutput>BZ2_bzRead</computeroutput> may return
Chris@4 963 <computeroutput>BZ_MEM_ERROR</computeroutput> even though some
Chris@4 964 of the compressed data has been read. The same is not true
Chris@4 965 for compression; once
Chris@4 966 <computeroutput>BZ2_bzCompressInit</computeroutput> or
Chris@4 967 <computeroutput>BZ2_bzWriteOpen</computeroutput> have
Chris@4 968 successfully completed,
Chris@4 969 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot
Chris@4 970 occur.</para></listitem>
Chris@4 971 </varlistentry>
Chris@4 972
Chris@4 973 <varlistentry>
Chris@4 974 <term><computeroutput>BZ_DATA_ERROR</computeroutput></term>
Chris@4 975 <listitem><para>Returned when a data integrity error is
Chris@4 976 detected during decompression. Most importantly, this means
Chris@4 977 when stored and computed CRCs for the data do not match. This
Chris@4 978 value is also returned upon detection of any other anomaly in
Chris@4 979 the compressed data.</para></listitem>
Chris@4 980 </varlistentry>
Chris@4 981
Chris@4 982 <varlistentry>
Chris@4 983 <term><computeroutput>BZ_DATA_ERROR_MAGIC</computeroutput></term>
Chris@4 984 <listitem><para>As a special case of
Chris@4 985 <computeroutput>BZ_DATA_ERROR</computeroutput>, it is
Chris@4 986 sometimes useful to know when the compressed stream does not
Chris@4 987 start with the correct magic bytes (<computeroutput>'B' 'Z'
Chris@4 988 'h'</computeroutput>).</para></listitem>
Chris@4 989 </varlistentry>
Chris@4 990
Chris@4 991 <varlistentry>
Chris@4 992 <term><computeroutput>BZ_IO_ERROR</computeroutput></term>
Chris@4 993 <listitem><para>Returned by
Chris@4 994 <computeroutput>BZ2_bzRead</computeroutput> and
Chris@4 995 <computeroutput>BZ2_bzWrite</computeroutput> when there is an
Chris@4 996 error reading or writing in the compressed file, and by
Chris@4 997 <computeroutput>BZ2_bzReadOpen</computeroutput> and
Chris@4 998 <computeroutput>BZ2_bzWriteOpen</computeroutput> for attempts
Chris@4 999 to use a file for which the error indicator (viz,
Chris@4 1000 <computeroutput>ferror(f)</computeroutput>) is set. On
Chris@4 1001 receipt of <computeroutput>BZ_IO_ERROR</computeroutput>, the
Chris@4 1002 caller should consult <computeroutput>errno</computeroutput>
Chris@4 1003 and/or <computeroutput>perror</computeroutput> to acquire
Chris@4 1004 operating-system specific information about the
Chris@4 1005 problem.</para></listitem>
Chris@4 1006 </varlistentry>
Chris@4 1007
Chris@4 1008 <varlistentry>
Chris@4 1009 <term><computeroutput>BZ_UNEXPECTED_EOF</computeroutput></term>
Chris@4 1010 <listitem><para>Returned by
Chris@4 1011 <computeroutput>BZ2_bzRead</computeroutput> when the
Chris@4 1012 compressed file finishes before the logical end of stream is
Chris@4 1013 detected.</para></listitem>
Chris@4 1014 </varlistentry>
Chris@4 1015
Chris@4 1016 <varlistentry>
Chris@4 1017 <term><computeroutput>BZ_OUTBUFF_FULL</computeroutput></term>
Chris@4 1018 <listitem><para>Returned by
Chris@4 1019 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and
Chris@4 1020 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> to
Chris@4 1021 indicate that the output data will not fit into the output
Chris@4 1022 buffer provided.</para></listitem>
Chris@4 1023 </varlistentry>
Chris@4 1024
Chris@4 1025 </variablelist>
Chris@4 1026
Chris@4 1027 </sect1>
Chris@4 1028
Chris@4 1029
Chris@4 1030
Chris@4 1031 <sect1 id="low-level" xreflabel=">Low-level interface">
Chris@4 1032 <title>Low-level interface</title>
Chris@4 1033
Chris@4 1034
Chris@4 1035 <sect2 id="bzcompress-init" xreflabel="BZ2_bzCompressInit">
Chris@4 1036 <title>BZ2_bzCompressInit</title>
Chris@4 1037
Chris@4 1038 <programlisting>
Chris@4 1039 typedef struct {
Chris@4 1040 char *next_in;
Chris@4 1041 unsigned int avail_in;
Chris@4 1042 unsigned int total_in_lo32;
Chris@4 1043 unsigned int total_in_hi32;
Chris@4 1044
Chris@4 1045 char *next_out;
Chris@4 1046 unsigned int avail_out;
Chris@4 1047 unsigned int total_out_lo32;
Chris@4 1048 unsigned int total_out_hi32;
Chris@4 1049
Chris@4 1050 void *state;
Chris@4 1051
Chris@4 1052 void *(*bzalloc)(void *,int,int);
Chris@4 1053 void (*bzfree)(void *,void *);
Chris@4 1054 void *opaque;
Chris@4 1055 } bz_stream;
Chris@4 1056
Chris@4 1057 int BZ2_bzCompressInit ( bz_stream *strm,
Chris@4 1058 int blockSize100k,
Chris@4 1059 int verbosity,
Chris@4 1060 int workFactor );
Chris@4 1061 </programlisting>
Chris@4 1062
Chris@4 1063 <para>Prepares for compression. The
Chris@4 1064 <computeroutput>bz_stream</computeroutput> structure holds all
Chris@4 1065 data pertaining to the compression activity. A
Chris@4 1066 <computeroutput>bz_stream</computeroutput> structure should be
Chris@4 1067 allocated and initialised prior to the call. The fields of
Chris@4 1068 <computeroutput>bz_stream</computeroutput> comprise the entirety
Chris@4 1069 of the user-visible data. <computeroutput>state</computeroutput>
Chris@4 1070 is a pointer to the private data structures required for
Chris@4 1071 compression.</para>
Chris@4 1072
Chris@4 1073 <para>Custom memory allocators are supported, via fields
Chris@4 1074 <computeroutput>bzalloc</computeroutput>,
Chris@4 1075 <computeroutput>bzfree</computeroutput>, and
Chris@4 1076 <computeroutput>opaque</computeroutput>. The value
Chris@4 1077 <computeroutput>opaque</computeroutput> is passed to as the first
Chris@4 1078 argument to all calls to <computeroutput>bzalloc</computeroutput>
Chris@4 1079 and <computeroutput>bzfree</computeroutput>, but is otherwise
Chris@4 1080 ignored by the library. The call <computeroutput>bzalloc (
Chris@4 1081 opaque, n, m )</computeroutput> is expected to return a pointer
Chris@4 1082 <computeroutput>p</computeroutput> to <computeroutput>n *
Chris@4 1083 m</computeroutput> bytes of memory, and <computeroutput>bzfree (
Chris@4 1084 opaque, p )</computeroutput> should free that memory.</para>
Chris@4 1085
Chris@4 1086 <para>If you don't want to use a custom memory allocator, set
Chris@4 1087 <computeroutput>bzalloc</computeroutput>,
Chris@4 1088 <computeroutput>bzfree</computeroutput> and
Chris@4 1089 <computeroutput>opaque</computeroutput> to
Chris@4 1090 <computeroutput>NULL</computeroutput>, and the library will then
Chris@4 1091 use the standard <computeroutput>malloc</computeroutput> /
Chris@4 1092 <computeroutput>free</computeroutput> routines.</para>
Chris@4 1093
Chris@4 1094 <para>Before calling
Chris@4 1095 <computeroutput>BZ2_bzCompressInit</computeroutput>, fields
Chris@4 1096 <computeroutput>bzalloc</computeroutput>,
Chris@4 1097 <computeroutput>bzfree</computeroutput> and
Chris@4 1098 <computeroutput>opaque</computeroutput> should be filled
Chris@4 1099 appropriately, as just described. Upon return, the internal
Chris@4 1100 state will have been allocated and initialised, and
Chris@4 1101 <computeroutput>total_in_lo32</computeroutput>,
Chris@4 1102 <computeroutput>total_in_hi32</computeroutput>,
Chris@4 1103 <computeroutput>total_out_lo32</computeroutput> and
Chris@4 1104 <computeroutput>total_out_hi32</computeroutput> will have been
Chris@4 1105 set to zero. These four fields are used by the library to inform
Chris@4 1106 the caller of the total amount of data passed into and out of the
Chris@4 1107 library, respectively. You should not try to change them. As of
Chris@4 1108 version 1.0, 64-bit counts are maintained, even on 32-bit
Chris@4 1109 platforms, using the <computeroutput>_hi32</computeroutput>
Chris@4 1110 fields to store the upper 32 bits of the count. So, for example,
Chris@4 1111 the total amount of data in is <computeroutput>(total_in_hi32
Chris@4 1112 &#60;&#60; 32) + total_in_lo32</computeroutput>.</para>
Chris@4 1113
Chris@4 1114 <para>Parameter <computeroutput>blockSize100k</computeroutput>
Chris@4 1115 specifies the block size to be used for compression. It should
Chris@4 1116 be a value between 1 and 9 inclusive, and the actual block size
Chris@4 1117 used is 100000 x this figure. 9 gives the best compression but
Chris@4 1118 takes most memory.</para>
Chris@4 1119
Chris@4 1120 <para>Parameter <computeroutput>verbosity</computeroutput> should
Chris@4 1121 be set to a number between 0 and 4 inclusive. 0 is silent, and
Chris@4 1122 greater numbers give increasingly verbose monitoring/debugging
Chris@4 1123 output. If the library has been compiled with
Chris@4 1124 <computeroutput>-DBZ_NO_STDIO</computeroutput>, no such output
Chris@4 1125 will appear for any verbosity setting.</para>
Chris@4 1126
Chris@4 1127 <para>Parameter <computeroutput>workFactor</computeroutput>
Chris@4 1128 controls how the compression phase behaves when presented with
Chris@4 1129 worst case, highly repetitive, input data. If compression runs
Chris@4 1130 into difficulties caused by repetitive data, the library switches
Chris@4 1131 from the standard sorting algorithm to a fallback algorithm. The
Chris@4 1132 fallback is slower than the standard algorithm by perhaps a
Chris@4 1133 factor of three, but always behaves reasonably, no matter how bad
Chris@4 1134 the input.</para>
Chris@4 1135
Chris@4 1136 <para>Lower values of <computeroutput>workFactor</computeroutput>
Chris@4 1137 reduce the amount of effort the standard algorithm will expend
Chris@4 1138 before resorting to the fallback. You should set this parameter
Chris@4 1139 carefully; too low, and many inputs will be handled by the
Chris@4 1140 fallback algorithm and so compress rather slowly, too high, and
Chris@4 1141 your average-to-worst case compression times can become very
Chris@4 1142 large. The default value of 30 gives reasonable behaviour over a
Chris@4 1143 wide range of circumstances.</para>
Chris@4 1144
Chris@4 1145 <para>Allowable values range from 0 to 250 inclusive. 0 is a
Chris@4 1146 special case, equivalent to using the default value of 30.</para>
Chris@4 1147
Chris@4 1148 <para>Note that the compressed output generated is the same
Chris@4 1149 regardless of whether or not the fallback algorithm is
Chris@4 1150 used.</para>
Chris@4 1151
Chris@4 1152 <para>Be aware also that this parameter may disappear entirely in
Chris@4 1153 future versions of the library. In principle it should be
Chris@4 1154 possible to devise a good way to automatically choose which
Chris@4 1155 algorithm to use. Such a mechanism would render the parameter
Chris@4 1156 obsolete.</para>
Chris@4 1157
Chris@4 1158 <para>Possible return values:</para>
Chris@4 1159
Chris@4 1160 <programlisting>
Chris@4 1161 BZ_CONFIG_ERROR
Chris@4 1162 if the library has been mis-compiled
Chris@4 1163 BZ_PARAM_ERROR
Chris@4 1164 if strm is NULL
Chris@4 1165 or blockSize < 1 or blockSize > 9
Chris@4 1166 or verbosity < 0 or verbosity > 4
Chris@4 1167 or workFactor < 0 or workFactor > 250
Chris@4 1168 BZ_MEM_ERROR
Chris@4 1169 if not enough memory is available
Chris@4 1170 BZ_OK
Chris@4 1171 otherwise
Chris@4 1172 </programlisting>
Chris@4 1173
Chris@4 1174 <para>Allowable next actions:</para>
Chris@4 1175
Chris@4 1176 <programlisting>
Chris@4 1177 BZ2_bzCompress
Chris@4 1178 if BZ_OK is returned
Chris@4 1179 no specific action needed in case of error
Chris@4 1180 </programlisting>
Chris@4 1181
Chris@4 1182 </sect2>
Chris@4 1183
Chris@4 1184
Chris@4 1185 <sect2 id="bzCompress" xreflabel="BZ2_bzCompress">
Chris@4 1186 <title>BZ2_bzCompress</title>
Chris@4 1187
Chris@4 1188 <programlisting>
Chris@4 1189 int BZ2_bzCompress ( bz_stream *strm, int action );
Chris@4 1190 </programlisting>
Chris@4 1191
Chris@4 1192 <para>Provides more input and/or output buffer space for the
Chris@4 1193 library. The caller maintains input and output buffers, and
Chris@4 1194 calls <computeroutput>BZ2_bzCompress</computeroutput> to transfer
Chris@4 1195 data between them.</para>
Chris@4 1196
Chris@4 1197 <para>Before each call to
Chris@4 1198 <computeroutput>BZ2_bzCompress</computeroutput>,
Chris@4 1199 <computeroutput>next_in</computeroutput> should point at the data
Chris@4 1200 to be compressed, and <computeroutput>avail_in</computeroutput>
Chris@4 1201 should indicate how many bytes the library may read.
Chris@4 1202 <computeroutput>BZ2_bzCompress</computeroutput> updates
Chris@4 1203 <computeroutput>next_in</computeroutput>,
Chris@4 1204 <computeroutput>avail_in</computeroutput> and
Chris@4 1205 <computeroutput>total_in</computeroutput> to reflect the number
Chris@4 1206 of bytes it has read.</para>
Chris@4 1207
Chris@4 1208 <para>Similarly, <computeroutput>next_out</computeroutput> should
Chris@4 1209 point to a buffer in which the compressed data is to be placed,
Chris@4 1210 with <computeroutput>avail_out</computeroutput> indicating how
Chris@4 1211 much output space is available.
Chris@4 1212 <computeroutput>BZ2_bzCompress</computeroutput> updates
Chris@4 1213 <computeroutput>next_out</computeroutput>,
Chris@4 1214 <computeroutput>avail_out</computeroutput> and
Chris@4 1215 <computeroutput>total_out</computeroutput> to reflect the number
Chris@4 1216 of bytes output.</para>
Chris@4 1217
Chris@4 1218 <para>You may provide and remove as little or as much data as you
Chris@4 1219 like on each call of
Chris@4 1220 <computeroutput>BZ2_bzCompress</computeroutput>. In the limit,
Chris@4 1221 it is acceptable to supply and remove data one byte at a time,
Chris@4 1222 although this would be terribly inefficient. You should always
Chris@4 1223 ensure that at least one byte of output space is available at
Chris@4 1224 each call.</para>
Chris@4 1225
Chris@4 1226 <para>A second purpose of
Chris@4 1227 <computeroutput>BZ2_bzCompress</computeroutput> is to request a
Chris@4 1228 change of mode of the compressed stream.</para>
Chris@4 1229
Chris@4 1230 <para>Conceptually, a compressed stream can be in one of four
Chris@4 1231 states: IDLE, RUNNING, FLUSHING and FINISHING. Before
Chris@4 1232 initialisation
Chris@4 1233 (<computeroutput>BZ2_bzCompressInit</computeroutput>) and after
Chris@4 1234 termination (<computeroutput>BZ2_bzCompressEnd</computeroutput>),
Chris@4 1235 a stream is regarded as IDLE.</para>
Chris@4 1236
Chris@4 1237 <para>Upon initialisation
Chris@4 1238 (<computeroutput>BZ2_bzCompressInit</computeroutput>), the stream
Chris@4 1239 is placed in the RUNNING state. Subsequent calls to
Chris@4 1240 <computeroutput>BZ2_bzCompress</computeroutput> should pass
Chris@4 1241 <computeroutput>BZ_RUN</computeroutput> as the requested action;
Chris@4 1242 other actions are illegal and will result in
Chris@4 1243 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>.</para>
Chris@4 1244
Chris@4 1245 <para>At some point, the calling program will have provided all
Chris@4 1246 the input data it wants to. It will then want to finish up -- in
Chris@4 1247 effect, asking the library to process any data it might have
Chris@4 1248 buffered internally. In this state,
Chris@4 1249 <computeroutput>BZ2_bzCompress</computeroutput> will no longer
Chris@4 1250 attempt to read data from
Chris@4 1251 <computeroutput>next_in</computeroutput>, but it will want to
Chris@4 1252 write data to <computeroutput>next_out</computeroutput>. Because
Chris@4 1253 the output buffer supplied by the user can be arbitrarily small,
Chris@4 1254 the finishing-up operation cannot necessarily be done with a
Chris@4 1255 single call of
Chris@4 1256 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
Chris@4 1257
Chris@4 1258 <para>Instead, the calling program passes
Chris@4 1259 <computeroutput>BZ_FINISH</computeroutput> as an action to
Chris@4 1260 <computeroutput>BZ2_bzCompress</computeroutput>. This changes
Chris@4 1261 the stream's state to FINISHING. Any remaining input (ie,
Chris@4 1262 <computeroutput>next_in[0 .. avail_in-1]</computeroutput>) is
Chris@4 1263 compressed and transferred to the output buffer. To do this,
Chris@4 1264 <computeroutput>BZ2_bzCompress</computeroutput> must be called
Chris@4 1265 repeatedly until all the output has been consumed. At that
Chris@4 1266 point, <computeroutput>BZ2_bzCompress</computeroutput> returns
Chris@4 1267 <computeroutput>BZ_STREAM_END</computeroutput>, and the stream's
Chris@4 1268 state is set back to IDLE.
Chris@4 1269 <computeroutput>BZ2_bzCompressEnd</computeroutput> should then be
Chris@4 1270 called.</para>
Chris@4 1271
Chris@4 1272 <para>Just to make sure the calling program does not cheat, the
Chris@4 1273 library makes a note of <computeroutput>avail_in</computeroutput>
Chris@4 1274 at the time of the first call to
Chris@4 1275 <computeroutput>BZ2_bzCompress</computeroutput> which has
Chris@4 1276 <computeroutput>BZ_FINISH</computeroutput> as an action (ie, at
Chris@4 1277 the time the program has announced its intention to not supply
Chris@4 1278 any more input). By comparing this value with that of
Chris@4 1279 <computeroutput>avail_in</computeroutput> over subsequent calls
Chris@4 1280 to <computeroutput>BZ2_bzCompress</computeroutput>, the library
Chris@4 1281 can detect any attempts to slip in more data to compress. Any
Chris@4 1282 calls for which this is detected will return
Chris@4 1283 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>. This
Chris@4 1284 indicates a programming mistake which should be corrected.</para>
Chris@4 1285
Chris@4 1286 <para>Instead of asking to finish, the calling program may ask
Chris@4 1287 <computeroutput>BZ2_bzCompress</computeroutput> to take all the
Chris@4 1288 remaining input, compress it and terminate the current
Chris@4 1289 (Burrows-Wheeler) compression block. This could be useful for
Chris@4 1290 error control purposes. The mechanism is analogous to that for
Chris@4 1291 finishing: call <computeroutput>BZ2_bzCompress</computeroutput>
Chris@4 1292 with an action of <computeroutput>BZ_FLUSH</computeroutput>,
Chris@4 1293 remove output data, and persist with the
Chris@4 1294 <computeroutput>BZ_FLUSH</computeroutput> action until the value
Chris@4 1295 <computeroutput>BZ_RUN</computeroutput> is returned. As with
Chris@4 1296 finishing, <computeroutput>BZ2_bzCompress</computeroutput>
Chris@4 1297 detects any attempt to provide more input data once the flush has
Chris@4 1298 begun.</para>
Chris@4 1299
Chris@4 1300 <para>Once the flush is complete, the stream returns to the
Chris@4 1301 normal RUNNING state.</para>
Chris@4 1302
Chris@4 1303 <para>This all sounds pretty complex, but isn't really. Here's a
Chris@4 1304 table which shows which actions are allowable in each state, what
Chris@4 1305 action will be taken, what the next state is, and what the
Chris@4 1306 non-error return values are. Note that you can't explicitly ask
Chris@4 1307 what state the stream is in, but nor do you need to -- it can be
Chris@4 1308 inferred from the values returned by
Chris@4 1309 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
Chris@4 1310
Chris@4 1311 <programlisting>
Chris@4 1312 IDLE/any
Chris@4 1313 Illegal. IDLE state only exists after BZ2_bzCompressEnd or
Chris@4 1314 before BZ2_bzCompressInit.
Chris@4 1315 Return value = BZ_SEQUENCE_ERROR
Chris@4 1316
Chris@4 1317 RUNNING/BZ_RUN
Chris@4 1318 Compress from next_in to next_out as much as possible.
Chris@4 1319 Next state = RUNNING
Chris@4 1320 Return value = BZ_RUN_OK
Chris@4 1321
Chris@4 1322 RUNNING/BZ_FLUSH
Chris@4 1323 Remember current value of next_in. Compress from next_in
Chris@4 1324 to next_out as much as possible, but do not accept any more input.
Chris@4 1325 Next state = FLUSHING
Chris@4 1326 Return value = BZ_FLUSH_OK
Chris@4 1327
Chris@4 1328 RUNNING/BZ_FINISH
Chris@4 1329 Remember current value of next_in. Compress from next_in
Chris@4 1330 to next_out as much as possible, but do not accept any more input.
Chris@4 1331 Next state = FINISHING
Chris@4 1332 Return value = BZ_FINISH_OK
Chris@4 1333
Chris@4 1334 FLUSHING/BZ_FLUSH
Chris@4 1335 Compress from next_in to next_out as much as possible,
Chris@4 1336 but do not accept any more input.
Chris@4 1337 If all the existing input has been used up and all compressed
Chris@4 1338 output has been removed
Chris@4 1339 Next state = RUNNING; Return value = BZ_RUN_OK
Chris@4 1340 else
Chris@4 1341 Next state = FLUSHING; Return value = BZ_FLUSH_OK
Chris@4 1342
Chris@4 1343 FLUSHING/other
Chris@4 1344 Illegal.
Chris@4 1345 Return value = BZ_SEQUENCE_ERROR
Chris@4 1346
Chris@4 1347 FINISHING/BZ_FINISH
Chris@4 1348 Compress from next_in to next_out as much as possible,
Chris@4 1349 but to not accept any more input.
Chris@4 1350 If all the existing input has been used up and all compressed
Chris@4 1351 output has been removed
Chris@4 1352 Next state = IDLE; Return value = BZ_STREAM_END
Chris@4 1353 else
Chris@4 1354 Next state = FINISHING; Return value = BZ_FINISH_OK
Chris@4 1355
Chris@4 1356 FINISHING/other
Chris@4 1357 Illegal.
Chris@4 1358 Return value = BZ_SEQUENCE_ERROR
Chris@4 1359 </programlisting>
Chris@4 1360
Chris@4 1361
Chris@4 1362 <para>That still looks complicated? Well, fair enough. The
Chris@4 1363 usual sequence of calls for compressing a load of data is:</para>
Chris@4 1364
Chris@4 1365 <orderedlist>
Chris@4 1366
Chris@4 1367 <listitem><para>Get started with
Chris@4 1368 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para></listitem>
Chris@4 1369
Chris@4 1370 <listitem><para>Shovel data in and shlurp out its compressed form
Chris@4 1371 using zero or more calls of
Chris@4 1372 <computeroutput>BZ2_bzCompress</computeroutput> with action =
Chris@4 1373 <computeroutput>BZ_RUN</computeroutput>.</para></listitem>
Chris@4 1374
Chris@4 1375 <listitem><para>Finish up. Repeatedly call
Chris@4 1376 <computeroutput>BZ2_bzCompress</computeroutput> with action =
Chris@4 1377 <computeroutput>BZ_FINISH</computeroutput>, copying out the
Chris@4 1378 compressed output, until
Chris@4 1379 <computeroutput>BZ_STREAM_END</computeroutput> is
Chris@4 1380 returned.</para></listitem> <listitem><para>Close up and go home. Call
Chris@4 1381 <computeroutput>BZ2_bzCompressEnd</computeroutput>.</para></listitem>
Chris@4 1382
Chris@4 1383 </orderedlist>
Chris@4 1384
Chris@4 1385 <para>If the data you want to compress fits into your input
Chris@4 1386 buffer all at once, you can skip the calls of
Chris@4 1387 <computeroutput>BZ2_bzCompress ( ..., BZ_RUN )</computeroutput>
Chris@4 1388 and just do the <computeroutput>BZ2_bzCompress ( ..., BZ_FINISH
Chris@4 1389 )</computeroutput> calls.</para>
Chris@4 1390
Chris@4 1391 <para>All required memory is allocated by
Chris@4 1392 <computeroutput>BZ2_bzCompressInit</computeroutput>. The
Chris@4 1393 compression library can accept any data at all (obviously). So
Chris@4 1394 you shouldn't get any error return values from the
Chris@4 1395 <computeroutput>BZ2_bzCompress</computeroutput> calls. If you
Chris@4 1396 do, they will be
Chris@4 1397 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, and indicate
Chris@4 1398 a bug in your programming.</para>
Chris@4 1399
Chris@4 1400 <para>Trivial other possible return values:</para>
Chris@4 1401
Chris@4 1402 <programlisting>
Chris@4 1403 BZ_PARAM_ERROR
Chris@4 1404 if strm is NULL, or strm->s is NULL
Chris@4 1405 </programlisting>
Chris@4 1406
Chris@4 1407 </sect2>
Chris@4 1408
Chris@4 1409
Chris@4 1410 <sect2 id="bzCompress-end" xreflabel="BZ2_bzCompressEnd">
Chris@4 1411 <title>BZ2_bzCompressEnd</title>
Chris@4 1412
Chris@4 1413 <programlisting>
Chris@4 1414 int BZ2_bzCompressEnd ( bz_stream *strm );
Chris@4 1415 </programlisting>
Chris@4 1416
Chris@4 1417 <para>Releases all memory associated with a compression
Chris@4 1418 stream.</para>
Chris@4 1419
Chris@4 1420 <para>Possible return values:</para>
Chris@4 1421
Chris@4 1422 <programlisting>
Chris@4 1423 BZ_PARAM_ERROR if strm is NULL or strm->s is NULL
Chris@4 1424 BZ_OK otherwise
Chris@4 1425 </programlisting>
Chris@4 1426
Chris@4 1427 </sect2>
Chris@4 1428
Chris@4 1429
Chris@4 1430 <sect2 id="bzDecompress-init" xreflabel="BZ2_bzDecompressInit">
Chris@4 1431 <title>BZ2_bzDecompressInit</title>
Chris@4 1432
Chris@4 1433 <programlisting>
Chris@4 1434 int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
Chris@4 1435 </programlisting>
Chris@4 1436
Chris@4 1437 <para>Prepares for decompression. As with
Chris@4 1438 <computeroutput>BZ2_bzCompressInit</computeroutput>, a
Chris@4 1439 <computeroutput>bz_stream</computeroutput> record should be
Chris@4 1440 allocated and initialised before the call. Fields
Chris@4 1441 <computeroutput>bzalloc</computeroutput>,
Chris@4 1442 <computeroutput>bzfree</computeroutput> and
Chris@4 1443 <computeroutput>opaque</computeroutput> should be set if a custom
Chris@4 1444 memory allocator is required, or made
Chris@4 1445 <computeroutput>NULL</computeroutput> for the normal
Chris@4 1446 <computeroutput>malloc</computeroutput> /
Chris@4 1447 <computeroutput>free</computeroutput> routines. Upon return, the
Chris@4 1448 internal state will have been initialised, and
Chris@4 1449 <computeroutput>total_in</computeroutput> and
Chris@4 1450 <computeroutput>total_out</computeroutput> will be zero.</para>
Chris@4 1451
Chris@4 1452 <para>For the meaning of parameter
Chris@4 1453 <computeroutput>verbosity</computeroutput>, see
Chris@4 1454 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
Chris@4 1455
Chris@4 1456 <para>If <computeroutput>small</computeroutput> is nonzero, the
Chris@4 1457 library will use an alternative decompression algorithm which
Chris@4 1458 uses less memory but at the cost of decompressing more slowly
Chris@4 1459 (roughly speaking, half the speed, but the maximum memory
Chris@4 1460 requirement drops to around 2300k). See <xref linkend="using"/>
Chris@4 1461 for more information on memory management.</para>
Chris@4 1462
Chris@4 1463 <para>Note that the amount of memory needed to decompress a
Chris@4 1464 stream cannot be determined until the stream's header has been
Chris@4 1465 read, so even if
Chris@4 1466 <computeroutput>BZ2_bzDecompressInit</computeroutput> succeeds, a
Chris@4 1467 subsequent <computeroutput>BZ2_bzDecompress</computeroutput>
Chris@4 1468 could fail with
Chris@4 1469 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para>
Chris@4 1470
Chris@4 1471 <para>Possible return values:</para>
Chris@4 1472
Chris@4 1473 <programlisting>
Chris@4 1474 BZ_CONFIG_ERROR
Chris@4 1475 if the library has been mis-compiled
Chris@4 1476 BZ_PARAM_ERROR
Chris@4 1477 if ( small != 0 && small != 1 )
Chris@4 1478 or (verbosity <; 0 || verbosity > 4)
Chris@4 1479 BZ_MEM_ERROR
Chris@4 1480 if insufficient memory is available
Chris@4 1481 </programlisting>
Chris@4 1482
Chris@4 1483 <para>Allowable next actions:</para>
Chris@4 1484
Chris@4 1485 <programlisting>
Chris@4 1486 BZ2_bzDecompress
Chris@4 1487 if BZ_OK was returned
Chris@4 1488 no specific action required in case of error
Chris@4 1489 </programlisting>
Chris@4 1490
Chris@4 1491 </sect2>
Chris@4 1492
Chris@4 1493
Chris@4 1494 <sect2 id="bzDecompress" xreflabel="BZ2_bzDecompress">
Chris@4 1495 <title>BZ2_bzDecompress</title>
Chris@4 1496
Chris@4 1497 <programlisting>
Chris@4 1498 int BZ2_bzDecompress ( bz_stream *strm );
Chris@4 1499 </programlisting>
Chris@4 1500
Chris@4 1501 <para>Provides more input and/out output buffer space for the
Chris@4 1502 library. The caller maintains input and output buffers, and uses
Chris@4 1503 <computeroutput>BZ2_bzDecompress</computeroutput> to transfer
Chris@4 1504 data between them.</para>
Chris@4 1505
Chris@4 1506 <para>Before each call to
Chris@4 1507 <computeroutput>BZ2_bzDecompress</computeroutput>,
Chris@4 1508 <computeroutput>next_in</computeroutput> should point at the
Chris@4 1509 compressed data, and <computeroutput>avail_in</computeroutput>
Chris@4 1510 should indicate how many bytes the library may read.
Chris@4 1511 <computeroutput>BZ2_bzDecompress</computeroutput> updates
Chris@4 1512 <computeroutput>next_in</computeroutput>,
Chris@4 1513 <computeroutput>avail_in</computeroutput> and
Chris@4 1514 <computeroutput>total_in</computeroutput> to reflect the number
Chris@4 1515 of bytes it has read.</para>
Chris@4 1516
Chris@4 1517 <para>Similarly, <computeroutput>next_out</computeroutput> should
Chris@4 1518 point to a buffer in which the uncompressed output is to be
Chris@4 1519 placed, with <computeroutput>avail_out</computeroutput>
Chris@4 1520 indicating how much output space is available.
Chris@4 1521 <computeroutput>BZ2_bzCompress</computeroutput> updates
Chris@4 1522 <computeroutput>next_out</computeroutput>,
Chris@4 1523 <computeroutput>avail_out</computeroutput> and
Chris@4 1524 <computeroutput>total_out</computeroutput> to reflect the number
Chris@4 1525 of bytes output.</para>
Chris@4 1526
Chris@4 1527 <para>You may provide and remove as little or as much data as you
Chris@4 1528 like on each call of
Chris@4 1529 <computeroutput>BZ2_bzDecompress</computeroutput>. In the limit,
Chris@4 1530 it is acceptable to supply and remove data one byte at a time,
Chris@4 1531 although this would be terribly inefficient. You should always
Chris@4 1532 ensure that at least one byte of output space is available at
Chris@4 1533 each call.</para>
Chris@4 1534
Chris@4 1535 <para>Use of <computeroutput>BZ2_bzDecompress</computeroutput> is
Chris@4 1536 simpler than
Chris@4 1537 <computeroutput>BZ2_bzCompress</computeroutput>.</para>
Chris@4 1538
Chris@4 1539 <para>You should provide input and remove output as described
Chris@4 1540 above, and repeatedly call
Chris@4 1541 <computeroutput>BZ2_bzDecompress</computeroutput> until
Chris@4 1542 <computeroutput>BZ_STREAM_END</computeroutput> is returned.
Chris@4 1543 Appearance of <computeroutput>BZ_STREAM_END</computeroutput>
Chris@4 1544 denotes that <computeroutput>BZ2_bzDecompress</computeroutput>
Chris@4 1545 has detected the logical end of the compressed stream.
Chris@4 1546 <computeroutput>BZ2_bzDecompress</computeroutput> will not
Chris@4 1547 produce <computeroutput>BZ_STREAM_END</computeroutput> until all
Chris@4 1548 output data has been placed into the output buffer, so once
Chris@4 1549 <computeroutput>BZ_STREAM_END</computeroutput> appears, you are
Chris@4 1550 guaranteed to have available all the decompressed output, and
Chris@4 1551 <computeroutput>BZ2_bzDecompressEnd</computeroutput> can safely
Chris@4 1552 be called.</para>
Chris@4 1553
Chris@4 1554 <para>If case of an error return value, you should call
Chris@4 1555 <computeroutput>BZ2_bzDecompressEnd</computeroutput> to clean up
Chris@4 1556 and release memory.</para>
Chris@4 1557
Chris@4 1558 <para>Possible return values:</para>
Chris@4 1559
Chris@4 1560 <programlisting>
Chris@4 1561 BZ_PARAM_ERROR
Chris@4 1562 if strm is NULL or strm->s is NULL
Chris@4 1563 or strm->avail_out < 1
Chris@4 1564 BZ_DATA_ERROR
Chris@4 1565 if a data integrity error is detected in the compressed stream
Chris@4 1566 BZ_DATA_ERROR_MAGIC
Chris@4 1567 if the compressed stream doesn't begin with the right magic bytes
Chris@4 1568 BZ_MEM_ERROR
Chris@4 1569 if there wasn't enough memory available
Chris@4 1570 BZ_STREAM_END
Chris@4 1571 if the logical end of the data stream was detected and all
Chris@4 1572 output in has been consumed, eg s-->avail_out > 0
Chris@4 1573 BZ_OK
Chris@4 1574 otherwise
Chris@4 1575 </programlisting>
Chris@4 1576
Chris@4 1577 <para>Allowable next actions:</para>
Chris@4 1578
Chris@4 1579 <programlisting>
Chris@4 1580 BZ2_bzDecompress
Chris@4 1581 if BZ_OK was returned
Chris@4 1582 BZ2_bzDecompressEnd
Chris@4 1583 otherwise
Chris@4 1584 </programlisting>
Chris@4 1585
Chris@4 1586 </sect2>
Chris@4 1587
Chris@4 1588
Chris@4 1589 <sect2 id="bzDecompress-end" xreflabel="BZ2_bzDecompressEnd">
Chris@4 1590 <title>BZ2_bzDecompressEnd</title>
Chris@4 1591
Chris@4 1592 <programlisting>
Chris@4 1593 int BZ2_bzDecompressEnd ( bz_stream *strm );
Chris@4 1594 </programlisting>
Chris@4 1595
Chris@4 1596 <para>Releases all memory associated with a decompression
Chris@4 1597 stream.</para>
Chris@4 1598
Chris@4 1599 <para>Possible return values:</para>
Chris@4 1600
Chris@4 1601 <programlisting>
Chris@4 1602 BZ_PARAM_ERROR
Chris@4 1603 if strm is NULL or strm->s is NULL
Chris@4 1604 BZ_OK
Chris@4 1605 otherwise
Chris@4 1606 </programlisting>
Chris@4 1607
Chris@4 1608 <para>Allowable next actions:</para>
Chris@4 1609
Chris@4 1610 <programlisting>
Chris@4 1611 None.
Chris@4 1612 </programlisting>
Chris@4 1613
Chris@4 1614 </sect2>
Chris@4 1615
Chris@4 1616 </sect1>
Chris@4 1617
Chris@4 1618
Chris@4 1619 <sect1 id="hl-interface" xreflabel="High-level interface">
Chris@4 1620 <title>High-level interface</title>
Chris@4 1621
Chris@4 1622 <para>This interface provides functions for reading and writing
Chris@4 1623 <computeroutput>bzip2</computeroutput> format files. First, some
Chris@4 1624 general points.</para>
Chris@4 1625
Chris@4 1626 <itemizedlist mark='bullet'>
Chris@4 1627
Chris@4 1628 <listitem><para>All of the functions take an
Chris@4 1629 <computeroutput>int*</computeroutput> first argument,
Chris@4 1630 <computeroutput>bzerror</computeroutput>. After each call,
Chris@4 1631 <computeroutput>bzerror</computeroutput> should be consulted
Chris@4 1632 first to determine the outcome of the call. If
Chris@4 1633 <computeroutput>bzerror</computeroutput> is
Chris@4 1634 <computeroutput>BZ_OK</computeroutput>, the call completed
Chris@4 1635 successfully, and only then should the return value of the
Chris@4 1636 function (if any) be consulted. If
Chris@4 1637 <computeroutput>bzerror</computeroutput> is
Chris@4 1638 <computeroutput>BZ_IO_ERROR</computeroutput>, there was an
Chris@4 1639 error reading/writing the underlying compressed file, and you
Chris@4 1640 should then consult <computeroutput>errno</computeroutput> /
Chris@4 1641 <computeroutput>perror</computeroutput> to determine the cause
Chris@4 1642 of the difficulty. <computeroutput>bzerror</computeroutput>
Chris@4 1643 may also be set to various other values; precise details are
Chris@4 1644 given on a per-function basis below.</para></listitem>
Chris@4 1645
Chris@4 1646 <listitem><para>If <computeroutput>bzerror</computeroutput> indicates
Chris@4 1647 an error (ie, anything except
Chris@4 1648 <computeroutput>BZ_OK</computeroutput> and
Chris@4 1649 <computeroutput>BZ_STREAM_END</computeroutput>), you should
Chris@4 1650 immediately call
Chris@4 1651 <computeroutput>BZ2_bzReadClose</computeroutput> (or
Chris@4 1652 <computeroutput>BZ2_bzWriteClose</computeroutput>, depending on
Chris@4 1653 whether you are attempting to read or to write) to free up all
Chris@4 1654 resources associated with the stream. Once an error has been
Chris@4 1655 indicated, behaviour of all calls except
Chris@4 1656 <computeroutput>BZ2_bzReadClose</computeroutput>
Chris@4 1657 (<computeroutput>BZ2_bzWriteClose</computeroutput>) is
Chris@4 1658 undefined. The implication is that (1)
Chris@4 1659 <computeroutput>bzerror</computeroutput> should be checked
Chris@4 1660 after each call, and (2) if
Chris@4 1661 <computeroutput>bzerror</computeroutput> indicates an error,
Chris@4 1662 <computeroutput>BZ2_bzReadClose</computeroutput>
Chris@4 1663 (<computeroutput>BZ2_bzWriteClose</computeroutput>) should then
Chris@4 1664 be called to clean up.</para></listitem>
Chris@4 1665
Chris@4 1666 <listitem><para>The <computeroutput>FILE*</computeroutput> arguments
Chris@4 1667 passed to <computeroutput>BZ2_bzReadOpen</computeroutput> /
Chris@4 1668 <computeroutput>BZ2_bzWriteOpen</computeroutput> should be set
Chris@4 1669 to binary mode. Most Unix systems will do this by default, but
Chris@4 1670 other platforms, including Windows and Mac, will not. If you
Chris@4 1671 omit this, you may encounter problems when moving code to new
Chris@4 1672 platforms.</para></listitem>
Chris@4 1673
Chris@4 1674 <listitem><para>Memory allocation requests are handled by
Chris@4 1675 <computeroutput>malloc</computeroutput> /
Chris@4 1676 <computeroutput>free</computeroutput>. At present there is no
Chris@4 1677 facility for user-defined memory allocators in the file I/O
Chris@4 1678 functions (could easily be added, though).</para></listitem>
Chris@4 1679
Chris@4 1680 </itemizedlist>
Chris@4 1681
Chris@4 1682
Chris@4 1683
Chris@4 1684 <sect2 id="bzreadopen" xreflabel="BZ2_bzReadOpen">
Chris@4 1685 <title>BZ2_bzReadOpen</title>
Chris@4 1686
Chris@4 1687 <programlisting>
Chris@4 1688 typedef void BZFILE;
Chris@4 1689
Chris@4 1690 BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f,
Chris@4 1691 int verbosity, int small,
Chris@4 1692 void *unused, int nUnused );
Chris@4 1693 </programlisting>
Chris@4 1694
Chris@4 1695 <para>Prepare to read compressed data from file handle
Chris@4 1696 <computeroutput>f</computeroutput>.
Chris@4 1697 <computeroutput>f</computeroutput> should refer to a file which
Chris@4 1698 has been opened for reading, and for which the error indicator
Chris@4 1699 (<computeroutput>ferror(f)</computeroutput>)is not set. If
Chris@4 1700 <computeroutput>small</computeroutput> is 1, the library will try
Chris@4 1701 to decompress using less memory, at the expense of speed.</para>
Chris@4 1702
Chris@4 1703 <para>For reasons explained below,
Chris@4 1704 <computeroutput>BZ2_bzRead</computeroutput> will decompress the
Chris@4 1705 <computeroutput>nUnused</computeroutput> bytes starting at
Chris@4 1706 <computeroutput>unused</computeroutput>, before starting to read
Chris@4 1707 from the file <computeroutput>f</computeroutput>. At most
Chris@4 1708 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes may be
Chris@4 1709 supplied like this. If this facility is not required, you should
Chris@4 1710 pass <computeroutput>NULL</computeroutput> and
Chris@4 1711 <computeroutput>0</computeroutput> for
Chris@4 1712 <computeroutput>unused</computeroutput> and
Chris@4 1713 n<computeroutput>Unused</computeroutput> respectively.</para>
Chris@4 1714
Chris@4 1715 <para>For the meaning of parameters
Chris@4 1716 <computeroutput>small</computeroutput> and
Chris@4 1717 <computeroutput>verbosity</computeroutput>, see
Chris@4 1718 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para>
Chris@4 1719
Chris@4 1720 <para>The amount of memory needed to decompress a file cannot be
Chris@4 1721 determined until the file's header has been read. So it is
Chris@4 1722 possible that <computeroutput>BZ2_bzReadOpen</computeroutput>
Chris@4 1723 returns <computeroutput>BZ_OK</computeroutput> but a subsequent
Chris@4 1724 call of <computeroutput>BZ2_bzRead</computeroutput> will return
Chris@4 1725 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para>
Chris@4 1726
Chris@4 1727 <para>Possible assignments to
Chris@4 1728 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 1729
Chris@4 1730 <programlisting>
Chris@4 1731 BZ_CONFIG_ERROR
Chris@4 1732 if the library has been mis-compiled
Chris@4 1733 BZ_PARAM_ERROR
Chris@4 1734 if f is NULL
Chris@4 1735 or small is neither 0 nor 1
Chris@4 1736 or ( unused == NULL && nUnused != 0 )
Chris@4 1737 or ( unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED) )
Chris@4 1738 BZ_IO_ERROR
Chris@4 1739 if ferror(f) is nonzero
Chris@4 1740 BZ_MEM_ERROR
Chris@4 1741 if insufficient memory is available
Chris@4 1742 BZ_OK
Chris@4 1743 otherwise.
Chris@4 1744 </programlisting>
Chris@4 1745
Chris@4 1746 <para>Possible return values:</para>
Chris@4 1747
Chris@4 1748 <programlisting>
Chris@4 1749 Pointer to an abstract BZFILE
Chris@4 1750 if bzerror is BZ_OK
Chris@4 1751 NULL
Chris@4 1752 otherwise
Chris@4 1753 </programlisting>
Chris@4 1754
Chris@4 1755 <para>Allowable next actions:</para>
Chris@4 1756
Chris@4 1757 <programlisting>
Chris@4 1758 BZ2_bzRead
Chris@4 1759 if bzerror is BZ_OK
Chris@4 1760 BZ2_bzClose
Chris@4 1761 otherwise
Chris@4 1762 </programlisting>
Chris@4 1763
Chris@4 1764 </sect2>
Chris@4 1765
Chris@4 1766
Chris@4 1767 <sect2 id="bzread" xreflabel="BZ2_bzRead">
Chris@4 1768 <title>BZ2_bzRead</title>
Chris@4 1769
Chris@4 1770 <programlisting>
Chris@4 1771 int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
Chris@4 1772 </programlisting>
Chris@4 1773
Chris@4 1774 <para>Reads up to <computeroutput>len</computeroutput>
Chris@4 1775 (uncompressed) bytes from the compressed file
Chris@4 1776 <computeroutput>b</computeroutput> into the buffer
Chris@4 1777 <computeroutput>buf</computeroutput>. If the read was
Chris@4 1778 successful, <computeroutput>bzerror</computeroutput> is set to
Chris@4 1779 <computeroutput>BZ_OK</computeroutput> and the number of bytes
Chris@4 1780 read is returned. If the logical end-of-stream was detected,
Chris@4 1781 <computeroutput>bzerror</computeroutput> will be set to
Chris@4 1782 <computeroutput>BZ_STREAM_END</computeroutput>, and the number of
Chris@4 1783 bytes read is returned. All other
Chris@4 1784 <computeroutput>bzerror</computeroutput> values denote an
Chris@4 1785 error.</para>
Chris@4 1786
Chris@4 1787 <para><computeroutput>BZ2_bzRead</computeroutput> will supply
Chris@4 1788 <computeroutput>len</computeroutput> bytes, unless the logical
Chris@4 1789 stream end is detected or an error occurs. Because of this, it
Chris@4 1790 is possible to detect the stream end by observing when the number
Chris@4 1791 of bytes returned is less than the number requested.
Chris@4 1792 Nevertheless, this is regarded as inadvisable; you should instead
Chris@4 1793 check <computeroutput>bzerror</computeroutput> after every call
Chris@4 1794 and watch out for
Chris@4 1795 <computeroutput>BZ_STREAM_END</computeroutput>.</para>
Chris@4 1796
Chris@4 1797 <para>Internally, <computeroutput>BZ2_bzRead</computeroutput>
Chris@4 1798 copies data from the compressed file in chunks of size
Chris@4 1799 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes before
Chris@4 1800 decompressing it. If the file contains more bytes than strictly
Chris@4 1801 needed to reach the logical end-of-stream,
Chris@4 1802 <computeroutput>BZ2_bzRead</computeroutput> will almost certainly
Chris@4 1803 read some of the trailing data before signalling
Chris@4 1804 <computeroutput>BZ_SEQUENCE_END</computeroutput>. To collect the
Chris@4 1805 read but unused data once
Chris@4 1806 <computeroutput>BZ_SEQUENCE_END</computeroutput> has appeared,
Chris@4 1807 call <computeroutput>BZ2_bzReadGetUnused</computeroutput>
Chris@4 1808 immediately before
Chris@4 1809 <computeroutput>BZ2_bzReadClose</computeroutput>.</para>
Chris@4 1810
Chris@4 1811 <para>Possible assignments to
Chris@4 1812 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 1813
Chris@4 1814 <programlisting>
Chris@4 1815 BZ_PARAM_ERROR
Chris@4 1816 if b is NULL or buf is NULL or len < 0
Chris@4 1817 BZ_SEQUENCE_ERROR
Chris@4 1818 if b was opened with BZ2_bzWriteOpen
Chris@4 1819 BZ_IO_ERROR
Chris@4 1820 if there is an error reading from the compressed file
Chris@4 1821 BZ_UNEXPECTED_EOF
Chris@4 1822 if the compressed file ended before
Chris@4 1823 the logical end-of-stream was detected
Chris@4 1824 BZ_DATA_ERROR
Chris@4 1825 if a data integrity error was detected in the compressed stream
Chris@4 1826 BZ_DATA_ERROR_MAGIC
Chris@4 1827 if the stream does not begin with the requisite header bytes
Chris@4 1828 (ie, is not a bzip2 data file). This is really
Chris@4 1829 a special case of BZ_DATA_ERROR.
Chris@4 1830 BZ_MEM_ERROR
Chris@4 1831 if insufficient memory was available
Chris@4 1832 BZ_STREAM_END
Chris@4 1833 if the logical end of stream was detected.
Chris@4 1834 BZ_OK
Chris@4 1835 otherwise.
Chris@4 1836 </programlisting>
Chris@4 1837
Chris@4 1838 <para>Possible return values:</para>
Chris@4 1839
Chris@4 1840 <programlisting>
Chris@4 1841 number of bytes read
Chris@4 1842 if bzerror is BZ_OK or BZ_STREAM_END
Chris@4 1843 undefined
Chris@4 1844 otherwise
Chris@4 1845 </programlisting>
Chris@4 1846
Chris@4 1847 <para>Allowable next actions:</para>
Chris@4 1848
Chris@4 1849 <programlisting>
Chris@4 1850 collect data from buf, then BZ2_bzRead or BZ2_bzReadClose
Chris@4 1851 if bzerror is BZ_OK
Chris@4 1852 collect data from buf, then BZ2_bzReadClose or BZ2_bzReadGetUnused
Chris@4 1853 if bzerror is BZ_SEQUENCE_END
Chris@4 1854 BZ2_bzReadClose
Chris@4 1855 otherwise
Chris@4 1856 </programlisting>
Chris@4 1857
Chris@4 1858 </sect2>
Chris@4 1859
Chris@4 1860
Chris@4 1861 <sect2 id="bzreadgetunused" xreflabel="BZ2_bzReadGetUnused">
Chris@4 1862 <title>BZ2_bzReadGetUnused</title>
Chris@4 1863
Chris@4 1864 <programlisting>
Chris@4 1865 void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b,
Chris@4 1866 void** unused, int* nUnused );
Chris@4 1867 </programlisting>
Chris@4 1868
Chris@4 1869 <para>Returns data which was read from the compressed file but
Chris@4 1870 was not needed to get to the logical end-of-stream.
Chris@4 1871 <computeroutput>*unused</computeroutput> is set to the address of
Chris@4 1872 the data, and <computeroutput>*nUnused</computeroutput> to the
Chris@4 1873 number of bytes. <computeroutput>*nUnused</computeroutput> will
Chris@4 1874 be set to a value between <computeroutput>0</computeroutput> and
Chris@4 1875 <computeroutput>BZ_MAX_UNUSED</computeroutput> inclusive.</para>
Chris@4 1876
Chris@4 1877 <para>This function may only be called once
Chris@4 1878 <computeroutput>BZ2_bzRead</computeroutput> has signalled
Chris@4 1879 <computeroutput>BZ_STREAM_END</computeroutput> but before
Chris@4 1880 <computeroutput>BZ2_bzReadClose</computeroutput>.</para>
Chris@4 1881
Chris@4 1882 <para>Possible assignments to
Chris@4 1883 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 1884
Chris@4 1885 <programlisting>
Chris@4 1886 BZ_PARAM_ERROR
Chris@4 1887 if b is NULL
Chris@4 1888 or unused is NULL or nUnused is NULL
Chris@4 1889 BZ_SEQUENCE_ERROR
Chris@4 1890 if BZ_STREAM_END has not been signalled
Chris@4 1891 or if b was opened with BZ2_bzWriteOpen
Chris@4 1892 BZ_OK
Chris@4 1893 otherwise
Chris@4 1894 </programlisting>
Chris@4 1895
Chris@4 1896 <para>Allowable next actions:</para>
Chris@4 1897
Chris@4 1898 <programlisting>
Chris@4 1899 BZ2_bzReadClose
Chris@4 1900 </programlisting>
Chris@4 1901
Chris@4 1902 </sect2>
Chris@4 1903
Chris@4 1904
Chris@4 1905 <sect2 id="bzreadclose" xreflabel="BZ2_bzReadClose">
Chris@4 1906 <title>BZ2_bzReadClose</title>
Chris@4 1907
Chris@4 1908 <programlisting>
Chris@4 1909 void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
Chris@4 1910 </programlisting>
Chris@4 1911
Chris@4 1912 <para>Releases all memory pertaining to the compressed file
Chris@4 1913 <computeroutput>b</computeroutput>.
Chris@4 1914 <computeroutput>BZ2_bzReadClose</computeroutput> does not call
Chris@4 1915 <computeroutput>fclose</computeroutput> on the underlying file
Chris@4 1916 handle, so you should do that yourself if appropriate.
Chris@4 1917 <computeroutput>BZ2_bzReadClose</computeroutput> should be called
Chris@4 1918 to clean up after all error situations.</para>
Chris@4 1919
Chris@4 1920 <para>Possible assignments to
Chris@4 1921 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 1922
Chris@4 1923 <programlisting>
Chris@4 1924 BZ_SEQUENCE_ERROR
Chris@4 1925 if b was opened with BZ2_bzOpenWrite
Chris@4 1926 BZ_OK
Chris@4 1927 otherwise
Chris@4 1928 </programlisting>
Chris@4 1929
Chris@4 1930 <para>Allowable next actions:</para>
Chris@4 1931
Chris@4 1932 <programlisting>
Chris@4 1933 none
Chris@4 1934 </programlisting>
Chris@4 1935
Chris@4 1936 </sect2>
Chris@4 1937
Chris@4 1938
Chris@4 1939 <sect2 id="bzwriteopen" xreflabel="BZ2_bzWriteOpen">
Chris@4 1940 <title>BZ2_bzWriteOpen</title>
Chris@4 1941
Chris@4 1942 <programlisting>
Chris@4 1943 BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f,
Chris@4 1944 int blockSize100k, int verbosity,
Chris@4 1945 int workFactor );
Chris@4 1946 </programlisting>
Chris@4 1947
Chris@4 1948 <para>Prepare to write compressed data to file handle
Chris@4 1949 <computeroutput>f</computeroutput>.
Chris@4 1950 <computeroutput>f</computeroutput> should refer to a file which
Chris@4 1951 has been opened for writing, and for which the error indicator
Chris@4 1952 (<computeroutput>ferror(f)</computeroutput>)is not set.</para>
Chris@4 1953
Chris@4 1954 <para>For the meaning of parameters
Chris@4 1955 <computeroutput>blockSize100k</computeroutput>,
Chris@4 1956 <computeroutput>verbosity</computeroutput> and
Chris@4 1957 <computeroutput>workFactor</computeroutput>, see
Chris@4 1958 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
Chris@4 1959
Chris@4 1960 <para>All required memory is allocated at this stage, so if the
Chris@4 1961 call completes successfully,
Chris@4 1962 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot be signalled
Chris@4 1963 by a subsequent call to
Chris@4 1964 <computeroutput>BZ2_bzWrite</computeroutput>.</para>
Chris@4 1965
Chris@4 1966 <para>Possible assignments to
Chris@4 1967 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 1968
Chris@4 1969 <programlisting>
Chris@4 1970 BZ_CONFIG_ERROR
Chris@4 1971 if the library has been mis-compiled
Chris@4 1972 BZ_PARAM_ERROR
Chris@4 1973 if f is NULL
Chris@4 1974 or blockSize100k < 1 or blockSize100k > 9
Chris@4 1975 BZ_IO_ERROR
Chris@4 1976 if ferror(f) is nonzero
Chris@4 1977 BZ_MEM_ERROR
Chris@4 1978 if insufficient memory is available
Chris@4 1979 BZ_OK
Chris@4 1980 otherwise
Chris@4 1981 </programlisting>
Chris@4 1982
Chris@4 1983 <para>Possible return values:</para>
Chris@4 1984
Chris@4 1985 <programlisting>
Chris@4 1986 Pointer to an abstract BZFILE
Chris@4 1987 if bzerror is BZ_OK
Chris@4 1988 NULL
Chris@4 1989 otherwise
Chris@4 1990 </programlisting>
Chris@4 1991
Chris@4 1992 <para>Allowable next actions:</para>
Chris@4 1993
Chris@4 1994 <programlisting>
Chris@4 1995 BZ2_bzWrite
Chris@4 1996 if bzerror is BZ_OK
Chris@4 1997 (you could go directly to BZ2_bzWriteClose, but this would be pretty pointless)
Chris@4 1998 BZ2_bzWriteClose
Chris@4 1999 otherwise
Chris@4 2000 </programlisting>
Chris@4 2001
Chris@4 2002 </sect2>
Chris@4 2003
Chris@4 2004
Chris@4 2005 <sect2 id="bzwrite" xreflabel="BZ2_bzWrite">
Chris@4 2006 <title>BZ2_bzWrite</title>
Chris@4 2007
Chris@4 2008 <programlisting>
Chris@4 2009 void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
Chris@4 2010 </programlisting>
Chris@4 2011
Chris@4 2012 <para>Absorbs <computeroutput>len</computeroutput> bytes from the
Chris@4 2013 buffer <computeroutput>buf</computeroutput>, eventually to be
Chris@4 2014 compressed and written to the file.</para>
Chris@4 2015
Chris@4 2016 <para>Possible assignments to
Chris@4 2017 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 2018
Chris@4 2019 <programlisting>
Chris@4 2020 BZ_PARAM_ERROR
Chris@4 2021 if b is NULL or buf is NULL or len < 0
Chris@4 2022 BZ_SEQUENCE_ERROR
Chris@4 2023 if b was opened with BZ2_bzReadOpen
Chris@4 2024 BZ_IO_ERROR
Chris@4 2025 if there is an error writing the compressed file.
Chris@4 2026 BZ_OK
Chris@4 2027 otherwise
Chris@4 2028 </programlisting>
Chris@4 2029
Chris@4 2030 </sect2>
Chris@4 2031
Chris@4 2032
Chris@4 2033 <sect2 id="bzwriteclose" xreflabel="BZ2_bzWriteClose">
Chris@4 2034 <title>BZ2_bzWriteClose</title>
Chris@4 2035
Chris@4 2036 <programlisting>
Chris@4 2037 void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
Chris@4 2038 int abandon,
Chris@4 2039 unsigned int* nbytes_in,
Chris@4 2040 unsigned int* nbytes_out );
Chris@4 2041
Chris@4 2042 void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
Chris@4 2043 int abandon,
Chris@4 2044 unsigned int* nbytes_in_lo32,
Chris@4 2045 unsigned int* nbytes_in_hi32,
Chris@4 2046 unsigned int* nbytes_out_lo32,
Chris@4 2047 unsigned int* nbytes_out_hi32 );
Chris@4 2048 </programlisting>
Chris@4 2049
Chris@4 2050 <para>Compresses and flushes to the compressed file all data so
Chris@4 2051 far supplied by <computeroutput>BZ2_bzWrite</computeroutput>.
Chris@4 2052 The logical end-of-stream markers are also written, so subsequent
Chris@4 2053 calls to <computeroutput>BZ2_bzWrite</computeroutput> are
Chris@4 2054 illegal. All memory associated with the compressed file
Chris@4 2055 <computeroutput>b</computeroutput> is released.
Chris@4 2056 <computeroutput>fflush</computeroutput> is called on the
Chris@4 2057 compressed file, but it is not
Chris@4 2058 <computeroutput>fclose</computeroutput>'d.</para>
Chris@4 2059
Chris@4 2060 <para>If <computeroutput>BZ2_bzWriteClose</computeroutput> is
Chris@4 2061 called to clean up after an error, the only action is to release
Chris@4 2062 the memory. The library records the error codes issued by
Chris@4 2063 previous calls, so this situation will be detected automatically.
Chris@4 2064 There is no attempt to complete the compression operation, nor to
Chris@4 2065 <computeroutput>fflush</computeroutput> the compressed file. You
Chris@4 2066 can force this behaviour to happen even in the case of no error,
Chris@4 2067 by passing a nonzero value to
Chris@4 2068 <computeroutput>abandon</computeroutput>.</para>
Chris@4 2069
Chris@4 2070 <para>If <computeroutput>nbytes_in</computeroutput> is non-null,
Chris@4 2071 <computeroutput>*nbytes_in</computeroutput> will be set to be the
Chris@4 2072 total volume of uncompressed data handled. Similarly,
Chris@4 2073 <computeroutput>nbytes_out</computeroutput> will be set to the
Chris@4 2074 total volume of compressed data written. For compatibility with
Chris@4 2075 older versions of the library,
Chris@4 2076 <computeroutput>BZ2_bzWriteClose</computeroutput> only yields the
Chris@4 2077 lower 32 bits of these counts. Use
Chris@4 2078 <computeroutput>BZ2_bzWriteClose64</computeroutput> if you want
Chris@4 2079 the full 64 bit counts. These two functions are otherwise
Chris@4 2080 absolutely identical.</para>
Chris@4 2081
Chris@4 2082 <para>Possible assignments to
Chris@4 2083 <computeroutput>bzerror</computeroutput>:</para>
Chris@4 2084
Chris@4 2085 <programlisting>
Chris@4 2086 BZ_SEQUENCE_ERROR
Chris@4 2087 if b was opened with BZ2_bzReadOpen
Chris@4 2088 BZ_IO_ERROR
Chris@4 2089 if there is an error writing the compressed file
Chris@4 2090 BZ_OK
Chris@4 2091 otherwise
Chris@4 2092 </programlisting>
Chris@4 2093
Chris@4 2094 </sect2>
Chris@4 2095
Chris@4 2096
Chris@4 2097 <sect2 id="embed" xreflabel="Handling embedded compressed data streams">
Chris@4 2098 <title>Handling embedded compressed data streams</title>
Chris@4 2099
Chris@4 2100 <para>The high-level library facilitates use of
Chris@4 2101 <computeroutput>bzip2</computeroutput> data streams which form
Chris@4 2102 some part of a surrounding, larger data stream.</para>
Chris@4 2103
Chris@4 2104 <itemizedlist mark='bullet'>
Chris@4 2105
Chris@4 2106 <listitem><para>For writing, the library takes an open file handle,
Chris@4 2107 writes compressed data to it,
Chris@4 2108 <computeroutput>fflush</computeroutput>es it but does not
Chris@4 2109 <computeroutput>fclose</computeroutput> it. The calling
Chris@4 2110 application can write its own data before and after the
Chris@4 2111 compressed data stream, using that same file handle.</para></listitem>
Chris@4 2112
Chris@4 2113 <listitem><para>Reading is more complex, and the facilities are not as
Chris@4 2114 general as they could be since generality is hard to reconcile
Chris@4 2115 with efficiency. <computeroutput>BZ2_bzRead</computeroutput>
Chris@4 2116 reads from the compressed file in blocks of size
Chris@4 2117 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes, and in
Chris@4 2118 doing so probably will overshoot the logical end of compressed
Chris@4 2119 stream. To recover this data once decompression has ended,
Chris@4 2120 call <computeroutput>BZ2_bzReadGetUnused</computeroutput> after
Chris@4 2121 the last call of <computeroutput>BZ2_bzRead</computeroutput>
Chris@4 2122 (the one returning
Chris@4 2123 <computeroutput>BZ_STREAM_END</computeroutput>) but before
Chris@4 2124 calling
Chris@4 2125 <computeroutput>BZ2_bzReadClose</computeroutput>.</para></listitem>
Chris@4 2126
Chris@4 2127 </itemizedlist>
Chris@4 2128
Chris@4 2129 <para>This mechanism makes it easy to decompress multiple
Chris@4 2130 <computeroutput>bzip2</computeroutput> streams placed end-to-end.
Chris@4 2131 As the end of one stream, when
Chris@4 2132 <computeroutput>BZ2_bzRead</computeroutput> returns
Chris@4 2133 <computeroutput>BZ_STREAM_END</computeroutput>, call
Chris@4 2134 <computeroutput>BZ2_bzReadGetUnused</computeroutput> to collect
Chris@4 2135 the unused data (copy it into your own buffer somewhere). That
Chris@4 2136 data forms the start of the next compressed stream. To start
Chris@4 2137 uncompressing that next stream, call
Chris@4 2138 <computeroutput>BZ2_bzReadOpen</computeroutput> again, feeding in
Chris@4 2139 the unused data via the <computeroutput>unused</computeroutput> /
Chris@4 2140 <computeroutput>nUnused</computeroutput> parameters. Keep doing
Chris@4 2141 this until <computeroutput>BZ_STREAM_END</computeroutput> return
Chris@4 2142 coincides with the physical end of file
Chris@4 2143 (<computeroutput>feof(f)</computeroutput>). In this situation
Chris@4 2144 <computeroutput>BZ2_bzReadGetUnused</computeroutput> will of
Chris@4 2145 course return no data.</para>
Chris@4 2146
Chris@4 2147 <para>This should give some feel for how the high-level interface
Chris@4 2148 can be used. If you require extra flexibility, you'll have to
Chris@4 2149 bite the bullet and get to grips with the low-level
Chris@4 2150 interface.</para>
Chris@4 2151
Chris@4 2152 </sect2>
Chris@4 2153
Chris@4 2154
Chris@4 2155 <sect2 id="std-rdwr" xreflabel="Standard file-reading/writing code">
Chris@4 2156 <title>Standard file-reading/writing code</title>
Chris@4 2157
Chris@4 2158 <para>Here's how you'd write data to a compressed file:</para>
Chris@4 2159
Chris@4 2160 <programlisting>
Chris@4 2161 FILE* f;
Chris@4 2162 BZFILE* b;
Chris@4 2163 int nBuf;
Chris@4 2164 char buf[ /* whatever size you like */ ];
Chris@4 2165 int bzerror;
Chris@4 2166 int nWritten;
Chris@4 2167
Chris@4 2168 f = fopen ( "myfile.bz2", "w" );
Chris@4 2169 if ( !f ) {
Chris@4 2170 /* handle error */
Chris@4 2171 }
Chris@4 2172 b = BZ2_bzWriteOpen( &bzerror, f, 9 );
Chris@4 2173 if (bzerror != BZ_OK) {
Chris@4 2174 BZ2_bzWriteClose ( b );
Chris@4 2175 /* handle error */
Chris@4 2176 }
Chris@4 2177
Chris@4 2178 while ( /* condition */ ) {
Chris@4 2179 /* get data to write into buf, and set nBuf appropriately */
Chris@4 2180 nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf );
Chris@4 2181 if (bzerror == BZ_IO_ERROR) {
Chris@4 2182 BZ2_bzWriteClose ( &bzerror, b );
Chris@4 2183 /* handle error */
Chris@4 2184 }
Chris@4 2185 }
Chris@4 2186
Chris@4 2187 BZ2_bzWriteClose( &bzerror, b );
Chris@4 2188 if (bzerror == BZ_IO_ERROR) {
Chris@4 2189 /* handle error */
Chris@4 2190 }
Chris@4 2191 </programlisting>
Chris@4 2192
Chris@4 2193 <para>And to read from a compressed file:</para>
Chris@4 2194
Chris@4 2195 <programlisting>
Chris@4 2196 FILE* f;
Chris@4 2197 BZFILE* b;
Chris@4 2198 int nBuf;
Chris@4 2199 char buf[ /* whatever size you like */ ];
Chris@4 2200 int bzerror;
Chris@4 2201 int nWritten;
Chris@4 2202
Chris@4 2203 f = fopen ( "myfile.bz2", "r" );
Chris@4 2204 if ( !f ) {
Chris@4 2205 /* handle error */
Chris@4 2206 }
Chris@4 2207 b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 );
Chris@4 2208 if ( bzerror != BZ_OK ) {
Chris@4 2209 BZ2_bzReadClose ( &bzerror, b );
Chris@4 2210 /* handle error */
Chris@4 2211 }
Chris@4 2212
Chris@4 2213 bzerror = BZ_OK;
Chris@4 2214 while ( bzerror == BZ_OK && /* arbitrary other conditions */) {
Chris@4 2215 nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ );
Chris@4 2216 if ( bzerror == BZ_OK ) {
Chris@4 2217 /* do something with buf[0 .. nBuf-1] */
Chris@4 2218 }
Chris@4 2219 }
Chris@4 2220 if ( bzerror != BZ_STREAM_END ) {
Chris@4 2221 BZ2_bzReadClose ( &bzerror, b );
Chris@4 2222 /* handle error */
Chris@4 2223 } else {
Chris@4 2224 BZ2_bzReadClose ( &bzerror, b );
Chris@4 2225 }
Chris@4 2226 </programlisting>
Chris@4 2227
Chris@4 2228 </sect2>
Chris@4 2229
Chris@4 2230 </sect1>
Chris@4 2231
Chris@4 2232
Chris@4 2233 <sect1 id="util-fns" xreflabel="Utility functions">
Chris@4 2234 <title>Utility functions</title>
Chris@4 2235
Chris@4 2236
Chris@4 2237 <sect2 id="bzbufftobuffcompress" xreflabel="BZ2_bzBuffToBuffCompress">
Chris@4 2238 <title>BZ2_bzBuffToBuffCompress</title>
Chris@4 2239
Chris@4 2240 <programlisting>
Chris@4 2241 int BZ2_bzBuffToBuffCompress( char* dest,
Chris@4 2242 unsigned int* destLen,
Chris@4 2243 char* source,
Chris@4 2244 unsigned int sourceLen,
Chris@4 2245 int blockSize100k,
Chris@4 2246 int verbosity,
Chris@4 2247 int workFactor );
Chris@4 2248 </programlisting>
Chris@4 2249
Chris@4 2250 <para>Attempts to compress the data in <computeroutput>source[0
Chris@4 2251 .. sourceLen-1]</computeroutput> into the destination buffer,
Chris@4 2252 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the
Chris@4 2253 destination buffer is big enough,
Chris@4 2254 <computeroutput>*destLen</computeroutput> is set to the size of
Chris@4 2255 the compressed data, and <computeroutput>BZ_OK</computeroutput>
Chris@4 2256 is returned. If the compressed data won't fit,
Chris@4 2257 <computeroutput>*destLen</computeroutput> is unchanged, and
Chris@4 2258 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is
Chris@4 2259 returned.</para>
Chris@4 2260
Chris@4 2261 <para>Compression in this manner is a one-shot event, done with a
Chris@4 2262 single call to this function. The resulting compressed data is a
Chris@4 2263 complete <computeroutput>bzip2</computeroutput> format data
Chris@4 2264 stream. There is no mechanism for making additional calls to
Chris@4 2265 provide extra input data. If you want that kind of mechanism,
Chris@4 2266 use the low-level interface.</para>
Chris@4 2267
Chris@4 2268 <para>For the meaning of parameters
Chris@4 2269 <computeroutput>blockSize100k</computeroutput>,
Chris@4 2270 <computeroutput>verbosity</computeroutput> and
Chris@4 2271 <computeroutput>workFactor</computeroutput>, see
Chris@4 2272 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para>
Chris@4 2273
Chris@4 2274 <para>To guarantee that the compressed data will fit in its
Chris@4 2275 buffer, allocate an output buffer of size 1% larger than the
Chris@4 2276 uncompressed data, plus six hundred extra bytes.</para>
Chris@4 2277
Chris@4 2278 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput>
Chris@4 2279 will not write data at or beyond
Chris@4 2280 <computeroutput>dest[*destLen]</computeroutput>, even in case of
Chris@4 2281 buffer overflow.</para>
Chris@4 2282
Chris@4 2283 <para>Possible return values:</para>
Chris@4 2284
Chris@4 2285 <programlisting>
Chris@4 2286 BZ_CONFIG_ERROR
Chris@4 2287 if the library has been mis-compiled
Chris@4 2288 BZ_PARAM_ERROR
Chris@4 2289 if dest is NULL or destLen is NULL
Chris@4 2290 or blockSize100k < 1 or blockSize100k > 9
Chris@4 2291 or verbosity < 0 or verbosity > 4
Chris@4 2292 or workFactor < 0 or workFactor > 250
Chris@4 2293 BZ_MEM_ERROR
Chris@4 2294 if insufficient memory is available
Chris@4 2295 BZ_OUTBUFF_FULL
Chris@4 2296 if the size of the compressed data exceeds *destLen
Chris@4 2297 BZ_OK
Chris@4 2298 otherwise
Chris@4 2299 </programlisting>
Chris@4 2300
Chris@4 2301 </sect2>
Chris@4 2302
Chris@4 2303
Chris@4 2304 <sect2 id="bzbufftobuffdecompress" xreflabel="BZ2_bzBuffToBuffDecompress">
Chris@4 2305 <title>BZ2_bzBuffToBuffDecompress</title>
Chris@4 2306
Chris@4 2307 <programlisting>
Chris@4 2308 int BZ2_bzBuffToBuffDecompress( char* dest,
Chris@4 2309 unsigned int* destLen,
Chris@4 2310 char* source,
Chris@4 2311 unsigned int sourceLen,
Chris@4 2312 int small,
Chris@4 2313 int verbosity );
Chris@4 2314 </programlisting>
Chris@4 2315
Chris@4 2316 <para>Attempts to decompress the data in <computeroutput>source[0
Chris@4 2317 .. sourceLen-1]</computeroutput> into the destination buffer,
Chris@4 2318 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the
Chris@4 2319 destination buffer is big enough,
Chris@4 2320 <computeroutput>*destLen</computeroutput> is set to the size of
Chris@4 2321 the uncompressed data, and <computeroutput>BZ_OK</computeroutput>
Chris@4 2322 is returned. If the compressed data won't fit,
Chris@4 2323 <computeroutput>*destLen</computeroutput> is unchanged, and
Chris@4 2324 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is
Chris@4 2325 returned.</para>
Chris@4 2326
Chris@4 2327 <para><computeroutput>source</computeroutput> is assumed to hold
Chris@4 2328 a complete <computeroutput>bzip2</computeroutput> format data
Chris@4 2329 stream.
Chris@4 2330 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> tries
Chris@4 2331 to decompress the entirety of the stream into the output
Chris@4 2332 buffer.</para>
Chris@4 2333
Chris@4 2334 <para>For the meaning of parameters
Chris@4 2335 <computeroutput>small</computeroutput> and
Chris@4 2336 <computeroutput>verbosity</computeroutput>, see
Chris@4 2337 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para>
Chris@4 2338
Chris@4 2339 <para>Because the compression ratio of the compressed data cannot
Chris@4 2340 be known in advance, there is no easy way to guarantee that the
Chris@4 2341 output buffer will be big enough. You may of course make
Chris@4 2342 arrangements in your code to record the size of the uncompressed
Chris@4 2343 data, but such a mechanism is beyond the scope of this
Chris@4 2344 library.</para>
Chris@4 2345
Chris@4 2346 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput>
Chris@4 2347 will not write data at or beyond
Chris@4 2348 <computeroutput>dest[*destLen]</computeroutput>, even in case of
Chris@4 2349 buffer overflow.</para>
Chris@4 2350
Chris@4 2351 <para>Possible return values:</para>
Chris@4 2352
Chris@4 2353 <programlisting>
Chris@4 2354 BZ_CONFIG_ERROR
Chris@4 2355 if the library has been mis-compiled
Chris@4 2356 BZ_PARAM_ERROR
Chris@4 2357 if dest is NULL or destLen is NULL
Chris@4 2358 or small != 0 && small != 1
Chris@4 2359 or verbosity < 0 or verbosity > 4
Chris@4 2360 BZ_MEM_ERROR
Chris@4 2361 if insufficient memory is available
Chris@4 2362 BZ_OUTBUFF_FULL
Chris@4 2363 if the size of the compressed data exceeds *destLen
Chris@4 2364 BZ_DATA_ERROR
Chris@4 2365 if a data integrity error was detected in the compressed data
Chris@4 2366 BZ_DATA_ERROR_MAGIC
Chris@4 2367 if the compressed data doesn't begin with the right magic bytes
Chris@4 2368 BZ_UNEXPECTED_EOF
Chris@4 2369 if the compressed data ends unexpectedly
Chris@4 2370 BZ_OK
Chris@4 2371 otherwise
Chris@4 2372 </programlisting>
Chris@4 2373
Chris@4 2374 </sect2>
Chris@4 2375
Chris@4 2376 </sect1>
Chris@4 2377
Chris@4 2378
Chris@4 2379 <sect1 id="zlib-compat" xreflabel="zlib compatibility functions">
Chris@4 2380 <title>zlib compatibility functions</title>
Chris@4 2381
Chris@4 2382 <para>Yoshioka Tsuneo has contributed some functions to give
Chris@4 2383 better <computeroutput>zlib</computeroutput> compatibility.
Chris@4 2384 These functions are <computeroutput>BZ2_bzopen</computeroutput>,
Chris@4 2385 <computeroutput>BZ2_bzread</computeroutput>,
Chris@4 2386 <computeroutput>BZ2_bzwrite</computeroutput>,
Chris@4 2387 <computeroutput>BZ2_bzflush</computeroutput>,
Chris@4 2388 <computeroutput>BZ2_bzclose</computeroutput>,
Chris@4 2389 <computeroutput>BZ2_bzerror</computeroutput> and
Chris@4 2390 <computeroutput>BZ2_bzlibVersion</computeroutput>. These
Chris@4 2391 functions are not (yet) officially part of the library. If they
Chris@4 2392 break, you get to keep all the pieces. Nevertheless, I think
Chris@4 2393 they work ok.</para>
Chris@4 2394
Chris@4 2395 <programlisting>
Chris@4 2396 typedef void BZFILE;
Chris@4 2397
Chris@4 2398 const char * BZ2_bzlibVersion ( void );
Chris@4 2399 </programlisting>
Chris@4 2400
Chris@4 2401 <para>Returns a string indicating the library version.</para>
Chris@4 2402
Chris@4 2403 <programlisting>
Chris@4 2404 BZFILE * BZ2_bzopen ( const char *path, const char *mode );
Chris@4 2405 BZFILE * BZ2_bzdopen ( int fd, const char *mode );
Chris@4 2406 </programlisting>
Chris@4 2407
Chris@4 2408 <para>Opens a <computeroutput>.bz2</computeroutput> file for
Chris@4 2409 reading or writing, using either its name or a pre-existing file
Chris@4 2410 descriptor. Analogous to <computeroutput>fopen</computeroutput>
Chris@4 2411 and <computeroutput>fdopen</computeroutput>.</para>
Chris@4 2412
Chris@4 2413 <programlisting>
Chris@4 2414 int BZ2_bzread ( BZFILE* b, void* buf, int len );
Chris@4 2415 int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
Chris@4 2416 </programlisting>
Chris@4 2417
Chris@4 2418 <para>Reads/writes data from/to a previously opened
Chris@4 2419 <computeroutput>BZFILE</computeroutput>. Analogous to
Chris@4 2420 <computeroutput>fread</computeroutput> and
Chris@4 2421 <computeroutput>fwrite</computeroutput>.</para>
Chris@4 2422
Chris@4 2423 <programlisting>
Chris@4 2424 int BZ2_bzflush ( BZFILE* b );
Chris@4 2425 void BZ2_bzclose ( BZFILE* b );
Chris@4 2426 </programlisting>
Chris@4 2427
Chris@4 2428 <para>Flushes/closes a <computeroutput>BZFILE</computeroutput>.
Chris@4 2429 <computeroutput>BZ2_bzflush</computeroutput> doesn't actually do
Chris@4 2430 anything. Analogous to <computeroutput>fflush</computeroutput>
Chris@4 2431 and <computeroutput>fclose</computeroutput>.</para>
Chris@4 2432
Chris@4 2433 <programlisting>
Chris@4 2434 const char * BZ2_bzerror ( BZFILE *b, int *errnum )
Chris@4 2435 </programlisting>
Chris@4 2436
Chris@4 2437 <para>Returns a string describing the more recent error status of
Chris@4 2438 <computeroutput>b</computeroutput>, and also sets
Chris@4 2439 <computeroutput>*errnum</computeroutput> to its numerical
Chris@4 2440 value.</para>
Chris@4 2441
Chris@4 2442 </sect1>
Chris@4 2443
Chris@4 2444
Chris@4 2445 <sect1 id="stdio-free"
Chris@4 2446 xreflabel="Using the library in a stdio-free environment">
Chris@4 2447 <title>Using the library in a stdio-free environment</title>
Chris@4 2448
Chris@4 2449
Chris@4 2450 <sect2 id="stdio-bye" xreflabel="Getting rid of stdio">
Chris@4 2451 <title>Getting rid of stdio</title>
Chris@4 2452
Chris@4 2453 <para>In a deeply embedded application, you might want to use
Chris@4 2454 just the memory-to-memory functions. You can do this
Chris@4 2455 conveniently by compiling the library with preprocessor symbol
Chris@4 2456 <computeroutput>BZ_NO_STDIO</computeroutput> defined. Doing this
Chris@4 2457 gives you a library containing only the following eight
Chris@4 2458 functions:</para>
Chris@4 2459
Chris@4 2460 <para><computeroutput>BZ2_bzCompressInit</computeroutput>,
Chris@4 2461 <computeroutput>BZ2_bzCompress</computeroutput>,
Chris@4 2462 <computeroutput>BZ2_bzCompressEnd</computeroutput>
Chris@4 2463 <computeroutput>BZ2_bzDecompressInit</computeroutput>,
Chris@4 2464 <computeroutput>BZ2_bzDecompress</computeroutput>,
Chris@4 2465 <computeroutput>BZ2_bzDecompressEnd</computeroutput>
Chris@4 2466 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput>,
Chris@4 2467 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput></para>
Chris@4 2468
Chris@4 2469 <para>When compiled like this, all functions will ignore
Chris@4 2470 <computeroutput>verbosity</computeroutput> settings.</para>
Chris@4 2471
Chris@4 2472 </sect2>
Chris@4 2473
Chris@4 2474
Chris@4 2475 <sect2 id="critical-error" xreflabel="Critical error handling">
Chris@4 2476 <title>Critical error handling</title>
Chris@4 2477
Chris@4 2478 <para><computeroutput>libbzip2</computeroutput> contains a number
Chris@4 2479 of internal assertion checks which should, needless to say, never
Chris@4 2480 be activated. Nevertheless, if an assertion should fail,
Chris@4 2481 behaviour depends on whether or not the library was compiled with
Chris@4 2482 <computeroutput>BZ_NO_STDIO</computeroutput> set.</para>
Chris@4 2483
Chris@4 2484 <para>For a normal compile, an assertion failure yields the
Chris@4 2485 message:</para>
Chris@4 2486
Chris@4 2487 <blockquote>
Chris@4 2488 <para>bzip2/libbzip2: internal error number N.</para>
Chris@4 2489 <para>This is a bug in bzip2/libbzip2, &bz-version; of &bz-date;.
Chris@4 2490 Please report it to me at: &bz-email;. If this happened
Chris@4 2491 when you were using some program which uses libbzip2 as a
Chris@4 2492 component, you should also report this bug to the author(s)
Chris@4 2493 of that program. Please make an effort to report this bug;
Chris@4 2494 timely and accurate bug reports eventually lead to higher
Chris@4 2495 quality software. Thanks. Julian Seward, &bz-date;.
Chris@4 2496 </para></blockquote>
Chris@4 2497
Chris@4 2498 <para>where <computeroutput>N</computeroutput> is some error code
Chris@4 2499 number. If <computeroutput>N == 1007</computeroutput>, it also
Chris@4 2500 prints some extra text advising the reader that unreliable memory
Chris@4 2501 is often associated with internal error 1007. (This is a
Chris@4 2502 frequently-observed-phenomenon with versions 1.0.0/1.0.1).</para>
Chris@4 2503
Chris@4 2504 <para><computeroutput>exit(3)</computeroutput> is then
Chris@4 2505 called.</para>
Chris@4 2506
Chris@4 2507 <para>For a <computeroutput>stdio</computeroutput>-free library,
Chris@4 2508 assertion failures result in a call to a function declared
Chris@4 2509 as:</para>
Chris@4 2510
Chris@4 2511 <programlisting>
Chris@4 2512 extern void bz_internal_error ( int errcode );
Chris@4 2513 </programlisting>
Chris@4 2514
Chris@4 2515 <para>The relevant code is passed as a parameter. You should
Chris@4 2516 supply such a function.</para>
Chris@4 2517
Chris@4 2518 <para>In either case, once an assertion failure has occurred, any
Chris@4 2519 <computeroutput>bz_stream</computeroutput> records involved can
Chris@4 2520 be regarded as invalid. You should not attempt to resume normal
Chris@4 2521 operation with them.</para>
Chris@4 2522
Chris@4 2523 <para>You may, of course, change critical error handling to suit
Chris@4 2524 your needs. As I said above, critical errors indicate bugs in
Chris@4 2525 the library and should not occur. All "normal" error situations
Chris@4 2526 are indicated via error return codes from functions, and can be
Chris@4 2527 recovered from.</para>
Chris@4 2528
Chris@4 2529 </sect2>
Chris@4 2530
Chris@4 2531 </sect1>
Chris@4 2532
Chris@4 2533
Chris@4 2534 <sect1 id="win-dll" xreflabel="Making a Windows DLL">
Chris@4 2535 <title>Making a Windows DLL</title>
Chris@4 2536
Chris@4 2537 <para>Everything related to Windows has been contributed by
Chris@4 2538 Yoshioka Tsuneo
Chris@4 2539 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>), so
Chris@4 2540 you should send your queries to him (but perhaps Cc: me,
Chris@4 2541 <computeroutput>&bz-email;</computeroutput>).</para>
Chris@4 2542
Chris@4 2543 <para>My vague understanding of what to do is: using Visual C++
Chris@4 2544 5.0, open the project file
Chris@4 2545 <computeroutput>libbz2.dsp</computeroutput>, and build. That's
Chris@4 2546 all.</para>
Chris@4 2547
Chris@4 2548 <para>If you can't open the project file for some reason, make a
Chris@4 2549 new one, naming these files:
Chris@4 2550 <computeroutput>blocksort.c</computeroutput>,
Chris@4 2551 <computeroutput>bzlib.c</computeroutput>,
Chris@4 2552 <computeroutput>compress.c</computeroutput>,
Chris@4 2553 <computeroutput>crctable.c</computeroutput>,
Chris@4 2554 <computeroutput>decompress.c</computeroutput>,
Chris@4 2555 <computeroutput>huffman.c</computeroutput>,
Chris@4 2556 <computeroutput>randtable.c</computeroutput> and
Chris@4 2557 <computeroutput>libbz2.def</computeroutput>. You will also need
Chris@4 2558 to name the header files <computeroutput>bzlib.h</computeroutput>
Chris@4 2559 and <computeroutput>bzlib_private.h</computeroutput>.</para>
Chris@4 2560
Chris@4 2561 <para>If you don't use VC++, you may need to define the
Chris@4 2562 proprocessor symbol
Chris@4 2563 <computeroutput>_WIN32</computeroutput>.</para>
Chris@4 2564
Chris@4 2565 <para>Finally, <computeroutput>dlltest.c</computeroutput> is a
Chris@4 2566 sample program using the DLL. It has a project file,
Chris@4 2567 <computeroutput>dlltest.dsp</computeroutput>.</para>
Chris@4 2568
Chris@4 2569 <para>If you just want a makefile for Visual C, have a look at
Chris@4 2570 <computeroutput>makefile.msc</computeroutput>.</para>
Chris@4 2571
Chris@4 2572 <para>Be aware that if you compile
Chris@4 2573 <computeroutput>bzip2</computeroutput> itself on Win32, you must
Chris@4 2574 set <computeroutput>BZ_UNIX</computeroutput> to 0 and
Chris@4 2575 <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the file
Chris@4 2576 <computeroutput>bzip2.c</computeroutput>, before compiling.
Chris@4 2577 Otherwise the resulting binary won't work correctly.</para>
Chris@4 2578
Chris@4 2579 <para>I haven't tried any of this stuff myself, but it all looks
Chris@4 2580 plausible.</para>
Chris@4 2581
Chris@4 2582 </sect1>
Chris@4 2583
Chris@4 2584 </chapter>
Chris@4 2585
Chris@4 2586
Chris@4 2587
Chris@4 2588 <chapter id="misc" xreflabel="Miscellanea">
Chris@4 2589 <title>Miscellanea</title>
Chris@4 2590
Chris@4 2591 <para>These are just some random thoughts of mine. Your mileage
Chris@4 2592 may vary.</para>
Chris@4 2593
Chris@4 2594
Chris@4 2595 <sect1 id="limits" xreflabel="Limitations of the compressed file format">
Chris@4 2596 <title>Limitations of the compressed file format</title>
Chris@4 2597
Chris@4 2598 <para><computeroutput>bzip2-1.0.X</computeroutput>,
Chris@4 2599 <computeroutput>0.9.5</computeroutput> and
Chris@4 2600 <computeroutput>0.9.0</computeroutput> use exactly the same file
Chris@4 2601 format as the original version,
Chris@4 2602 <computeroutput>bzip2-0.1</computeroutput>. This decision was
Chris@4 2603 made in the interests of stability. Creating yet another
Chris@4 2604 incompatible compressed file format would create further
Chris@4 2605 confusion and disruption for users.</para>
Chris@4 2606
Chris@4 2607 <para>Nevertheless, this is not a painless decision. Development
Chris@4 2608 work since the release of
Chris@4 2609 <computeroutput>bzip2-0.1</computeroutput> in August 1997 has
Chris@4 2610 shown complexities in the file format which slow down
Chris@4 2611 decompression and, in retrospect, are unnecessary. These
Chris@4 2612 are:</para>
Chris@4 2613
Chris@4 2614 <itemizedlist mark='bullet'>
Chris@4 2615
Chris@4 2616 <listitem><para>The run-length encoder, which is the first of the
Chris@4 2617 compression transformations, is entirely irrelevant. The
Chris@4 2618 original purpose was to protect the sorting algorithm from the
Chris@4 2619 very worst case input: a string of repeated symbols. But
Chris@4 2620 algorithm steps Q6a and Q6b in the original Burrows-Wheeler
Chris@4 2621 technical report (SRC-124) show how repeats can be handled
Chris@4 2622 without difficulty in block sorting.</para></listitem>
Chris@4 2623
Chris@4 2624 <listitem><para>The randomisation mechanism doesn't really need to be
Chris@4 2625 there. Udi Manber and Gene Myers published a suffix array
Chris@4 2626 construction algorithm a few years back, which can be employed
Chris@4 2627 to sort any block, no matter how repetitive, in O(N log N)
Chris@4 2628 time. Subsequent work by Kunihiko Sadakane has produced a
Chris@4 2629 derivative O(N (log N)^2) algorithm which usually outperforms
Chris@4 2630 the Manber-Myers algorithm.</para>
Chris@4 2631
Chris@4 2632 <para>I could have changed to Sadakane's algorithm, but I find
Chris@4 2633 it to be slower than <computeroutput>bzip2</computeroutput>'s
Chris@4 2634 existing algorithm for most inputs, and the randomisation
Chris@4 2635 mechanism protects adequately against bad cases. I didn't
Chris@4 2636 think it was a good tradeoff to make. Partly this is due to
Chris@4 2637 the fact that I was not flooded with email complaints about
Chris@4 2638 <computeroutput>bzip2-0.1</computeroutput>'s performance on
Chris@4 2639 repetitive data, so perhaps it isn't a problem for real
Chris@4 2640 inputs.</para>
Chris@4 2641
Chris@4 2642 <para>Probably the best long-term solution, and the one I have
Chris@4 2643 incorporated into 0.9.5 and above, is to use the existing
Chris@4 2644 sorting algorithm initially, and fall back to a O(N (log N)^2)
Chris@4 2645 algorithm if the standard algorithm gets into
Chris@4 2646 difficulties.</para></listitem>
Chris@4 2647
Chris@4 2648 <listitem><para>The compressed file format was never designed to be
Chris@4 2649 handled by a library, and I have had to jump though some hoops
Chris@4 2650 to produce an efficient implementation of decompression. It's
Chris@4 2651 a bit hairy. Try passing
Chris@4 2652 <computeroutput>decompress.c</computeroutput> through the C
Chris@4 2653 preprocessor and you'll see what I mean. Much of this
Chris@4 2654 complexity could have been avoided if the compressed size of
Chris@4 2655 each block of data was recorded in the data stream.</para></listitem>
Chris@4 2656
Chris@4 2657 <listitem><para>An Adler-32 checksum, rather than a CRC32 checksum,
Chris@4 2658 would be faster to compute.</para></listitem>
Chris@4 2659
Chris@4 2660 </itemizedlist>
Chris@4 2661
Chris@4 2662 <para>It would be fair to say that the
Chris@4 2663 <computeroutput>bzip2</computeroutput> format was frozen before I
Chris@4 2664 properly and fully understood the performance consequences of
Chris@4 2665 doing so.</para>
Chris@4 2666
Chris@4 2667 <para>Improvements which I was able to incorporate into 0.9.0,
Chris@4 2668 despite using the same file format, are:</para>
Chris@4 2669
Chris@4 2670 <itemizedlist mark='bullet'>
Chris@4 2671
Chris@4 2672 <listitem><para>Single array implementation of the inverse BWT. This
Chris@4 2673 significantly speeds up decompression, presumably because it
Chris@4 2674 reduces the number of cache misses.</para></listitem>
Chris@4 2675
Chris@4 2676 <listitem><para>Faster inverse MTF transform for large MTF values.
Chris@4 2677 The new implementation is based on the notion of sliding blocks
Chris@4 2678 of values.</para></listitem>
Chris@4 2679
Chris@4 2680 <listitem><para><computeroutput>bzip2-0.9.0</computeroutput> now reads
Chris@4 2681 and writes files with <computeroutput>fread</computeroutput>
Chris@4 2682 and <computeroutput>fwrite</computeroutput>; version 0.1 used
Chris@4 2683 <computeroutput>putc</computeroutput> and
Chris@4 2684 <computeroutput>getc</computeroutput>. Duh! Well, you live
Chris@4 2685 and learn.</para></listitem>
Chris@4 2686
Chris@4 2687 </itemizedlist>
Chris@4 2688
Chris@4 2689 <para>Further ahead, it would be nice to be able to do random
Chris@4 2690 access into files. This will require some careful design of
Chris@4 2691 compressed file formats.</para>
Chris@4 2692
Chris@4 2693 </sect1>
Chris@4 2694
Chris@4 2695
Chris@4 2696 <sect1 id="port-issues" xreflabel="Portability issues">
Chris@4 2697 <title>Portability issues</title>
Chris@4 2698
Chris@4 2699 <para>After some consideration, I have decided not to use GNU
Chris@4 2700 <computeroutput>autoconf</computeroutput> to configure 0.9.5 or
Chris@4 2701 1.0.</para>
Chris@4 2702
Chris@4 2703 <para><computeroutput>autoconf</computeroutput>, admirable and
Chris@4 2704 wonderful though it is, mainly assists with portability problems
Chris@4 2705 between Unix-like platforms. But
Chris@4 2706 <computeroutput>bzip2</computeroutput> doesn't have much in the
Chris@4 2707 way of portability problems on Unix; most of the difficulties
Chris@4 2708 appear when porting to the Mac, or to Microsoft's operating
Chris@4 2709 systems. <computeroutput>autoconf</computeroutput> doesn't help
Chris@4 2710 in those cases, and brings in a whole load of new
Chris@4 2711 complexity.</para>
Chris@4 2712
Chris@4 2713 <para>Most people should be able to compile the library and
Chris@4 2714 program under Unix straight out-of-the-box, so to speak,
Chris@4 2715 especially if you have a version of GNU C available.</para>
Chris@4 2716
Chris@4 2717 <para>There are a couple of
Chris@4 2718 <computeroutput>__inline__</computeroutput> directives in the
Chris@4 2719 code. GNU C (<computeroutput>gcc</computeroutput>) should be
Chris@4 2720 able to handle them. If you're not using GNU C, your C compiler
Chris@4 2721 shouldn't see them at all. If your compiler does, for some
Chris@4 2722 reason, see them and doesn't like them, just
Chris@4 2723 <computeroutput>#define</computeroutput>
Chris@4 2724 <computeroutput>__inline__</computeroutput> to be
Chris@4 2725 <computeroutput>/* */</computeroutput>. One easy way to do this
Chris@4 2726 is to compile with the flag
Chris@4 2727 <computeroutput>-D__inline__=</computeroutput>, which should be
Chris@4 2728 understood by most Unix compilers.</para>
Chris@4 2729
Chris@4 2730 <para>If you still have difficulties, try compiling with the
Chris@4 2731 macro <computeroutput>BZ_STRICT_ANSI</computeroutput> defined.
Chris@4 2732 This should enable you to build the library in a strictly ANSI
Chris@4 2733 compliant environment. Building the program itself like this is
Chris@4 2734 dangerous and not supported, since you remove
Chris@4 2735 <computeroutput>bzip2</computeroutput>'s checks against
Chris@4 2736 compressing directories, symbolic links, devices, and other
Chris@4 2737 not-really-a-file entities. This could cause filesystem
Chris@4 2738 corruption!</para>
Chris@4 2739
Chris@4 2740 <para>One other thing: if you create a
Chris@4 2741 <computeroutput>bzip2</computeroutput> binary for public distribution,
Chris@4 2742 please consider linking it statically (<computeroutput>gcc
Chris@4 2743 -static</computeroutput>). This avoids all sorts of library-version
Chris@4 2744 issues that others may encounter later on.</para>
Chris@4 2745
Chris@4 2746 <para>If you build <computeroutput>bzip2</computeroutput> on
Chris@4 2747 Win32, you must set <computeroutput>BZ_UNIX</computeroutput> to 0
Chris@4 2748 and <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the
Chris@4 2749 file <computeroutput>bzip2.c</computeroutput>, before compiling.
Chris@4 2750 Otherwise the resulting binary won't work correctly.</para>
Chris@4 2751
Chris@4 2752 </sect1>
Chris@4 2753
Chris@4 2754
Chris@4 2755 <sect1 id="bugs" xreflabel="Reporting bugs">
Chris@4 2756 <title>Reporting bugs</title>
Chris@4 2757
Chris@4 2758 <para>I tried pretty hard to make sure
Chris@4 2759 <computeroutput>bzip2</computeroutput> is bug free, both by
Chris@4 2760 design and by testing. Hopefully you'll never need to read this
Chris@4 2761 section for real.</para>
Chris@4 2762
Chris@4 2763 <para>Nevertheless, if <computeroutput>bzip2</computeroutput> dies
Chris@4 2764 with a segmentation fault, a bus error or an internal assertion
Chris@4 2765 failure, it will ask you to email me a bug report. Experience from
Chris@4 2766 years of feedback of bzip2 users indicates that almost all these
Chris@4 2767 problems can be traced to either compiler bugs or hardware
Chris@4 2768 problems.</para>
Chris@4 2769
Chris@4 2770 <itemizedlist mark='bullet'>
Chris@4 2771
Chris@4 2772 <listitem><para>Recompile the program with no optimisation, and
Chris@4 2773 see if it works. And/or try a different compiler. I heard all
Chris@4 2774 sorts of stories about various flavours of GNU C (and other
Chris@4 2775 compilers) generating bad code for
Chris@4 2776 <computeroutput>bzip2</computeroutput>, and I've run across two
Chris@4 2777 such examples myself.</para>
Chris@4 2778
Chris@4 2779 <para>2.7.X versions of GNU C are known to generate bad code
Chris@4 2780 from time to time, at high optimisation levels. If you get
Chris@4 2781 problems, try using the flags
Chris@4 2782 <computeroutput>-O2</computeroutput>
Chris@4 2783 <computeroutput>-fomit-frame-pointer</computeroutput>
Chris@4 2784 <computeroutput>-fno-strength-reduce</computeroutput>. You
Chris@4 2785 should specifically <emphasis>not</emphasis> use
Chris@4 2786 <computeroutput>-funroll-loops</computeroutput>.</para>
Chris@4 2787
Chris@4 2788 <para>You may notice that the Makefile runs six tests as part
Chris@4 2789 of the build process. If the program passes all of these, it's
Chris@4 2790 a pretty good (but not 100%) indication that the compiler has
Chris@4 2791 done its job correctly.</para></listitem>
Chris@4 2792
Chris@4 2793 <listitem><para>If <computeroutput>bzip2</computeroutput>
Chris@4 2794 crashes randomly, and the crashes are not repeatable, you may
Chris@4 2795 have a flaky memory subsystem.
Chris@4 2796 <computeroutput>bzip2</computeroutput> really hammers your
Chris@4 2797 memory hierarchy, and if it's a bit marginal, you may get these
Chris@4 2798 problems. Ditto if your disk or I/O subsystem is slowly
Chris@4 2799 failing. Yup, this really does happen.</para>
Chris@4 2800
Chris@4 2801 <para>Try using a different machine of the same type, and see
Chris@4 2802 if you can repeat the problem.</para></listitem>
Chris@4 2803
Chris@4 2804 <listitem><para>This isn't really a bug, but ... If
Chris@4 2805 <computeroutput>bzip2</computeroutput> tells you your file is
Chris@4 2806 corrupted on decompression, and you obtained the file via FTP,
Chris@4 2807 there is a possibility that you forgot to tell FTP to do a
Chris@4 2808 binary mode transfer. That absolutely will cause the file to
Chris@4 2809 be non-decompressible. You'll have to transfer it
Chris@4 2810 again.</para></listitem>
Chris@4 2811
Chris@4 2812 </itemizedlist>
Chris@4 2813
Chris@4 2814 <para>If you've incorporated
Chris@4 2815 <computeroutput>libbzip2</computeroutput> into your own program
Chris@4 2816 and are getting problems, please, please, please, check that the
Chris@4 2817 parameters you are passing in calls to the library, are correct,
Chris@4 2818 and in accordance with what the documentation says is allowable.
Chris@4 2819 I have tried to make the library robust against such problems,
Chris@4 2820 but I'm sure I haven't succeeded.</para>
Chris@4 2821
Chris@4 2822 <para>Finally, if the above comments don't help, you'll have to
Chris@4 2823 send me a bug report. Now, it's just amazing how many people
Chris@4 2824 will send me a bug report saying something like:</para>
Chris@4 2825
Chris@4 2826 <programlisting>
Chris@4 2827 bzip2 crashed with segmentation fault on my machine
Chris@4 2828 </programlisting>
Chris@4 2829
Chris@4 2830 <para>and absolutely nothing else. Needless to say, a such a
Chris@4 2831 report is <emphasis>totally, utterly, completely and
Chris@4 2832 comprehensively 100% useless; a waste of your time, my time, and
Chris@4 2833 net bandwidth</emphasis>. With no details at all, there's no way
Chris@4 2834 I can possibly begin to figure out what the problem is.</para>
Chris@4 2835
Chris@4 2836 <para>The rules of the game are: facts, facts, facts. Don't omit
Chris@4 2837 them because "oh, they won't be relevant". At the bare
Chris@4 2838 minimum:</para>
Chris@4 2839
Chris@4 2840 <programlisting>
Chris@4 2841 Machine type. Operating system version.
Chris@4 2842 Exact version of bzip2 (do bzip2 -V).
Chris@4 2843 Exact version of the compiler used.
Chris@4 2844 Flags passed to the compiler.
Chris@4 2845 </programlisting>
Chris@4 2846
Chris@4 2847 <para>However, the most important single thing that will help me
Chris@4 2848 is the file that you were trying to compress or decompress at the
Chris@4 2849 time the problem happened. Without that, my ability to do
Chris@4 2850 anything more than speculate about the cause, is limited.</para>
Chris@4 2851
Chris@4 2852 </sect1>
Chris@4 2853
Chris@4 2854
Chris@4 2855 <sect1 id="package" xreflabel="Did you get the right package?">
Chris@4 2856 <title>Did you get the right package?</title>
Chris@4 2857
Chris@4 2858 <para><computeroutput>bzip2</computeroutput> is a resource hog.
Chris@4 2859 It soaks up large amounts of CPU cycles and memory. Also, it
Chris@4 2860 gives very large latencies. In the worst case, you can feed many
Chris@4 2861 megabytes of uncompressed data into the library before getting
Chris@4 2862 any compressed output, so this probably rules out applications
Chris@4 2863 requiring interactive behaviour.</para>
Chris@4 2864
Chris@4 2865 <para>These aren't faults of my implementation, I hope, but more
Chris@4 2866 an intrinsic property of the Burrows-Wheeler transform
Chris@4 2867 (unfortunately). Maybe this isn't what you want.</para>
Chris@4 2868
Chris@4 2869 <para>If you want a compressor and/or library which is faster,
Chris@4 2870 uses less memory but gets pretty good compression, and has
Chris@4 2871 minimal latency, consider Jean-loup Gailly's and Mark Adler's
Chris@4 2872 work, <computeroutput>zlib-1.2.1</computeroutput> and
Chris@4 2873 <computeroutput>gzip-1.2.4</computeroutput>. Look for them at
Chris@4 2874 <ulink url="http://www.zlib.org">http://www.zlib.org</ulink> and
Chris@4 2875 <ulink url="http://www.gzip.org">http://www.gzip.org</ulink>
Chris@4 2876 respectively.</para>
Chris@4 2877
Chris@4 2878 <para>For something faster and lighter still, you might try Markus F
Chris@4 2879 X J Oberhumer's <computeroutput>LZO</computeroutput> real-time
Chris@4 2880 compression/decompression library, at
Chris@4 2881 <ulink url="http://www.oberhumer.com/opensource">http://www.oberhumer.com/opensource</ulink>.</para>
Chris@4 2882
Chris@4 2883 </sect1>
Chris@4 2884
Chris@4 2885
Chris@4 2886
Chris@4 2887 <sect1 id="reading" xreflabel="Further Reading">
Chris@4 2888 <title>Further Reading</title>
Chris@4 2889
Chris@4 2890 <para><computeroutput>bzip2</computeroutput> is not research
Chris@4 2891 work, in the sense that it doesn't present any new ideas.
Chris@4 2892 Rather, it's an engineering exercise based on existing
Chris@4 2893 ideas.</para>
Chris@4 2894
Chris@4 2895 <para>Four documents describe essentially all the ideas behind
Chris@4 2896 <computeroutput>bzip2</computeroutput>:</para>
Chris@4 2897
Chris@4 2898 <literallayout>Michael Burrows and D. J. Wheeler:
Chris@4 2899 "A block-sorting lossless data compression algorithm"
Chris@4 2900 10th May 1994.
Chris@4 2901 Digital SRC Research Report 124.
Chris@4 2902 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
Chris@4 2903 If you have trouble finding it, try searching at the
Chris@4 2904 New Zealand Digital Library, http://www.nzdl.org.
Chris@4 2905
Chris@4 2906 Daniel S. Hirschberg and Debra A. LeLewer
Chris@4 2907 "Efficient Decoding of Prefix Codes"
Chris@4 2908 Communications of the ACM, April 1990, Vol 33, Number 4.
Chris@4 2909 You might be able to get an electronic copy of this
Chris@4 2910 from the ACM Digital Library.
Chris@4 2911
Chris@4 2912 David J. Wheeler
Chris@4 2913 Program bred3.c and accompanying document bred3.ps.
Chris@4 2914 This contains the idea behind the multi-table Huffman coding scheme.
Chris@4 2915 ftp://ftp.cl.cam.ac.uk/users/djw3/
Chris@4 2916
Chris@4 2917 Jon L. Bentley and Robert Sedgewick
Chris@4 2918 "Fast Algorithms for Sorting and Searching Strings"
Chris@4 2919 Available from Sedgewick's web page,
Chris@4 2920 www.cs.princeton.edu/~rs
Chris@4 2921 </literallayout>
Chris@4 2922
Chris@4 2923 <para>The following paper gives valuable additional insights into
Chris@4 2924 the algorithm, but is not immediately the basis of any code used
Chris@4 2925 in bzip2.</para>
Chris@4 2926
Chris@4 2927 <literallayout>Peter Fenwick:
Chris@4 2928 Block Sorting Text Compression
Chris@4 2929 Proceedings of the 19th Australasian Computer Science Conference,
Chris@4 2930 Melbourne, Australia. Jan 31 - Feb 2, 1996.
Chris@4 2931 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps</literallayout>
Chris@4 2932
Chris@4 2933 <para>Kunihiko Sadakane's sorting algorithm, mentioned above, is
Chris@4 2934 available from:</para>
Chris@4 2935
Chris@4 2936 <literallayout>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
Chris@4 2937 </literallayout>
Chris@4 2938
Chris@4 2939 <para>The Manber-Myers suffix array construction algorithm is
Chris@4 2940 described in a paper available from:</para>
Chris@4 2941
Chris@4 2942 <literallayout>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
Chris@4 2943 </literallayout>
Chris@4 2944
Chris@4 2945 <para>Finally, the following papers document some
Chris@4 2946 investigations I made into the performance of sorting
Chris@4 2947 and decompression algorithms:</para>
Chris@4 2948
Chris@4 2949 <literallayout>Julian Seward
Chris@4 2950 On the Performance of BWT Sorting Algorithms
Chris@4 2951 Proceedings of the IEEE Data Compression Conference 2000
Chris@4 2952 Snowbird, Utah. 28-30 March 2000.
Chris@4 2953
Chris@4 2954 Julian Seward
Chris@4 2955 Space-time Tradeoffs in the Inverse B-W Transform
Chris@4 2956 Proceedings of the IEEE Data Compression Conference 2001
Chris@4 2957 Snowbird, Utah. 27-29 March 2001.
Chris@4 2958 </literallayout>
Chris@4 2959
Chris@4 2960 </sect1>
Chris@4 2961
Chris@4 2962 </chapter>
Chris@4 2963
Chris@4 2964 </book>