Mercurial > hg > sv-dependency-builds
comparison src/bzip2-1.0.6/manual.xml @ 89:8a15ff55d9af
Add bzip2, zlib, liblo, portaudio sources
author | Chris Cannam <cannam@all-day-breakfast.com> |
---|---|
date | Wed, 20 Mar 2013 13:59:52 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
88:fe7c3a0b0259 | 89:8a15ff55d9af |
---|---|
1 <?xml version="1.0"?> <!-- -*- sgml -*- --> | |
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" | |
3 "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"[ | |
4 | |
5 <!-- various strings, dates etc. common to all docs --> | |
6 <!ENTITY % common-ents SYSTEM "entities.xml"> %common-ents; | |
7 ]> | |
8 | |
9 <book lang="en" id="userman" xreflabel="bzip2 Manual"> | |
10 | |
11 <bookinfo> | |
12 <title>bzip2 and libbzip2, version 1.0.6</title> | |
13 <subtitle>A program and library for data compression</subtitle> | |
14 <copyright> | |
15 <year>&bz-lifespan;</year> | |
16 <holder>Julian Seward</holder> | |
17 </copyright> | |
18 <releaseinfo>Version &bz-version; of &bz-date;</releaseinfo> | |
19 | |
20 <authorgroup> | |
21 <author> | |
22 <firstname>Julian</firstname> | |
23 <surname>Seward</surname> | |
24 <affiliation> | |
25 <orgname>&bz-url;</orgname> | |
26 </affiliation> | |
27 </author> | |
28 </authorgroup> | |
29 | |
30 <legalnotice> | |
31 | |
32 <para>This program, <computeroutput>bzip2</computeroutput>, the | |
33 associated library <computeroutput>libbzip2</computeroutput>, and | |
34 all documentation, are copyright © &bz-lifespan; Julian Seward. | |
35 All rights reserved.</para> | |
36 | |
37 <para>Redistribution and use in source and binary forms, with | |
38 or without modification, are permitted provided that the | |
39 following conditions are met:</para> | |
40 | |
41 <itemizedlist mark='bullet'> | |
42 | |
43 <listitem><para>Redistributions of source code must retain the | |
44 above copyright notice, this list of conditions and the | |
45 following disclaimer.</para></listitem> | |
46 | |
47 <listitem><para>The origin of this software must not be | |
48 misrepresented; you must not claim that you wrote the original | |
49 software. If you use this software in a product, an | |
50 acknowledgment in the product documentation would be | |
51 appreciated but is not required.</para></listitem> | |
52 | |
53 <listitem><para>Altered source versions must be plainly marked | |
54 as such, and must not be misrepresented as being the original | |
55 software.</para></listitem> | |
56 | |
57 <listitem><para>The name of the author may not be used to | |
58 endorse or promote products derived from this software without | |
59 specific prior written permission.</para></listitem> | |
60 | |
61 </itemizedlist> | |
62 | |
63 <para>THIS SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY | |
64 EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, | |
65 THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A | |
66 PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE | |
67 AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, | |
68 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED | |
69 TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, | |
70 DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND | |
71 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT | |
72 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING | |
73 IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF | |
74 THE POSSIBILITY OF SUCH DAMAGE.</para> | |
75 | |
76 <para>PATENTS: To the best of my knowledge, | |
77 <computeroutput>bzip2</computeroutput> and | |
78 <computeroutput>libbzip2</computeroutput> do not use any patented | |
79 algorithms. However, I do not have the resources to carry | |
80 out a patent search. Therefore I cannot give any guarantee of | |
81 the above statement. | |
82 </para> | |
83 | |
84 </legalnotice> | |
85 | |
86 </bookinfo> | |
87 | |
88 | |
89 | |
90 <chapter id="intro" xreflabel="Introduction"> | |
91 <title>Introduction</title> | |
92 | |
93 <para><computeroutput>bzip2</computeroutput> compresses files | |
94 using the Burrows-Wheeler block-sorting text compression | |
95 algorithm, and Huffman coding. Compression is generally | |
96 considerably better than that achieved by more conventional | |
97 LZ77/LZ78-based compressors, and approaches the performance of | |
98 the PPM family of statistical compressors.</para> | |
99 | |
100 <para><computeroutput>bzip2</computeroutput> is built on top of | |
101 <computeroutput>libbzip2</computeroutput>, a flexible library for | |
102 handling compressed data in the | |
103 <computeroutput>bzip2</computeroutput> format. This manual | |
104 describes both how to use the program and how to work with the | |
105 library interface. Most of the manual is devoted to this | |
106 library, not the program, which is good news if your interest is | |
107 only in the program.</para> | |
108 | |
109 <itemizedlist mark='bullet'> | |
110 | |
111 <listitem><para><xref linkend="using"/> describes how to use | |
112 <computeroutput>bzip2</computeroutput>; this is the only part | |
113 you need to read if you just want to know how to operate the | |
114 program.</para></listitem> | |
115 | |
116 <listitem><para><xref linkend="libprog"/> describes the | |
117 programming interfaces in detail, and</para></listitem> | |
118 | |
119 <listitem><para><xref linkend="misc"/> records some | |
120 miscellaneous notes which I thought ought to be recorded | |
121 somewhere.</para></listitem> | |
122 | |
123 </itemizedlist> | |
124 | |
125 </chapter> | |
126 | |
127 | |
128 <chapter id="using" xreflabel="How to use bzip2"> | |
129 <title>How to use bzip2</title> | |
130 | |
131 <para>This chapter contains a copy of the | |
132 <computeroutput>bzip2</computeroutput> man page, and nothing | |
133 else.</para> | |
134 | |
135 <sect1 id="name" xreflabel="NAME"> | |
136 <title>NAME</title> | |
137 | |
138 <itemizedlist mark='bullet'> | |
139 | |
140 <listitem><para><computeroutput>bzip2</computeroutput>, | |
141 <computeroutput>bunzip2</computeroutput> - a block-sorting file | |
142 compressor, v1.0.6</para></listitem> | |
143 | |
144 <listitem><para><computeroutput>bzcat</computeroutput> - | |
145 decompresses files to stdout</para></listitem> | |
146 | |
147 <listitem><para><computeroutput>bzip2recover</computeroutput> - | |
148 recovers data from damaged bzip2 files</para></listitem> | |
149 | |
150 </itemizedlist> | |
151 | |
152 </sect1> | |
153 | |
154 | |
155 <sect1 id="synopsis" xreflabel="SYNOPSIS"> | |
156 <title>SYNOPSIS</title> | |
157 | |
158 <itemizedlist mark='bullet'> | |
159 | |
160 <listitem><para><computeroutput>bzip2</computeroutput> [ | |
161 -cdfkqstvzVL123456789 ] [ filenames ... ]</para></listitem> | |
162 | |
163 <listitem><para><computeroutput>bunzip2</computeroutput> [ | |
164 -fkvsVL ] [ filenames ... ]</para></listitem> | |
165 | |
166 <listitem><para><computeroutput>bzcat</computeroutput> [ -s ] [ | |
167 filenames ... ]</para></listitem> | |
168 | |
169 <listitem><para><computeroutput>bzip2recover</computeroutput> | |
170 filename</para></listitem> | |
171 | |
172 </itemizedlist> | |
173 | |
174 </sect1> | |
175 | |
176 | |
177 <sect1 id="description" xreflabel="DESCRIPTION"> | |
178 <title>DESCRIPTION</title> | |
179 | |
180 <para><computeroutput>bzip2</computeroutput> compresses files | |
181 using the Burrows-Wheeler block sorting text compression | |
182 algorithm, and Huffman coding. Compression is generally | |
183 considerably better than that achieved by more conventional | |
184 LZ77/LZ78-based compressors, and approaches the performance of | |
185 the PPM family of statistical compressors.</para> | |
186 | |
187 <para>The command-line options are deliberately very similar to | |
188 those of GNU <computeroutput>gzip</computeroutput>, but they are | |
189 not identical.</para> | |
190 | |
191 <para><computeroutput>bzip2</computeroutput> expects a list of | |
192 file names to accompany the command-line flags. Each file is | |
193 replaced by a compressed version of itself, with the name | |
194 <computeroutput>original_name.bz2</computeroutput>. Each | |
195 compressed file has the same modification date, permissions, and, | |
196 when possible, ownership as the corresponding original, so that | |
197 these properties can be correctly restored at decompression time. | |
198 File name handling is naive in the sense that there is no | |
199 mechanism for preserving original file names, permissions, | |
200 ownerships or dates in filesystems which lack these concepts, or | |
201 have serious file name length restrictions, such as | |
202 MS-DOS.</para> | |
203 | |
204 <para><computeroutput>bzip2</computeroutput> and | |
205 <computeroutput>bunzip2</computeroutput> will by default not | |
206 overwrite existing files. If you want this to happen, specify | |
207 the <computeroutput>-f</computeroutput> flag.</para> | |
208 | |
209 <para>If no file names are specified, | |
210 <computeroutput>bzip2</computeroutput> compresses from standard | |
211 input to standard output. In this case, | |
212 <computeroutput>bzip2</computeroutput> will decline to write | |
213 compressed output to a terminal, as this would be entirely | |
214 incomprehensible and therefore pointless.</para> | |
215 | |
216 <para><computeroutput>bunzip2</computeroutput> (or | |
217 <computeroutput>bzip2 -d</computeroutput>) decompresses all | |
218 specified files. Files which were not created by | |
219 <computeroutput>bzip2</computeroutput> will be detected and | |
220 ignored, and a warning issued. | |
221 <computeroutput>bzip2</computeroutput> attempts to guess the | |
222 filename for the decompressed file from that of the compressed | |
223 file as follows:</para> | |
224 | |
225 <itemizedlist mark='bullet'> | |
226 | |
227 <listitem><para><computeroutput>filename.bz2 </computeroutput> | |
228 becomes | |
229 <computeroutput>filename</computeroutput></para></listitem> | |
230 | |
231 <listitem><para><computeroutput>filename.bz </computeroutput> | |
232 becomes | |
233 <computeroutput>filename</computeroutput></para></listitem> | |
234 | |
235 <listitem><para><computeroutput>filename.tbz2</computeroutput> | |
236 becomes | |
237 <computeroutput>filename.tar</computeroutput></para></listitem> | |
238 | |
239 <listitem><para><computeroutput>filename.tbz </computeroutput> | |
240 becomes | |
241 <computeroutput>filename.tar</computeroutput></para></listitem> | |
242 | |
243 <listitem><para><computeroutput>anyothername </computeroutput> | |
244 becomes | |
245 <computeroutput>anyothername.out</computeroutput></para></listitem> | |
246 | |
247 </itemizedlist> | |
248 | |
249 <para>If the file does not end in one of the recognised endings, | |
250 <computeroutput>.bz2</computeroutput>, | |
251 <computeroutput>.bz</computeroutput>, | |
252 <computeroutput>.tbz2</computeroutput> or | |
253 <computeroutput>.tbz</computeroutput>, | |
254 <computeroutput>bzip2</computeroutput> complains that it cannot | |
255 guess the name of the original file, and uses the original name | |
256 with <computeroutput>.out</computeroutput> appended.</para> | |
257 | |
258 <para>As with compression, supplying no filenames causes | |
259 decompression from standard input to standard output.</para> | |
260 | |
261 <para><computeroutput>bunzip2</computeroutput> will correctly | |
262 decompress a file which is the concatenation of two or more | |
263 compressed files. The result is the concatenation of the | |
264 corresponding uncompressed files. Integrity testing | |
265 (<computeroutput>-t</computeroutput>) of concatenated compressed | |
266 files is also supported.</para> | |
267 | |
268 <para>You can also compress or decompress files to the standard | |
269 output by giving the <computeroutput>-c</computeroutput> flag. | |
270 Multiple files may be compressed and decompressed like this. The | |
271 resulting outputs are fed sequentially to stdout. Compression of | |
272 multiple files in this manner generates a stream containing | |
273 multiple compressed file representations. Such a stream can be | |
274 decompressed correctly only by | |
275 <computeroutput>bzip2</computeroutput> version 0.9.0 or later. | |
276 Earlier versions of <computeroutput>bzip2</computeroutput> will | |
277 stop after decompressing the first file in the stream.</para> | |
278 | |
279 <para><computeroutput>bzcat</computeroutput> (or | |
280 <computeroutput>bzip2 -dc</computeroutput>) decompresses all | |
281 specified files to the standard output.</para> | |
282 | |
283 <para><computeroutput>bzip2</computeroutput> will read arguments | |
284 from the environment variables | |
285 <computeroutput>BZIP2</computeroutput> and | |
286 <computeroutput>BZIP</computeroutput>, in that order, and will | |
287 process them before any arguments read from the command line. | |
288 This gives a convenient way to supply default arguments.</para> | |
289 | |
290 <para>Compression is always performed, even if the compressed | |
291 file is slightly larger than the original. Files of less than | |
292 about one hundred bytes tend to get larger, since the compression | |
293 mechanism has a constant overhead in the region of 50 bytes. | |
294 Random data (including the output of most file compressors) is | |
295 coded at about 8.05 bits per byte, giving an expansion of around | |
296 0.5%.</para> | |
297 | |
298 <para>As a self-check for your protection, | |
299 <computeroutput>bzip2</computeroutput> uses 32-bit CRCs to make | |
300 sure that the decompressed version of a file is identical to the | |
301 original. This guards against corruption of the compressed data, | |
302 and against undetected bugs in | |
303 <computeroutput>bzip2</computeroutput> (hopefully very unlikely). | |
304 The chances of data corruption going undetected is microscopic, | |
305 about one chance in four billion for each file processed. Be | |
306 aware, though, that the check occurs upon decompression, so it | |
307 can only tell you that something is wrong. It can't help you | |
308 recover the original uncompressed data. You can use | |
309 <computeroutput>bzip2recover</computeroutput> to try to recover | |
310 data from damaged files.</para> | |
311 | |
312 <para>Return values: 0 for a normal exit, 1 for environmental | |
313 problems (file not found, invalid flags, I/O errors, etc.), 2 | |
314 to indicate a corrupt compressed file, 3 for an internal | |
315 consistency error (eg, bug) which caused | |
316 <computeroutput>bzip2</computeroutput> to panic.</para> | |
317 | |
318 </sect1> | |
319 | |
320 | |
321 <sect1 id="options" xreflabel="OPTIONS"> | |
322 <title>OPTIONS</title> | |
323 | |
324 <variablelist> | |
325 | |
326 <varlistentry> | |
327 <term><computeroutput>-c --stdout</computeroutput></term> | |
328 <listitem><para>Compress or decompress to standard | |
329 output.</para></listitem> | |
330 </varlistentry> | |
331 | |
332 <varlistentry> | |
333 <term><computeroutput>-d --decompress</computeroutput></term> | |
334 <listitem><para>Force decompression. | |
335 <computeroutput>bzip2</computeroutput>, | |
336 <computeroutput>bunzip2</computeroutput> and | |
337 <computeroutput>bzcat</computeroutput> are really the same | |
338 program, and the decision about what actions to take is done on | |
339 the basis of which name is used. This flag overrides that | |
340 mechanism, and forces bzip2 to decompress.</para></listitem> | |
341 </varlistentry> | |
342 | |
343 <varlistentry> | |
344 <term><computeroutput>-z --compress</computeroutput></term> | |
345 <listitem><para>The complement to | |
346 <computeroutput>-d</computeroutput>: forces compression, | |
347 regardless of the invokation name.</para></listitem> | |
348 </varlistentry> | |
349 | |
350 <varlistentry> | |
351 <term><computeroutput>-t --test</computeroutput></term> | |
352 <listitem><para>Check integrity of the specified file(s), but | |
353 don't decompress them. This really performs a trial | |
354 decompression and throws away the result.</para></listitem> | |
355 </varlistentry> | |
356 | |
357 <varlistentry> | |
358 <term><computeroutput>-f --force</computeroutput></term> | |
359 <listitem><para>Force overwrite of output files. Normally, | |
360 <computeroutput>bzip2</computeroutput> will not overwrite | |
361 existing output files. Also forces | |
362 <computeroutput>bzip2</computeroutput> to break hard links to | |
363 files, which it otherwise wouldn't do.</para> | |
364 <para><computeroutput>bzip2</computeroutput> normally declines | |
365 to decompress files which don't have the correct magic header | |
366 bytes. If forced (<computeroutput>-f</computeroutput>), | |
367 however, it will pass such files through unmodified. This is | |
368 how GNU <computeroutput>gzip</computeroutput> behaves.</para> | |
369 </listitem> | |
370 </varlistentry> | |
371 | |
372 <varlistentry> | |
373 <term><computeroutput>-k --keep</computeroutput></term> | |
374 <listitem><para>Keep (don't delete) input files during | |
375 compression or decompression.</para></listitem> | |
376 </varlistentry> | |
377 | |
378 <varlistentry> | |
379 <term><computeroutput>-s --small</computeroutput></term> | |
380 <listitem><para>Reduce memory usage, for compression, | |
381 decompression and testing. Files are decompressed and tested | |
382 using a modified algorithm which only requires 2.5 bytes per | |
383 block byte. This means any file can be decompressed in 2300k | |
384 of memory, albeit at about half the normal speed.</para> | |
385 <para>During compression, <computeroutput>-s</computeroutput> | |
386 selects a block size of 200k, which limits memory use to around | |
387 the same figure, at the expense of your compression ratio. In | |
388 short, if your machine is low on memory (8 megabytes or less), | |
389 use <computeroutput>-s</computeroutput> for everything. See | |
390 <xref linkend="memory-management"/> below.</para></listitem> | |
391 </varlistentry> | |
392 | |
393 <varlistentry> | |
394 <term><computeroutput>-q --quiet</computeroutput></term> | |
395 <listitem><para>Suppress non-essential warning messages. | |
396 Messages pertaining to I/O errors and other critical events | |
397 will not be suppressed.</para></listitem> | |
398 </varlistentry> | |
399 | |
400 <varlistentry> | |
401 <term><computeroutput>-v --verbose</computeroutput></term> | |
402 <listitem><para>Verbose mode -- show the compression ratio for | |
403 each file processed. Further | |
404 <computeroutput>-v</computeroutput>'s increase the verbosity | |
405 level, spewing out lots of information which is primarily of | |
406 interest for diagnostic purposes.</para></listitem> | |
407 </varlistentry> | |
408 | |
409 <varlistentry> | |
410 <term><computeroutput>-L --license -V --version</computeroutput></term> | |
411 <listitem><para>Display the software version, license terms and | |
412 conditions.</para></listitem> | |
413 </varlistentry> | |
414 | |
415 <varlistentry> | |
416 <term><computeroutput>-1</computeroutput> (or | |
417 <computeroutput>--fast</computeroutput>) to | |
418 <computeroutput>-9</computeroutput> (or | |
419 <computeroutput>-best</computeroutput>)</term> | |
420 <listitem><para>Set the block size to 100 k, 200 k ... 900 k | |
421 when compressing. Has no effect when decompressing. See <xref | |
422 linkend="memory-management" /> below. The | |
423 <computeroutput>--fast</computeroutput> and | |
424 <computeroutput>--best</computeroutput> aliases are primarily | |
425 for GNU <computeroutput>gzip</computeroutput> compatibility. | |
426 In particular, <computeroutput>--fast</computeroutput> doesn't | |
427 make things significantly faster. And | |
428 <computeroutput>--best</computeroutput> merely selects the | |
429 default behaviour.</para></listitem> | |
430 </varlistentry> | |
431 | |
432 <varlistentry> | |
433 <term><computeroutput>--</computeroutput></term> | |
434 <listitem><para>Treats all subsequent arguments as file names, | |
435 even if they start with a dash. This is so you can handle | |
436 files with names beginning with a dash, for example: | |
437 <computeroutput>bzip2 -- | |
438 -myfilename</computeroutput>.</para></listitem> | |
439 </varlistentry> | |
440 | |
441 <varlistentry> | |
442 <term><computeroutput>--repetitive-fast</computeroutput></term> | |
443 <term><computeroutput>--repetitive-best</computeroutput></term> | |
444 <listitem><para>These flags are redundant in versions 0.9.5 and | |
445 above. They provided some coarse control over the behaviour of | |
446 the sorting algorithm in earlier versions, which was sometimes | |
447 useful. 0.9.5 and above have an improved algorithm which | |
448 renders these flags irrelevant.</para></listitem> | |
449 </varlistentry> | |
450 | |
451 </variablelist> | |
452 | |
453 </sect1> | |
454 | |
455 | |
456 <sect1 id="memory-management" xreflabel="MEMORY MANAGEMENT"> | |
457 <title>MEMORY MANAGEMENT</title> | |
458 | |
459 <para><computeroutput>bzip2</computeroutput> compresses large | |
460 files in blocks. The block size affects both the compression | |
461 ratio achieved, and the amount of memory needed for compression | |
462 and decompression. The flags <computeroutput>-1</computeroutput> | |
463 through <computeroutput>-9</computeroutput> specify the block | |
464 size to be 100,000 bytes through 900,000 bytes (the default) | |
465 respectively. At decompression time, the block size used for | |
466 compression is read from the header of the compressed file, and | |
467 <computeroutput>bunzip2</computeroutput> then allocates itself | |
468 just enough memory to decompress the file. Since block sizes are | |
469 stored in compressed files, it follows that the flags | |
470 <computeroutput>-1</computeroutput> to | |
471 <computeroutput>-9</computeroutput> are irrelevant to and so | |
472 ignored during decompression.</para> | |
473 | |
474 <para>Compression and decompression requirements, in bytes, can be | |
475 estimated as:</para> | |
476 <programlisting> | |
477 Compression: 400k + ( 8 x block size ) | |
478 | |
479 Decompression: 100k + ( 4 x block size ), or | |
480 100k + ( 2.5 x block size ) | |
481 </programlisting> | |
482 | |
483 <para>Larger block sizes give rapidly diminishing marginal | |
484 returns. Most of the compression comes from the first two or | |
485 three hundred k of block size, a fact worth bearing in mind when | |
486 using <computeroutput>bzip2</computeroutput> on small machines. | |
487 It is also important to appreciate that the decompression memory | |
488 requirement is set at compression time by the choice of block | |
489 size.</para> | |
490 | |
491 <para>For files compressed with the default 900k block size, | |
492 <computeroutput>bunzip2</computeroutput> will require about 3700 | |
493 kbytes to decompress. To support decompression of any file on a | |
494 4 megabyte machine, <computeroutput>bunzip2</computeroutput> has | |
495 an option to decompress using approximately half this amount of | |
496 memory, about 2300 kbytes. Decompression speed is also halved, | |
497 so you should use this option only where necessary. The relevant | |
498 flag is <computeroutput>-s</computeroutput>.</para> | |
499 | |
500 <para>In general, try and use the largest block size memory | |
501 constraints allow, since that maximises the compression achieved. | |
502 Compression and decompression speed are virtually unaffected by | |
503 block size.</para> | |
504 | |
505 <para>Another significant point applies to files which fit in a | |
506 single block -- that means most files you'd encounter using a | |
507 large block size. The amount of real memory touched is | |
508 proportional to the size of the file, since the file is smaller | |
509 than a block. For example, compressing a file 20,000 bytes long | |
510 with the flag <computeroutput>-9</computeroutput> will cause the | |
511 compressor to allocate around 7600k of memory, but only touch | |
512 400k + 20000 * 8 = 560 kbytes of it. Similarly, the decompressor | |
513 will allocate 3700k but only touch 100k + 20000 * 4 = 180 | |
514 kbytes.</para> | |
515 | |
516 <para>Here is a table which summarises the maximum memory usage | |
517 for different block sizes. Also recorded is the total compressed | |
518 size for 14 files of the Calgary Text Compression Corpus | |
519 totalling 3,141,622 bytes. This column gives some feel for how | |
520 compression varies with block size. These figures tend to | |
521 understate the advantage of larger block sizes for larger files, | |
522 since the Corpus is dominated by smaller files.</para> | |
523 | |
524 <programlisting> | |
525 Compress Decompress Decompress Corpus | |
526 Flag usage usage -s usage Size | |
527 | |
528 -1 1200k 500k 350k 914704 | |
529 -2 2000k 900k 600k 877703 | |
530 -3 2800k 1300k 850k 860338 | |
531 -4 3600k 1700k 1100k 846899 | |
532 -5 4400k 2100k 1350k 845160 | |
533 -6 5200k 2500k 1600k 838626 | |
534 -7 6100k 2900k 1850k 834096 | |
535 -8 6800k 3300k 2100k 828642 | |
536 -9 7600k 3700k 2350k 828642 | |
537 </programlisting> | |
538 | |
539 </sect1> | |
540 | |
541 | |
542 <sect1 id="recovering" xreflabel="RECOVERING DATA FROM DAMAGED FILES"> | |
543 <title>RECOVERING DATA FROM DAMAGED FILES</title> | |
544 | |
545 <para><computeroutput>bzip2</computeroutput> compresses files in | |
546 blocks, usually 900kbytes long. Each block is handled | |
547 independently. If a media or transmission error causes a | |
548 multi-block <computeroutput>.bz2</computeroutput> file to become | |
549 damaged, it may be possible to recover data from the undamaged | |
550 blocks in the file.</para> | |
551 | |
552 <para>The compressed representation of each block is delimited by | |
553 a 48-bit pattern, which makes it possible to find the block | |
554 boundaries with reasonable certainty. Each block also carries | |
555 its own 32-bit CRC, so damaged blocks can be distinguished from | |
556 undamaged ones.</para> | |
557 | |
558 <para><computeroutput>bzip2recover</computeroutput> is a simple | |
559 program whose purpose is to search for blocks in | |
560 <computeroutput>.bz2</computeroutput> files, and write each block | |
561 out into its own <computeroutput>.bz2</computeroutput> file. You | |
562 can then use <computeroutput>bzip2 -t</computeroutput> to test | |
563 the integrity of the resulting files, and decompress those which | |
564 are undamaged.</para> | |
565 | |
566 <para><computeroutput>bzip2recover</computeroutput> takes a | |
567 single argument, the name of the damaged file, and writes a | |
568 number of files <computeroutput>rec0001file.bz2</computeroutput>, | |
569 <computeroutput>rec0002file.bz2</computeroutput>, etc, containing | |
570 the extracted blocks. The output filenames are designed so that | |
571 the use of wildcards in subsequent processing -- for example, | |
572 <computeroutput>bzip2 -dc rec*file.bz2 > | |
573 recovered_data</computeroutput> -- lists the files in the correct | |
574 order.</para> | |
575 | |
576 <para><computeroutput>bzip2recover</computeroutput> should be of | |
577 most use dealing with large <computeroutput>.bz2</computeroutput> | |
578 files, as these will contain many blocks. It is clearly futile | |
579 to use it on damaged single-block files, since a damaged block | |
580 cannot be recovered. If you wish to minimise any potential data | |
581 loss through media or transmission errors, you might consider | |
582 compressing with a smaller block size.</para> | |
583 | |
584 </sect1> | |
585 | |
586 | |
587 <sect1 id="performance" xreflabel="PERFORMANCE NOTES"> | |
588 <title>PERFORMANCE NOTES</title> | |
589 | |
590 <para>The sorting phase of compression gathers together similar | |
591 strings in the file. Because of this, files containing very long | |
592 runs of repeated symbols, like "aabaabaabaab ..." (repeated | |
593 several hundred times) may compress more slowly than normal. | |
594 Versions 0.9.5 and above fare much better than previous versions | |
595 in this respect. The ratio between worst-case and average-case | |
596 compression time is in the region of 10:1. For previous | |
597 versions, this figure was more like 100:1. You can use the | |
598 <computeroutput>-vvvv</computeroutput> option to monitor progress | |
599 in great detail, if you want.</para> | |
600 | |
601 <para>Decompression speed is unaffected by these | |
602 phenomena.</para> | |
603 | |
604 <para><computeroutput>bzip2</computeroutput> usually allocates | |
605 several megabytes of memory to operate in, and then charges all | |
606 over it in a fairly random fashion. This means that performance, | |
607 both for compressing and decompressing, is largely determined by | |
608 the speed at which your machine can service cache misses. | |
609 Because of this, small changes to the code to reduce the miss | |
610 rate have been observed to give disproportionately large | |
611 performance improvements. I imagine | |
612 <computeroutput>bzip2</computeroutput> will perform best on | |
613 machines with very large caches.</para> | |
614 | |
615 </sect1> | |
616 | |
617 | |
618 | |
619 <sect1 id="caveats" xreflabel="CAVEATS"> | |
620 <title>CAVEATS</title> | |
621 | |
622 <para>I/O error messages are not as helpful as they could be. | |
623 <computeroutput>bzip2</computeroutput> tries hard to detect I/O | |
624 errors and exit cleanly, but the details of what the problem is | |
625 sometimes seem rather misleading.</para> | |
626 | |
627 <para>This manual page pertains to version &bz-version; of | |
628 <computeroutput>bzip2</computeroutput>. Compressed data created by | |
629 this version is entirely forwards and backwards compatible with the | |
630 previous public releases, versions 0.1pl2, 0.9.0 and 0.9.5, 1.0.0, | |
631 1.0.1, 1.0.2 and 1.0.3, but with the following exception: 0.9.0 and | |
632 above can correctly decompress multiple concatenated compressed files. | |
633 0.1pl2 cannot do this; it will stop after decompressing just the first | |
634 file in the stream.</para> | |
635 | |
636 <para><computeroutput>bzip2recover</computeroutput> versions | |
637 prior to 1.0.2 used 32-bit integers to represent bit positions in | |
638 compressed files, so it could not handle compressed files more | |
639 than 512 megabytes long. Versions 1.0.2 and above use 64-bit ints | |
640 on some platforms which support them (GNU supported targets, and | |
641 Windows). To establish whether or not | |
642 <computeroutput>bzip2recover</computeroutput> was built with such | |
643 a limitation, run it without arguments. In any event you can | |
644 build yourself an unlimited version if you can recompile it with | |
645 <computeroutput>MaybeUInt64</computeroutput> set to be an | |
646 unsigned 64-bit integer.</para> | |
647 | |
648 </sect1> | |
649 | |
650 | |
651 | |
652 <sect1 id="author" xreflabel="AUTHOR"> | |
653 <title>AUTHOR</title> | |
654 | |
655 <para>Julian Seward, | |
656 <computeroutput>&bz-email;</computeroutput></para> | |
657 | |
658 <para>The ideas embodied in | |
659 <computeroutput>bzip2</computeroutput> are due to (at least) the | |
660 following people: Michael Burrows and David Wheeler (for the | |
661 block sorting transformation), David Wheeler (again, for the | |
662 Huffman coder), Peter Fenwick (for the structured coding model in | |
663 the original <computeroutput>bzip</computeroutput>, and many | |
664 refinements), and Alistair Moffat, Radford Neal and Ian Witten | |
665 (for the arithmetic coder in the original | |
666 <computeroutput>bzip</computeroutput>). I am much indebted for | |
667 their help, support and advice. See the manual in the source | |
668 distribution for pointers to sources of documentation. Christian | |
669 von Roques encouraged me to look for faster sorting algorithms, | |
670 so as to speed up compression. Bela Lubkin encouraged me to | |
671 improve the worst-case compression performance. | |
672 Donna Robinson XMLised the documentation. | |
673 Many people sent | |
674 patches, helped with portability problems, lent machines, gave | |
675 advice and were generally helpful.</para> | |
676 | |
677 </sect1> | |
678 | |
679 </chapter> | |
680 | |
681 | |
682 | |
683 <chapter id="libprog" xreflabel="Programming with libbzip2"> | |
684 <title> | |
685 Programming with <computeroutput>libbzip2</computeroutput> | |
686 </title> | |
687 | |
688 <para>This chapter describes the programming interface to | |
689 <computeroutput>libbzip2</computeroutput>.</para> | |
690 | |
691 <para>For general background information, particularly about | |
692 memory use and performance aspects, you'd be well advised to read | |
693 <xref linkend="using"/> as well.</para> | |
694 | |
695 | |
696 <sect1 id="top-level" xreflabel="Top-level structure"> | |
697 <title>Top-level structure</title> | |
698 | |
699 <para><computeroutput>libbzip2</computeroutput> is a flexible | |
700 library for compressing and decompressing data in the | |
701 <computeroutput>bzip2</computeroutput> data format. Although | |
702 packaged as a single entity, it helps to regard the library as | |
703 three separate parts: the low level interface, and the high level | |
704 interface, and some utility functions.</para> | |
705 | |
706 <para>The structure of | |
707 <computeroutput>libbzip2</computeroutput>'s interfaces is similar | |
708 to that of Jean-loup Gailly's and Mark Adler's excellent | |
709 <computeroutput>zlib</computeroutput> library.</para> | |
710 | |
711 <para>All externally visible symbols have names beginning | |
712 <computeroutput>BZ2_</computeroutput>. This is new in version | |
713 1.0. The intention is to minimise pollution of the namespaces of | |
714 library clients.</para> | |
715 | |
716 <para>To use any part of the library, you need to | |
717 <computeroutput>#include <bzlib.h></computeroutput> | |
718 into your sources.</para> | |
719 | |
720 | |
721 | |
722 <sect2 id="ll-summary" xreflabel="Low-level summary"> | |
723 <title>Low-level summary</title> | |
724 | |
725 <para>This interface provides services for compressing and | |
726 decompressing data in memory. There's no provision for dealing | |
727 with files, streams or any other I/O mechanisms, just straight | |
728 memory-to-memory work. In fact, this part of the library can be | |
729 compiled without inclusion of | |
730 <computeroutput>stdio.h</computeroutput>, which may be helpful | |
731 for embedded applications.</para> | |
732 | |
733 <para>The low-level part of the library has no global variables | |
734 and is therefore thread-safe.</para> | |
735 | |
736 <para>Six routines make up the low level interface: | |
737 <computeroutput>BZ2_bzCompressInit</computeroutput>, | |
738 <computeroutput>BZ2_bzCompress</computeroutput>, and | |
739 <computeroutput>BZ2_bzCompressEnd</computeroutput> for | |
740 compression, and a corresponding trio | |
741 <computeroutput>BZ2_bzDecompressInit</computeroutput>, | |
742 <computeroutput>BZ2_bzDecompress</computeroutput> and | |
743 <computeroutput>BZ2_bzDecompressEnd</computeroutput> for | |
744 decompression. The <computeroutput>*Init</computeroutput> | |
745 functions allocate memory for compression/decompression and do | |
746 other initialisations, whilst the | |
747 <computeroutput>*End</computeroutput> functions close down | |
748 operations and release memory.</para> | |
749 | |
750 <para>The real work is done by | |
751 <computeroutput>BZ2_bzCompress</computeroutput> and | |
752 <computeroutput>BZ2_bzDecompress</computeroutput>. These | |
753 compress and decompress data from a user-supplied input buffer to | |
754 a user-supplied output buffer. These buffers can be any size; | |
755 arbitrary quantities of data are handled by making repeated calls | |
756 to these functions. This is a flexible mechanism allowing a | |
757 consumer-pull style of activity, or producer-push, or a mixture | |
758 of both.</para> | |
759 | |
760 </sect2> | |
761 | |
762 | |
763 <sect2 id="hl-summary" xreflabel="High-level summary"> | |
764 <title>High-level summary</title> | |
765 | |
766 <para>This interface provides some handy wrappers around the | |
767 low-level interface to facilitate reading and writing | |
768 <computeroutput>bzip2</computeroutput> format files | |
769 (<computeroutput>.bz2</computeroutput> files). The routines | |
770 provide hooks to facilitate reading files in which the | |
771 <computeroutput>bzip2</computeroutput> data stream is embedded | |
772 within some larger-scale file structure, or where there are | |
773 multiple <computeroutput>bzip2</computeroutput> data streams | |
774 concatenated end-to-end.</para> | |
775 | |
776 <para>For reading files, | |
777 <computeroutput>BZ2_bzReadOpen</computeroutput>, | |
778 <computeroutput>BZ2_bzRead</computeroutput>, | |
779 <computeroutput>BZ2_bzReadClose</computeroutput> and | |
780 <computeroutput>BZ2_bzReadGetUnused</computeroutput> are | |
781 supplied. For writing files, | |
782 <computeroutput>BZ2_bzWriteOpen</computeroutput>, | |
783 <computeroutput>BZ2_bzWrite</computeroutput> and | |
784 <computeroutput>BZ2_bzWriteFinish</computeroutput> are | |
785 available.</para> | |
786 | |
787 <para>As with the low-level library, no global variables are used | |
788 so the library is per se thread-safe. However, if I/O errors | |
789 occur whilst reading or writing the underlying compressed files, | |
790 you may have to consult <computeroutput>errno</computeroutput> to | |
791 determine the cause of the error. In that case, you'd need a C | |
792 library which correctly supports | |
793 <computeroutput>errno</computeroutput> in a multithreaded | |
794 environment.</para> | |
795 | |
796 <para>To make the library a little simpler and more portable, | |
797 <computeroutput>BZ2_bzReadOpen</computeroutput> and | |
798 <computeroutput>BZ2_bzWriteOpen</computeroutput> require you to | |
799 pass them file handles (<computeroutput>FILE*</computeroutput>s) | |
800 which have previously been opened for reading or writing | |
801 respectively. That avoids portability problems associated with | |
802 file operations and file attributes, whilst not being much of an | |
803 imposition on the programmer.</para> | |
804 | |
805 </sect2> | |
806 | |
807 | |
808 <sect2 id="util-fns-summary" xreflabel="Utility functions summary"> | |
809 <title>Utility functions summary</title> | |
810 | |
811 <para>For very simple needs, | |
812 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and | |
813 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> are | |
814 provided. These compress data in memory from one buffer to | |
815 another buffer in a single function call. You should assess | |
816 whether these functions fulfill your memory-to-memory | |
817 compression/decompression requirements before investing effort in | |
818 understanding the more general but more complex low-level | |
819 interface.</para> | |
820 | |
821 <para>Yoshioka Tsuneo | |
822 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>) has | |
823 contributed some functions to give better | |
824 <computeroutput>zlib</computeroutput> compatibility. These | |
825 functions are <computeroutput>BZ2_bzopen</computeroutput>, | |
826 <computeroutput>BZ2_bzread</computeroutput>, | |
827 <computeroutput>BZ2_bzwrite</computeroutput>, | |
828 <computeroutput>BZ2_bzflush</computeroutput>, | |
829 <computeroutput>BZ2_bzclose</computeroutput>, | |
830 <computeroutput>BZ2_bzerror</computeroutput> and | |
831 <computeroutput>BZ2_bzlibVersion</computeroutput>. You may find | |
832 these functions more convenient for simple file reading and | |
833 writing, than those in the high-level interface. These functions | |
834 are not (yet) officially part of the library, and are minimally | |
835 documented here. If they break, you get to keep all the pieces. | |
836 I hope to document them properly when time permits.</para> | |
837 | |
838 <para>Yoshioka also contributed modifications to allow the | |
839 library to be built as a Windows DLL.</para> | |
840 | |
841 </sect2> | |
842 | |
843 </sect1> | |
844 | |
845 | |
846 <sect1 id="err-handling" xreflabel="Error handling"> | |
847 <title>Error handling</title> | |
848 | |
849 <para>The library is designed to recover cleanly in all | |
850 situations, including the worst-case situation of decompressing | |
851 random data. I'm not 100% sure that it can always do this, so | |
852 you might want to add a signal handler to catch segmentation | |
853 violations during decompression if you are feeling especially | |
854 paranoid. I would be interested in hearing more about the | |
855 robustness of the library to corrupted compressed data.</para> | |
856 | |
857 <para>Version 1.0.3 more robust in this respect than any | |
858 previous version. Investigations with Valgrind (a tool for detecting | |
859 problems with memory management) indicate | |
860 that, at least for the few files I tested, all single-bit errors | |
861 in the decompressed data are caught properly, with no | |
862 segmentation faults, no uses of uninitialised data, no out of | |
863 range reads or writes, and no infinite looping in the decompressor. | |
864 So it's certainly pretty robust, although | |
865 I wouldn't claim it to be totally bombproof.</para> | |
866 | |
867 <para>The file <computeroutput>bzlib.h</computeroutput> contains | |
868 all definitions needed to use the library. In particular, you | |
869 should definitely not include | |
870 <computeroutput>bzlib_private.h</computeroutput>.</para> | |
871 | |
872 <para>In <computeroutput>bzlib.h</computeroutput>, the various | |
873 return values are defined. The following list is not intended as | |
874 an exhaustive description of the circumstances in which a given | |
875 value may be returned -- those descriptions are given later. | |
876 Rather, it is intended to convey the rough meaning of each return | |
877 value. The first five actions are normal and not intended to | |
878 denote an error situation.</para> | |
879 | |
880 <variablelist> | |
881 | |
882 <varlistentry> | |
883 <term><computeroutput>BZ_OK</computeroutput></term> | |
884 <listitem><para>The requested action was completed | |
885 successfully.</para></listitem> | |
886 </varlistentry> | |
887 | |
888 <varlistentry> | |
889 <term><computeroutput>BZ_RUN_OK, BZ_FLUSH_OK, | |
890 BZ_FINISH_OK</computeroutput></term> | |
891 <listitem><para>In | |
892 <computeroutput>BZ2_bzCompress</computeroutput>, the requested | |
893 flush/finish/nothing-special action was completed | |
894 successfully.</para></listitem> | |
895 </varlistentry> | |
896 | |
897 <varlistentry> | |
898 <term><computeroutput>BZ_STREAM_END</computeroutput></term> | |
899 <listitem><para>Compression of data was completed, or the | |
900 logical stream end was detected during | |
901 decompression.</para></listitem> | |
902 </varlistentry> | |
903 | |
904 </variablelist> | |
905 | |
906 <para>The following return values indicate an error of some | |
907 kind.</para> | |
908 | |
909 <variablelist> | |
910 | |
911 <varlistentry> | |
912 <term><computeroutput>BZ_CONFIG_ERROR</computeroutput></term> | |
913 <listitem><para>Indicates that the library has been improperly | |
914 compiled on your platform -- a major configuration error. | |
915 Specifically, it means that | |
916 <computeroutput>sizeof(char)</computeroutput>, | |
917 <computeroutput>sizeof(short)</computeroutput> and | |
918 <computeroutput>sizeof(int)</computeroutput> are not 1, 2 and | |
919 4 respectively, as they should be. Note that the library | |
920 should still work properly on 64-bit platforms which follow | |
921 the LP64 programming model -- that is, where | |
922 <computeroutput>sizeof(long)</computeroutput> and | |
923 <computeroutput>sizeof(void*)</computeroutput> are 8. Under | |
924 LP64, <computeroutput>sizeof(int)</computeroutput> is still 4, | |
925 so <computeroutput>libbzip2</computeroutput>, which doesn't | |
926 use the <computeroutput>long</computeroutput> type, is | |
927 OK.</para></listitem> | |
928 </varlistentry> | |
929 | |
930 <varlistentry> | |
931 <term><computeroutput>BZ_SEQUENCE_ERROR</computeroutput></term> | |
932 <listitem><para>When using the library, it is important to call | |
933 the functions in the correct sequence and with data structures | |
934 (buffers etc) in the correct states. | |
935 <computeroutput>libbzip2</computeroutput> checks as much as it | |
936 can to ensure this is happening, and returns | |
937 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> if not. | |
938 Code which complies precisely with the function semantics, as | |
939 detailed below, should never receive this value; such an event | |
940 denotes buggy code which you should | |
941 investigate.</para></listitem> | |
942 </varlistentry> | |
943 | |
944 <varlistentry> | |
945 <term><computeroutput>BZ_PARAM_ERROR</computeroutput></term> | |
946 <listitem><para>Returned when a parameter to a function call is | |
947 out of range or otherwise manifestly incorrect. As with | |
948 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, this | |
949 denotes a bug in the client code. The distinction between | |
950 <computeroutput>BZ_PARAM_ERROR</computeroutput> and | |
951 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput> is a bit | |
952 hazy, but still worth making.</para></listitem> | |
953 </varlistentry> | |
954 | |
955 <varlistentry> | |
956 <term><computeroutput>BZ_MEM_ERROR</computeroutput></term> | |
957 <listitem><para>Returned when a request to allocate memory | |
958 failed. Note that the quantity of memory needed to decompress | |
959 a stream cannot be determined until the stream's header has | |
960 been read. So | |
961 <computeroutput>BZ2_bzDecompress</computeroutput> and | |
962 <computeroutput>BZ2_bzRead</computeroutput> may return | |
963 <computeroutput>BZ_MEM_ERROR</computeroutput> even though some | |
964 of the compressed data has been read. The same is not true | |
965 for compression; once | |
966 <computeroutput>BZ2_bzCompressInit</computeroutput> or | |
967 <computeroutput>BZ2_bzWriteOpen</computeroutput> have | |
968 successfully completed, | |
969 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot | |
970 occur.</para></listitem> | |
971 </varlistentry> | |
972 | |
973 <varlistentry> | |
974 <term><computeroutput>BZ_DATA_ERROR</computeroutput></term> | |
975 <listitem><para>Returned when a data integrity error is | |
976 detected during decompression. Most importantly, this means | |
977 when stored and computed CRCs for the data do not match. This | |
978 value is also returned upon detection of any other anomaly in | |
979 the compressed data.</para></listitem> | |
980 </varlistentry> | |
981 | |
982 <varlistentry> | |
983 <term><computeroutput>BZ_DATA_ERROR_MAGIC</computeroutput></term> | |
984 <listitem><para>As a special case of | |
985 <computeroutput>BZ_DATA_ERROR</computeroutput>, it is | |
986 sometimes useful to know when the compressed stream does not | |
987 start with the correct magic bytes (<computeroutput>'B' 'Z' | |
988 'h'</computeroutput>).</para></listitem> | |
989 </varlistentry> | |
990 | |
991 <varlistentry> | |
992 <term><computeroutput>BZ_IO_ERROR</computeroutput></term> | |
993 <listitem><para>Returned by | |
994 <computeroutput>BZ2_bzRead</computeroutput> and | |
995 <computeroutput>BZ2_bzWrite</computeroutput> when there is an | |
996 error reading or writing in the compressed file, and by | |
997 <computeroutput>BZ2_bzReadOpen</computeroutput> and | |
998 <computeroutput>BZ2_bzWriteOpen</computeroutput> for attempts | |
999 to use a file for which the error indicator (viz, | |
1000 <computeroutput>ferror(f)</computeroutput>) is set. On | |
1001 receipt of <computeroutput>BZ_IO_ERROR</computeroutput>, the | |
1002 caller should consult <computeroutput>errno</computeroutput> | |
1003 and/or <computeroutput>perror</computeroutput> to acquire | |
1004 operating-system specific information about the | |
1005 problem.</para></listitem> | |
1006 </varlistentry> | |
1007 | |
1008 <varlistentry> | |
1009 <term><computeroutput>BZ_UNEXPECTED_EOF</computeroutput></term> | |
1010 <listitem><para>Returned by | |
1011 <computeroutput>BZ2_bzRead</computeroutput> when the | |
1012 compressed file finishes before the logical end of stream is | |
1013 detected.</para></listitem> | |
1014 </varlistentry> | |
1015 | |
1016 <varlistentry> | |
1017 <term><computeroutput>BZ_OUTBUFF_FULL</computeroutput></term> | |
1018 <listitem><para>Returned by | |
1019 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput> and | |
1020 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> to | |
1021 indicate that the output data will not fit into the output | |
1022 buffer provided.</para></listitem> | |
1023 </varlistentry> | |
1024 | |
1025 </variablelist> | |
1026 | |
1027 </sect1> | |
1028 | |
1029 | |
1030 | |
1031 <sect1 id="low-level" xreflabel=">Low-level interface"> | |
1032 <title>Low-level interface</title> | |
1033 | |
1034 | |
1035 <sect2 id="bzcompress-init" xreflabel="BZ2_bzCompressInit"> | |
1036 <title>BZ2_bzCompressInit</title> | |
1037 | |
1038 <programlisting> | |
1039 typedef struct { | |
1040 char *next_in; | |
1041 unsigned int avail_in; | |
1042 unsigned int total_in_lo32; | |
1043 unsigned int total_in_hi32; | |
1044 | |
1045 char *next_out; | |
1046 unsigned int avail_out; | |
1047 unsigned int total_out_lo32; | |
1048 unsigned int total_out_hi32; | |
1049 | |
1050 void *state; | |
1051 | |
1052 void *(*bzalloc)(void *,int,int); | |
1053 void (*bzfree)(void *,void *); | |
1054 void *opaque; | |
1055 } bz_stream; | |
1056 | |
1057 int BZ2_bzCompressInit ( bz_stream *strm, | |
1058 int blockSize100k, | |
1059 int verbosity, | |
1060 int workFactor ); | |
1061 </programlisting> | |
1062 | |
1063 <para>Prepares for compression. The | |
1064 <computeroutput>bz_stream</computeroutput> structure holds all | |
1065 data pertaining to the compression activity. A | |
1066 <computeroutput>bz_stream</computeroutput> structure should be | |
1067 allocated and initialised prior to the call. The fields of | |
1068 <computeroutput>bz_stream</computeroutput> comprise the entirety | |
1069 of the user-visible data. <computeroutput>state</computeroutput> | |
1070 is a pointer to the private data structures required for | |
1071 compression.</para> | |
1072 | |
1073 <para>Custom memory allocators are supported, via fields | |
1074 <computeroutput>bzalloc</computeroutput>, | |
1075 <computeroutput>bzfree</computeroutput>, and | |
1076 <computeroutput>opaque</computeroutput>. The value | |
1077 <computeroutput>opaque</computeroutput> is passed to as the first | |
1078 argument to all calls to <computeroutput>bzalloc</computeroutput> | |
1079 and <computeroutput>bzfree</computeroutput>, but is otherwise | |
1080 ignored by the library. The call <computeroutput>bzalloc ( | |
1081 opaque, n, m )</computeroutput> is expected to return a pointer | |
1082 <computeroutput>p</computeroutput> to <computeroutput>n * | |
1083 m</computeroutput> bytes of memory, and <computeroutput>bzfree ( | |
1084 opaque, p )</computeroutput> should free that memory.</para> | |
1085 | |
1086 <para>If you don't want to use a custom memory allocator, set | |
1087 <computeroutput>bzalloc</computeroutput>, | |
1088 <computeroutput>bzfree</computeroutput> and | |
1089 <computeroutput>opaque</computeroutput> to | |
1090 <computeroutput>NULL</computeroutput>, and the library will then | |
1091 use the standard <computeroutput>malloc</computeroutput> / | |
1092 <computeroutput>free</computeroutput> routines.</para> | |
1093 | |
1094 <para>Before calling | |
1095 <computeroutput>BZ2_bzCompressInit</computeroutput>, fields | |
1096 <computeroutput>bzalloc</computeroutput>, | |
1097 <computeroutput>bzfree</computeroutput> and | |
1098 <computeroutput>opaque</computeroutput> should be filled | |
1099 appropriately, as just described. Upon return, the internal | |
1100 state will have been allocated and initialised, and | |
1101 <computeroutput>total_in_lo32</computeroutput>, | |
1102 <computeroutput>total_in_hi32</computeroutput>, | |
1103 <computeroutput>total_out_lo32</computeroutput> and | |
1104 <computeroutput>total_out_hi32</computeroutput> will have been | |
1105 set to zero. These four fields are used by the library to inform | |
1106 the caller of the total amount of data passed into and out of the | |
1107 library, respectively. You should not try to change them. As of | |
1108 version 1.0, 64-bit counts are maintained, even on 32-bit | |
1109 platforms, using the <computeroutput>_hi32</computeroutput> | |
1110 fields to store the upper 32 bits of the count. So, for example, | |
1111 the total amount of data in is <computeroutput>(total_in_hi32 | |
1112 << 32) + total_in_lo32</computeroutput>.</para> | |
1113 | |
1114 <para>Parameter <computeroutput>blockSize100k</computeroutput> | |
1115 specifies the block size to be used for compression. It should | |
1116 be a value between 1 and 9 inclusive, and the actual block size | |
1117 used is 100000 x this figure. 9 gives the best compression but | |
1118 takes most memory.</para> | |
1119 | |
1120 <para>Parameter <computeroutput>verbosity</computeroutput> should | |
1121 be set to a number between 0 and 4 inclusive. 0 is silent, and | |
1122 greater numbers give increasingly verbose monitoring/debugging | |
1123 output. If the library has been compiled with | |
1124 <computeroutput>-DBZ_NO_STDIO</computeroutput>, no such output | |
1125 will appear for any verbosity setting.</para> | |
1126 | |
1127 <para>Parameter <computeroutput>workFactor</computeroutput> | |
1128 controls how the compression phase behaves when presented with | |
1129 worst case, highly repetitive, input data. If compression runs | |
1130 into difficulties caused by repetitive data, the library switches | |
1131 from the standard sorting algorithm to a fallback algorithm. The | |
1132 fallback is slower than the standard algorithm by perhaps a | |
1133 factor of three, but always behaves reasonably, no matter how bad | |
1134 the input.</para> | |
1135 | |
1136 <para>Lower values of <computeroutput>workFactor</computeroutput> | |
1137 reduce the amount of effort the standard algorithm will expend | |
1138 before resorting to the fallback. You should set this parameter | |
1139 carefully; too low, and many inputs will be handled by the | |
1140 fallback algorithm and so compress rather slowly, too high, and | |
1141 your average-to-worst case compression times can become very | |
1142 large. The default value of 30 gives reasonable behaviour over a | |
1143 wide range of circumstances.</para> | |
1144 | |
1145 <para>Allowable values range from 0 to 250 inclusive. 0 is a | |
1146 special case, equivalent to using the default value of 30.</para> | |
1147 | |
1148 <para>Note that the compressed output generated is the same | |
1149 regardless of whether or not the fallback algorithm is | |
1150 used.</para> | |
1151 | |
1152 <para>Be aware also that this parameter may disappear entirely in | |
1153 future versions of the library. In principle it should be | |
1154 possible to devise a good way to automatically choose which | |
1155 algorithm to use. Such a mechanism would render the parameter | |
1156 obsolete.</para> | |
1157 | |
1158 <para>Possible return values:</para> | |
1159 | |
1160 <programlisting> | |
1161 BZ_CONFIG_ERROR | |
1162 if the library has been mis-compiled | |
1163 BZ_PARAM_ERROR | |
1164 if strm is NULL | |
1165 or blockSize < 1 or blockSize > 9 | |
1166 or verbosity < 0 or verbosity > 4 | |
1167 or workFactor < 0 or workFactor > 250 | |
1168 BZ_MEM_ERROR | |
1169 if not enough memory is available | |
1170 BZ_OK | |
1171 otherwise | |
1172 </programlisting> | |
1173 | |
1174 <para>Allowable next actions:</para> | |
1175 | |
1176 <programlisting> | |
1177 BZ2_bzCompress | |
1178 if BZ_OK is returned | |
1179 no specific action needed in case of error | |
1180 </programlisting> | |
1181 | |
1182 </sect2> | |
1183 | |
1184 | |
1185 <sect2 id="bzCompress" xreflabel="BZ2_bzCompress"> | |
1186 <title>BZ2_bzCompress</title> | |
1187 | |
1188 <programlisting> | |
1189 int BZ2_bzCompress ( bz_stream *strm, int action ); | |
1190 </programlisting> | |
1191 | |
1192 <para>Provides more input and/or output buffer space for the | |
1193 library. The caller maintains input and output buffers, and | |
1194 calls <computeroutput>BZ2_bzCompress</computeroutput> to transfer | |
1195 data between them.</para> | |
1196 | |
1197 <para>Before each call to | |
1198 <computeroutput>BZ2_bzCompress</computeroutput>, | |
1199 <computeroutput>next_in</computeroutput> should point at the data | |
1200 to be compressed, and <computeroutput>avail_in</computeroutput> | |
1201 should indicate how many bytes the library may read. | |
1202 <computeroutput>BZ2_bzCompress</computeroutput> updates | |
1203 <computeroutput>next_in</computeroutput>, | |
1204 <computeroutput>avail_in</computeroutput> and | |
1205 <computeroutput>total_in</computeroutput> to reflect the number | |
1206 of bytes it has read.</para> | |
1207 | |
1208 <para>Similarly, <computeroutput>next_out</computeroutput> should | |
1209 point to a buffer in which the compressed data is to be placed, | |
1210 with <computeroutput>avail_out</computeroutput> indicating how | |
1211 much output space is available. | |
1212 <computeroutput>BZ2_bzCompress</computeroutput> updates | |
1213 <computeroutput>next_out</computeroutput>, | |
1214 <computeroutput>avail_out</computeroutput> and | |
1215 <computeroutput>total_out</computeroutput> to reflect the number | |
1216 of bytes output.</para> | |
1217 | |
1218 <para>You may provide and remove as little or as much data as you | |
1219 like on each call of | |
1220 <computeroutput>BZ2_bzCompress</computeroutput>. In the limit, | |
1221 it is acceptable to supply and remove data one byte at a time, | |
1222 although this would be terribly inefficient. You should always | |
1223 ensure that at least one byte of output space is available at | |
1224 each call.</para> | |
1225 | |
1226 <para>A second purpose of | |
1227 <computeroutput>BZ2_bzCompress</computeroutput> is to request a | |
1228 change of mode of the compressed stream.</para> | |
1229 | |
1230 <para>Conceptually, a compressed stream can be in one of four | |
1231 states: IDLE, RUNNING, FLUSHING and FINISHING. Before | |
1232 initialisation | |
1233 (<computeroutput>BZ2_bzCompressInit</computeroutput>) and after | |
1234 termination (<computeroutput>BZ2_bzCompressEnd</computeroutput>), | |
1235 a stream is regarded as IDLE.</para> | |
1236 | |
1237 <para>Upon initialisation | |
1238 (<computeroutput>BZ2_bzCompressInit</computeroutput>), the stream | |
1239 is placed in the RUNNING state. Subsequent calls to | |
1240 <computeroutput>BZ2_bzCompress</computeroutput> should pass | |
1241 <computeroutput>BZ_RUN</computeroutput> as the requested action; | |
1242 other actions are illegal and will result in | |
1243 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>.</para> | |
1244 | |
1245 <para>At some point, the calling program will have provided all | |
1246 the input data it wants to. It will then want to finish up -- in | |
1247 effect, asking the library to process any data it might have | |
1248 buffered internally. In this state, | |
1249 <computeroutput>BZ2_bzCompress</computeroutput> will no longer | |
1250 attempt to read data from | |
1251 <computeroutput>next_in</computeroutput>, but it will want to | |
1252 write data to <computeroutput>next_out</computeroutput>. Because | |
1253 the output buffer supplied by the user can be arbitrarily small, | |
1254 the finishing-up operation cannot necessarily be done with a | |
1255 single call of | |
1256 <computeroutput>BZ2_bzCompress</computeroutput>.</para> | |
1257 | |
1258 <para>Instead, the calling program passes | |
1259 <computeroutput>BZ_FINISH</computeroutput> as an action to | |
1260 <computeroutput>BZ2_bzCompress</computeroutput>. This changes | |
1261 the stream's state to FINISHING. Any remaining input (ie, | |
1262 <computeroutput>next_in[0 .. avail_in-1]</computeroutput>) is | |
1263 compressed and transferred to the output buffer. To do this, | |
1264 <computeroutput>BZ2_bzCompress</computeroutput> must be called | |
1265 repeatedly until all the output has been consumed. At that | |
1266 point, <computeroutput>BZ2_bzCompress</computeroutput> returns | |
1267 <computeroutput>BZ_STREAM_END</computeroutput>, and the stream's | |
1268 state is set back to IDLE. | |
1269 <computeroutput>BZ2_bzCompressEnd</computeroutput> should then be | |
1270 called.</para> | |
1271 | |
1272 <para>Just to make sure the calling program does not cheat, the | |
1273 library makes a note of <computeroutput>avail_in</computeroutput> | |
1274 at the time of the first call to | |
1275 <computeroutput>BZ2_bzCompress</computeroutput> which has | |
1276 <computeroutput>BZ_FINISH</computeroutput> as an action (ie, at | |
1277 the time the program has announced its intention to not supply | |
1278 any more input). By comparing this value with that of | |
1279 <computeroutput>avail_in</computeroutput> over subsequent calls | |
1280 to <computeroutput>BZ2_bzCompress</computeroutput>, the library | |
1281 can detect any attempts to slip in more data to compress. Any | |
1282 calls for which this is detected will return | |
1283 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>. This | |
1284 indicates a programming mistake which should be corrected.</para> | |
1285 | |
1286 <para>Instead of asking to finish, the calling program may ask | |
1287 <computeroutput>BZ2_bzCompress</computeroutput> to take all the | |
1288 remaining input, compress it and terminate the current | |
1289 (Burrows-Wheeler) compression block. This could be useful for | |
1290 error control purposes. The mechanism is analogous to that for | |
1291 finishing: call <computeroutput>BZ2_bzCompress</computeroutput> | |
1292 with an action of <computeroutput>BZ_FLUSH</computeroutput>, | |
1293 remove output data, and persist with the | |
1294 <computeroutput>BZ_FLUSH</computeroutput> action until the value | |
1295 <computeroutput>BZ_RUN</computeroutput> is returned. As with | |
1296 finishing, <computeroutput>BZ2_bzCompress</computeroutput> | |
1297 detects any attempt to provide more input data once the flush has | |
1298 begun.</para> | |
1299 | |
1300 <para>Once the flush is complete, the stream returns to the | |
1301 normal RUNNING state.</para> | |
1302 | |
1303 <para>This all sounds pretty complex, but isn't really. Here's a | |
1304 table which shows which actions are allowable in each state, what | |
1305 action will be taken, what the next state is, and what the | |
1306 non-error return values are. Note that you can't explicitly ask | |
1307 what state the stream is in, but nor do you need to -- it can be | |
1308 inferred from the values returned by | |
1309 <computeroutput>BZ2_bzCompress</computeroutput>.</para> | |
1310 | |
1311 <programlisting> | |
1312 IDLE/any | |
1313 Illegal. IDLE state only exists after BZ2_bzCompressEnd or | |
1314 before BZ2_bzCompressInit. | |
1315 Return value = BZ_SEQUENCE_ERROR | |
1316 | |
1317 RUNNING/BZ_RUN | |
1318 Compress from next_in to next_out as much as possible. | |
1319 Next state = RUNNING | |
1320 Return value = BZ_RUN_OK | |
1321 | |
1322 RUNNING/BZ_FLUSH | |
1323 Remember current value of next_in. Compress from next_in | |
1324 to next_out as much as possible, but do not accept any more input. | |
1325 Next state = FLUSHING | |
1326 Return value = BZ_FLUSH_OK | |
1327 | |
1328 RUNNING/BZ_FINISH | |
1329 Remember current value of next_in. Compress from next_in | |
1330 to next_out as much as possible, but do not accept any more input. | |
1331 Next state = FINISHING | |
1332 Return value = BZ_FINISH_OK | |
1333 | |
1334 FLUSHING/BZ_FLUSH | |
1335 Compress from next_in to next_out as much as possible, | |
1336 but do not accept any more input. | |
1337 If all the existing input has been used up and all compressed | |
1338 output has been removed | |
1339 Next state = RUNNING; Return value = BZ_RUN_OK | |
1340 else | |
1341 Next state = FLUSHING; Return value = BZ_FLUSH_OK | |
1342 | |
1343 FLUSHING/other | |
1344 Illegal. | |
1345 Return value = BZ_SEQUENCE_ERROR | |
1346 | |
1347 FINISHING/BZ_FINISH | |
1348 Compress from next_in to next_out as much as possible, | |
1349 but to not accept any more input. | |
1350 If all the existing input has been used up and all compressed | |
1351 output has been removed | |
1352 Next state = IDLE; Return value = BZ_STREAM_END | |
1353 else | |
1354 Next state = FINISHING; Return value = BZ_FINISH_OK | |
1355 | |
1356 FINISHING/other | |
1357 Illegal. | |
1358 Return value = BZ_SEQUENCE_ERROR | |
1359 </programlisting> | |
1360 | |
1361 | |
1362 <para>That still looks complicated? Well, fair enough. The | |
1363 usual sequence of calls for compressing a load of data is:</para> | |
1364 | |
1365 <orderedlist> | |
1366 | |
1367 <listitem><para>Get started with | |
1368 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para></listitem> | |
1369 | |
1370 <listitem><para>Shovel data in and shlurp out its compressed form | |
1371 using zero or more calls of | |
1372 <computeroutput>BZ2_bzCompress</computeroutput> with action = | |
1373 <computeroutput>BZ_RUN</computeroutput>.</para></listitem> | |
1374 | |
1375 <listitem><para>Finish up. Repeatedly call | |
1376 <computeroutput>BZ2_bzCompress</computeroutput> with action = | |
1377 <computeroutput>BZ_FINISH</computeroutput>, copying out the | |
1378 compressed output, until | |
1379 <computeroutput>BZ_STREAM_END</computeroutput> is | |
1380 returned.</para></listitem> <listitem><para>Close up and go home. Call | |
1381 <computeroutput>BZ2_bzCompressEnd</computeroutput>.</para></listitem> | |
1382 | |
1383 </orderedlist> | |
1384 | |
1385 <para>If the data you want to compress fits into your input | |
1386 buffer all at once, you can skip the calls of | |
1387 <computeroutput>BZ2_bzCompress ( ..., BZ_RUN )</computeroutput> | |
1388 and just do the <computeroutput>BZ2_bzCompress ( ..., BZ_FINISH | |
1389 )</computeroutput> calls.</para> | |
1390 | |
1391 <para>All required memory is allocated by | |
1392 <computeroutput>BZ2_bzCompressInit</computeroutput>. The | |
1393 compression library can accept any data at all (obviously). So | |
1394 you shouldn't get any error return values from the | |
1395 <computeroutput>BZ2_bzCompress</computeroutput> calls. If you | |
1396 do, they will be | |
1397 <computeroutput>BZ_SEQUENCE_ERROR</computeroutput>, and indicate | |
1398 a bug in your programming.</para> | |
1399 | |
1400 <para>Trivial other possible return values:</para> | |
1401 | |
1402 <programlisting> | |
1403 BZ_PARAM_ERROR | |
1404 if strm is NULL, or strm->s is NULL | |
1405 </programlisting> | |
1406 | |
1407 </sect2> | |
1408 | |
1409 | |
1410 <sect2 id="bzCompress-end" xreflabel="BZ2_bzCompressEnd"> | |
1411 <title>BZ2_bzCompressEnd</title> | |
1412 | |
1413 <programlisting> | |
1414 int BZ2_bzCompressEnd ( bz_stream *strm ); | |
1415 </programlisting> | |
1416 | |
1417 <para>Releases all memory associated with a compression | |
1418 stream.</para> | |
1419 | |
1420 <para>Possible return values:</para> | |
1421 | |
1422 <programlisting> | |
1423 BZ_PARAM_ERROR if strm is NULL or strm->s is NULL | |
1424 BZ_OK otherwise | |
1425 </programlisting> | |
1426 | |
1427 </sect2> | |
1428 | |
1429 | |
1430 <sect2 id="bzDecompress-init" xreflabel="BZ2_bzDecompressInit"> | |
1431 <title>BZ2_bzDecompressInit</title> | |
1432 | |
1433 <programlisting> | |
1434 int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small ); | |
1435 </programlisting> | |
1436 | |
1437 <para>Prepares for decompression. As with | |
1438 <computeroutput>BZ2_bzCompressInit</computeroutput>, a | |
1439 <computeroutput>bz_stream</computeroutput> record should be | |
1440 allocated and initialised before the call. Fields | |
1441 <computeroutput>bzalloc</computeroutput>, | |
1442 <computeroutput>bzfree</computeroutput> and | |
1443 <computeroutput>opaque</computeroutput> should be set if a custom | |
1444 memory allocator is required, or made | |
1445 <computeroutput>NULL</computeroutput> for the normal | |
1446 <computeroutput>malloc</computeroutput> / | |
1447 <computeroutput>free</computeroutput> routines. Upon return, the | |
1448 internal state will have been initialised, and | |
1449 <computeroutput>total_in</computeroutput> and | |
1450 <computeroutput>total_out</computeroutput> will be zero.</para> | |
1451 | |
1452 <para>For the meaning of parameter | |
1453 <computeroutput>verbosity</computeroutput>, see | |
1454 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para> | |
1455 | |
1456 <para>If <computeroutput>small</computeroutput> is nonzero, the | |
1457 library will use an alternative decompression algorithm which | |
1458 uses less memory but at the cost of decompressing more slowly | |
1459 (roughly speaking, half the speed, but the maximum memory | |
1460 requirement drops to around 2300k). See <xref linkend="using"/> | |
1461 for more information on memory management.</para> | |
1462 | |
1463 <para>Note that the amount of memory needed to decompress a | |
1464 stream cannot be determined until the stream's header has been | |
1465 read, so even if | |
1466 <computeroutput>BZ2_bzDecompressInit</computeroutput> succeeds, a | |
1467 subsequent <computeroutput>BZ2_bzDecompress</computeroutput> | |
1468 could fail with | |
1469 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para> | |
1470 | |
1471 <para>Possible return values:</para> | |
1472 | |
1473 <programlisting> | |
1474 BZ_CONFIG_ERROR | |
1475 if the library has been mis-compiled | |
1476 BZ_PARAM_ERROR | |
1477 if ( small != 0 && small != 1 ) | |
1478 or (verbosity <; 0 || verbosity > 4) | |
1479 BZ_MEM_ERROR | |
1480 if insufficient memory is available | |
1481 </programlisting> | |
1482 | |
1483 <para>Allowable next actions:</para> | |
1484 | |
1485 <programlisting> | |
1486 BZ2_bzDecompress | |
1487 if BZ_OK was returned | |
1488 no specific action required in case of error | |
1489 </programlisting> | |
1490 | |
1491 </sect2> | |
1492 | |
1493 | |
1494 <sect2 id="bzDecompress" xreflabel="BZ2_bzDecompress"> | |
1495 <title>BZ2_bzDecompress</title> | |
1496 | |
1497 <programlisting> | |
1498 int BZ2_bzDecompress ( bz_stream *strm ); | |
1499 </programlisting> | |
1500 | |
1501 <para>Provides more input and/out output buffer space for the | |
1502 library. The caller maintains input and output buffers, and uses | |
1503 <computeroutput>BZ2_bzDecompress</computeroutput> to transfer | |
1504 data between them.</para> | |
1505 | |
1506 <para>Before each call to | |
1507 <computeroutput>BZ2_bzDecompress</computeroutput>, | |
1508 <computeroutput>next_in</computeroutput> should point at the | |
1509 compressed data, and <computeroutput>avail_in</computeroutput> | |
1510 should indicate how many bytes the library may read. | |
1511 <computeroutput>BZ2_bzDecompress</computeroutput> updates | |
1512 <computeroutput>next_in</computeroutput>, | |
1513 <computeroutput>avail_in</computeroutput> and | |
1514 <computeroutput>total_in</computeroutput> to reflect the number | |
1515 of bytes it has read.</para> | |
1516 | |
1517 <para>Similarly, <computeroutput>next_out</computeroutput> should | |
1518 point to a buffer in which the uncompressed output is to be | |
1519 placed, with <computeroutput>avail_out</computeroutput> | |
1520 indicating how much output space is available. | |
1521 <computeroutput>BZ2_bzCompress</computeroutput> updates | |
1522 <computeroutput>next_out</computeroutput>, | |
1523 <computeroutput>avail_out</computeroutput> and | |
1524 <computeroutput>total_out</computeroutput> to reflect the number | |
1525 of bytes output.</para> | |
1526 | |
1527 <para>You may provide and remove as little or as much data as you | |
1528 like on each call of | |
1529 <computeroutput>BZ2_bzDecompress</computeroutput>. In the limit, | |
1530 it is acceptable to supply and remove data one byte at a time, | |
1531 although this would be terribly inefficient. You should always | |
1532 ensure that at least one byte of output space is available at | |
1533 each call.</para> | |
1534 | |
1535 <para>Use of <computeroutput>BZ2_bzDecompress</computeroutput> is | |
1536 simpler than | |
1537 <computeroutput>BZ2_bzCompress</computeroutput>.</para> | |
1538 | |
1539 <para>You should provide input and remove output as described | |
1540 above, and repeatedly call | |
1541 <computeroutput>BZ2_bzDecompress</computeroutput> until | |
1542 <computeroutput>BZ_STREAM_END</computeroutput> is returned. | |
1543 Appearance of <computeroutput>BZ_STREAM_END</computeroutput> | |
1544 denotes that <computeroutput>BZ2_bzDecompress</computeroutput> | |
1545 has detected the logical end of the compressed stream. | |
1546 <computeroutput>BZ2_bzDecompress</computeroutput> will not | |
1547 produce <computeroutput>BZ_STREAM_END</computeroutput> until all | |
1548 output data has been placed into the output buffer, so once | |
1549 <computeroutput>BZ_STREAM_END</computeroutput> appears, you are | |
1550 guaranteed to have available all the decompressed output, and | |
1551 <computeroutput>BZ2_bzDecompressEnd</computeroutput> can safely | |
1552 be called.</para> | |
1553 | |
1554 <para>If case of an error return value, you should call | |
1555 <computeroutput>BZ2_bzDecompressEnd</computeroutput> to clean up | |
1556 and release memory.</para> | |
1557 | |
1558 <para>Possible return values:</para> | |
1559 | |
1560 <programlisting> | |
1561 BZ_PARAM_ERROR | |
1562 if strm is NULL or strm->s is NULL | |
1563 or strm->avail_out < 1 | |
1564 BZ_DATA_ERROR | |
1565 if a data integrity error is detected in the compressed stream | |
1566 BZ_DATA_ERROR_MAGIC | |
1567 if the compressed stream doesn't begin with the right magic bytes | |
1568 BZ_MEM_ERROR | |
1569 if there wasn't enough memory available | |
1570 BZ_STREAM_END | |
1571 if the logical end of the data stream was detected and all | |
1572 output in has been consumed, eg s-->avail_out > 0 | |
1573 BZ_OK | |
1574 otherwise | |
1575 </programlisting> | |
1576 | |
1577 <para>Allowable next actions:</para> | |
1578 | |
1579 <programlisting> | |
1580 BZ2_bzDecompress | |
1581 if BZ_OK was returned | |
1582 BZ2_bzDecompressEnd | |
1583 otherwise | |
1584 </programlisting> | |
1585 | |
1586 </sect2> | |
1587 | |
1588 | |
1589 <sect2 id="bzDecompress-end" xreflabel="BZ2_bzDecompressEnd"> | |
1590 <title>BZ2_bzDecompressEnd</title> | |
1591 | |
1592 <programlisting> | |
1593 int BZ2_bzDecompressEnd ( bz_stream *strm ); | |
1594 </programlisting> | |
1595 | |
1596 <para>Releases all memory associated with a decompression | |
1597 stream.</para> | |
1598 | |
1599 <para>Possible return values:</para> | |
1600 | |
1601 <programlisting> | |
1602 BZ_PARAM_ERROR | |
1603 if strm is NULL or strm->s is NULL | |
1604 BZ_OK | |
1605 otherwise | |
1606 </programlisting> | |
1607 | |
1608 <para>Allowable next actions:</para> | |
1609 | |
1610 <programlisting> | |
1611 None. | |
1612 </programlisting> | |
1613 | |
1614 </sect2> | |
1615 | |
1616 </sect1> | |
1617 | |
1618 | |
1619 <sect1 id="hl-interface" xreflabel="High-level interface"> | |
1620 <title>High-level interface</title> | |
1621 | |
1622 <para>This interface provides functions for reading and writing | |
1623 <computeroutput>bzip2</computeroutput> format files. First, some | |
1624 general points.</para> | |
1625 | |
1626 <itemizedlist mark='bullet'> | |
1627 | |
1628 <listitem><para>All of the functions take an | |
1629 <computeroutput>int*</computeroutput> first argument, | |
1630 <computeroutput>bzerror</computeroutput>. After each call, | |
1631 <computeroutput>bzerror</computeroutput> should be consulted | |
1632 first to determine the outcome of the call. If | |
1633 <computeroutput>bzerror</computeroutput> is | |
1634 <computeroutput>BZ_OK</computeroutput>, the call completed | |
1635 successfully, and only then should the return value of the | |
1636 function (if any) be consulted. If | |
1637 <computeroutput>bzerror</computeroutput> is | |
1638 <computeroutput>BZ_IO_ERROR</computeroutput>, there was an | |
1639 error reading/writing the underlying compressed file, and you | |
1640 should then consult <computeroutput>errno</computeroutput> / | |
1641 <computeroutput>perror</computeroutput> to determine the cause | |
1642 of the difficulty. <computeroutput>bzerror</computeroutput> | |
1643 may also be set to various other values; precise details are | |
1644 given on a per-function basis below.</para></listitem> | |
1645 | |
1646 <listitem><para>If <computeroutput>bzerror</computeroutput> indicates | |
1647 an error (ie, anything except | |
1648 <computeroutput>BZ_OK</computeroutput> and | |
1649 <computeroutput>BZ_STREAM_END</computeroutput>), you should | |
1650 immediately call | |
1651 <computeroutput>BZ2_bzReadClose</computeroutput> (or | |
1652 <computeroutput>BZ2_bzWriteClose</computeroutput>, depending on | |
1653 whether you are attempting to read or to write) to free up all | |
1654 resources associated with the stream. Once an error has been | |
1655 indicated, behaviour of all calls except | |
1656 <computeroutput>BZ2_bzReadClose</computeroutput> | |
1657 (<computeroutput>BZ2_bzWriteClose</computeroutput>) is | |
1658 undefined. The implication is that (1) | |
1659 <computeroutput>bzerror</computeroutput> should be checked | |
1660 after each call, and (2) if | |
1661 <computeroutput>bzerror</computeroutput> indicates an error, | |
1662 <computeroutput>BZ2_bzReadClose</computeroutput> | |
1663 (<computeroutput>BZ2_bzWriteClose</computeroutput>) should then | |
1664 be called to clean up.</para></listitem> | |
1665 | |
1666 <listitem><para>The <computeroutput>FILE*</computeroutput> arguments | |
1667 passed to <computeroutput>BZ2_bzReadOpen</computeroutput> / | |
1668 <computeroutput>BZ2_bzWriteOpen</computeroutput> should be set | |
1669 to binary mode. Most Unix systems will do this by default, but | |
1670 other platforms, including Windows and Mac, will not. If you | |
1671 omit this, you may encounter problems when moving code to new | |
1672 platforms.</para></listitem> | |
1673 | |
1674 <listitem><para>Memory allocation requests are handled by | |
1675 <computeroutput>malloc</computeroutput> / | |
1676 <computeroutput>free</computeroutput>. At present there is no | |
1677 facility for user-defined memory allocators in the file I/O | |
1678 functions (could easily be added, though).</para></listitem> | |
1679 | |
1680 </itemizedlist> | |
1681 | |
1682 | |
1683 | |
1684 <sect2 id="bzreadopen" xreflabel="BZ2_bzReadOpen"> | |
1685 <title>BZ2_bzReadOpen</title> | |
1686 | |
1687 <programlisting> | |
1688 typedef void BZFILE; | |
1689 | |
1690 BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f, | |
1691 int verbosity, int small, | |
1692 void *unused, int nUnused ); | |
1693 </programlisting> | |
1694 | |
1695 <para>Prepare to read compressed data from file handle | |
1696 <computeroutput>f</computeroutput>. | |
1697 <computeroutput>f</computeroutput> should refer to a file which | |
1698 has been opened for reading, and for which the error indicator | |
1699 (<computeroutput>ferror(f)</computeroutput>)is not set. If | |
1700 <computeroutput>small</computeroutput> is 1, the library will try | |
1701 to decompress using less memory, at the expense of speed.</para> | |
1702 | |
1703 <para>For reasons explained below, | |
1704 <computeroutput>BZ2_bzRead</computeroutput> will decompress the | |
1705 <computeroutput>nUnused</computeroutput> bytes starting at | |
1706 <computeroutput>unused</computeroutput>, before starting to read | |
1707 from the file <computeroutput>f</computeroutput>. At most | |
1708 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes may be | |
1709 supplied like this. If this facility is not required, you should | |
1710 pass <computeroutput>NULL</computeroutput> and | |
1711 <computeroutput>0</computeroutput> for | |
1712 <computeroutput>unused</computeroutput> and | |
1713 n<computeroutput>Unused</computeroutput> respectively.</para> | |
1714 | |
1715 <para>For the meaning of parameters | |
1716 <computeroutput>small</computeroutput> and | |
1717 <computeroutput>verbosity</computeroutput>, see | |
1718 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para> | |
1719 | |
1720 <para>The amount of memory needed to decompress a file cannot be | |
1721 determined until the file's header has been read. So it is | |
1722 possible that <computeroutput>BZ2_bzReadOpen</computeroutput> | |
1723 returns <computeroutput>BZ_OK</computeroutput> but a subsequent | |
1724 call of <computeroutput>BZ2_bzRead</computeroutput> will return | |
1725 <computeroutput>BZ_MEM_ERROR</computeroutput>.</para> | |
1726 | |
1727 <para>Possible assignments to | |
1728 <computeroutput>bzerror</computeroutput>:</para> | |
1729 | |
1730 <programlisting> | |
1731 BZ_CONFIG_ERROR | |
1732 if the library has been mis-compiled | |
1733 BZ_PARAM_ERROR | |
1734 if f is NULL | |
1735 or small is neither 0 nor 1 | |
1736 or ( unused == NULL && nUnused != 0 ) | |
1737 or ( unused != NULL && !(0 <= nUnused <= BZ_MAX_UNUSED) ) | |
1738 BZ_IO_ERROR | |
1739 if ferror(f) is nonzero | |
1740 BZ_MEM_ERROR | |
1741 if insufficient memory is available | |
1742 BZ_OK | |
1743 otherwise. | |
1744 </programlisting> | |
1745 | |
1746 <para>Possible return values:</para> | |
1747 | |
1748 <programlisting> | |
1749 Pointer to an abstract BZFILE | |
1750 if bzerror is BZ_OK | |
1751 NULL | |
1752 otherwise | |
1753 </programlisting> | |
1754 | |
1755 <para>Allowable next actions:</para> | |
1756 | |
1757 <programlisting> | |
1758 BZ2_bzRead | |
1759 if bzerror is BZ_OK | |
1760 BZ2_bzClose | |
1761 otherwise | |
1762 </programlisting> | |
1763 | |
1764 </sect2> | |
1765 | |
1766 | |
1767 <sect2 id="bzread" xreflabel="BZ2_bzRead"> | |
1768 <title>BZ2_bzRead</title> | |
1769 | |
1770 <programlisting> | |
1771 int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len ); | |
1772 </programlisting> | |
1773 | |
1774 <para>Reads up to <computeroutput>len</computeroutput> | |
1775 (uncompressed) bytes from the compressed file | |
1776 <computeroutput>b</computeroutput> into the buffer | |
1777 <computeroutput>buf</computeroutput>. If the read was | |
1778 successful, <computeroutput>bzerror</computeroutput> is set to | |
1779 <computeroutput>BZ_OK</computeroutput> and the number of bytes | |
1780 read is returned. If the logical end-of-stream was detected, | |
1781 <computeroutput>bzerror</computeroutput> will be set to | |
1782 <computeroutput>BZ_STREAM_END</computeroutput>, and the number of | |
1783 bytes read is returned. All other | |
1784 <computeroutput>bzerror</computeroutput> values denote an | |
1785 error.</para> | |
1786 | |
1787 <para><computeroutput>BZ2_bzRead</computeroutput> will supply | |
1788 <computeroutput>len</computeroutput> bytes, unless the logical | |
1789 stream end is detected or an error occurs. Because of this, it | |
1790 is possible to detect the stream end by observing when the number | |
1791 of bytes returned is less than the number requested. | |
1792 Nevertheless, this is regarded as inadvisable; you should instead | |
1793 check <computeroutput>bzerror</computeroutput> after every call | |
1794 and watch out for | |
1795 <computeroutput>BZ_STREAM_END</computeroutput>.</para> | |
1796 | |
1797 <para>Internally, <computeroutput>BZ2_bzRead</computeroutput> | |
1798 copies data from the compressed file in chunks of size | |
1799 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes before | |
1800 decompressing it. If the file contains more bytes than strictly | |
1801 needed to reach the logical end-of-stream, | |
1802 <computeroutput>BZ2_bzRead</computeroutput> will almost certainly | |
1803 read some of the trailing data before signalling | |
1804 <computeroutput>BZ_SEQUENCE_END</computeroutput>. To collect the | |
1805 read but unused data once | |
1806 <computeroutput>BZ_SEQUENCE_END</computeroutput> has appeared, | |
1807 call <computeroutput>BZ2_bzReadGetUnused</computeroutput> | |
1808 immediately before | |
1809 <computeroutput>BZ2_bzReadClose</computeroutput>.</para> | |
1810 | |
1811 <para>Possible assignments to | |
1812 <computeroutput>bzerror</computeroutput>:</para> | |
1813 | |
1814 <programlisting> | |
1815 BZ_PARAM_ERROR | |
1816 if b is NULL or buf is NULL or len < 0 | |
1817 BZ_SEQUENCE_ERROR | |
1818 if b was opened with BZ2_bzWriteOpen | |
1819 BZ_IO_ERROR | |
1820 if there is an error reading from the compressed file | |
1821 BZ_UNEXPECTED_EOF | |
1822 if the compressed file ended before | |
1823 the logical end-of-stream was detected | |
1824 BZ_DATA_ERROR | |
1825 if a data integrity error was detected in the compressed stream | |
1826 BZ_DATA_ERROR_MAGIC | |
1827 if the stream does not begin with the requisite header bytes | |
1828 (ie, is not a bzip2 data file). This is really | |
1829 a special case of BZ_DATA_ERROR. | |
1830 BZ_MEM_ERROR | |
1831 if insufficient memory was available | |
1832 BZ_STREAM_END | |
1833 if the logical end of stream was detected. | |
1834 BZ_OK | |
1835 otherwise. | |
1836 </programlisting> | |
1837 | |
1838 <para>Possible return values:</para> | |
1839 | |
1840 <programlisting> | |
1841 number of bytes read | |
1842 if bzerror is BZ_OK or BZ_STREAM_END | |
1843 undefined | |
1844 otherwise | |
1845 </programlisting> | |
1846 | |
1847 <para>Allowable next actions:</para> | |
1848 | |
1849 <programlisting> | |
1850 collect data from buf, then BZ2_bzRead or BZ2_bzReadClose | |
1851 if bzerror is BZ_OK | |
1852 collect data from buf, then BZ2_bzReadClose or BZ2_bzReadGetUnused | |
1853 if bzerror is BZ_SEQUENCE_END | |
1854 BZ2_bzReadClose | |
1855 otherwise | |
1856 </programlisting> | |
1857 | |
1858 </sect2> | |
1859 | |
1860 | |
1861 <sect2 id="bzreadgetunused" xreflabel="BZ2_bzReadGetUnused"> | |
1862 <title>BZ2_bzReadGetUnused</title> | |
1863 | |
1864 <programlisting> | |
1865 void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b, | |
1866 void** unused, int* nUnused ); | |
1867 </programlisting> | |
1868 | |
1869 <para>Returns data which was read from the compressed file but | |
1870 was not needed to get to the logical end-of-stream. | |
1871 <computeroutput>*unused</computeroutput> is set to the address of | |
1872 the data, and <computeroutput>*nUnused</computeroutput> to the | |
1873 number of bytes. <computeroutput>*nUnused</computeroutput> will | |
1874 be set to a value between <computeroutput>0</computeroutput> and | |
1875 <computeroutput>BZ_MAX_UNUSED</computeroutput> inclusive.</para> | |
1876 | |
1877 <para>This function may only be called once | |
1878 <computeroutput>BZ2_bzRead</computeroutput> has signalled | |
1879 <computeroutput>BZ_STREAM_END</computeroutput> but before | |
1880 <computeroutput>BZ2_bzReadClose</computeroutput>.</para> | |
1881 | |
1882 <para>Possible assignments to | |
1883 <computeroutput>bzerror</computeroutput>:</para> | |
1884 | |
1885 <programlisting> | |
1886 BZ_PARAM_ERROR | |
1887 if b is NULL | |
1888 or unused is NULL or nUnused is NULL | |
1889 BZ_SEQUENCE_ERROR | |
1890 if BZ_STREAM_END has not been signalled | |
1891 or if b was opened with BZ2_bzWriteOpen | |
1892 BZ_OK | |
1893 otherwise | |
1894 </programlisting> | |
1895 | |
1896 <para>Allowable next actions:</para> | |
1897 | |
1898 <programlisting> | |
1899 BZ2_bzReadClose | |
1900 </programlisting> | |
1901 | |
1902 </sect2> | |
1903 | |
1904 | |
1905 <sect2 id="bzreadclose" xreflabel="BZ2_bzReadClose"> | |
1906 <title>BZ2_bzReadClose</title> | |
1907 | |
1908 <programlisting> | |
1909 void BZ2_bzReadClose ( int *bzerror, BZFILE *b ); | |
1910 </programlisting> | |
1911 | |
1912 <para>Releases all memory pertaining to the compressed file | |
1913 <computeroutput>b</computeroutput>. | |
1914 <computeroutput>BZ2_bzReadClose</computeroutput> does not call | |
1915 <computeroutput>fclose</computeroutput> on the underlying file | |
1916 handle, so you should do that yourself if appropriate. | |
1917 <computeroutput>BZ2_bzReadClose</computeroutput> should be called | |
1918 to clean up after all error situations.</para> | |
1919 | |
1920 <para>Possible assignments to | |
1921 <computeroutput>bzerror</computeroutput>:</para> | |
1922 | |
1923 <programlisting> | |
1924 BZ_SEQUENCE_ERROR | |
1925 if b was opened with BZ2_bzOpenWrite | |
1926 BZ_OK | |
1927 otherwise | |
1928 </programlisting> | |
1929 | |
1930 <para>Allowable next actions:</para> | |
1931 | |
1932 <programlisting> | |
1933 none | |
1934 </programlisting> | |
1935 | |
1936 </sect2> | |
1937 | |
1938 | |
1939 <sect2 id="bzwriteopen" xreflabel="BZ2_bzWriteOpen"> | |
1940 <title>BZ2_bzWriteOpen</title> | |
1941 | |
1942 <programlisting> | |
1943 BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f, | |
1944 int blockSize100k, int verbosity, | |
1945 int workFactor ); | |
1946 </programlisting> | |
1947 | |
1948 <para>Prepare to write compressed data to file handle | |
1949 <computeroutput>f</computeroutput>. | |
1950 <computeroutput>f</computeroutput> should refer to a file which | |
1951 has been opened for writing, and for which the error indicator | |
1952 (<computeroutput>ferror(f)</computeroutput>)is not set.</para> | |
1953 | |
1954 <para>For the meaning of parameters | |
1955 <computeroutput>blockSize100k</computeroutput>, | |
1956 <computeroutput>verbosity</computeroutput> and | |
1957 <computeroutput>workFactor</computeroutput>, see | |
1958 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para> | |
1959 | |
1960 <para>All required memory is allocated at this stage, so if the | |
1961 call completes successfully, | |
1962 <computeroutput>BZ_MEM_ERROR</computeroutput> cannot be signalled | |
1963 by a subsequent call to | |
1964 <computeroutput>BZ2_bzWrite</computeroutput>.</para> | |
1965 | |
1966 <para>Possible assignments to | |
1967 <computeroutput>bzerror</computeroutput>:</para> | |
1968 | |
1969 <programlisting> | |
1970 BZ_CONFIG_ERROR | |
1971 if the library has been mis-compiled | |
1972 BZ_PARAM_ERROR | |
1973 if f is NULL | |
1974 or blockSize100k < 1 or blockSize100k > 9 | |
1975 BZ_IO_ERROR | |
1976 if ferror(f) is nonzero | |
1977 BZ_MEM_ERROR | |
1978 if insufficient memory is available | |
1979 BZ_OK | |
1980 otherwise | |
1981 </programlisting> | |
1982 | |
1983 <para>Possible return values:</para> | |
1984 | |
1985 <programlisting> | |
1986 Pointer to an abstract BZFILE | |
1987 if bzerror is BZ_OK | |
1988 NULL | |
1989 otherwise | |
1990 </programlisting> | |
1991 | |
1992 <para>Allowable next actions:</para> | |
1993 | |
1994 <programlisting> | |
1995 BZ2_bzWrite | |
1996 if bzerror is BZ_OK | |
1997 (you could go directly to BZ2_bzWriteClose, but this would be pretty pointless) | |
1998 BZ2_bzWriteClose | |
1999 otherwise | |
2000 </programlisting> | |
2001 | |
2002 </sect2> | |
2003 | |
2004 | |
2005 <sect2 id="bzwrite" xreflabel="BZ2_bzWrite"> | |
2006 <title>BZ2_bzWrite</title> | |
2007 | |
2008 <programlisting> | |
2009 void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len ); | |
2010 </programlisting> | |
2011 | |
2012 <para>Absorbs <computeroutput>len</computeroutput> bytes from the | |
2013 buffer <computeroutput>buf</computeroutput>, eventually to be | |
2014 compressed and written to the file.</para> | |
2015 | |
2016 <para>Possible assignments to | |
2017 <computeroutput>bzerror</computeroutput>:</para> | |
2018 | |
2019 <programlisting> | |
2020 BZ_PARAM_ERROR | |
2021 if b is NULL or buf is NULL or len < 0 | |
2022 BZ_SEQUENCE_ERROR | |
2023 if b was opened with BZ2_bzReadOpen | |
2024 BZ_IO_ERROR | |
2025 if there is an error writing the compressed file. | |
2026 BZ_OK | |
2027 otherwise | |
2028 </programlisting> | |
2029 | |
2030 </sect2> | |
2031 | |
2032 | |
2033 <sect2 id="bzwriteclose" xreflabel="BZ2_bzWriteClose"> | |
2034 <title>BZ2_bzWriteClose</title> | |
2035 | |
2036 <programlisting> | |
2037 void BZ2_bzWriteClose( int *bzerror, BZFILE* f, | |
2038 int abandon, | |
2039 unsigned int* nbytes_in, | |
2040 unsigned int* nbytes_out ); | |
2041 | |
2042 void BZ2_bzWriteClose64( int *bzerror, BZFILE* f, | |
2043 int abandon, | |
2044 unsigned int* nbytes_in_lo32, | |
2045 unsigned int* nbytes_in_hi32, | |
2046 unsigned int* nbytes_out_lo32, | |
2047 unsigned int* nbytes_out_hi32 ); | |
2048 </programlisting> | |
2049 | |
2050 <para>Compresses and flushes to the compressed file all data so | |
2051 far supplied by <computeroutput>BZ2_bzWrite</computeroutput>. | |
2052 The logical end-of-stream markers are also written, so subsequent | |
2053 calls to <computeroutput>BZ2_bzWrite</computeroutput> are | |
2054 illegal. All memory associated with the compressed file | |
2055 <computeroutput>b</computeroutput> is released. | |
2056 <computeroutput>fflush</computeroutput> is called on the | |
2057 compressed file, but it is not | |
2058 <computeroutput>fclose</computeroutput>'d.</para> | |
2059 | |
2060 <para>If <computeroutput>BZ2_bzWriteClose</computeroutput> is | |
2061 called to clean up after an error, the only action is to release | |
2062 the memory. The library records the error codes issued by | |
2063 previous calls, so this situation will be detected automatically. | |
2064 There is no attempt to complete the compression operation, nor to | |
2065 <computeroutput>fflush</computeroutput> the compressed file. You | |
2066 can force this behaviour to happen even in the case of no error, | |
2067 by passing a nonzero value to | |
2068 <computeroutput>abandon</computeroutput>.</para> | |
2069 | |
2070 <para>If <computeroutput>nbytes_in</computeroutput> is non-null, | |
2071 <computeroutput>*nbytes_in</computeroutput> will be set to be the | |
2072 total volume of uncompressed data handled. Similarly, | |
2073 <computeroutput>nbytes_out</computeroutput> will be set to the | |
2074 total volume of compressed data written. For compatibility with | |
2075 older versions of the library, | |
2076 <computeroutput>BZ2_bzWriteClose</computeroutput> only yields the | |
2077 lower 32 bits of these counts. Use | |
2078 <computeroutput>BZ2_bzWriteClose64</computeroutput> if you want | |
2079 the full 64 bit counts. These two functions are otherwise | |
2080 absolutely identical.</para> | |
2081 | |
2082 <para>Possible assignments to | |
2083 <computeroutput>bzerror</computeroutput>:</para> | |
2084 | |
2085 <programlisting> | |
2086 BZ_SEQUENCE_ERROR | |
2087 if b was opened with BZ2_bzReadOpen | |
2088 BZ_IO_ERROR | |
2089 if there is an error writing the compressed file | |
2090 BZ_OK | |
2091 otherwise | |
2092 </programlisting> | |
2093 | |
2094 </sect2> | |
2095 | |
2096 | |
2097 <sect2 id="embed" xreflabel="Handling embedded compressed data streams"> | |
2098 <title>Handling embedded compressed data streams</title> | |
2099 | |
2100 <para>The high-level library facilitates use of | |
2101 <computeroutput>bzip2</computeroutput> data streams which form | |
2102 some part of a surrounding, larger data stream.</para> | |
2103 | |
2104 <itemizedlist mark='bullet'> | |
2105 | |
2106 <listitem><para>For writing, the library takes an open file handle, | |
2107 writes compressed data to it, | |
2108 <computeroutput>fflush</computeroutput>es it but does not | |
2109 <computeroutput>fclose</computeroutput> it. The calling | |
2110 application can write its own data before and after the | |
2111 compressed data stream, using that same file handle.</para></listitem> | |
2112 | |
2113 <listitem><para>Reading is more complex, and the facilities are not as | |
2114 general as they could be since generality is hard to reconcile | |
2115 with efficiency. <computeroutput>BZ2_bzRead</computeroutput> | |
2116 reads from the compressed file in blocks of size | |
2117 <computeroutput>BZ_MAX_UNUSED</computeroutput> bytes, and in | |
2118 doing so probably will overshoot the logical end of compressed | |
2119 stream. To recover this data once decompression has ended, | |
2120 call <computeroutput>BZ2_bzReadGetUnused</computeroutput> after | |
2121 the last call of <computeroutput>BZ2_bzRead</computeroutput> | |
2122 (the one returning | |
2123 <computeroutput>BZ_STREAM_END</computeroutput>) but before | |
2124 calling | |
2125 <computeroutput>BZ2_bzReadClose</computeroutput>.</para></listitem> | |
2126 | |
2127 </itemizedlist> | |
2128 | |
2129 <para>This mechanism makes it easy to decompress multiple | |
2130 <computeroutput>bzip2</computeroutput> streams placed end-to-end. | |
2131 As the end of one stream, when | |
2132 <computeroutput>BZ2_bzRead</computeroutput> returns | |
2133 <computeroutput>BZ_STREAM_END</computeroutput>, call | |
2134 <computeroutput>BZ2_bzReadGetUnused</computeroutput> to collect | |
2135 the unused data (copy it into your own buffer somewhere). That | |
2136 data forms the start of the next compressed stream. To start | |
2137 uncompressing that next stream, call | |
2138 <computeroutput>BZ2_bzReadOpen</computeroutput> again, feeding in | |
2139 the unused data via the <computeroutput>unused</computeroutput> / | |
2140 <computeroutput>nUnused</computeroutput> parameters. Keep doing | |
2141 this until <computeroutput>BZ_STREAM_END</computeroutput> return | |
2142 coincides with the physical end of file | |
2143 (<computeroutput>feof(f)</computeroutput>). In this situation | |
2144 <computeroutput>BZ2_bzReadGetUnused</computeroutput> will of | |
2145 course return no data.</para> | |
2146 | |
2147 <para>This should give some feel for how the high-level interface | |
2148 can be used. If you require extra flexibility, you'll have to | |
2149 bite the bullet and get to grips with the low-level | |
2150 interface.</para> | |
2151 | |
2152 </sect2> | |
2153 | |
2154 | |
2155 <sect2 id="std-rdwr" xreflabel="Standard file-reading/writing code"> | |
2156 <title>Standard file-reading/writing code</title> | |
2157 | |
2158 <para>Here's how you'd write data to a compressed file:</para> | |
2159 | |
2160 <programlisting> | |
2161 FILE* f; | |
2162 BZFILE* b; | |
2163 int nBuf; | |
2164 char buf[ /* whatever size you like */ ]; | |
2165 int bzerror; | |
2166 int nWritten; | |
2167 | |
2168 f = fopen ( "myfile.bz2", "w" ); | |
2169 if ( !f ) { | |
2170 /* handle error */ | |
2171 } | |
2172 b = BZ2_bzWriteOpen( &bzerror, f, 9 ); | |
2173 if (bzerror != BZ_OK) { | |
2174 BZ2_bzWriteClose ( b ); | |
2175 /* handle error */ | |
2176 } | |
2177 | |
2178 while ( /* condition */ ) { | |
2179 /* get data to write into buf, and set nBuf appropriately */ | |
2180 nWritten = BZ2_bzWrite ( &bzerror, b, buf, nBuf ); | |
2181 if (bzerror == BZ_IO_ERROR) { | |
2182 BZ2_bzWriteClose ( &bzerror, b ); | |
2183 /* handle error */ | |
2184 } | |
2185 } | |
2186 | |
2187 BZ2_bzWriteClose( &bzerror, b ); | |
2188 if (bzerror == BZ_IO_ERROR) { | |
2189 /* handle error */ | |
2190 } | |
2191 </programlisting> | |
2192 | |
2193 <para>And to read from a compressed file:</para> | |
2194 | |
2195 <programlisting> | |
2196 FILE* f; | |
2197 BZFILE* b; | |
2198 int nBuf; | |
2199 char buf[ /* whatever size you like */ ]; | |
2200 int bzerror; | |
2201 int nWritten; | |
2202 | |
2203 f = fopen ( "myfile.bz2", "r" ); | |
2204 if ( !f ) { | |
2205 /* handle error */ | |
2206 } | |
2207 b = BZ2_bzReadOpen ( &bzerror, f, 0, NULL, 0 ); | |
2208 if ( bzerror != BZ_OK ) { | |
2209 BZ2_bzReadClose ( &bzerror, b ); | |
2210 /* handle error */ | |
2211 } | |
2212 | |
2213 bzerror = BZ_OK; | |
2214 while ( bzerror == BZ_OK && /* arbitrary other conditions */) { | |
2215 nBuf = BZ2_bzRead ( &bzerror, b, buf, /* size of buf */ ); | |
2216 if ( bzerror == BZ_OK ) { | |
2217 /* do something with buf[0 .. nBuf-1] */ | |
2218 } | |
2219 } | |
2220 if ( bzerror != BZ_STREAM_END ) { | |
2221 BZ2_bzReadClose ( &bzerror, b ); | |
2222 /* handle error */ | |
2223 } else { | |
2224 BZ2_bzReadClose ( &bzerror, b ); | |
2225 } | |
2226 </programlisting> | |
2227 | |
2228 </sect2> | |
2229 | |
2230 </sect1> | |
2231 | |
2232 | |
2233 <sect1 id="util-fns" xreflabel="Utility functions"> | |
2234 <title>Utility functions</title> | |
2235 | |
2236 | |
2237 <sect2 id="bzbufftobuffcompress" xreflabel="BZ2_bzBuffToBuffCompress"> | |
2238 <title>BZ2_bzBuffToBuffCompress</title> | |
2239 | |
2240 <programlisting> | |
2241 int BZ2_bzBuffToBuffCompress( char* dest, | |
2242 unsigned int* destLen, | |
2243 char* source, | |
2244 unsigned int sourceLen, | |
2245 int blockSize100k, | |
2246 int verbosity, | |
2247 int workFactor ); | |
2248 </programlisting> | |
2249 | |
2250 <para>Attempts to compress the data in <computeroutput>source[0 | |
2251 .. sourceLen-1]</computeroutput> into the destination buffer, | |
2252 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the | |
2253 destination buffer is big enough, | |
2254 <computeroutput>*destLen</computeroutput> is set to the size of | |
2255 the compressed data, and <computeroutput>BZ_OK</computeroutput> | |
2256 is returned. If the compressed data won't fit, | |
2257 <computeroutput>*destLen</computeroutput> is unchanged, and | |
2258 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is | |
2259 returned.</para> | |
2260 | |
2261 <para>Compression in this manner is a one-shot event, done with a | |
2262 single call to this function. The resulting compressed data is a | |
2263 complete <computeroutput>bzip2</computeroutput> format data | |
2264 stream. There is no mechanism for making additional calls to | |
2265 provide extra input data. If you want that kind of mechanism, | |
2266 use the low-level interface.</para> | |
2267 | |
2268 <para>For the meaning of parameters | |
2269 <computeroutput>blockSize100k</computeroutput>, | |
2270 <computeroutput>verbosity</computeroutput> and | |
2271 <computeroutput>workFactor</computeroutput>, see | |
2272 <computeroutput>BZ2_bzCompressInit</computeroutput>.</para> | |
2273 | |
2274 <para>To guarantee that the compressed data will fit in its | |
2275 buffer, allocate an output buffer of size 1% larger than the | |
2276 uncompressed data, plus six hundred extra bytes.</para> | |
2277 | |
2278 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> | |
2279 will not write data at or beyond | |
2280 <computeroutput>dest[*destLen]</computeroutput>, even in case of | |
2281 buffer overflow.</para> | |
2282 | |
2283 <para>Possible return values:</para> | |
2284 | |
2285 <programlisting> | |
2286 BZ_CONFIG_ERROR | |
2287 if the library has been mis-compiled | |
2288 BZ_PARAM_ERROR | |
2289 if dest is NULL or destLen is NULL | |
2290 or blockSize100k < 1 or blockSize100k > 9 | |
2291 or verbosity < 0 or verbosity > 4 | |
2292 or workFactor < 0 or workFactor > 250 | |
2293 BZ_MEM_ERROR | |
2294 if insufficient memory is available | |
2295 BZ_OUTBUFF_FULL | |
2296 if the size of the compressed data exceeds *destLen | |
2297 BZ_OK | |
2298 otherwise | |
2299 </programlisting> | |
2300 | |
2301 </sect2> | |
2302 | |
2303 | |
2304 <sect2 id="bzbufftobuffdecompress" xreflabel="BZ2_bzBuffToBuffDecompress"> | |
2305 <title>BZ2_bzBuffToBuffDecompress</title> | |
2306 | |
2307 <programlisting> | |
2308 int BZ2_bzBuffToBuffDecompress( char* dest, | |
2309 unsigned int* destLen, | |
2310 char* source, | |
2311 unsigned int sourceLen, | |
2312 int small, | |
2313 int verbosity ); | |
2314 </programlisting> | |
2315 | |
2316 <para>Attempts to decompress the data in <computeroutput>source[0 | |
2317 .. sourceLen-1]</computeroutput> into the destination buffer, | |
2318 <computeroutput>dest[0 .. *destLen-1]</computeroutput>. If the | |
2319 destination buffer is big enough, | |
2320 <computeroutput>*destLen</computeroutput> is set to the size of | |
2321 the uncompressed data, and <computeroutput>BZ_OK</computeroutput> | |
2322 is returned. If the compressed data won't fit, | |
2323 <computeroutput>*destLen</computeroutput> is unchanged, and | |
2324 <computeroutput>BZ_OUTBUFF_FULL</computeroutput> is | |
2325 returned.</para> | |
2326 | |
2327 <para><computeroutput>source</computeroutput> is assumed to hold | |
2328 a complete <computeroutput>bzip2</computeroutput> format data | |
2329 stream. | |
2330 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> tries | |
2331 to decompress the entirety of the stream into the output | |
2332 buffer.</para> | |
2333 | |
2334 <para>For the meaning of parameters | |
2335 <computeroutput>small</computeroutput> and | |
2336 <computeroutput>verbosity</computeroutput>, see | |
2337 <computeroutput>BZ2_bzDecompressInit</computeroutput>.</para> | |
2338 | |
2339 <para>Because the compression ratio of the compressed data cannot | |
2340 be known in advance, there is no easy way to guarantee that the | |
2341 output buffer will be big enough. You may of course make | |
2342 arrangements in your code to record the size of the uncompressed | |
2343 data, but such a mechanism is beyond the scope of this | |
2344 library.</para> | |
2345 | |
2346 <para><computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput> | |
2347 will not write data at or beyond | |
2348 <computeroutput>dest[*destLen]</computeroutput>, even in case of | |
2349 buffer overflow.</para> | |
2350 | |
2351 <para>Possible return values:</para> | |
2352 | |
2353 <programlisting> | |
2354 BZ_CONFIG_ERROR | |
2355 if the library has been mis-compiled | |
2356 BZ_PARAM_ERROR | |
2357 if dest is NULL or destLen is NULL | |
2358 or small != 0 && small != 1 | |
2359 or verbosity < 0 or verbosity > 4 | |
2360 BZ_MEM_ERROR | |
2361 if insufficient memory is available | |
2362 BZ_OUTBUFF_FULL | |
2363 if the size of the compressed data exceeds *destLen | |
2364 BZ_DATA_ERROR | |
2365 if a data integrity error was detected in the compressed data | |
2366 BZ_DATA_ERROR_MAGIC | |
2367 if the compressed data doesn't begin with the right magic bytes | |
2368 BZ_UNEXPECTED_EOF | |
2369 if the compressed data ends unexpectedly | |
2370 BZ_OK | |
2371 otherwise | |
2372 </programlisting> | |
2373 | |
2374 </sect2> | |
2375 | |
2376 </sect1> | |
2377 | |
2378 | |
2379 <sect1 id="zlib-compat" xreflabel="zlib compatibility functions"> | |
2380 <title>zlib compatibility functions</title> | |
2381 | |
2382 <para>Yoshioka Tsuneo has contributed some functions to give | |
2383 better <computeroutput>zlib</computeroutput> compatibility. | |
2384 These functions are <computeroutput>BZ2_bzopen</computeroutput>, | |
2385 <computeroutput>BZ2_bzread</computeroutput>, | |
2386 <computeroutput>BZ2_bzwrite</computeroutput>, | |
2387 <computeroutput>BZ2_bzflush</computeroutput>, | |
2388 <computeroutput>BZ2_bzclose</computeroutput>, | |
2389 <computeroutput>BZ2_bzerror</computeroutput> and | |
2390 <computeroutput>BZ2_bzlibVersion</computeroutput>. These | |
2391 functions are not (yet) officially part of the library. If they | |
2392 break, you get to keep all the pieces. Nevertheless, I think | |
2393 they work ok.</para> | |
2394 | |
2395 <programlisting> | |
2396 typedef void BZFILE; | |
2397 | |
2398 const char * BZ2_bzlibVersion ( void ); | |
2399 </programlisting> | |
2400 | |
2401 <para>Returns a string indicating the library version.</para> | |
2402 | |
2403 <programlisting> | |
2404 BZFILE * BZ2_bzopen ( const char *path, const char *mode ); | |
2405 BZFILE * BZ2_bzdopen ( int fd, const char *mode ); | |
2406 </programlisting> | |
2407 | |
2408 <para>Opens a <computeroutput>.bz2</computeroutput> file for | |
2409 reading or writing, using either its name or a pre-existing file | |
2410 descriptor. Analogous to <computeroutput>fopen</computeroutput> | |
2411 and <computeroutput>fdopen</computeroutput>.</para> | |
2412 | |
2413 <programlisting> | |
2414 int BZ2_bzread ( BZFILE* b, void* buf, int len ); | |
2415 int BZ2_bzwrite ( BZFILE* b, void* buf, int len ); | |
2416 </programlisting> | |
2417 | |
2418 <para>Reads/writes data from/to a previously opened | |
2419 <computeroutput>BZFILE</computeroutput>. Analogous to | |
2420 <computeroutput>fread</computeroutput> and | |
2421 <computeroutput>fwrite</computeroutput>.</para> | |
2422 | |
2423 <programlisting> | |
2424 int BZ2_bzflush ( BZFILE* b ); | |
2425 void BZ2_bzclose ( BZFILE* b ); | |
2426 </programlisting> | |
2427 | |
2428 <para>Flushes/closes a <computeroutput>BZFILE</computeroutput>. | |
2429 <computeroutput>BZ2_bzflush</computeroutput> doesn't actually do | |
2430 anything. Analogous to <computeroutput>fflush</computeroutput> | |
2431 and <computeroutput>fclose</computeroutput>.</para> | |
2432 | |
2433 <programlisting> | |
2434 const char * BZ2_bzerror ( BZFILE *b, int *errnum ) | |
2435 </programlisting> | |
2436 | |
2437 <para>Returns a string describing the more recent error status of | |
2438 <computeroutput>b</computeroutput>, and also sets | |
2439 <computeroutput>*errnum</computeroutput> to its numerical | |
2440 value.</para> | |
2441 | |
2442 </sect1> | |
2443 | |
2444 | |
2445 <sect1 id="stdio-free" | |
2446 xreflabel="Using the library in a stdio-free environment"> | |
2447 <title>Using the library in a stdio-free environment</title> | |
2448 | |
2449 | |
2450 <sect2 id="stdio-bye" xreflabel="Getting rid of stdio"> | |
2451 <title>Getting rid of stdio</title> | |
2452 | |
2453 <para>In a deeply embedded application, you might want to use | |
2454 just the memory-to-memory functions. You can do this | |
2455 conveniently by compiling the library with preprocessor symbol | |
2456 <computeroutput>BZ_NO_STDIO</computeroutput> defined. Doing this | |
2457 gives you a library containing only the following eight | |
2458 functions:</para> | |
2459 | |
2460 <para><computeroutput>BZ2_bzCompressInit</computeroutput>, | |
2461 <computeroutput>BZ2_bzCompress</computeroutput>, | |
2462 <computeroutput>BZ2_bzCompressEnd</computeroutput> | |
2463 <computeroutput>BZ2_bzDecompressInit</computeroutput>, | |
2464 <computeroutput>BZ2_bzDecompress</computeroutput>, | |
2465 <computeroutput>BZ2_bzDecompressEnd</computeroutput> | |
2466 <computeroutput>BZ2_bzBuffToBuffCompress</computeroutput>, | |
2467 <computeroutput>BZ2_bzBuffToBuffDecompress</computeroutput></para> | |
2468 | |
2469 <para>When compiled like this, all functions will ignore | |
2470 <computeroutput>verbosity</computeroutput> settings.</para> | |
2471 | |
2472 </sect2> | |
2473 | |
2474 | |
2475 <sect2 id="critical-error" xreflabel="Critical error handling"> | |
2476 <title>Critical error handling</title> | |
2477 | |
2478 <para><computeroutput>libbzip2</computeroutput> contains a number | |
2479 of internal assertion checks which should, needless to say, never | |
2480 be activated. Nevertheless, if an assertion should fail, | |
2481 behaviour depends on whether or not the library was compiled with | |
2482 <computeroutput>BZ_NO_STDIO</computeroutput> set.</para> | |
2483 | |
2484 <para>For a normal compile, an assertion failure yields the | |
2485 message:</para> | |
2486 | |
2487 <blockquote> | |
2488 <para>bzip2/libbzip2: internal error number N.</para> | |
2489 <para>This is a bug in bzip2/libbzip2, &bz-version; of &bz-date;. | |
2490 Please report it to me at: &bz-email;. If this happened | |
2491 when you were using some program which uses libbzip2 as a | |
2492 component, you should also report this bug to the author(s) | |
2493 of that program. Please make an effort to report this bug; | |
2494 timely and accurate bug reports eventually lead to higher | |
2495 quality software. Thanks. Julian Seward, &bz-date;. | |
2496 </para></blockquote> | |
2497 | |
2498 <para>where <computeroutput>N</computeroutput> is some error code | |
2499 number. If <computeroutput>N == 1007</computeroutput>, it also | |
2500 prints some extra text advising the reader that unreliable memory | |
2501 is often associated with internal error 1007. (This is a | |
2502 frequently-observed-phenomenon with versions 1.0.0/1.0.1).</para> | |
2503 | |
2504 <para><computeroutput>exit(3)</computeroutput> is then | |
2505 called.</para> | |
2506 | |
2507 <para>For a <computeroutput>stdio</computeroutput>-free library, | |
2508 assertion failures result in a call to a function declared | |
2509 as:</para> | |
2510 | |
2511 <programlisting> | |
2512 extern void bz_internal_error ( int errcode ); | |
2513 </programlisting> | |
2514 | |
2515 <para>The relevant code is passed as a parameter. You should | |
2516 supply such a function.</para> | |
2517 | |
2518 <para>In either case, once an assertion failure has occurred, any | |
2519 <computeroutput>bz_stream</computeroutput> records involved can | |
2520 be regarded as invalid. You should not attempt to resume normal | |
2521 operation with them.</para> | |
2522 | |
2523 <para>You may, of course, change critical error handling to suit | |
2524 your needs. As I said above, critical errors indicate bugs in | |
2525 the library and should not occur. All "normal" error situations | |
2526 are indicated via error return codes from functions, and can be | |
2527 recovered from.</para> | |
2528 | |
2529 </sect2> | |
2530 | |
2531 </sect1> | |
2532 | |
2533 | |
2534 <sect1 id="win-dll" xreflabel="Making a Windows DLL"> | |
2535 <title>Making a Windows DLL</title> | |
2536 | |
2537 <para>Everything related to Windows has been contributed by | |
2538 Yoshioka Tsuneo | |
2539 (<computeroutput>tsuneo@rr.iij4u.or.jp</computeroutput>), so | |
2540 you should send your queries to him (but perhaps Cc: me, | |
2541 <computeroutput>&bz-email;</computeroutput>).</para> | |
2542 | |
2543 <para>My vague understanding of what to do is: using Visual C++ | |
2544 5.0, open the project file | |
2545 <computeroutput>libbz2.dsp</computeroutput>, and build. That's | |
2546 all.</para> | |
2547 | |
2548 <para>If you can't open the project file for some reason, make a | |
2549 new one, naming these files: | |
2550 <computeroutput>blocksort.c</computeroutput>, | |
2551 <computeroutput>bzlib.c</computeroutput>, | |
2552 <computeroutput>compress.c</computeroutput>, | |
2553 <computeroutput>crctable.c</computeroutput>, | |
2554 <computeroutput>decompress.c</computeroutput>, | |
2555 <computeroutput>huffman.c</computeroutput>, | |
2556 <computeroutput>randtable.c</computeroutput> and | |
2557 <computeroutput>libbz2.def</computeroutput>. You will also need | |
2558 to name the header files <computeroutput>bzlib.h</computeroutput> | |
2559 and <computeroutput>bzlib_private.h</computeroutput>.</para> | |
2560 | |
2561 <para>If you don't use VC++, you may need to define the | |
2562 proprocessor symbol | |
2563 <computeroutput>_WIN32</computeroutput>.</para> | |
2564 | |
2565 <para>Finally, <computeroutput>dlltest.c</computeroutput> is a | |
2566 sample program using the DLL. It has a project file, | |
2567 <computeroutput>dlltest.dsp</computeroutput>.</para> | |
2568 | |
2569 <para>If you just want a makefile for Visual C, have a look at | |
2570 <computeroutput>makefile.msc</computeroutput>.</para> | |
2571 | |
2572 <para>Be aware that if you compile | |
2573 <computeroutput>bzip2</computeroutput> itself on Win32, you must | |
2574 set <computeroutput>BZ_UNIX</computeroutput> to 0 and | |
2575 <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the file | |
2576 <computeroutput>bzip2.c</computeroutput>, before compiling. | |
2577 Otherwise the resulting binary won't work correctly.</para> | |
2578 | |
2579 <para>I haven't tried any of this stuff myself, but it all looks | |
2580 plausible.</para> | |
2581 | |
2582 </sect1> | |
2583 | |
2584 </chapter> | |
2585 | |
2586 | |
2587 | |
2588 <chapter id="misc" xreflabel="Miscellanea"> | |
2589 <title>Miscellanea</title> | |
2590 | |
2591 <para>These are just some random thoughts of mine. Your mileage | |
2592 may vary.</para> | |
2593 | |
2594 | |
2595 <sect1 id="limits" xreflabel="Limitations of the compressed file format"> | |
2596 <title>Limitations of the compressed file format</title> | |
2597 | |
2598 <para><computeroutput>bzip2-1.0.X</computeroutput>, | |
2599 <computeroutput>0.9.5</computeroutput> and | |
2600 <computeroutput>0.9.0</computeroutput> use exactly the same file | |
2601 format as the original version, | |
2602 <computeroutput>bzip2-0.1</computeroutput>. This decision was | |
2603 made in the interests of stability. Creating yet another | |
2604 incompatible compressed file format would create further | |
2605 confusion and disruption for users.</para> | |
2606 | |
2607 <para>Nevertheless, this is not a painless decision. Development | |
2608 work since the release of | |
2609 <computeroutput>bzip2-0.1</computeroutput> in August 1997 has | |
2610 shown complexities in the file format which slow down | |
2611 decompression and, in retrospect, are unnecessary. These | |
2612 are:</para> | |
2613 | |
2614 <itemizedlist mark='bullet'> | |
2615 | |
2616 <listitem><para>The run-length encoder, which is the first of the | |
2617 compression transformations, is entirely irrelevant. The | |
2618 original purpose was to protect the sorting algorithm from the | |
2619 very worst case input: a string of repeated symbols. But | |
2620 algorithm steps Q6a and Q6b in the original Burrows-Wheeler | |
2621 technical report (SRC-124) show how repeats can be handled | |
2622 without difficulty in block sorting.</para></listitem> | |
2623 | |
2624 <listitem><para>The randomisation mechanism doesn't really need to be | |
2625 there. Udi Manber and Gene Myers published a suffix array | |
2626 construction algorithm a few years back, which can be employed | |
2627 to sort any block, no matter how repetitive, in O(N log N) | |
2628 time. Subsequent work by Kunihiko Sadakane has produced a | |
2629 derivative O(N (log N)^2) algorithm which usually outperforms | |
2630 the Manber-Myers algorithm.</para> | |
2631 | |
2632 <para>I could have changed to Sadakane's algorithm, but I find | |
2633 it to be slower than <computeroutput>bzip2</computeroutput>'s | |
2634 existing algorithm for most inputs, and the randomisation | |
2635 mechanism protects adequately against bad cases. I didn't | |
2636 think it was a good tradeoff to make. Partly this is due to | |
2637 the fact that I was not flooded with email complaints about | |
2638 <computeroutput>bzip2-0.1</computeroutput>'s performance on | |
2639 repetitive data, so perhaps it isn't a problem for real | |
2640 inputs.</para> | |
2641 | |
2642 <para>Probably the best long-term solution, and the one I have | |
2643 incorporated into 0.9.5 and above, is to use the existing | |
2644 sorting algorithm initially, and fall back to a O(N (log N)^2) | |
2645 algorithm if the standard algorithm gets into | |
2646 difficulties.</para></listitem> | |
2647 | |
2648 <listitem><para>The compressed file format was never designed to be | |
2649 handled by a library, and I have had to jump though some hoops | |
2650 to produce an efficient implementation of decompression. It's | |
2651 a bit hairy. Try passing | |
2652 <computeroutput>decompress.c</computeroutput> through the C | |
2653 preprocessor and you'll see what I mean. Much of this | |
2654 complexity could have been avoided if the compressed size of | |
2655 each block of data was recorded in the data stream.</para></listitem> | |
2656 | |
2657 <listitem><para>An Adler-32 checksum, rather than a CRC32 checksum, | |
2658 would be faster to compute.</para></listitem> | |
2659 | |
2660 </itemizedlist> | |
2661 | |
2662 <para>It would be fair to say that the | |
2663 <computeroutput>bzip2</computeroutput> format was frozen before I | |
2664 properly and fully understood the performance consequences of | |
2665 doing so.</para> | |
2666 | |
2667 <para>Improvements which I was able to incorporate into 0.9.0, | |
2668 despite using the same file format, are:</para> | |
2669 | |
2670 <itemizedlist mark='bullet'> | |
2671 | |
2672 <listitem><para>Single array implementation of the inverse BWT. This | |
2673 significantly speeds up decompression, presumably because it | |
2674 reduces the number of cache misses.</para></listitem> | |
2675 | |
2676 <listitem><para>Faster inverse MTF transform for large MTF values. | |
2677 The new implementation is based on the notion of sliding blocks | |
2678 of values.</para></listitem> | |
2679 | |
2680 <listitem><para><computeroutput>bzip2-0.9.0</computeroutput> now reads | |
2681 and writes files with <computeroutput>fread</computeroutput> | |
2682 and <computeroutput>fwrite</computeroutput>; version 0.1 used | |
2683 <computeroutput>putc</computeroutput> and | |
2684 <computeroutput>getc</computeroutput>. Duh! Well, you live | |
2685 and learn.</para></listitem> | |
2686 | |
2687 </itemizedlist> | |
2688 | |
2689 <para>Further ahead, it would be nice to be able to do random | |
2690 access into files. This will require some careful design of | |
2691 compressed file formats.</para> | |
2692 | |
2693 </sect1> | |
2694 | |
2695 | |
2696 <sect1 id="port-issues" xreflabel="Portability issues"> | |
2697 <title>Portability issues</title> | |
2698 | |
2699 <para>After some consideration, I have decided not to use GNU | |
2700 <computeroutput>autoconf</computeroutput> to configure 0.9.5 or | |
2701 1.0.</para> | |
2702 | |
2703 <para><computeroutput>autoconf</computeroutput>, admirable and | |
2704 wonderful though it is, mainly assists with portability problems | |
2705 between Unix-like platforms. But | |
2706 <computeroutput>bzip2</computeroutput> doesn't have much in the | |
2707 way of portability problems on Unix; most of the difficulties | |
2708 appear when porting to the Mac, or to Microsoft's operating | |
2709 systems. <computeroutput>autoconf</computeroutput> doesn't help | |
2710 in those cases, and brings in a whole load of new | |
2711 complexity.</para> | |
2712 | |
2713 <para>Most people should be able to compile the library and | |
2714 program under Unix straight out-of-the-box, so to speak, | |
2715 especially if you have a version of GNU C available.</para> | |
2716 | |
2717 <para>There are a couple of | |
2718 <computeroutput>__inline__</computeroutput> directives in the | |
2719 code. GNU C (<computeroutput>gcc</computeroutput>) should be | |
2720 able to handle them. If you're not using GNU C, your C compiler | |
2721 shouldn't see them at all. If your compiler does, for some | |
2722 reason, see them and doesn't like them, just | |
2723 <computeroutput>#define</computeroutput> | |
2724 <computeroutput>__inline__</computeroutput> to be | |
2725 <computeroutput>/* */</computeroutput>. One easy way to do this | |
2726 is to compile with the flag | |
2727 <computeroutput>-D__inline__=</computeroutput>, which should be | |
2728 understood by most Unix compilers.</para> | |
2729 | |
2730 <para>If you still have difficulties, try compiling with the | |
2731 macro <computeroutput>BZ_STRICT_ANSI</computeroutput> defined. | |
2732 This should enable you to build the library in a strictly ANSI | |
2733 compliant environment. Building the program itself like this is | |
2734 dangerous and not supported, since you remove | |
2735 <computeroutput>bzip2</computeroutput>'s checks against | |
2736 compressing directories, symbolic links, devices, and other | |
2737 not-really-a-file entities. This could cause filesystem | |
2738 corruption!</para> | |
2739 | |
2740 <para>One other thing: if you create a | |
2741 <computeroutput>bzip2</computeroutput> binary for public distribution, | |
2742 please consider linking it statically (<computeroutput>gcc | |
2743 -static</computeroutput>). This avoids all sorts of library-version | |
2744 issues that others may encounter later on.</para> | |
2745 | |
2746 <para>If you build <computeroutput>bzip2</computeroutput> on | |
2747 Win32, you must set <computeroutput>BZ_UNIX</computeroutput> to 0 | |
2748 and <computeroutput>BZ_LCCWIN32</computeroutput> to 1, in the | |
2749 file <computeroutput>bzip2.c</computeroutput>, before compiling. | |
2750 Otherwise the resulting binary won't work correctly.</para> | |
2751 | |
2752 </sect1> | |
2753 | |
2754 | |
2755 <sect1 id="bugs" xreflabel="Reporting bugs"> | |
2756 <title>Reporting bugs</title> | |
2757 | |
2758 <para>I tried pretty hard to make sure | |
2759 <computeroutput>bzip2</computeroutput> is bug free, both by | |
2760 design and by testing. Hopefully you'll never need to read this | |
2761 section for real.</para> | |
2762 | |
2763 <para>Nevertheless, if <computeroutput>bzip2</computeroutput> dies | |
2764 with a segmentation fault, a bus error or an internal assertion | |
2765 failure, it will ask you to email me a bug report. Experience from | |
2766 years of feedback of bzip2 users indicates that almost all these | |
2767 problems can be traced to either compiler bugs or hardware | |
2768 problems.</para> | |
2769 | |
2770 <itemizedlist mark='bullet'> | |
2771 | |
2772 <listitem><para>Recompile the program with no optimisation, and | |
2773 see if it works. And/or try a different compiler. I heard all | |
2774 sorts of stories about various flavours of GNU C (and other | |
2775 compilers) generating bad code for | |
2776 <computeroutput>bzip2</computeroutput>, and I've run across two | |
2777 such examples myself.</para> | |
2778 | |
2779 <para>2.7.X versions of GNU C are known to generate bad code | |
2780 from time to time, at high optimisation levels. If you get | |
2781 problems, try using the flags | |
2782 <computeroutput>-O2</computeroutput> | |
2783 <computeroutput>-fomit-frame-pointer</computeroutput> | |
2784 <computeroutput>-fno-strength-reduce</computeroutput>. You | |
2785 should specifically <emphasis>not</emphasis> use | |
2786 <computeroutput>-funroll-loops</computeroutput>.</para> | |
2787 | |
2788 <para>You may notice that the Makefile runs six tests as part | |
2789 of the build process. If the program passes all of these, it's | |
2790 a pretty good (but not 100%) indication that the compiler has | |
2791 done its job correctly.</para></listitem> | |
2792 | |
2793 <listitem><para>If <computeroutput>bzip2</computeroutput> | |
2794 crashes randomly, and the crashes are not repeatable, you may | |
2795 have a flaky memory subsystem. | |
2796 <computeroutput>bzip2</computeroutput> really hammers your | |
2797 memory hierarchy, and if it's a bit marginal, you may get these | |
2798 problems. Ditto if your disk or I/O subsystem is slowly | |
2799 failing. Yup, this really does happen.</para> | |
2800 | |
2801 <para>Try using a different machine of the same type, and see | |
2802 if you can repeat the problem.</para></listitem> | |
2803 | |
2804 <listitem><para>This isn't really a bug, but ... If | |
2805 <computeroutput>bzip2</computeroutput> tells you your file is | |
2806 corrupted on decompression, and you obtained the file via FTP, | |
2807 there is a possibility that you forgot to tell FTP to do a | |
2808 binary mode transfer. That absolutely will cause the file to | |
2809 be non-decompressible. You'll have to transfer it | |
2810 again.</para></listitem> | |
2811 | |
2812 </itemizedlist> | |
2813 | |
2814 <para>If you've incorporated | |
2815 <computeroutput>libbzip2</computeroutput> into your own program | |
2816 and are getting problems, please, please, please, check that the | |
2817 parameters you are passing in calls to the library, are correct, | |
2818 and in accordance with what the documentation says is allowable. | |
2819 I have tried to make the library robust against such problems, | |
2820 but I'm sure I haven't succeeded.</para> | |
2821 | |
2822 <para>Finally, if the above comments don't help, you'll have to | |
2823 send me a bug report. Now, it's just amazing how many people | |
2824 will send me a bug report saying something like:</para> | |
2825 | |
2826 <programlisting> | |
2827 bzip2 crashed with segmentation fault on my machine | |
2828 </programlisting> | |
2829 | |
2830 <para>and absolutely nothing else. Needless to say, a such a | |
2831 report is <emphasis>totally, utterly, completely and | |
2832 comprehensively 100% useless; a waste of your time, my time, and | |
2833 net bandwidth</emphasis>. With no details at all, there's no way | |
2834 I can possibly begin to figure out what the problem is.</para> | |
2835 | |
2836 <para>The rules of the game are: facts, facts, facts. Don't omit | |
2837 them because "oh, they won't be relevant". At the bare | |
2838 minimum:</para> | |
2839 | |
2840 <programlisting> | |
2841 Machine type. Operating system version. | |
2842 Exact version of bzip2 (do bzip2 -V). | |
2843 Exact version of the compiler used. | |
2844 Flags passed to the compiler. | |
2845 </programlisting> | |
2846 | |
2847 <para>However, the most important single thing that will help me | |
2848 is the file that you were trying to compress or decompress at the | |
2849 time the problem happened. Without that, my ability to do | |
2850 anything more than speculate about the cause, is limited.</para> | |
2851 | |
2852 </sect1> | |
2853 | |
2854 | |
2855 <sect1 id="package" xreflabel="Did you get the right package?"> | |
2856 <title>Did you get the right package?</title> | |
2857 | |
2858 <para><computeroutput>bzip2</computeroutput> is a resource hog. | |
2859 It soaks up large amounts of CPU cycles and memory. Also, it | |
2860 gives very large latencies. In the worst case, you can feed many | |
2861 megabytes of uncompressed data into the library before getting | |
2862 any compressed output, so this probably rules out applications | |
2863 requiring interactive behaviour.</para> | |
2864 | |
2865 <para>These aren't faults of my implementation, I hope, but more | |
2866 an intrinsic property of the Burrows-Wheeler transform | |
2867 (unfortunately). Maybe this isn't what you want.</para> | |
2868 | |
2869 <para>If you want a compressor and/or library which is faster, | |
2870 uses less memory but gets pretty good compression, and has | |
2871 minimal latency, consider Jean-loup Gailly's and Mark Adler's | |
2872 work, <computeroutput>zlib-1.2.1</computeroutput> and | |
2873 <computeroutput>gzip-1.2.4</computeroutput>. Look for them at | |
2874 <ulink url="http://www.zlib.org">http://www.zlib.org</ulink> and | |
2875 <ulink url="http://www.gzip.org">http://www.gzip.org</ulink> | |
2876 respectively.</para> | |
2877 | |
2878 <para>For something faster and lighter still, you might try Markus F | |
2879 X J Oberhumer's <computeroutput>LZO</computeroutput> real-time | |
2880 compression/decompression library, at | |
2881 <ulink url="http://www.oberhumer.com/opensource">http://www.oberhumer.com/opensource</ulink>.</para> | |
2882 | |
2883 </sect1> | |
2884 | |
2885 | |
2886 | |
2887 <sect1 id="reading" xreflabel="Further Reading"> | |
2888 <title>Further Reading</title> | |
2889 | |
2890 <para><computeroutput>bzip2</computeroutput> is not research | |
2891 work, in the sense that it doesn't present any new ideas. | |
2892 Rather, it's an engineering exercise based on existing | |
2893 ideas.</para> | |
2894 | |
2895 <para>Four documents describe essentially all the ideas behind | |
2896 <computeroutput>bzip2</computeroutput>:</para> | |
2897 | |
2898 <literallayout>Michael Burrows and D. J. Wheeler: | |
2899 "A block-sorting lossless data compression algorithm" | |
2900 10th May 1994. | |
2901 Digital SRC Research Report 124. | |
2902 ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz | |
2903 If you have trouble finding it, try searching at the | |
2904 New Zealand Digital Library, http://www.nzdl.org. | |
2905 | |
2906 Daniel S. Hirschberg and Debra A. LeLewer | |
2907 "Efficient Decoding of Prefix Codes" | |
2908 Communications of the ACM, April 1990, Vol 33, Number 4. | |
2909 You might be able to get an electronic copy of this | |
2910 from the ACM Digital Library. | |
2911 | |
2912 David J. Wheeler | |
2913 Program bred3.c and accompanying document bred3.ps. | |
2914 This contains the idea behind the multi-table Huffman coding scheme. | |
2915 ftp://ftp.cl.cam.ac.uk/users/djw3/ | |
2916 | |
2917 Jon L. Bentley and Robert Sedgewick | |
2918 "Fast Algorithms for Sorting and Searching Strings" | |
2919 Available from Sedgewick's web page, | |
2920 www.cs.princeton.edu/~rs | |
2921 </literallayout> | |
2922 | |
2923 <para>The following paper gives valuable additional insights into | |
2924 the algorithm, but is not immediately the basis of any code used | |
2925 in bzip2.</para> | |
2926 | |
2927 <literallayout>Peter Fenwick: | |
2928 Block Sorting Text Compression | |
2929 Proceedings of the 19th Australasian Computer Science Conference, | |
2930 Melbourne, Australia. Jan 31 - Feb 2, 1996. | |
2931 ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps</literallayout> | |
2932 | |
2933 <para>Kunihiko Sadakane's sorting algorithm, mentioned above, is | |
2934 available from:</para> | |
2935 | |
2936 <literallayout>http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz | |
2937 </literallayout> | |
2938 | |
2939 <para>The Manber-Myers suffix array construction algorithm is | |
2940 described in a paper available from:</para> | |
2941 | |
2942 <literallayout>http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps | |
2943 </literallayout> | |
2944 | |
2945 <para>Finally, the following papers document some | |
2946 investigations I made into the performance of sorting | |
2947 and decompression algorithms:</para> | |
2948 | |
2949 <literallayout>Julian Seward | |
2950 On the Performance of BWT Sorting Algorithms | |
2951 Proceedings of the IEEE Data Compression Conference 2000 | |
2952 Snowbird, Utah. 28-30 March 2000. | |
2953 | |
2954 Julian Seward | |
2955 Space-time Tradeoffs in the Inverse B-W Transform | |
2956 Proceedings of the IEEE Data Compression Conference 2001 | |
2957 Snowbird, Utah. 27-29 March 2001. | |
2958 </literallayout> | |
2959 | |
2960 </sect1> | |
2961 | |
2962 </chapter> | |
2963 | |
2964 </book> |