sv-dependency-builds: src/bzip2-1.0.6/bzip2.1 annotate

annotate src/bzip2-1.0.6/bzip2.1 @ 23:619f715526df sv_v2.1

Update Vamp plugin SDK to 2.5

author	Chris Cannam
date	Thu, 09 May 2013 10:52:46 +0100
parents	e13257ea84a4
children

rev	line source
Chris@4	1 .PU
Chris@4	2 .TH bzip2 1
Chris@4	3 .SH NAME
Chris@4	4 bzip2, bunzip2 \- a block-sorting file compressor, v1.0.6
Chris@4	5 .br
Chris@4	6 bzcat \- decompresses files to stdout
Chris@4	7 .br
Chris@4	8 bzip2recover \- recovers data from damaged bzip2 files
Chris@4	9
Chris@4	10 .SH SYNOPSIS
Chris@4	11 .ll +8
Chris@4	12 .B bzip2
Chris@4	13 .RB [ " \-cdfkqstvzVL123456789 " ]
Chris@4	14 [
Chris@4	15 .I "filenames \&..."
Chris@4	16 ]
Chris@4	17 .ll -8
Chris@4	18 .br
Chris@4	19 .B bunzip2
Chris@4	20 .RB [ " \-fkvsVL " ]
Chris@4	21 [
Chris@4	22 .I "filenames \&..."
Chris@4	23 ]
Chris@4	24 .br
Chris@4	25 .B bzcat
Chris@4	26 .RB [ " \-s " ]
Chris@4	27 [
Chris@4	28 .I "filenames \&..."
Chris@4	29 ]
Chris@4	30 .br
Chris@4	31 .B bzip2recover
Chris@4	32 .I "filename"
Chris@4	33
Chris@4	34 .SH DESCRIPTION
Chris@4	35 .I bzip2
Chris@4	36 compresses files using the Burrows-Wheeler block sorting
Chris@4	37 text compression algorithm, and Huffman coding. Compression is
Chris@4	38 generally considerably better than that achieved by more conventional
Chris@4	39 LZ77/LZ78-based compressors, and approaches the performance of the PPM
Chris@4	40 family of statistical compressors.
Chris@4	41
Chris@4	42 The command-line options are deliberately very similar to
Chris@4	43 those of
Chris@4	44 .I GNU gzip,
Chris@4	45 but they are not identical.
Chris@4	46
Chris@4	47 .I bzip2
Chris@4	48 expects a list of file names to accompany the
Chris@4	49 command-line flags. Each file is replaced by a compressed version of
Chris@4	50 itself, with the name "original_name.bz2".
Chris@4	51 Each compressed file
Chris@4	52 has the same modification date, permissions, and, when possible,
Chris@4	53 ownership as the corresponding original, so that these properties can
Chris@4	54 be correctly restored at decompression time. File name handling is
Chris@4	55 naive in the sense that there is no mechanism for preserving original
Chris@4	56 file names, permissions, ownerships or dates in filesystems which lack
Chris@4	57 these concepts, or have serious file name length restrictions, such as
Chris@4	58 MS-DOS.
Chris@4	59
Chris@4	60 .I bzip2
Chris@4	61 and
Chris@4	62 .I bunzip2
Chris@4	63 will by default not overwrite existing
Chris@4	64 files. If you want this to happen, specify the \-f flag.
Chris@4	65
Chris@4	66 If no file names are specified,
Chris@4	67 .I bzip2
Chris@4	68 compresses from standard
Chris@4	69 input to standard output. In this case,
Chris@4	70 .I bzip2
Chris@4	71 will decline to
Chris@4	72 write compressed output to a terminal, as this would be entirely
Chris@4	73 incomprehensible and therefore pointless.
Chris@4	74
Chris@4	75 .I bunzip2
Chris@4	76 (or
Chris@4	77 .I bzip2 \-d)
Chris@4	78 decompresses all
Chris@4	79 specified files. Files which were not created by
Chris@4	80 .I bzip2
Chris@4	81 will be detected and ignored, and a warning issued.
Chris@4	82 .I bzip2
Chris@4	83 attempts to guess the filename for the decompressed file
Chris@4	84 from that of the compressed file as follows:
Chris@4	85
Chris@4	86 filename.bz2 becomes filename
Chris@4	87 filename.bz becomes filename
Chris@4	88 filename.tbz2 becomes filename.tar
Chris@4	89 filename.tbz becomes filename.tar
Chris@4	90 anyothername becomes anyothername.out
Chris@4	91
Chris@4	92 If the file does not end in one of the recognised endings,
Chris@4	93 .I .bz2,
Chris@4	94 .I .bz,
Chris@4	95 .I .tbz2
Chris@4	96 or
Chris@4	97 .I .tbz,
Chris@4	98 .I bzip2
Chris@4	99 complains that it cannot
Chris@4	100 guess the name of the original file, and uses the original name
Chris@4	101 with
Chris@4	102 .I .out
Chris@4	103 appended.
Chris@4	104
Chris@4	105 As with compression, supplying no
Chris@4	106 filenames causes decompression from
Chris@4	107 standard input to standard output.
Chris@4	108
Chris@4	109 .I bunzip2
Chris@4	110 will correctly decompress a file which is the
Chris@4	111 concatenation of two or more compressed files. The result is the
Chris@4	112 concatenation of the corresponding uncompressed files. Integrity
Chris@4	113 testing (\-t)
Chris@4	114 of concatenated
Chris@4	115 compressed files is also supported.
Chris@4	116
Chris@4	117 You can also compress or decompress files to the standard output by
Chris@4	118 giving the \-c flag. Multiple files may be compressed and
Chris@4	119 decompressed like this. The resulting outputs are fed sequentially to
Chris@4	120 stdout. Compression of multiple files
Chris@4	121 in this manner generates a stream
Chris@4	122 containing multiple compressed file representations. Such a stream
Chris@4	123 can be decompressed correctly only by
Chris@4	124 .I bzip2
Chris@4	125 version 0.9.0 or
Chris@4	126 later. Earlier versions of
Chris@4	127 .I bzip2
Chris@4	128 will stop after decompressing
Chris@4	129 the first file in the stream.
Chris@4	130
Chris@4	131 .I bzcat
Chris@4	132 (or
Chris@4	133 .I bzip2 -dc)
Chris@4	134 decompresses all specified files to
Chris@4	135 the standard output.
Chris@4	136
Chris@4	137 .I bzip2
Chris@4	138 will read arguments from the environment variables
Chris@4	139 .I BZIP2
Chris@4	140 and
Chris@4	141 .I BZIP,
Chris@4	142 in that order, and will process them
Chris@4	143 before any arguments read from the command line. This gives a
Chris@4	144 convenient way to supply default arguments.
Chris@4	145
Chris@4	146 Compression is always performed, even if the compressed
Chris@4	147 file is slightly
Chris@4	148 larger than the original. Files of less than about one hundred bytes
Chris@4	149 tend to get larger, since the compression mechanism has a constant
Chris@4	150 overhead in the region of 50 bytes. Random data (including the output
Chris@4	151 of most file compressors) is coded at about 8.05 bits per byte, giving
Chris@4	152 an expansion of around 0.5%.
Chris@4	153
Chris@4	154 As a self-check for your protection,
Chris@4	155 .I
Chris@4	156 bzip2
Chris@4	157 uses 32-bit CRCs to
Chris@4	158 make sure that the decompressed version of a file is identical to the
Chris@4	159 original. This guards against corruption of the compressed data, and
Chris@4	160 against undetected bugs in
Chris@4	161 .I bzip2
Chris@4	162 (hopefully very unlikely). The
Chris@4	163 chances of data corruption going undetected is microscopic, about one
Chris@4	164 chance in four billion for each file processed. Be aware, though, that
Chris@4	165 the check occurs upon decompression, so it can only tell you that
Chris@4	166 something is wrong. It can't help you
Chris@4	167 recover the original uncompressed
Chris@4	168 data. You can use
Chris@4	169 .I bzip2recover
Chris@4	170 to try to recover data from
Chris@4	171 damaged files.
Chris@4	172
Chris@4	173 Return values: 0 for a normal exit, 1 for environmental problems (file
Chris@4	174 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
Chris@4	175 compressed file, 3 for an internal consistency error (eg, bug) which
Chris@4	176 caused
Chris@4	177 .I bzip2
Chris@4	178 to panic.
Chris@4	179
Chris@4	180 .SH OPTIONS
Chris@4	181 .TP
Chris@4	182 .B \-c --stdout
Chris@4	183 Compress or decompress to standard output.
Chris@4	184 .TP
Chris@4	185 .B \-d --decompress
Chris@4	186 Force decompression.
Chris@4	187 .I bzip2,
Chris@4	188 .I bunzip2
Chris@4	189 and
Chris@4	190 .I bzcat
Chris@4	191 are
Chris@4	192 really the same program, and the decision about what actions to take is
Chris@4	193 done on the basis of which name is used. This flag overrides that
Chris@4	194 mechanism, and forces
Chris@4	195 .I bzip2
Chris@4	196 to decompress.
Chris@4	197 .TP
Chris@4	198 .B \-z --compress
Chris@4	199 The complement to \-d: forces compression, regardless of the
Chris@4	200 invocation name.
Chris@4	201 .TP
Chris@4	202 .B \-t --test
Chris@4	203 Check integrity of the specified file(s), but don't decompress them.
Chris@4	204 This really performs a trial decompression and throws away the result.
Chris@4	205 .TP
Chris@4	206 .B \-f --force
Chris@4	207 Force overwrite of output files. Normally,
Chris@4	208 .I bzip2
Chris@4	209 will not overwrite
Chris@4	210 existing output files. Also forces
Chris@4	211 .I bzip2
Chris@4	212 to break hard links
Chris@4	213 to files, which it otherwise wouldn't do.
Chris@4	214
Chris@4	215 bzip2 normally declines to decompress files which don't have the
Chris@4	216 correct magic header bytes. If forced (-f), however, it will pass
Chris@4	217 such files through unmodified. This is how GNU gzip behaves.
Chris@4	218 .TP
Chris@4	219 .B \-k --keep
Chris@4	220 Keep (don't delete) input files during compression
Chris@4	221 or decompression.
Chris@4	222 .TP
Chris@4	223 .B \-s --small
Chris@4	224 Reduce memory usage, for compression, decompression and testing. Files
Chris@4	225 are decompressed and tested using a modified algorithm which only
Chris@4	226 requires 2.5 bytes per block byte. This means any file can be
Chris@4	227 decompressed in 2300k of memory, albeit at about half the normal speed.
Chris@4	228
Chris@4	229 During compression, \-s selects a block size of 200k, which limits
Chris@4	230 memory use to around the same figure, at the expense of your compression
Chris@4	231 ratio. In short, if your machine is low on memory (8 megabytes or
Chris@4	232 less), use \-s for everything. See MEMORY MANAGEMENT below.
Chris@4	233 .TP
Chris@4	234 .B \-q --quiet
Chris@4	235 Suppress non-essential warning messages. Messages pertaining to
Chris@4	236 I/O errors and other critical events will not be suppressed.
Chris@4	237 .TP
Chris@4	238 .B \-v --verbose
Chris@4	239 Verbose mode -- show the compression ratio for each file processed.
Chris@4	240 Further \-v's increase the verbosity level, spewing out lots of
Chris@4	241 information which is primarily of interest for diagnostic purposes.
Chris@4	242 .TP
Chris@4	243 .B \-L --license -V --version
Chris@4	244 Display the software version, license terms and conditions.
Chris@4	245 .TP
Chris@4	246 .B \-1 (or \-\-fast) to \-9 (or \-\-best)
Chris@4	247 Set the block size to 100 k, 200 k .. 900 k when compressing. Has no
Chris@4	248 effect when decompressing. See MEMORY MANAGEMENT below.
Chris@4	249 The \-\-fast and \-\-best aliases are primarily for GNU gzip
Chris@4	250 compatibility. In particular, \-\-fast doesn't make things
Chris@4	251 significantly faster.
Chris@4	252 And \-\-best merely selects the default behaviour.
Chris@4	253 .TP
Chris@4	254 .B \--
Chris@4	255 Treats all subsequent arguments as file names, even if they start
Chris@4	256 with a dash. This is so you can handle files with names beginning
Chris@4	257 with a dash, for example: bzip2 \-- \-myfilename.
Chris@4	258 .TP
Chris@4	259 .B \--repetitive-fast --repetitive-best
Chris@4	260 These flags are redundant in versions 0.9.5 and above. They provided
Chris@4	261 some coarse control over the behaviour of the sorting algorithm in
Chris@4	262 earlier versions, which was sometimes useful. 0.9.5 and above have an
Chris@4	263 improved algorithm which renders these flags irrelevant.
Chris@4	264
Chris@4	265 .SH MEMORY MANAGEMENT
Chris@4	266 .I bzip2
Chris@4	267 compresses large files in blocks. The block size affects
Chris@4	268 both the compression ratio achieved, and the amount of memory needed for
Chris@4	269 compression and decompression. The flags \-1 through \-9
Chris@4	270 specify the block size to be 100,000 bytes through 900,000 bytes (the
Chris@4	271 default) respectively. At decompression time, the block size used for
Chris@4	272 compression is read from the header of the compressed file, and
Chris@4	273 .I bunzip2
Chris@4	274 then allocates itself just enough memory to decompress
Chris@4	275 the file. Since block sizes are stored in compressed files, it follows
Chris@4	276 that the flags \-1 to \-9 are irrelevant to and so ignored
Chris@4	277 during decompression.
Chris@4	278
Chris@4	279 Compression and decompression requirements,
Chris@4	280 in bytes, can be estimated as:
Chris@4	281
Chris@4	282 Compression: 400k + ( 8 x block size )
Chris@4	283
Chris@4	284 Decompression: 100k + ( 4 x block size ), or
Chris@4	285 100k + ( 2.5 x block size )
Chris@4	286
Chris@4	287 Larger block sizes give rapidly diminishing marginal returns. Most of
Chris@4	288 the compression comes from the first two or three hundred k of block
Chris@4	289 size, a fact worth bearing in mind when using
Chris@4	290 .I bzip2
Chris@4	291 on small machines.
Chris@4	292 It is also important to appreciate that the decompression memory
Chris@4	293 requirement is set at compression time by the choice of block size.
Chris@4	294
Chris@4	295 For files compressed with the default 900k block size,
Chris@4	296 .I bunzip2
Chris@4	297 will require about 3700 kbytes to decompress. To support decompression
Chris@4	298 of any file on a 4 megabyte machine,
Chris@4	299 .I bunzip2
Chris@4	300 has an option to
Chris@4	301 decompress using approximately half this amount of memory, about 2300
Chris@4	302 kbytes. Decompression speed is also halved, so you should use this
Chris@4	303 option only where necessary. The relevant flag is -s.
Chris@4	304
Chris@4	305 In general, try and use the largest block size memory constraints allow,
Chris@4	306 since that maximises the compression achieved. Compression and
Chris@4	307 decompression speed are virtually unaffected by block size.
Chris@4	308
Chris@4	309 Another significant point applies to files which fit in a single block
Chris@4	310 -- that means most files you'd encounter using a large block size. The
Chris@4	311 amount of real memory touched is proportional to the size of the file,
Chris@4	312 since the file is smaller than a block. For example, compressing a file
Chris@4	313 20,000 bytes long with the flag -9 will cause the compressor to
Chris@4	314 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
Chris@4	315 kbytes of it. Similarly, the decompressor will allocate 3700k but only
Chris@4	316 touch 100k + 20000 * 4 = 180 kbytes.
Chris@4	317
Chris@4	318 Here is a table which summarises the maximum memory usage for different
Chris@4	319 block sizes. Also recorded is the total compressed size for 14 files of
Chris@4	320 the Calgary Text Compression Corpus totalling 3,141,622 bytes. This
Chris@4	321 column gives some feel for how compression varies with block size.
Chris@4	322 These figures tend to understate the advantage of larger block sizes for
Chris@4	323 larger files, since the Corpus is dominated by smaller files.
Chris@4	324
Chris@4	325 Compress Decompress Decompress Corpus
Chris@4	326 Flag usage usage -s usage Size
Chris@4	327
Chris@4	328 -1 1200k 500k 350k 914704
Chris@4	329 -2 2000k 900k 600k 877703
Chris@4	330 -3 2800k 1300k 850k 860338
Chris@4	331 -4 3600k 1700k 1100k 846899
Chris@4	332 -5 4400k 2100k 1350k 845160
Chris@4	333 -6 5200k 2500k 1600k 838626
Chris@4	334 -7 6100k 2900k 1850k 834096
Chris@4	335 -8 6800k 3300k 2100k 828642
Chris@4	336 -9 7600k 3700k 2350k 828642
Chris@4	337
Chris@4	338 .SH RECOVERING DATA FROM DAMAGED FILES
Chris@4	339 .I bzip2
Chris@4	340 compresses files in blocks, usually 900kbytes long. Each
Chris@4	341 block is handled independently. If a media or transmission error causes
Chris@4	342 a multi-block .bz2
Chris@4	343 file to become damaged, it may be possible to
Chris@4	344 recover data from the undamaged blocks in the file.
Chris@4	345
Chris@4	346 The compressed representation of each block is delimited by a 48-bit
Chris@4	347 pattern, which makes it possible to find the block boundaries with
Chris@4	348 reasonable certainty. Each block also carries its own 32-bit CRC, so
Chris@4	349 damaged blocks can be distinguished from undamaged ones.
Chris@4	350
Chris@4	351 .I bzip2recover
Chris@4	352 is a simple program whose purpose is to search for
Chris@4	353 blocks in .bz2 files, and write each block out into its own .bz2
Chris@4	354 file. You can then use
Chris@4	355 .I bzip2
Chris@4	356 \-t
Chris@4	357 to test the
Chris@4	358 integrity of the resulting files, and decompress those which are
Chris@4	359 undamaged.
Chris@4	360
Chris@4	361 .I bzip2recover
Chris@4	362 takes a single argument, the name of the damaged file,
Chris@4	363 and writes a number of files "rec00001file.bz2",
Chris@4	364 "rec00002file.bz2", etc, containing the extracted blocks.
Chris@4	365 The output filenames are designed so that the use of
Chris@4	366 wildcards in subsequent processing -- for example,
Chris@4	367 "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in
Chris@4	368 the correct order.
Chris@4	369
Chris@4	370 .I bzip2recover
Chris@4	371 should be of most use dealing with large .bz2
Chris@4	372 files, as these will contain many blocks. It is clearly
Chris@4	373 futile to use it on damaged single-block files, since a
Chris@4	374 damaged block cannot be recovered. If you wish to minimise
Chris@4	375 any potential data loss through media or transmission errors,
Chris@4	376 you might consider compressing with a smaller
Chris@4	377 block size.
Chris@4	378
Chris@4	379 .SH PERFORMANCE NOTES
Chris@4	380 The sorting phase of compression gathers together similar strings in the
Chris@4	381 file. Because of this, files containing very long runs of repeated
Chris@4	382 symbols, like "aabaabaabaab ..." (repeated several hundred times) may
Chris@4	383 compress more slowly than normal. Versions 0.9.5 and above fare much
Chris@4	384 better than previous versions in this respect. The ratio between
Chris@4	385 worst-case and average-case compression time is in the region of 10:1.
Chris@4	386 For previous versions, this figure was more like 100:1. You can use the
Chris@4	387 \-vvvv option to monitor progress in great detail, if you want.
Chris@4	388
Chris@4	389 Decompression speed is unaffected by these phenomena.
Chris@4	390
Chris@4	391 .I bzip2
Chris@4	392 usually allocates several megabytes of memory to operate
Chris@4	393 in, and then charges all over it in a fairly random fashion. This means
Chris@4	394 that performance, both for compressing and decompressing, is largely
Chris@4	395 determined by the speed at which your machine can service cache misses.
Chris@4	396 Because of this, small changes to the code to reduce the miss rate have
Chris@4	397 been observed to give disproportionately large performance improvements.
Chris@4	398 I imagine
Chris@4	399 .I bzip2
Chris@4	400 will perform best on machines with very large caches.
Chris@4	401
Chris@4	402 .SH CAVEATS
Chris@4	403 I/O error messages are not as helpful as they could be.
Chris@4	404 .I bzip2
Chris@4	405 tries hard to detect I/O errors and exit cleanly, but the details of
Chris@4	406 what the problem is sometimes seem rather misleading.
Chris@4	407
Chris@4	408 This manual page pertains to version 1.0.6 of
Chris@4	409 .I bzip2.
Chris@4	410 Compressed data created by this version is entirely forwards and
Chris@4	411 backwards compatible with the previous public releases, versions
Chris@4	412 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and above, but with the following
Chris@4	413 exception: 0.9.0 and above can correctly decompress multiple
Chris@4	414 concatenated compressed files. 0.1pl2 cannot do this; it will stop
Chris@4	415 after decompressing just the first file in the stream.
Chris@4	416
Chris@4	417 .I bzip2recover
Chris@4	418 versions prior to 1.0.2 used 32-bit integers to represent
Chris@4	419 bit positions in compressed files, so they could not handle compressed
Chris@4	420 files more than 512 megabytes long. Versions 1.0.2 and above use
Chris@4	421 64-bit ints on some platforms which support them (GNU supported
Chris@4	422 targets, and Windows). To establish whether or not bzip2recover was
Chris@4	423 built with such a limitation, run it without arguments. In any event
Chris@4	424 you can build yourself an unlimited version if you can recompile it
Chris@4	425 with MaybeUInt64 set to be an unsigned 64-bit integer.
Chris@4	426
Chris@4	427
Chris@4	428
Chris@4	429 .SH AUTHOR
Chris@4	430 Julian Seward, jsewardbzip.org.
Chris@4	431
Chris@4	432 http://www.bzip.org
Chris@4	433
Chris@4	434 The ideas embodied in
Chris@4	435 .I bzip2
Chris@4	436 are due to (at least) the following
Chris@4	437 people: Michael Burrows and David Wheeler (for the block sorting
Chris@4	438 transformation), David Wheeler (again, for the Huffman coder), Peter
Chris@4	439 Fenwick (for the structured coding model in the original
Chris@4	440 .I bzip,
Chris@4	441 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
Chris@4	442 (for the arithmetic coder in the original
Chris@4	443 .I bzip).
Chris@4	444 I am much
Chris@4	445 indebted for their help, support and advice. See the manual in the
Chris@4	446 source distribution for pointers to sources of documentation. Christian
Chris@4	447 von Roques encouraged me to look for faster sorting algorithms, so as to
Chris@4	448 speed up compression. Bela Lubkin encouraged me to improve the
Chris@4	449 worst-case compression performance.
Chris@4	450 Donna Robinson XMLised the documentation.
Chris@4	451 The bz* scripts are derived from those of GNU gzip.
Chris@4	452 Many people sent patches, helped
Chris@4	453 with portability problems, lent machines, gave advice and were generally
Chris@4	454 helpful.

Mercurial > hg > sv-dependency-builds

annotate src/bzip2-1.0.6/bzip2.1 @ 23:619f715526df sv_v2.1