Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Network Working Group P. Deutsch Chris@43: Request for Comments: 1950 Aladdin Enterprises Chris@43: Category: Informational J-L. Gailly Chris@43: Info-ZIP Chris@43: May 1996 Chris@43: Chris@43: Chris@43: ZLIB Compressed Data Format Specification version 3.3 Chris@43: Chris@43: Status of This Memo Chris@43: Chris@43: This memo provides information for the Internet community. This memo Chris@43: does not specify an Internet standard of any kind. Distribution of Chris@43: this memo is unlimited. Chris@43: Chris@43: IESG Note: Chris@43: Chris@43: The IESG takes no position on the validity of any Intellectual Chris@43: Property Rights statements contained in this document. Chris@43: Chris@43: Notices Chris@43: Chris@43: Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly Chris@43: Chris@43: Permission is granted to copy and distribute this document for any Chris@43: purpose and without charge, including translations into other Chris@43: languages and incorporation into compilations, provided that the Chris@43: copyright notice and this notice are preserved, and that any Chris@43: substantive changes or deletions from the original are clearly Chris@43: marked. Chris@43: Chris@43: A pointer to the latest version of this and related documentation in Chris@43: HTML format can be found at the URL Chris@43: . Chris@43: Chris@43: Abstract Chris@43: Chris@43: This specification defines a lossless compressed data format. The Chris@43: data can be produced or consumed, even for an arbitrarily long Chris@43: sequentially presented input data stream, using only an a priori Chris@43: bounded amount of intermediate storage. The format presently uses Chris@43: the DEFLATE compression method but can be easily extended to use Chris@43: other compression methods. It can be implemented readily in a manner Chris@43: not covered by patents. This specification also defines the ADLER-32 Chris@43: checksum (an extension and improvement of the Fletcher checksum), Chris@43: used for detection of data corruption, and provides an algorithm for Chris@43: computing it. Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 1] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: Table of Contents Chris@43: Chris@43: 1. Introduction ................................................... 2 Chris@43: 1.1. Purpose ................................................... 2 Chris@43: 1.2. Intended audience ......................................... 3 Chris@43: 1.3. Scope ..................................................... 3 Chris@43: 1.4. Compliance ................................................ 3 Chris@43: 1.5. Definitions of terms and conventions used ................ 3 Chris@43: 1.6. Changes from previous versions ............................ 3 Chris@43: 2. Detailed specification ......................................... 3 Chris@43: 2.1. Overall conventions ....................................... 3 Chris@43: 2.2. Data format ............................................... 4 Chris@43: 2.3. Compliance ................................................ 7 Chris@43: 3. References ..................................................... 7 Chris@43: 4. Source code .................................................... 8 Chris@43: 5. Security Considerations ........................................ 8 Chris@43: 6. Acknowledgements ............................................... 8 Chris@43: 7. Authors' Addresses ............................................. 8 Chris@43: 8. Appendix: Rationale ............................................ 9 Chris@43: 9. Appendix: Sample code ..........................................10 Chris@43: Chris@43: 1. Introduction Chris@43: Chris@43: 1.1. Purpose Chris@43: Chris@43: The purpose of this specification is to define a lossless Chris@43: compressed data format that: Chris@43: Chris@43: * Is independent of CPU type, operating system, file system, Chris@43: and character set, and hence can be used for interchange; Chris@43: Chris@43: * Can be produced or consumed, even for an arbitrarily long Chris@43: sequentially presented input data stream, using only an a Chris@43: priori bounded amount of intermediate storage, and hence can Chris@43: be used in data communications or similar structures such as Chris@43: Unix filters; Chris@43: Chris@43: * Can use a number of different compression methods; Chris@43: Chris@43: * Can be implemented readily in a manner not covered by Chris@43: patents, and hence can be practiced freely. Chris@43: Chris@43: The data format defined by this specification does not attempt to Chris@43: allow random access to compressed data. Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 2] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: 1.2. Intended audience Chris@43: Chris@43: This specification is intended for use by implementors of software Chris@43: to compress data into zlib format and/or decompress data from zlib Chris@43: format. Chris@43: Chris@43: The text of the specification assumes a basic background in Chris@43: programming at the level of bits and other primitive data Chris@43: representations. Chris@43: Chris@43: 1.3. Scope Chris@43: Chris@43: The specification specifies a compressed data format that can be Chris@43: used for in-memory compression of a sequence of arbitrary bytes. Chris@43: Chris@43: 1.4. Compliance Chris@43: Chris@43: Unless otherwise indicated below, a compliant decompressor must be Chris@43: able to accept and decompress any data set that conforms to all Chris@43: the specifications presented here; a compliant compressor must Chris@43: produce data sets that conform to all the specifications presented Chris@43: here. Chris@43: Chris@43: 1.5. Definitions of terms and conventions used Chris@43: Chris@43: byte: 8 bits stored or transmitted as a unit (same as an octet). Chris@43: (For this specification, a byte is exactly 8 bits, even on Chris@43: machines which store a character on a number of bits different Chris@43: from 8.) See below, for the numbering of bits within a byte. Chris@43: Chris@43: 1.6. Changes from previous versions Chris@43: Chris@43: Version 3.1 was the first public release of this specification. Chris@43: In version 3.2, some terminology was changed and the Adler-32 Chris@43: sample code was rewritten for clarity. In version 3.3, the Chris@43: support for a preset dictionary was introduced, and the Chris@43: specification was converted to RFC style. Chris@43: Chris@43: 2. Detailed specification Chris@43: Chris@43: 2.1. Overall conventions Chris@43: Chris@43: In the diagrams below, a box like this: Chris@43: Chris@43: +---+ Chris@43: | | <-- the vertical bars might be missing Chris@43: +---+ Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 3] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: represents one byte; a box like this: Chris@43: Chris@43: +==============+ Chris@43: | | Chris@43: +==============+ Chris@43: Chris@43: represents a variable number of bytes. Chris@43: Chris@43: Bytes stored within a computer do not have a "bit order", since Chris@43: they are always treated as a unit. However, a byte considered as Chris@43: an integer between 0 and 255 does have a most- and least- Chris@43: significant bit, and since we write numbers with the most- Chris@43: significant digit on the left, we also write bytes with the most- Chris@43: significant bit on the left. In the diagrams below, we number the Chris@43: bits of a byte so that bit 0 is the least-significant bit, i.e., Chris@43: the bits are numbered: Chris@43: Chris@43: +--------+ Chris@43: |76543210| Chris@43: +--------+ Chris@43: Chris@43: Within a computer, a number may occupy multiple bytes. All Chris@43: multi-byte numbers in the format described here are stored with Chris@43: the MOST-significant byte first (at the lower memory address). Chris@43: For example, the decimal number 520 is stored as: Chris@43: Chris@43: 0 1 Chris@43: +--------+--------+ Chris@43: |00000010|00001000| Chris@43: +--------+--------+ Chris@43: ^ ^ Chris@43: | | Chris@43: | + less significant byte = 8 Chris@43: + more significant byte = 2 x 256 Chris@43: Chris@43: 2.2. Data format Chris@43: Chris@43: A zlib stream has the following structure: Chris@43: Chris@43: 0 1 Chris@43: +---+---+ Chris@43: |CMF|FLG| (more-->) Chris@43: +---+---+ Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 4] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: (if FLG.FDICT set) Chris@43: Chris@43: 0 1 2 3 Chris@43: +---+---+---+---+ Chris@43: | DICTID | (more-->) Chris@43: +---+---+---+---+ Chris@43: Chris@43: +=====================+---+---+---+---+ Chris@43: |...compressed data...| ADLER32 | Chris@43: +=====================+---+---+---+---+ Chris@43: Chris@43: Any data which may appear after ADLER32 are not part of the zlib Chris@43: stream. Chris@43: Chris@43: CMF (Compression Method and flags) Chris@43: This byte is divided into a 4-bit compression method and a 4- Chris@43: bit information field depending on the compression method. Chris@43: Chris@43: bits 0 to 3 CM Compression method Chris@43: bits 4 to 7 CINFO Compression info Chris@43: Chris@43: CM (Compression method) Chris@43: This identifies the compression method used in the file. CM = 8 Chris@43: denotes the "deflate" compression method with a window size up Chris@43: to 32K. This is the method used by gzip and PNG (see Chris@43: references [1] and [2] in Chapter 3, below, for the reference Chris@43: documents). CM = 15 is reserved. It might be used in a future Chris@43: version of this specification to indicate the presence of an Chris@43: extra field before the compressed data. Chris@43: Chris@43: CINFO (Compression info) Chris@43: For CM = 8, CINFO is the base-2 logarithm of the LZ77 window Chris@43: size, minus eight (CINFO=7 indicates a 32K window size). Values Chris@43: of CINFO above 7 are not allowed in this version of the Chris@43: specification. CINFO is not defined in this specification for Chris@43: CM not equal to 8. Chris@43: Chris@43: FLG (FLaGs) Chris@43: This flag byte is divided as follows: Chris@43: Chris@43: bits 0 to 4 FCHECK (check bits for CMF and FLG) Chris@43: bit 5 FDICT (preset dictionary) Chris@43: bits 6 to 7 FLEVEL (compression level) Chris@43: Chris@43: The FCHECK value must be such that CMF and FLG, when viewed as Chris@43: a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), Chris@43: is a multiple of 31. Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 5] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: FDICT (Preset dictionary) Chris@43: If FDICT is set, a DICT dictionary identifier is present Chris@43: immediately after the FLG byte. The dictionary is a sequence of Chris@43: bytes which are initially fed to the compressor without Chris@43: producing any compressed output. DICT is the Adler-32 checksum Chris@43: of this sequence of bytes (see the definition of ADLER32 Chris@43: below). The decompressor can use this identifier to determine Chris@43: which dictionary has been used by the compressor. Chris@43: Chris@43: FLEVEL (Compression level) Chris@43: These flags are available for use by specific compression Chris@43: methods. The "deflate" method (CM = 8) sets these flags as Chris@43: follows: Chris@43: Chris@43: 0 - compressor used fastest algorithm Chris@43: 1 - compressor used fast algorithm Chris@43: 2 - compressor used default algorithm Chris@43: 3 - compressor used maximum compression, slowest algorithm Chris@43: Chris@43: The information in FLEVEL is not needed for decompression; it Chris@43: is there to indicate if recompression might be worthwhile. Chris@43: Chris@43: compressed data Chris@43: For compression method 8, the compressed data is stored in the Chris@43: deflate compressed data format as described in the document Chris@43: "DEFLATE Compressed Data Format Specification" by L. Peter Chris@43: Deutsch. (See reference [3] in Chapter 3, below) Chris@43: Chris@43: Other compressed data formats are not specified in this version Chris@43: of the zlib specification. Chris@43: Chris@43: ADLER32 (Adler-32 checksum) Chris@43: This contains a checksum value of the uncompressed data Chris@43: (excluding any dictionary data) computed according to Adler-32 Chris@43: algorithm. This algorithm is a 32-bit extension and improvement Chris@43: of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073 Chris@43: standard. See references [4] and [5] in Chapter 3, below) Chris@43: Chris@43: Adler-32 is composed of two sums accumulated per byte: s1 is Chris@43: the sum of all bytes, s2 is the sum of all s1 values. Both sums Chris@43: are done modulo 65521. s1 is initialized to 1, s2 to zero. The Chris@43: Adler-32 checksum is stored as s2*65536 + s1 in most- Chris@43: significant-byte first (network) order. Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 6] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: 2.3. Compliance Chris@43: Chris@43: A compliant compressor must produce streams with correct CMF, FLG Chris@43: and ADLER32, but need not support preset dictionaries. When the Chris@43: zlib data format is used as part of another standard data format, Chris@43: the compressor may use only preset dictionaries that are specified Chris@43: by this other data format. If this other format does not use the Chris@43: preset dictionary feature, the compressor must not set the FDICT Chris@43: flag. Chris@43: Chris@43: A compliant decompressor must check CMF, FLG, and ADLER32, and Chris@43: provide an error indication if any of these have incorrect values. Chris@43: A compliant decompressor must give an error indication if CM is Chris@43: not one of the values defined in this specification (only the Chris@43: value 8 is permitted in this version), since another value could Chris@43: indicate the presence of new features that would cause subsequent Chris@43: data to be interpreted incorrectly. A compliant decompressor must Chris@43: give an error indication if FDICT is set and DICTID is not the Chris@43: identifier of a known preset dictionary. A decompressor may Chris@43: ignore FLEVEL and still be compliant. When the zlib data format Chris@43: is being used as a part of another standard format, a compliant Chris@43: decompressor must support all the preset dictionaries specified by Chris@43: the other format. When the other format does not use the preset Chris@43: dictionary feature, a compliant decompressor must reject any Chris@43: stream in which the FDICT flag is set. Chris@43: Chris@43: 3. References Chris@43: Chris@43: [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification", Chris@43: available in ftp://ftp.uu.net/pub/archiving/zip/doc/ Chris@43: Chris@43: [2] Thomas Boutell, "PNG (Portable Network Graphics) specification", Chris@43: available in ftp://ftp.uu.net/graphics/png/documents/ Chris@43: Chris@43: [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", Chris@43: available in ftp://ftp.uu.net/pub/archiving/zip/doc/ Chris@43: Chris@43: [4] Fletcher, J. G., "An Arithmetic Checksum for Serial Chris@43: Transmissions," IEEE Transactions on Communications, Vol. COM-30, Chris@43: No. 1, January 1982, pp. 247-252. Chris@43: Chris@43: [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms," Chris@43: November, 1993, pp. 144, 145. (Available from Chris@43: gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073. Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 7] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: 4. Source code Chris@43: Chris@43: Source code for a C language implementation of a "zlib" compliant Chris@43: library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/. Chris@43: Chris@43: 5. Security Considerations Chris@43: Chris@43: A decoder that fails to check the ADLER32 checksum value may be Chris@43: subject to undetected data corruption. Chris@43: Chris@43: 6. Acknowledgements Chris@43: Chris@43: Trademarks cited in this document are the property of their Chris@43: respective owners. Chris@43: Chris@43: Jean-Loup Gailly and Mark Adler designed the zlib format and wrote Chris@43: the related software described in this specification. Glenn Chris@43: Randers-Pehrson converted this document to RFC and HTML format. Chris@43: Chris@43: 7. Authors' Addresses Chris@43: Chris@43: L. Peter Deutsch Chris@43: Aladdin Enterprises Chris@43: 203 Santa Margarita Ave. Chris@43: Menlo Park, CA 94025 Chris@43: Chris@43: Phone: (415) 322-0103 (AM only) Chris@43: FAX: (415) 322-1734 Chris@43: EMail: Chris@43: Chris@43: Chris@43: Jean-Loup Gailly Chris@43: Chris@43: EMail: Chris@43: Chris@43: Questions about the technical content of this specification can be Chris@43: sent by email to Chris@43: Chris@43: Jean-Loup Gailly and Chris@43: Mark Adler Chris@43: Chris@43: Editorial comments on this specification can be sent by email to Chris@43: Chris@43: L. Peter Deutsch and Chris@43: Glenn Randers-Pehrson Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 8] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: 8. Appendix: Rationale Chris@43: Chris@43: 8.1. Preset dictionaries Chris@43: Chris@43: A preset dictionary is specially useful to compress short input Chris@43: sequences. The compressor can take advantage of the dictionary Chris@43: context to encode the input in a more compact manner. The Chris@43: decompressor can be initialized with the appropriate context by Chris@43: virtually decompressing a compressed version of the dictionary Chris@43: without producing any output. However for certain compression Chris@43: algorithms such as the deflate algorithm this operation can be Chris@43: achieved without actually performing any decompression. Chris@43: Chris@43: The compressor and the decompressor must use exactly the same Chris@43: dictionary. The dictionary may be fixed or may be chosen among a Chris@43: certain number of predefined dictionaries, according to the kind Chris@43: of input data. The decompressor can determine which dictionary has Chris@43: been chosen by the compressor by checking the dictionary Chris@43: identifier. This document does not specify the contents of Chris@43: predefined dictionaries, since the optimal dictionaries are Chris@43: application specific. Standard data formats using this feature of Chris@43: the zlib specification must precisely define the allowed Chris@43: dictionaries. Chris@43: Chris@43: 8.2. The Adler-32 algorithm Chris@43: Chris@43: The Adler-32 algorithm is much faster than the CRC32 algorithm yet Chris@43: still provides an extremely low probability of undetected errors. Chris@43: Chris@43: The modulo on unsigned long accumulators can be delayed for 5552 Chris@43: bytes, so the modulo operation time is negligible. If the bytes Chris@43: are a, b, c, the second sum is 3a + 2b + c + 3, and so is position Chris@43: and order sensitive, unlike the first sum, which is just a Chris@43: checksum. That 65521 is prime is important to avoid a possible Chris@43: large class of two-byte errors that leave the check unchanged. Chris@43: (The Fletcher checksum uses 255, which is not prime and which also Chris@43: makes the Fletcher check insensitive to single byte changes 0 <-> Chris@43: 255.) Chris@43: Chris@43: The sum s1 is initialized to 1 instead of zero to make the length Chris@43: of the sequence part of s2, so that the length does not have to be Chris@43: checked separately. (Any sequence of zeroes has a Fletcher Chris@43: checksum of zero.) Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 9] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: 9. Appendix: Sample code Chris@43: Chris@43: The following C code computes the Adler-32 checksum of a data buffer. Chris@43: It is written for clarity, not for speed. The sample code is in the Chris@43: ANSI C programming language. Non C users may find it easier to read Chris@43: with these hints: Chris@43: Chris@43: & Bitwise AND operator. Chris@43: >> Bitwise right shift operator. When applied to an Chris@43: unsigned quantity, as here, right shift inserts zero bit(s) Chris@43: at the left. Chris@43: << Bitwise left shift operator. Left shift inserts zero Chris@43: bit(s) at the right. Chris@43: ++ "n++" increments the variable n. Chris@43: % modulo operator: a % b is the remainder of a divided by b. Chris@43: Chris@43: #define BASE 65521 /* largest prime smaller than 65536 */ Chris@43: Chris@43: /* Chris@43: Update a running Adler-32 checksum with the bytes buf[0..len-1] Chris@43: and return the updated checksum. The Adler-32 checksum should be Chris@43: initialized to 1. Chris@43: Chris@43: Usage example: Chris@43: Chris@43: unsigned long adler = 1L; Chris@43: Chris@43: while (read_buffer(buffer, length) != EOF) { Chris@43: adler = update_adler32(adler, buffer, length); Chris@43: } Chris@43: if (adler != original_adler) error(); Chris@43: */ Chris@43: unsigned long update_adler32(unsigned long adler, Chris@43: unsigned char *buf, int len) Chris@43: { Chris@43: unsigned long s1 = adler & 0xffff; Chris@43: unsigned long s2 = (adler >> 16) & 0xffff; Chris@43: int n; Chris@43: Chris@43: for (n = 0; n < len; n++) { Chris@43: s1 = (s1 + buf[n]) % BASE; Chris@43: s2 = (s2 + s1) % BASE; Chris@43: } Chris@43: return (s2 << 16) + s1; Chris@43: } Chris@43: Chris@43: /* Return the adler32 of the bytes buf[0..len-1] */ Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 10] Chris@43: Chris@43: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@43: Chris@43: Chris@43: unsigned long adler32(unsigned char *buf, int len) Chris@43: { Chris@43: return update_adler32(1L, buf, len); Chris@43: } Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Chris@43: Deutsch & Gailly Informational [Page 11] Chris@43: