Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Network Working Group P. Deutsch Chris@4: Request for Comments: 1950 Aladdin Enterprises Chris@4: Category: Informational J-L. Gailly Chris@4: Info-ZIP Chris@4: May 1996 Chris@4: Chris@4: Chris@4: ZLIB Compressed Data Format Specification version 3.3 Chris@4: Chris@4: Status of This Memo Chris@4: Chris@4: This memo provides information for the Internet community. This memo Chris@4: does not specify an Internet standard of any kind. Distribution of Chris@4: this memo is unlimited. Chris@4: Chris@4: IESG Note: Chris@4: Chris@4: The IESG takes no position on the validity of any Intellectual Chris@4: Property Rights statements contained in this document. Chris@4: Chris@4: Notices Chris@4: Chris@4: Copyright (c) 1996 L. Peter Deutsch and Jean-Loup Gailly Chris@4: Chris@4: Permission is granted to copy and distribute this document for any Chris@4: purpose and without charge, including translations into other Chris@4: languages and incorporation into compilations, provided that the Chris@4: copyright notice and this notice are preserved, and that any Chris@4: substantive changes or deletions from the original are clearly Chris@4: marked. Chris@4: Chris@4: A pointer to the latest version of this and related documentation in Chris@4: HTML format can be found at the URL Chris@4: . Chris@4: Chris@4: Abstract Chris@4: Chris@4: This specification defines a lossless compressed data format. The Chris@4: data can be produced or consumed, even for an arbitrarily long Chris@4: sequentially presented input data stream, using only an a priori Chris@4: bounded amount of intermediate storage. The format presently uses Chris@4: the DEFLATE compression method but can be easily extended to use Chris@4: other compression methods. It can be implemented readily in a manner Chris@4: not covered by patents. This specification also defines the ADLER-32 Chris@4: checksum (an extension and improvement of the Fletcher checksum), Chris@4: used for detection of data corruption, and provides an algorithm for Chris@4: computing it. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 1] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: Table of Contents Chris@4: Chris@4: 1. Introduction ................................................... 2 Chris@4: 1.1. Purpose ................................................... 2 Chris@4: 1.2. Intended audience ......................................... 3 Chris@4: 1.3. Scope ..................................................... 3 Chris@4: 1.4. Compliance ................................................ 3 Chris@4: 1.5. Definitions of terms and conventions used ................ 3 Chris@4: 1.6. Changes from previous versions ............................ 3 Chris@4: 2. Detailed specification ......................................... 3 Chris@4: 2.1. Overall conventions ....................................... 3 Chris@4: 2.2. Data format ............................................... 4 Chris@4: 2.3. Compliance ................................................ 7 Chris@4: 3. References ..................................................... 7 Chris@4: 4. Source code .................................................... 8 Chris@4: 5. Security Considerations ........................................ 8 Chris@4: 6. Acknowledgements ............................................... 8 Chris@4: 7. Authors' Addresses ............................................. 8 Chris@4: 8. Appendix: Rationale ............................................ 9 Chris@4: 9. Appendix: Sample code ..........................................10 Chris@4: Chris@4: 1. Introduction Chris@4: Chris@4: 1.1. Purpose Chris@4: Chris@4: The purpose of this specification is to define a lossless Chris@4: compressed data format that: Chris@4: Chris@4: * Is independent of CPU type, operating system, file system, Chris@4: and character set, and hence can be used for interchange; Chris@4: Chris@4: * Can be produced or consumed, even for an arbitrarily long Chris@4: sequentially presented input data stream, using only an a Chris@4: priori bounded amount of intermediate storage, and hence can Chris@4: be used in data communications or similar structures such as Chris@4: Unix filters; Chris@4: Chris@4: * Can use a number of different compression methods; Chris@4: Chris@4: * Can be implemented readily in a manner not covered by Chris@4: patents, and hence can be practiced freely. Chris@4: Chris@4: The data format defined by this specification does not attempt to Chris@4: allow random access to compressed data. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 2] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: 1.2. Intended audience Chris@4: Chris@4: This specification is intended for use by implementors of software Chris@4: to compress data into zlib format and/or decompress data from zlib Chris@4: format. Chris@4: Chris@4: The text of the specification assumes a basic background in Chris@4: programming at the level of bits and other primitive data Chris@4: representations. Chris@4: Chris@4: 1.3. Scope Chris@4: Chris@4: The specification specifies a compressed data format that can be Chris@4: used for in-memory compression of a sequence of arbitrary bytes. Chris@4: Chris@4: 1.4. Compliance Chris@4: Chris@4: Unless otherwise indicated below, a compliant decompressor must be Chris@4: able to accept and decompress any data set that conforms to all Chris@4: the specifications presented here; a compliant compressor must Chris@4: produce data sets that conform to all the specifications presented Chris@4: here. Chris@4: Chris@4: 1.5. Definitions of terms and conventions used Chris@4: Chris@4: byte: 8 bits stored or transmitted as a unit (same as an octet). Chris@4: (For this specification, a byte is exactly 8 bits, even on Chris@4: machines which store a character on a number of bits different Chris@4: from 8.) See below, for the numbering of bits within a byte. Chris@4: Chris@4: 1.6. Changes from previous versions Chris@4: Chris@4: Version 3.1 was the first public release of this specification. Chris@4: In version 3.2, some terminology was changed and the Adler-32 Chris@4: sample code was rewritten for clarity. In version 3.3, the Chris@4: support for a preset dictionary was introduced, and the Chris@4: specification was converted to RFC style. Chris@4: Chris@4: 2. Detailed specification Chris@4: Chris@4: 2.1. Overall conventions Chris@4: Chris@4: In the diagrams below, a box like this: Chris@4: Chris@4: +---+ Chris@4: | | <-- the vertical bars might be missing Chris@4: +---+ Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 3] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: represents one byte; a box like this: Chris@4: Chris@4: +==============+ Chris@4: | | Chris@4: +==============+ Chris@4: Chris@4: represents a variable number of bytes. Chris@4: Chris@4: Bytes stored within a computer do not have a "bit order", since Chris@4: they are always treated as a unit. However, a byte considered as Chris@4: an integer between 0 and 255 does have a most- and least- Chris@4: significant bit, and since we write numbers with the most- Chris@4: significant digit on the left, we also write bytes with the most- Chris@4: significant bit on the left. In the diagrams below, we number the Chris@4: bits of a byte so that bit 0 is the least-significant bit, i.e., Chris@4: the bits are numbered: Chris@4: Chris@4: +--------+ Chris@4: |76543210| Chris@4: +--------+ Chris@4: Chris@4: Within a computer, a number may occupy multiple bytes. All Chris@4: multi-byte numbers in the format described here are stored with Chris@4: the MOST-significant byte first (at the lower memory address). Chris@4: For example, the decimal number 520 is stored as: Chris@4: Chris@4: 0 1 Chris@4: +--------+--------+ Chris@4: |00000010|00001000| Chris@4: +--------+--------+ Chris@4: ^ ^ Chris@4: | | Chris@4: | + less significant byte = 8 Chris@4: + more significant byte = 2 x 256 Chris@4: Chris@4: 2.2. Data format Chris@4: Chris@4: A zlib stream has the following structure: Chris@4: Chris@4: 0 1 Chris@4: +---+---+ Chris@4: |CMF|FLG| (more-->) Chris@4: +---+---+ Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 4] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: (if FLG.FDICT set) Chris@4: Chris@4: 0 1 2 3 Chris@4: +---+---+---+---+ Chris@4: | DICTID | (more-->) Chris@4: +---+---+---+---+ Chris@4: Chris@4: +=====================+---+---+---+---+ Chris@4: |...compressed data...| ADLER32 | Chris@4: +=====================+---+---+---+---+ Chris@4: Chris@4: Any data which may appear after ADLER32 are not part of the zlib Chris@4: stream. Chris@4: Chris@4: CMF (Compression Method and flags) Chris@4: This byte is divided into a 4-bit compression method and a 4- Chris@4: bit information field depending on the compression method. Chris@4: Chris@4: bits 0 to 3 CM Compression method Chris@4: bits 4 to 7 CINFO Compression info Chris@4: Chris@4: CM (Compression method) Chris@4: This identifies the compression method used in the file. CM = 8 Chris@4: denotes the "deflate" compression method with a window size up Chris@4: to 32K. This is the method used by gzip and PNG (see Chris@4: references [1] and [2] in Chapter 3, below, for the reference Chris@4: documents). CM = 15 is reserved. It might be used in a future Chris@4: version of this specification to indicate the presence of an Chris@4: extra field before the compressed data. Chris@4: Chris@4: CINFO (Compression info) Chris@4: For CM = 8, CINFO is the base-2 logarithm of the LZ77 window Chris@4: size, minus eight (CINFO=7 indicates a 32K window size). Values Chris@4: of CINFO above 7 are not allowed in this version of the Chris@4: specification. CINFO is not defined in this specification for Chris@4: CM not equal to 8. Chris@4: Chris@4: FLG (FLaGs) Chris@4: This flag byte is divided as follows: Chris@4: Chris@4: bits 0 to 4 FCHECK (check bits for CMF and FLG) Chris@4: bit 5 FDICT (preset dictionary) Chris@4: bits 6 to 7 FLEVEL (compression level) Chris@4: Chris@4: The FCHECK value must be such that CMF and FLG, when viewed as Chris@4: a 16-bit unsigned integer stored in MSB order (CMF*256 + FLG), Chris@4: is a multiple of 31. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 5] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: FDICT (Preset dictionary) Chris@4: If FDICT is set, a DICT dictionary identifier is present Chris@4: immediately after the FLG byte. The dictionary is a sequence of Chris@4: bytes which are initially fed to the compressor without Chris@4: producing any compressed output. DICT is the Adler-32 checksum Chris@4: of this sequence of bytes (see the definition of ADLER32 Chris@4: below). The decompressor can use this identifier to determine Chris@4: which dictionary has been used by the compressor. Chris@4: Chris@4: FLEVEL (Compression level) Chris@4: These flags are available for use by specific compression Chris@4: methods. The "deflate" method (CM = 8) sets these flags as Chris@4: follows: Chris@4: Chris@4: 0 - compressor used fastest algorithm Chris@4: 1 - compressor used fast algorithm Chris@4: 2 - compressor used default algorithm Chris@4: 3 - compressor used maximum compression, slowest algorithm Chris@4: Chris@4: The information in FLEVEL is not needed for decompression; it Chris@4: is there to indicate if recompression might be worthwhile. Chris@4: Chris@4: compressed data Chris@4: For compression method 8, the compressed data is stored in the Chris@4: deflate compressed data format as described in the document Chris@4: "DEFLATE Compressed Data Format Specification" by L. Peter Chris@4: Deutsch. (See reference [3] in Chapter 3, below) Chris@4: Chris@4: Other compressed data formats are not specified in this version Chris@4: of the zlib specification. Chris@4: Chris@4: ADLER32 (Adler-32 checksum) Chris@4: This contains a checksum value of the uncompressed data Chris@4: (excluding any dictionary data) computed according to Adler-32 Chris@4: algorithm. This algorithm is a 32-bit extension and improvement Chris@4: of the Fletcher algorithm, used in the ITU-T X.224 / ISO 8073 Chris@4: standard. See references [4] and [5] in Chapter 3, below) Chris@4: Chris@4: Adler-32 is composed of two sums accumulated per byte: s1 is Chris@4: the sum of all bytes, s2 is the sum of all s1 values. Both sums Chris@4: are done modulo 65521. s1 is initialized to 1, s2 to zero. The Chris@4: Adler-32 checksum is stored as s2*65536 + s1 in most- Chris@4: significant-byte first (network) order. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 6] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: 2.3. Compliance Chris@4: Chris@4: A compliant compressor must produce streams with correct CMF, FLG Chris@4: and ADLER32, but need not support preset dictionaries. When the Chris@4: zlib data format is used as part of another standard data format, Chris@4: the compressor may use only preset dictionaries that are specified Chris@4: by this other data format. If this other format does not use the Chris@4: preset dictionary feature, the compressor must not set the FDICT Chris@4: flag. Chris@4: Chris@4: A compliant decompressor must check CMF, FLG, and ADLER32, and Chris@4: provide an error indication if any of these have incorrect values. Chris@4: A compliant decompressor must give an error indication if CM is Chris@4: not one of the values defined in this specification (only the Chris@4: value 8 is permitted in this version), since another value could Chris@4: indicate the presence of new features that would cause subsequent Chris@4: data to be interpreted incorrectly. A compliant decompressor must Chris@4: give an error indication if FDICT is set and DICTID is not the Chris@4: identifier of a known preset dictionary. A decompressor may Chris@4: ignore FLEVEL and still be compliant. When the zlib data format Chris@4: is being used as a part of another standard format, a compliant Chris@4: decompressor must support all the preset dictionaries specified by Chris@4: the other format. When the other format does not use the preset Chris@4: dictionary feature, a compliant decompressor must reject any Chris@4: stream in which the FDICT flag is set. Chris@4: Chris@4: 3. References Chris@4: Chris@4: [1] Deutsch, L.P.,"GZIP Compressed Data Format Specification", Chris@4: available in ftp://ftp.uu.net/pub/archiving/zip/doc/ Chris@4: Chris@4: [2] Thomas Boutell, "PNG (Portable Network Graphics) specification", Chris@4: available in ftp://ftp.uu.net/graphics/png/documents/ Chris@4: Chris@4: [3] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", Chris@4: available in ftp://ftp.uu.net/pub/archiving/zip/doc/ Chris@4: Chris@4: [4] Fletcher, J. G., "An Arithmetic Checksum for Serial Chris@4: Transmissions," IEEE Transactions on Communications, Vol. COM-30, Chris@4: No. 1, January 1982, pp. 247-252. Chris@4: Chris@4: [5] ITU-T Recommendation X.224, Annex D, "Checksum Algorithms," Chris@4: November, 1993, pp. 144, 145. (Available from Chris@4: gopher://info.itu.ch). ITU-T X.244 is also the same as ISO 8073. Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 7] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: 4. Source code Chris@4: Chris@4: Source code for a C language implementation of a "zlib" compliant Chris@4: library is available at ftp://ftp.uu.net/pub/archiving/zip/zlib/. Chris@4: Chris@4: 5. Security Considerations Chris@4: Chris@4: A decoder that fails to check the ADLER32 checksum value may be Chris@4: subject to undetected data corruption. Chris@4: Chris@4: 6. Acknowledgements Chris@4: Chris@4: Trademarks cited in this document are the property of their Chris@4: respective owners. Chris@4: Chris@4: Jean-Loup Gailly and Mark Adler designed the zlib format and wrote Chris@4: the related software described in this specification. Glenn Chris@4: Randers-Pehrson converted this document to RFC and HTML format. Chris@4: Chris@4: 7. Authors' Addresses Chris@4: Chris@4: L. Peter Deutsch Chris@4: Aladdin Enterprises Chris@4: 203 Santa Margarita Ave. Chris@4: Menlo Park, CA 94025 Chris@4: Chris@4: Phone: (415) 322-0103 (AM only) Chris@4: FAX: (415) 322-1734 Chris@4: EMail: Chris@4: Chris@4: Chris@4: Jean-Loup Gailly Chris@4: Chris@4: EMail: Chris@4: Chris@4: Questions about the technical content of this specification can be Chris@4: sent by email to Chris@4: Chris@4: Jean-Loup Gailly and Chris@4: Mark Adler Chris@4: Chris@4: Editorial comments on this specification can be sent by email to Chris@4: Chris@4: L. Peter Deutsch and Chris@4: Glenn Randers-Pehrson Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 8] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: 8. Appendix: Rationale Chris@4: Chris@4: 8.1. Preset dictionaries Chris@4: Chris@4: A preset dictionary is specially useful to compress short input Chris@4: sequences. The compressor can take advantage of the dictionary Chris@4: context to encode the input in a more compact manner. The Chris@4: decompressor can be initialized with the appropriate context by Chris@4: virtually decompressing a compressed version of the dictionary Chris@4: without producing any output. However for certain compression Chris@4: algorithms such as the deflate algorithm this operation can be Chris@4: achieved without actually performing any decompression. Chris@4: Chris@4: The compressor and the decompressor must use exactly the same Chris@4: dictionary. The dictionary may be fixed or may be chosen among a Chris@4: certain number of predefined dictionaries, according to the kind Chris@4: of input data. The decompressor can determine which dictionary has Chris@4: been chosen by the compressor by checking the dictionary Chris@4: identifier. This document does not specify the contents of Chris@4: predefined dictionaries, since the optimal dictionaries are Chris@4: application specific. Standard data formats using this feature of Chris@4: the zlib specification must precisely define the allowed Chris@4: dictionaries. Chris@4: Chris@4: 8.2. The Adler-32 algorithm Chris@4: Chris@4: The Adler-32 algorithm is much faster than the CRC32 algorithm yet Chris@4: still provides an extremely low probability of undetected errors. Chris@4: Chris@4: The modulo on unsigned long accumulators can be delayed for 5552 Chris@4: bytes, so the modulo operation time is negligible. If the bytes Chris@4: are a, b, c, the second sum is 3a + 2b + c + 3, and so is position Chris@4: and order sensitive, unlike the first sum, which is just a Chris@4: checksum. That 65521 is prime is important to avoid a possible Chris@4: large class of two-byte errors that leave the check unchanged. Chris@4: (The Fletcher checksum uses 255, which is not prime and which also Chris@4: makes the Fletcher check insensitive to single byte changes 0 <-> Chris@4: 255.) Chris@4: Chris@4: The sum s1 is initialized to 1 instead of zero to make the length Chris@4: of the sequence part of s2, so that the length does not have to be Chris@4: checked separately. (Any sequence of zeroes has a Fletcher Chris@4: checksum of zero.) Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 9] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: 9. Appendix: Sample code Chris@4: Chris@4: The following C code computes the Adler-32 checksum of a data buffer. Chris@4: It is written for clarity, not for speed. The sample code is in the Chris@4: ANSI C programming language. Non C users may find it easier to read Chris@4: with these hints: Chris@4: Chris@4: & Bitwise AND operator. Chris@4: >> Bitwise right shift operator. When applied to an Chris@4: unsigned quantity, as here, right shift inserts zero bit(s) Chris@4: at the left. Chris@4: << Bitwise left shift operator. Left shift inserts zero Chris@4: bit(s) at the right. Chris@4: ++ "n++" increments the variable n. Chris@4: % modulo operator: a % b is the remainder of a divided by b. Chris@4: Chris@4: #define BASE 65521 /* largest prime smaller than 65536 */ Chris@4: Chris@4: /* Chris@4: Update a running Adler-32 checksum with the bytes buf[0..len-1] Chris@4: and return the updated checksum. The Adler-32 checksum should be Chris@4: initialized to 1. Chris@4: Chris@4: Usage example: Chris@4: Chris@4: unsigned long adler = 1L; Chris@4: Chris@4: while (read_buffer(buffer, length) != EOF) { Chris@4: adler = update_adler32(adler, buffer, length); Chris@4: } Chris@4: if (adler != original_adler) error(); Chris@4: */ Chris@4: unsigned long update_adler32(unsigned long adler, Chris@4: unsigned char *buf, int len) Chris@4: { Chris@4: unsigned long s1 = adler & 0xffff; Chris@4: unsigned long s2 = (adler >> 16) & 0xffff; Chris@4: int n; Chris@4: Chris@4: for (n = 0; n < len; n++) { Chris@4: s1 = (s1 + buf[n]) % BASE; Chris@4: s2 = (s2 + s1) % BASE; Chris@4: } Chris@4: return (s2 << 16) + s1; Chris@4: } Chris@4: Chris@4: /* Return the adler32 of the bytes buf[0..len-1] */ Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 10] Chris@4: Chris@4: RFC 1950 ZLIB Compressed Data Format Specification May 1996 Chris@4: Chris@4: Chris@4: unsigned long adler32(unsigned char *buf, int len) Chris@4: { Chris@4: return update_adler32(1L, buf, len); Chris@4: } Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Chris@4: Deutsch & Gailly Informational [Page 11] Chris@4: