sv-dependency-builds: src/capnproto-git-20161025/doc/encoding.md annotate

annotate src/capnproto-git-20161025/doc/encoding.md @ 83:ae30d91d2ffe

Replace these with versions built using an older toolset (so as to avoid ABI compatibilities when linking on Ubuntu 14.04 for packaging purposes)

author	Chris Cannam
date	Fri, 07 Feb 2020 11:51:13 +0000
parents	9530b331f8c1
children

rev	line source
cannam@48	1 ---
cannam@48	2 layout: page
cannam@48	3 title: Encoding Spec
cannam@48	4 ---
cannam@48	5
cannam@48	6 # Encoding Spec
cannam@48	7
cannam@48	8 ## Organization
cannam@48	9
cannam@48	10 ### 64-bit Words
cannam@48	11
cannam@48	12 For the purpose of Cap'n Proto, a "word" is defined as 8 bytes, or 64 bits. Since alignment of
cannam@48	13 data is important, all objects (structs, lists, and blobs) are aligned to word boundaries, and
cannam@48	14 sizes are usually expressed in terms of words. (Primitive values are aligned to a multiple of
cannam@48	15 their size within a struct or list.)
cannam@48	16
cannam@48	17 ### Messages
cannam@48	18
cannam@48	19 The unit of communication in Cap'n Proto is a "message". A message is a tree of objects, with
cannam@48	20 the root always being a struct.
cannam@48	21
cannam@48	22 Physically, messages may be split into several "segments", each of which is a flat blob of bytes.
cannam@48	23 Typically, a segment must be loaded into a contiguous block of memory before it can be accessed,
cannam@48	24 so that the relative pointers within the segment can be followed quickly. However, when a message
cannam@48	25 has multiple segments, it does not matter where those segments are located in memory relative to
cannam@48	26 each other; inter-segment pointers are encoded differently, as we'll see later.
cannam@48	27
cannam@48	28 Ideally, every message would have only one segment. However, there are a few reasons why splitting
cannam@48	29 a message into multiple segments may be convenient:
cannam@48	30
cannam@48	31 * It can be difficult to predict how large a message might be until you start writing it, and you
cannam@48	32 can't start writing it until you have a segment to write to. If it turns out the segment you
cannam@48	33 allocated isn't big enough, you can allocate additional segments without the need to relocate the
cannam@48	34 data you've already written.
cannam@48	35 * Allocating excessively large blocks of memory can make life difficult for memory allocators,
cannam@48	36 especially on 32-bit systems with limited address space.
cannam@48	37
cannam@48	38 The first word of the first segment of the message is always a pointer pointing to the message's
cannam@48	39 root struct.
cannam@48	40
cannam@48	41 ### Objects
cannam@48	42
cannam@48	43 Each segment in a message contains a series of objects. For the purpose of Cap'n Proto, an "object"
cannam@48	44 is any value which may have a pointer pointing to it. Pointers can only point to the beginning of
cannam@48	45 objects, not into the middle, and no more than one pointer can point at each object. Thus, objects
cannam@48	46 and the pointers connecting them form a tree, not a graph. An object is itself composed of
cannam@48	47 primitive data values and pointers, in a layout that depends on the kind of object.
cannam@48	48
cannam@48	49 At the moment, there are three kinds of objects: structs, lists, and far-pointer landing pads.
cannam@48	50 Blobs might also be considered to be a kind of object, but are encoded identically to lists of
cannam@48	51 bytes.
cannam@48	52
cannam@48	53 ## Value Encoding
cannam@48	54
cannam@48	55 ### Primitive Values
cannam@48	56
cannam@48	57 The built-in primitive types are encoded as follows:
cannam@48	58
cannam@48	59 * `Void`: Not encoded at all. It has only one possible value thus carries no information.
cannam@48	60 * `Bool`: One bit. 1 = true, 0 = false.
cannam@48	61 * Integers: Encoded in little-endian format. Signed integers use two's complement.
cannam@48	62 * Floating-points: Encoded in little-endian IEEE-754 format.
cannam@48	63
cannam@48	64 Primitive types must always be aligned to a multiple of their size. Note that since the size of
cannam@48	65 a `Bool` is one bit, this means eight `Bool` values can be encoded in a single byte -- this differs
cannam@48	66 from C++, where the `bool` type takes a whole byte.
cannam@48	67
cannam@48	68 ### Enums
cannam@48	69
cannam@48	70 Enums are encoded the same as `UInt16`.
cannam@48	71
cannam@48	72 ## Object Encoding
cannam@48	73
cannam@48	74 ### Blobs
cannam@48	75
cannam@48	76 The built-in blob types are encoded as follows:
cannam@48	77
cannam@48	78 * `Data`: Encoded as a pointer, identical to `List(UInt8)`.
cannam@48	79 * `Text`: Like `Data`, but the content must be valid UTF-8, and the last byte of the content must
cannam@48	80 be zero. The encoding allows bytes other than the last to be zero, but some applications
cannam@48	81 (especially ones written in languages that use NUL-terminated strings) may truncate at the first
cannam@48	82 zero. If a particular text field is explicitly intended to support zero bytes, it should
cannam@48	83 document this, but otherwise senders should assume that zero bytes are not allowed to be safe.
cannam@48	84 Note that the NUL terminator is included in the size sent on the wire, but the runtime library
cannam@48	85 should not count it in any size reported to the application.
cannam@48	86
cannam@48	87 ### Structs
cannam@48	88
cannam@48	89 A struct value is encoded as a pointer to its content. The content is split into two sections:
cannam@48	90 data and pointers, with the pointer section appearing immediately after the data section. This
cannam@48	91 split allows structs to be traversed (e.g., copied) without knowing their type.
cannam@48	92
cannam@48	93 A struct pointer looks like this:
cannam@48	94
cannam@48	95 lsb struct pointer msb
cannam@48	96 +-+-----------------------------+---------------+---------------+
cannam@48	97 \|A\| B \| C \| D \|
cannam@48	98 +-+-----------------------------+---------------+---------------+
cannam@48	99
cannam@48	100 A (2 bits) = 0, to indicate that this is a struct pointer.
cannam@48	101 B (30 bits) = Offset, in words, from the end of the pointer to the
cannam@48	102 start of the struct's data section. Signed.
cannam@48	103 C (16 bits) = Size of the struct's data section, in words.
cannam@48	104 D (16 bits) = Size of the struct's pointer section, in words.
cannam@48	105
cannam@48	106 Fields are positioned within the struct according to an algorithm with the following principles:
cannam@48	107
cannam@48	108 * The position of each field depends only on its definition and the definitions of lower-numbered
cannam@48	109 fields, never on the definitions of higher-numbered fields. This ensures backwards-compatibility
cannam@48	110 when new fields are added.
cannam@48	111 * Due to alignment requirements, fields in the data section may be separated by padding. However,
cannam@48	112 later-numbered fields may be positioned into the padding left between earlier-numbered fields.
cannam@48	113 Because of this, a struct will never contain more than 63 bits of padding. Since objects are
cannam@48	114 rounded up to a whole number of words anyway, padding never ends up wasting space.
cannam@48	115 * Unions and groups need not occupy contiguous memory. Indeed, they may have to be split into
cannam@48	116 multiple slots if new fields are added later on.
cannam@48	117
cannam@48	118 Field offsets are computed by the Cap'n Proto compiler. The precise algorithm is too complicated
cannam@48	119 to describe here, but you need not implement it yourself, as the compiler can produce a compiled
cannam@48	120 schema format which includes offset information.
cannam@48	121
cannam@48	122 ### Lists
cannam@48	123
cannam@48	124 A list value is encoded as a pointer to a flat array of values.
cannam@48	125
cannam@48	126 lsb list pointer msb
cannam@48	127 +-+-----------------------------+--+----------------------------+
cannam@48	128 \|A\| B \|C \| D \|
cannam@48	129 +-+-----------------------------+--+----------------------------+
cannam@48	130
cannam@48	131 A (2 bits) = 1, to indicate that this is a list pointer.
cannam@48	132 B (30 bits) = Offset, in words, from the end of the pointer to the
cannam@48	133 start of the first element of the list. Signed.
cannam@48	134 C (3 bits) = Size of each element:
cannam@48	135 0 = 0 (e.g. List(Void))
cannam@48	136 1 = 1 bit
cannam@48	137 2 = 1 byte
cannam@48	138 3 = 2 bytes
cannam@48	139 4 = 4 bytes
cannam@48	140 5 = 8 bytes (non-pointer)
cannam@48	141 6 = 8 bytes (pointer)
cannam@48	142 7 = composite (see below)
cannam@48	143 D (29 bits) = Number of elements in the list, except when C is 7
cannam@48	144 (see below).
cannam@48	145
cannam@48	146 The pointed-to values are tightly-packed. In particular, `Bool`s are packed bit-by-bit in
cannam@48	147 little-endian order (the first bit is the least-significant bit of the first byte).
cannam@48	148
cannam@48	149 When C = 7, the elements of the list are fixed-width composite values -- usually, structs. In
cannam@48	150 this case, the list content is prefixed by a "tag" word that describes each individual element.
cannam@48	151 The tag has the same layout as a struct pointer, except that the pointer offset (B) instead
cannam@48	152 indicates the number of elements in the list. Meanwhile, section (D) of the list pointer -- which
cannam@48	153 normally would store this element count -- instead stores the total number of _words_ in the list
cannam@48	154 (not counting the tag word). The reason we store a word count in the pointer rather than an element
cannam@48	155 count is to ensure that the extents of the list's location can always be determined by inspecting
cannam@48	156 the pointer alone, without having to look at the tag; this may allow more-efficient prefetching in
cannam@48	157 some use cases. The reason we don't store struct lists as a list of pointers is because doing so
cannam@48	158 would take significantly more space (an extra pointer per element) and may be less cache-friendly.
cannam@48	159
cannam@48	160 In the future, we could consider implementing matrixes using the "composite" element type, with the
cannam@48	161 elements being fixed-size lists rather than structs. In this case, the tag would look like a list
cannam@48	162 pointer rather than a struct pointer. As of this writing, no such feature has been implemented.
cannam@48	163
cannam@48	164 A struct list must always be written using C = 7. However, a list of any element size (except
cannam@48	165 C = 1, i.e. 1-bit) may be decoded as a struct list, with each element being interpreted as being
cannam@48	166 a prefix of the struct data. For instance, a list of 2-byte values (C = 3) can be decoded as a
cannam@48	167 struct list where each struct has 2 bytes in their "data" section (and an empty pointer section). A
cannam@48	168 list of pointer values (C = 6) can be decoded as a struct list where each struct has a pointer
cannam@48	169 section with one pointer (and an empty data section). The purpose of this rule is to make it
cannam@48	170 possible to upgrade a list of primitives to a list of structs, as described under the
cannam@48	171 [protocol evolution rules](language.html#evolving-your-protocol).
cannam@48	172 (We make a special exception that boolean lists cannot be upgraded in this way due to the
cannam@48	173 unreasonable implementation burden.) Note that even though struct lists can be decoded from any
cannam@48	174 element size (except C = 1), it is NOT permitted to encode a struct list using any type other than
cannam@48	175 C = 7 because doing so would interfere with the [canonicalization algorithm](#canonicalization).
cannam@48	176
cannam@48	177 #### Default Values
cannam@48	178
cannam@48	179 A default struct is always all-zeros. To achieve this, fields in the data section are stored xor'd
cannam@48	180 with their defined default values. An all-zero pointer is considered "null" (such a pointer would
cannam@48	181 otherwise point to a zero-size struct, which might as well be considered null); accessor methods
cannam@48	182 for pointer fields check for null and return a pointer to their default value in this case.
cannam@48	183
cannam@48	184 There are several reasons why this is desirable:
cannam@48	185
cannam@48	186 * Cap'n Proto messages are often "packed" with a simple compression algorithm that deflates
cannam@48	187 zero-value bytes.
cannam@48	188 * Newly-allocated structs only need to be zero-initialized, which is fast and requires no knowledge
cannam@48	189 of the struct type except its size.
cannam@48	190 * If a newly-added field is placed in space that was previously padding, messages written by old
cannam@48	191 binaries that do not know about this field will still have its default value set correctly --
cannam@48	192 because it is always zero.
cannam@48	193
cannam@48	194 ### Inter-Segment Pointers
cannam@48	195
cannam@48	196 When a pointer needs to point to a different segment, offsets no longer work. We instead encode
cannam@48	197 the pointer as a "far pointer", which looks like this:
cannam@48	198
cannam@48	199 lsb far pointer msb
cannam@48	200 +-+-+---------------------------+-------------------------------+
cannam@48	201 \|A\|B\| C \| D \|
cannam@48	202 +-+-+---------------------------+-------------------------------+
cannam@48	203
cannam@48	204 A (2 bits) = 2, to indicate that this is a far pointer.
cannam@48	205 B (1 bit) = 0 if the landing pad is one word, 1 if it is two words.
cannam@48	206 See explanation below.
cannam@48	207 C (29 bits) = Offset, in words, from the start of the target segment
cannam@48	208 to the location of the far-pointer landing-pad within that
cannam@48	209 segment. Unsigned.
cannam@48	210 D (32 bits) = ID of the target segment. (Segments are numbered
cannam@48	211 sequentially starting from zero.)
cannam@48	212
cannam@48	213 If B == 0, then the "landing pad" of a far pointer is normally just another pointer, which in turn
cannam@48	214 points to the actual object.
cannam@48	215
cannam@48	216 If B == 1, then the "landing pad" is itself another far pointer that is interpreted differently:
cannam@48	217 This far pointer (which always has B = 0) points to the start of the object's _content_, located in
cannam@48	218 some other segment. The landing pad is itself immediately followed by a tag word. The tag word
cannam@48	219 looks exactly like an intra-segment pointer to the target object would look, except that the offset
cannam@48	220 is always zero.
cannam@48	221
cannam@48	222 The reason for the convoluted double-far convention is to make it possible to form a new pointer
cannam@48	223 to an object in a segment that is full. If you can't allocate even one word in the segment where
cannam@48	224 the target resides, then you will need to allocate a landing pad in some other segment, and use
cannam@48	225 this double-far approach. This should be exceedingly rare in practice since pointers are normally
cannam@48	226 set to point to new objects, not existing ones.
cannam@48	227
cannam@48	228 ### Capabilities (Interfaces)
cannam@48	229
cannam@48	230 When using Cap'n Proto for [RPC](rpc.html), every message has an associated "capability table"
cannam@48	231 which is a flat list of all capabilities present in the message body. The details of what this
cannam@48	232 table contains and where it is stored are the responsibility of the RPC system; in some cases, the
cannam@48	233 table may not even be part of the message content.
cannam@48	234
cannam@48	235 A capability pointer, then, simply contains an index into the separate capability table.
cannam@48	236
cannam@48	237 lsb capability pointer msb
cannam@48	238 +-+-----------------------------+-------------------------------+
cannam@48	239 \|A\| B \| C \|
cannam@48	240 +-+-----------------------------+-------------------------------+
cannam@48	241
cannam@48	242 A (2 bits) = 3, to indicate that this is an "other" pointer.
cannam@48	243 B (30 bits) = 0, to indicate that this is a capability pointer.
cannam@48	244 (All other values are reserved for future use.)
cannam@48	245 C (32 bits) = Index of the capability in the message's capability
cannam@48	246 table.
cannam@48	247
cannam@48	248 In [rpc.capnp](https://github.com/sandstorm-io/capnproto/blob/master/c++/src/capnp/rpc.capnp), the
cannam@48	249 capability table is encoded as a list of `CapDescriptors`, appearing along-side the message content
cannam@48	250 in the `Payload` struct. However, some use cases may call for different approaches. A message
cannam@48	251 that is built and consumed within the same process need not encode the capability table at all
cannam@48	252 (it can just keep the table as a separate array). A message that is going to be stored to disk
cannam@48	253 would need to store a table of `SturdyRef`s instead of `CapDescriptor`s.
cannam@48	254
cannam@48	255 ## Serialization Over a Stream
cannam@48	256
cannam@48	257 When transmitting a message, the segments must be framed in some way, i.e. to communicate the
cannam@48	258 number of segments and their sizes before communicating the actual data. The best framing approach
cannam@48	259 may differ depending on the medium -- for example, messages read via `mmap` or shared memory may
cannam@48	260 call for a different approach than messages sent over a socket or a pipe. Cap'n Proto does not
cannam@48	261 attempt to specify a framing format for every situation. However, since byte streams are by far
cannam@48	262 the most common transmission medium, Cap'n Proto does define and implement a recommended framing
cannam@48	263 format for them.
cannam@48	264
cannam@48	265 When transmitting over a stream, the following should be sent. All integers are unsigned and
cannam@48	266 little-endian.
cannam@48	267
cannam@48	268 * (4 bytes) The number of segments, minus one (since there is always at least one segment).
cannam@48	269 * (N * 4 bytes) The size of each segment, in words.
cannam@48	270 * (0 or 4 bytes) Padding up to the next word boundary.
cannam@48	271 * The content of each segment, in order.
cannam@48	272
cannam@48	273 ### Packing
cannam@48	274
cannam@48	275 For cases where bandwidth usage matters, Cap'n Proto defines a simple compression scheme called
cannam@48	276 "packing". This scheme is based on the observation that Cap'n Proto messages contain lots of
cannam@48	277 zero bytes: padding bytes, unset fields, and high-order bytes of small-valued integers.
cannam@48	278
cannam@48	279 In packed format, each word of the message is reduced to a tag byte followed by zero to eight
cannam@48	280 content bytes. The bits of the tag byte correspond to the bytes of the unpacked word, with the
cannam@48	281 least-significant bit corresponding to the first byte. Each zero bit indicates that the
cannam@48	282 corresponding byte is zero. The non-zero bytes are packed following the tag.
cannam@48	283
cannam@48	284 For example, here is some typical Cap'n Proto data (a struct pointer (offset = 2, data size = 3,
cannam@48	285 pointer count = 2) followed by a text pointer (offset = 6, length = 53)) and its packed form:
cannam@48	286
cannam@48	287 unpacked (hex): 08 00 00 00 03 00 02 00 19 00 00 00 aa 01 00 00
cannam@48	288 packed (hex): 51 08 03 02 31 19 aa 01
cannam@48	289
cannam@48	290 In addition to the above, there are two tag values which are treated specially: 0x00 and 0xff.
cannam@48	291
cannam@48	292 * 0x00: The tag is followed by a single byte which indicates a count of consecutive zero-valued
cannam@48	293 words, minus 1. E.g. if the tag 0x00 is followed by 0x05, the sequence unpacks to 6 words of
cannam@48	294 zero.
cannam@48	295
cannam@48	296 Or, put another way: the tag is first decoded as if it were not special. Since none of the bits
cannam@48	297 are set, it is followed by no bytes and expands to a word full of zeros. After that, the next
cannam@48	298 byte is interpreted as a count of _additional_ words that are also all-zero.
cannam@48	299
cannam@48	300 * 0xff: The tag is followed by the bytes of the word (as if it weren't special), but after those
cannam@48	301 bytes is another byte with value N. Following that byte is N unpacked words that should be copied
cannam@48	302 directly. These unpacked words may or may not contain zeros -- it is up to the compressor to
cannam@48	303 decide when to end the unpacked span and return to packing each word. The purpose of this rule
cannam@48	304 is to minimize the impact of packing on data that doesn't contain any zeros -- in particular,
cannam@48	305 long text blobs. Because of this rule, the worst-case space overhead of packing is 2 bytes per
cannam@48	306 2 KiB of input (256 words = 2KiB).
cannam@48	307
cannam@48	308 Examples:
cannam@48	309
cannam@48	310 unpacked (hex): 00 (x 32 bytes)
cannam@48	311 packed (hex): 00 03
cannam@48	312
cannam@48	313 unpacked (hex): 8a (x 32 bytes)
cannam@48	314 packed (hex): ff 8a (x 8 bytes) 03 8a (x 24 bytes)
cannam@48	315
cannam@48	316 Notice that both of the special cases begin by treating the tag as if it weren't special. This
cannam@48	317 is intentionally designed to make encoding faster: you can compute the tag value and encode the
cannam@48	318 bytes in a single pass through the input word. Only after you've finished with that word do you
cannam@48	319 need to check whether the tag ended up being 0x00 or 0xff.
cannam@48	320
cannam@48	321 It is possible to write both an encoder and a decoder which only branch at the end of each word,
cannam@48	322 and only to handle the two special tags. It is not necessary to branch on every byte. See the
cannam@48	323 C++ reference implementation for an example.
cannam@48	324
cannam@48	325 Packing is normally applied on top of the standard stream framing described in the previous
cannam@48	326 section.
cannam@48	327
cannam@48	328 ### Compression
cannam@48	329
cannam@48	330 When Cap'n Proto messages may contain repetitive data (especially, large text blobs), it makes sense
cannam@48	331 to apply a standard compression algorithm in addition to packing. When CPU time is scarce, we
cannam@48	332 recommend [LZ4 compression](https://code.google.com/p/lz4/). Otherwise, [zlib](http://www.zlib.net)
cannam@48	333 is slower but will compress more.
cannam@48	334
cannam@48	335 ## Canonicalization
cannam@48	336
cannam@48	337 Cap'n Proto messages have a well-defined canonical form. Cap'n Proto encoders are NOT required to
cannam@48	338 output messages in canonical form, and in fact they will almost never do so by default. However,
cannam@48	339 it is possible to write code which canonicalizes a Cap'n Proto message without knowing its schema.
cannam@48	340
cannam@48	341 A canonical Cap'n Proto message must adhere to the following rules:
cannam@48	342
cannam@48	343 * The object tree must be encoded in preorder (with respect to the order of the pointers within
cannam@48	344 each object).
cannam@48	345 * The message must be encoded as a single segment. (When signing or hashing a canonical Cap'n Proto
cannam@48	346 message, the segment table shall not be included, because it would be redundant.)
cannam@48	347 * Trailing zero-valued words in a struct's data or pointer segments must be truncated. Since zero
cannam@48	348 represents a default value, this does not change the struct's meaning. This rule is important
cannam@48	349 to ensure that adding a new field to a struct does not affect the canonical encoding of messages
cannam@48	350 that do not set that field.
cannam@48	351 * Similarly, for a struct list, if a trailing word in a section of all structs in the list is zero,
cannam@48	352 then it must be truncated from all structs in the list. (All structs in a struct list must have
cannam@48	353 equal sizes, hence a trailing zero can only be removed if it is zero in all elements.)
cannam@48	354 * Canonical messages are not packed. However, packing can still be applied for transmission
cannam@48	355 purposes; the message must simply be unpacked before checking signatures.
cannam@48	356
cannam@48	357 Note that Cap'n Proto 0.5 introduced the rule that struct lists must always be encoded using
cannam@48	358 C = 7 in the [list pointer](#lists). Prior versions of Cap'n Proto allowed struct lists to be
cannam@48	359 encoded using any element size, so that small structs could be compacted to take less that a word
cannam@48	360 per element, and many encoders in fact implemented this. Unfortunately, this "optimization" made
cannam@48	361 canonicalization impossible without knowing the schema, which is a significant obstacle. Therefore,
cannam@48	362 the rules have been changed in 0.5, but data written by previous versions may not be possible to
cannam@48	363 canonicalize.
cannam@48	364
cannam@48	365 ## Security Considerations
cannam@48	366
cannam@48	367 A naive implementation of a Cap'n Proto reader may be vulnerable to attacks based on various kinds
cannam@48	368 of malicious input. Implementations MUST guard against these.
cannam@48	369
cannam@48	370 ### Pointer Validation
cannam@48	371
cannam@48	372 Cap'n Proto readers must validate pointers, e.g. to check that the target object is within the
cannam@48	373 bounds of its segment. To avoid an upfront scan of the message (which would defeat Cap'n Proto's
cannam@48	374 O(1) parsing performance), validation should occur lazily when the getter method for a pointer is
cannam@48	375 called, throwing an exception or returning a default value if the pointer is invalid.
cannam@48	376
cannam@48	377 ### Amplification attack
cannam@48	378
cannam@48	379 A message containing cyclic (or even just overlapping) pointers can cause the reader to go into
cannam@48	380 an infinite loop while traversing the content.
cannam@48	381
cannam@48	382 To defend against this, as the application traverses the message, each time a pointer is
cannam@48	383 dereferenced, a counter should be incremented by the size of the data to which it points. If this
cannam@48	384 counter goes over some limit, an error should be raised, and/or default values should be returned. We call this limit the "traversal limit" (or, sometimes, the "read limit").
cannam@48	385
cannam@48	386 The C++ implementation currently defaults to a limit of 64MiB, but allows the caller to set a
cannam@48	387 different limit if desired. Another reasonable strategy is to set the limit to some multiple of
cannam@48	388 the original message size; however, most applications should place limits on overall message sizes
cannam@48	389 anyway, so it makes sense to have one check cover both.
cannam@48	390
cannam@48	391 List amplification: A list of `Void` values or zero-size structs can have a very large element count while taking constant space on the wire. If the receiving application expects a list of structs, it will see these zero-sized elements as valid structs set to their default values. If it iterates through the list processing each element, it could spend a large amount of CPU time or other resources despite the message being small. To defend against this, the "traversal limit" should count a list of zero-sized elements as if each element were one word instead. This rule was introduced in the C++ implementation in [commit 1048706](https://github.com/sandstorm-io/capnproto/commit/104870608fde3c698483fdef6b97f093fc15685d).
cannam@48	392
cannam@48	393 ### Stack overflow DoS attack
cannam@48	394
cannam@48	395 A message with deeply-nested objects can cause a stack overflow in typical code which processes
cannam@48	396 messages recursively.
cannam@48	397
cannam@48	398 To defend against this, as the application traverses the message, the pointer depth should be
cannam@48	399 tracked. If it goes over some limit, an error should be raised. The C++ implementation currently
cannam@48	400 defaults to a limit of 64 pointers, but allows the caller to set a different limit.

Mercurial > hg > sv-dependency-builds

annotate src/capnproto-git-20161025/doc/encoding.md @ 83:ae30d91d2ffe