cannam@133
|
1 ---
|
cannam@133
|
2 layout: page
|
cannam@133
|
3 title: Encoding Spec
|
cannam@133
|
4 ---
|
cannam@133
|
5
|
cannam@133
|
6 # Encoding Spec
|
cannam@133
|
7
|
cannam@133
|
8 ## Organization
|
cannam@133
|
9
|
cannam@133
|
10 ### 64-bit Words
|
cannam@133
|
11
|
cannam@133
|
12 For the purpose of Cap'n Proto, a "word" is defined as 8 bytes, or 64 bits. Since alignment of
|
cannam@133
|
13 data is important, all objects (structs, lists, and blobs) are aligned to word boundaries, and
|
cannam@133
|
14 sizes are usually expressed in terms of words. (Primitive values are aligned to a multiple of
|
cannam@133
|
15 their size within a struct or list.)
|
cannam@133
|
16
|
cannam@133
|
17 ### Messages
|
cannam@133
|
18
|
cannam@133
|
19 The unit of communication in Cap'n Proto is a "message". A message is a tree of objects, with
|
cannam@133
|
20 the root always being a struct.
|
cannam@133
|
21
|
cannam@133
|
22 Physically, messages may be split into several "segments", each of which is a flat blob of bytes.
|
cannam@133
|
23 Typically, a segment must be loaded into a contiguous block of memory before it can be accessed,
|
cannam@133
|
24 so that the relative pointers within the segment can be followed quickly. However, when a message
|
cannam@133
|
25 has multiple segments, it does not matter where those segments are located in memory relative to
|
cannam@133
|
26 each other; inter-segment pointers are encoded differently, as we'll see later.
|
cannam@133
|
27
|
cannam@133
|
28 Ideally, every message would have only one segment. However, there are a few reasons why splitting
|
cannam@133
|
29 a message into multiple segments may be convenient:
|
cannam@133
|
30
|
cannam@133
|
31 * It can be difficult to predict how large a message might be until you start writing it, and you
|
cannam@133
|
32 can't start writing it until you have a segment to write to. If it turns out the segment you
|
cannam@133
|
33 allocated isn't big enough, you can allocate additional segments without the need to relocate the
|
cannam@133
|
34 data you've already written.
|
cannam@133
|
35 * Allocating excessively large blocks of memory can make life difficult for memory allocators,
|
cannam@133
|
36 especially on 32-bit systems with limited address space.
|
cannam@133
|
37
|
cannam@133
|
38 The first word of the first segment of the message is always a pointer pointing to the message's
|
cannam@133
|
39 root struct.
|
cannam@133
|
40
|
cannam@133
|
41 ### Objects
|
cannam@133
|
42
|
cannam@133
|
43 Each segment in a message contains a series of objects. For the purpose of Cap'n Proto, an "object"
|
cannam@133
|
44 is any value which may have a pointer pointing to it. Pointers can only point to the beginning of
|
cannam@133
|
45 objects, not into the middle, and no more than one pointer can point at each object. Thus, objects
|
cannam@133
|
46 and the pointers connecting them form a tree, not a graph. An object is itself composed of
|
cannam@133
|
47 primitive data values and pointers, in a layout that depends on the kind of object.
|
cannam@133
|
48
|
cannam@133
|
49 At the moment, there are three kinds of objects: structs, lists, and far-pointer landing pads.
|
cannam@133
|
50 Blobs might also be considered to be a kind of object, but are encoded identically to lists of
|
cannam@133
|
51 bytes.
|
cannam@133
|
52
|
cannam@133
|
53 ## Value Encoding
|
cannam@133
|
54
|
cannam@133
|
55 ### Primitive Values
|
cannam@133
|
56
|
cannam@133
|
57 The built-in primitive types are encoded as follows:
|
cannam@133
|
58
|
cannam@133
|
59 * `Void`: Not encoded at all. It has only one possible value thus carries no information.
|
cannam@133
|
60 * `Bool`: One bit. 1 = true, 0 = false.
|
cannam@133
|
61 * Integers: Encoded in little-endian format. Signed integers use two's complement.
|
cannam@133
|
62 * Floating-points: Encoded in little-endian IEEE-754 format.
|
cannam@133
|
63
|
cannam@133
|
64 Primitive types must always be aligned to a multiple of their size. Note that since the size of
|
cannam@133
|
65 a `Bool` is one bit, this means eight `Bool` values can be encoded in a single byte -- this differs
|
cannam@133
|
66 from C++, where the `bool` type takes a whole byte.
|
cannam@133
|
67
|
cannam@133
|
68 ### Enums
|
cannam@133
|
69
|
cannam@133
|
70 Enums are encoded the same as `UInt16`.
|
cannam@133
|
71
|
cannam@133
|
72 ## Object Encoding
|
cannam@133
|
73
|
cannam@133
|
74 ### Blobs
|
cannam@133
|
75
|
cannam@133
|
76 The built-in blob types are encoded as follows:
|
cannam@133
|
77
|
cannam@133
|
78 * `Data`: Encoded as a pointer, identical to `List(UInt8)`.
|
cannam@133
|
79 * `Text`: Like `Data`, but the content must be valid UTF-8, and the last byte of the content must
|
cannam@133
|
80 be zero. The encoding allows bytes other than the last to be zero, but some applications
|
cannam@133
|
81 (especially ones written in languages that use NUL-terminated strings) may truncate at the first
|
cannam@133
|
82 zero. If a particular text field is explicitly intended to support zero bytes, it should
|
cannam@133
|
83 document this, but otherwise senders should assume that zero bytes are not allowed to be safe.
|
cannam@133
|
84 Note that the NUL terminator is included in the size sent on the wire, but the runtime library
|
cannam@133
|
85 should not count it in any size reported to the application.
|
cannam@133
|
86
|
cannam@133
|
87 ### Structs
|
cannam@133
|
88
|
cannam@133
|
89 A struct value is encoded as a pointer to its content. The content is split into two sections:
|
cannam@133
|
90 data and pointers, with the pointer section appearing immediately after the data section. This
|
cannam@133
|
91 split allows structs to be traversed (e.g., copied) without knowing their type.
|
cannam@133
|
92
|
cannam@133
|
93 A struct pointer looks like this:
|
cannam@133
|
94
|
cannam@133
|
95 lsb struct pointer msb
|
cannam@133
|
96 +-+-----------------------------+---------------+---------------+
|
cannam@133
|
97 |A| B | C | D |
|
cannam@133
|
98 +-+-----------------------------+---------------+---------------+
|
cannam@133
|
99
|
cannam@133
|
100 A (2 bits) = 0, to indicate that this is a struct pointer.
|
cannam@133
|
101 B (30 bits) = Offset, in words, from the end of the pointer to the
|
cannam@133
|
102 start of the struct's data section. Signed.
|
cannam@133
|
103 C (16 bits) = Size of the struct's data section, in words.
|
cannam@133
|
104 D (16 bits) = Size of the struct's pointer section, in words.
|
cannam@133
|
105
|
cannam@133
|
106 Fields are positioned within the struct according to an algorithm with the following principles:
|
cannam@133
|
107
|
cannam@133
|
108 * The position of each field depends only on its definition and the definitions of lower-numbered
|
cannam@133
|
109 fields, never on the definitions of higher-numbered fields. This ensures backwards-compatibility
|
cannam@133
|
110 when new fields are added.
|
cannam@133
|
111 * Due to alignment requirements, fields in the data section may be separated by padding. However,
|
cannam@133
|
112 later-numbered fields may be positioned into the padding left between earlier-numbered fields.
|
cannam@133
|
113 Because of this, a struct will never contain more than 63 bits of padding. Since objects are
|
cannam@133
|
114 rounded up to a whole number of words anyway, padding never ends up wasting space.
|
cannam@133
|
115 * Unions and groups need not occupy contiguous memory. Indeed, they may have to be split into
|
cannam@133
|
116 multiple slots if new fields are added later on.
|
cannam@133
|
117
|
cannam@133
|
118 Field offsets are computed by the Cap'n Proto compiler. The precise algorithm is too complicated
|
cannam@133
|
119 to describe here, but you need not implement it yourself, as the compiler can produce a compiled
|
cannam@133
|
120 schema format which includes offset information.
|
cannam@133
|
121
|
cannam@133
|
122 ### Lists
|
cannam@133
|
123
|
cannam@133
|
124 A list value is encoded as a pointer to a flat array of values.
|
cannam@133
|
125
|
cannam@133
|
126 lsb list pointer msb
|
cannam@133
|
127 +-+-----------------------------+--+----------------------------+
|
cannam@133
|
128 |A| B |C | D |
|
cannam@133
|
129 +-+-----------------------------+--+----------------------------+
|
cannam@133
|
130
|
cannam@133
|
131 A (2 bits) = 1, to indicate that this is a list pointer.
|
cannam@133
|
132 B (30 bits) = Offset, in words, from the end of the pointer to the
|
cannam@133
|
133 start of the first element of the list. Signed.
|
cannam@133
|
134 C (3 bits) = Size of each element:
|
cannam@133
|
135 0 = 0 (e.g. List(Void))
|
cannam@133
|
136 1 = 1 bit
|
cannam@133
|
137 2 = 1 byte
|
cannam@133
|
138 3 = 2 bytes
|
cannam@133
|
139 4 = 4 bytes
|
cannam@133
|
140 5 = 8 bytes (non-pointer)
|
cannam@133
|
141 6 = 8 bytes (pointer)
|
cannam@133
|
142 7 = composite (see below)
|
cannam@133
|
143 D (29 bits) = Number of elements in the list, except when C is 7
|
cannam@133
|
144 (see below).
|
cannam@133
|
145
|
cannam@133
|
146 The pointed-to values are tightly-packed. In particular, `Bool`s are packed bit-by-bit in
|
cannam@133
|
147 little-endian order (the first bit is the least-significant bit of the first byte).
|
cannam@133
|
148
|
cannam@133
|
149 When C = 7, the elements of the list are fixed-width composite values -- usually, structs. In
|
cannam@133
|
150 this case, the list content is prefixed by a "tag" word that describes each individual element.
|
cannam@133
|
151 The tag has the same layout as a struct pointer, except that the pointer offset (B) instead
|
cannam@133
|
152 indicates the number of elements in the list. Meanwhile, section (D) of the list pointer -- which
|
cannam@133
|
153 normally would store this element count -- instead stores the total number of _words_ in the list
|
cannam@133
|
154 (not counting the tag word). The reason we store a word count in the pointer rather than an element
|
cannam@133
|
155 count is to ensure that the extents of the list's location can always be determined by inspecting
|
cannam@133
|
156 the pointer alone, without having to look at the tag; this may allow more-efficient prefetching in
|
cannam@133
|
157 some use cases. The reason we don't store struct lists as a list of pointers is because doing so
|
cannam@133
|
158 would take significantly more space (an extra pointer per element) and may be less cache-friendly.
|
cannam@133
|
159
|
cannam@133
|
160 In the future, we could consider implementing matrixes using the "composite" element type, with the
|
cannam@133
|
161 elements being fixed-size lists rather than structs. In this case, the tag would look like a list
|
cannam@133
|
162 pointer rather than a struct pointer. As of this writing, no such feature has been implemented.
|
cannam@133
|
163
|
cannam@133
|
164 A struct list must always be written using C = 7. However, a list of any element size (except
|
cannam@133
|
165 C = 1, i.e. 1-bit) may be *decoded* as a struct list, with each element being interpreted as being
|
cannam@133
|
166 a prefix of the struct data. For instance, a list of 2-byte values (C = 3) can be decoded as a
|
cannam@133
|
167 struct list where each struct has 2 bytes in their "data" section (and an empty pointer section). A
|
cannam@133
|
168 list of pointer values (C = 6) can be decoded as a struct list where each struct has a pointer
|
cannam@133
|
169 section with one pointer (and an empty data section). The purpose of this rule is to make it
|
cannam@133
|
170 possible to upgrade a list of primitives to a list of structs, as described under the
|
cannam@133
|
171 [protocol evolution rules](language.html#evolving-your-protocol).
|
cannam@133
|
172 (We make a special exception that boolean lists cannot be upgraded in this way due to the
|
cannam@133
|
173 unreasonable implementation burden.) Note that even though struct lists can be decoded from any
|
cannam@133
|
174 element size (except C = 1), it is NOT permitted to encode a struct list using any type other than
|
cannam@133
|
175 C = 7 because doing so would interfere with the [canonicalization algorithm](#canonicalization).
|
cannam@133
|
176
|
cannam@133
|
177 #### Default Values
|
cannam@133
|
178
|
cannam@133
|
179 A default struct is always all-zeros. To achieve this, fields in the data section are stored xor'd
|
cannam@133
|
180 with their defined default values. An all-zero pointer is considered "null" (such a pointer would
|
cannam@133
|
181 otherwise point to a zero-size struct, which might as well be considered null); accessor methods
|
cannam@133
|
182 for pointer fields check for null and return a pointer to their default value in this case.
|
cannam@133
|
183
|
cannam@133
|
184 There are several reasons why this is desirable:
|
cannam@133
|
185
|
cannam@133
|
186 * Cap'n Proto messages are often "packed" with a simple compression algorithm that deflates
|
cannam@133
|
187 zero-value bytes.
|
cannam@133
|
188 * Newly-allocated structs only need to be zero-initialized, which is fast and requires no knowledge
|
cannam@133
|
189 of the struct type except its size.
|
cannam@133
|
190 * If a newly-added field is placed in space that was previously padding, messages written by old
|
cannam@133
|
191 binaries that do not know about this field will still have its default value set correctly --
|
cannam@133
|
192 because it is always zero.
|
cannam@133
|
193
|
cannam@133
|
194 ### Inter-Segment Pointers
|
cannam@133
|
195
|
cannam@133
|
196 When a pointer needs to point to a different segment, offsets no longer work. We instead encode
|
cannam@133
|
197 the pointer as a "far pointer", which looks like this:
|
cannam@133
|
198
|
cannam@133
|
199 lsb far pointer msb
|
cannam@133
|
200 +-+-+---------------------------+-------------------------------+
|
cannam@133
|
201 |A|B| C | D |
|
cannam@133
|
202 +-+-+---------------------------+-------------------------------+
|
cannam@133
|
203
|
cannam@133
|
204 A (2 bits) = 2, to indicate that this is a far pointer.
|
cannam@133
|
205 B (1 bit) = 0 if the landing pad is one word, 1 if it is two words.
|
cannam@133
|
206 See explanation below.
|
cannam@133
|
207 C (29 bits) = Offset, in words, from the start of the target segment
|
cannam@133
|
208 to the location of the far-pointer landing-pad within that
|
cannam@133
|
209 segment. Unsigned.
|
cannam@133
|
210 D (32 bits) = ID of the target segment. (Segments are numbered
|
cannam@133
|
211 sequentially starting from zero.)
|
cannam@133
|
212
|
cannam@133
|
213 If B == 0, then the "landing pad" of a far pointer is normally just another pointer, which in turn
|
cannam@133
|
214 points to the actual object.
|
cannam@133
|
215
|
cannam@133
|
216 If B == 1, then the "landing pad" is itself another far pointer that is interpreted differently:
|
cannam@133
|
217 This far pointer (which always has B = 0) points to the start of the object's _content_, located in
|
cannam@133
|
218 some other segment. The landing pad is itself immediately followed by a tag word. The tag word
|
cannam@133
|
219 looks exactly like an intra-segment pointer to the target object would look, except that the offset
|
cannam@133
|
220 is always zero.
|
cannam@133
|
221
|
cannam@133
|
222 The reason for the convoluted double-far convention is to make it possible to form a new pointer
|
cannam@133
|
223 to an object in a segment that is full. If you can't allocate even one word in the segment where
|
cannam@133
|
224 the target resides, then you will need to allocate a landing pad in some other segment, and use
|
cannam@133
|
225 this double-far approach. This should be exceedingly rare in practice since pointers are normally
|
cannam@133
|
226 set to point to new objects, not existing ones.
|
cannam@133
|
227
|
cannam@133
|
228 ### Capabilities (Interfaces)
|
cannam@133
|
229
|
cannam@133
|
230 When using Cap'n Proto for [RPC](rpc.html), every message has an associated "capability table"
|
cannam@133
|
231 which is a flat list of all capabilities present in the message body. The details of what this
|
cannam@133
|
232 table contains and where it is stored are the responsibility of the RPC system; in some cases, the
|
cannam@133
|
233 table may not even be part of the message content.
|
cannam@133
|
234
|
cannam@133
|
235 A capability pointer, then, simply contains an index into the separate capability table.
|
cannam@133
|
236
|
cannam@133
|
237 lsb capability pointer msb
|
cannam@133
|
238 +-+-----------------------------+-------------------------------+
|
cannam@133
|
239 |A| B | C |
|
cannam@133
|
240 +-+-----------------------------+-------------------------------+
|
cannam@133
|
241
|
cannam@133
|
242 A (2 bits) = 3, to indicate that this is an "other" pointer.
|
cannam@133
|
243 B (30 bits) = 0, to indicate that this is a capability pointer.
|
cannam@133
|
244 (All other values are reserved for future use.)
|
cannam@133
|
245 C (32 bits) = Index of the capability in the message's capability
|
cannam@133
|
246 table.
|
cannam@133
|
247
|
cannam@133
|
248 In [rpc.capnp](https://github.com/sandstorm-io/capnproto/blob/master/c++/src/capnp/rpc.capnp), the
|
cannam@133
|
249 capability table is encoded as a list of `CapDescriptors`, appearing along-side the message content
|
cannam@133
|
250 in the `Payload` struct. However, some use cases may call for different approaches. A message
|
cannam@133
|
251 that is built and consumed within the same process need not encode the capability table at all
|
cannam@133
|
252 (it can just keep the table as a separate array). A message that is going to be stored to disk
|
cannam@133
|
253 would need to store a table of `SturdyRef`s instead of `CapDescriptor`s.
|
cannam@133
|
254
|
cannam@133
|
255 ## Serialization Over a Stream
|
cannam@133
|
256
|
cannam@133
|
257 When transmitting a message, the segments must be framed in some way, i.e. to communicate the
|
cannam@133
|
258 number of segments and their sizes before communicating the actual data. The best framing approach
|
cannam@133
|
259 may differ depending on the medium -- for example, messages read via `mmap` or shared memory may
|
cannam@133
|
260 call for a different approach than messages sent over a socket or a pipe. Cap'n Proto does not
|
cannam@133
|
261 attempt to specify a framing format for every situation. However, since byte streams are by far
|
cannam@133
|
262 the most common transmission medium, Cap'n Proto does define and implement a recommended framing
|
cannam@133
|
263 format for them.
|
cannam@133
|
264
|
cannam@133
|
265 When transmitting over a stream, the following should be sent. All integers are unsigned and
|
cannam@133
|
266 little-endian.
|
cannam@133
|
267
|
cannam@133
|
268 * (4 bytes) The number of segments, minus one (since there is always at least one segment).
|
cannam@133
|
269 * (N * 4 bytes) The size of each segment, in words.
|
cannam@133
|
270 * (0 or 4 bytes) Padding up to the next word boundary.
|
cannam@133
|
271 * The content of each segment, in order.
|
cannam@133
|
272
|
cannam@133
|
273 ### Packing
|
cannam@133
|
274
|
cannam@133
|
275 For cases where bandwidth usage matters, Cap'n Proto defines a simple compression scheme called
|
cannam@133
|
276 "packing". This scheme is based on the observation that Cap'n Proto messages contain lots of
|
cannam@133
|
277 zero bytes: padding bytes, unset fields, and high-order bytes of small-valued integers.
|
cannam@133
|
278
|
cannam@133
|
279 In packed format, each word of the message is reduced to a tag byte followed by zero to eight
|
cannam@133
|
280 content bytes. The bits of the tag byte correspond to the bytes of the unpacked word, with the
|
cannam@133
|
281 least-significant bit corresponding to the first byte. Each zero bit indicates that the
|
cannam@133
|
282 corresponding byte is zero. The non-zero bytes are packed following the tag.
|
cannam@133
|
283
|
cannam@133
|
284 For example, here is some typical Cap'n Proto data (a struct pointer (offset = 2, data size = 3,
|
cannam@133
|
285 pointer count = 2) followed by a text pointer (offset = 6, length = 53)) and its packed form:
|
cannam@133
|
286
|
cannam@133
|
287 unpacked (hex): 08 00 00 00 03 00 02 00 19 00 00 00 aa 01 00 00
|
cannam@133
|
288 packed (hex): 51 08 03 02 31 19 aa 01
|
cannam@133
|
289
|
cannam@133
|
290 In addition to the above, there are two tag values which are treated specially: 0x00 and 0xff.
|
cannam@133
|
291
|
cannam@133
|
292 * 0x00: The tag is followed by a single byte which indicates a count of consecutive zero-valued
|
cannam@133
|
293 words, minus 1. E.g. if the tag 0x00 is followed by 0x05, the sequence unpacks to 6 words of
|
cannam@133
|
294 zero.
|
cannam@133
|
295
|
cannam@133
|
296 Or, put another way: the tag is first decoded as if it were not special. Since none of the bits
|
cannam@133
|
297 are set, it is followed by no bytes and expands to a word full of zeros. After that, the next
|
cannam@133
|
298 byte is interpreted as a count of _additional_ words that are also all-zero.
|
cannam@133
|
299
|
cannam@133
|
300 * 0xff: The tag is followed by the bytes of the word (as if it weren't special), but after those
|
cannam@133
|
301 bytes is another byte with value N. Following that byte is N unpacked words that should be copied
|
cannam@133
|
302 directly. These unpacked words may or may not contain zeros -- it is up to the compressor to
|
cannam@133
|
303 decide when to end the unpacked span and return to packing each word. The purpose of this rule
|
cannam@133
|
304 is to minimize the impact of packing on data that doesn't contain any zeros -- in particular,
|
cannam@133
|
305 long text blobs. Because of this rule, the worst-case space overhead of packing is 2 bytes per
|
cannam@133
|
306 2 KiB of input (256 words = 2KiB).
|
cannam@133
|
307
|
cannam@133
|
308 Examples:
|
cannam@133
|
309
|
cannam@133
|
310 unpacked (hex): 00 (x 32 bytes)
|
cannam@133
|
311 packed (hex): 00 03
|
cannam@133
|
312
|
cannam@133
|
313 unpacked (hex): 8a (x 32 bytes)
|
cannam@133
|
314 packed (hex): ff 8a (x 8 bytes) 03 8a (x 24 bytes)
|
cannam@133
|
315
|
cannam@133
|
316 Notice that both of the special cases begin by treating the tag as if it weren't special. This
|
cannam@133
|
317 is intentionally designed to make encoding faster: you can compute the tag value and encode the
|
cannam@133
|
318 bytes in a single pass through the input word. Only after you've finished with that word do you
|
cannam@133
|
319 need to check whether the tag ended up being 0x00 or 0xff.
|
cannam@133
|
320
|
cannam@133
|
321 It is possible to write both an encoder and a decoder which only branch at the end of each word,
|
cannam@133
|
322 and only to handle the two special tags. It is not necessary to branch on every byte. See the
|
cannam@133
|
323 C++ reference implementation for an example.
|
cannam@133
|
324
|
cannam@133
|
325 Packing is normally applied on top of the standard stream framing described in the previous
|
cannam@133
|
326 section.
|
cannam@133
|
327
|
cannam@133
|
328 ### Compression
|
cannam@133
|
329
|
cannam@133
|
330 When Cap'n Proto messages may contain repetitive data (especially, large text blobs), it makes sense
|
cannam@133
|
331 to apply a standard compression algorithm in addition to packing. When CPU time is scarce, we
|
cannam@133
|
332 recommend [LZ4 compression](https://code.google.com/p/lz4/). Otherwise, [zlib](http://www.zlib.net)
|
cannam@133
|
333 is slower but will compress more.
|
cannam@133
|
334
|
cannam@133
|
335 ## Canonicalization
|
cannam@133
|
336
|
cannam@133
|
337 Cap'n Proto messages have a well-defined canonical form. Cap'n Proto encoders are NOT required to
|
cannam@133
|
338 output messages in canonical form, and in fact they will almost never do so by default. However,
|
cannam@133
|
339 it is possible to write code which canonicalizes a Cap'n Proto message without knowing its schema.
|
cannam@133
|
340
|
cannam@133
|
341 A canonical Cap'n Proto message must adhere to the following rules:
|
cannam@133
|
342
|
cannam@133
|
343 * The object tree must be encoded in preorder (with respect to the order of the pointers within
|
cannam@133
|
344 each object).
|
cannam@133
|
345 * The message must be encoded as a single segment. (When signing or hashing a canonical Cap'n Proto
|
cannam@133
|
346 message, the segment table shall not be included, because it would be redundant.)
|
cannam@133
|
347 * Trailing zero-valued words in a struct's data or pointer segments must be truncated. Since zero
|
cannam@133
|
348 represents a default value, this does not change the struct's meaning. This rule is important
|
cannam@133
|
349 to ensure that adding a new field to a struct does not affect the canonical encoding of messages
|
cannam@133
|
350 that do not set that field.
|
cannam@133
|
351 * Similarly, for a struct list, if a trailing word in a section of all structs in the list is zero,
|
cannam@133
|
352 then it must be truncated from all structs in the list. (All structs in a struct list must have
|
cannam@133
|
353 equal sizes, hence a trailing zero can only be removed if it is zero in all elements.)
|
cannam@133
|
354 * Canonical messages are not packed. However, packing can still be applied for transmission
|
cannam@133
|
355 purposes; the message must simply be unpacked before checking signatures.
|
cannam@133
|
356
|
cannam@133
|
357 Note that Cap'n Proto 0.5 introduced the rule that struct lists must always be encoded using
|
cannam@133
|
358 C = 7 in the [list pointer](#lists). Prior versions of Cap'n Proto allowed struct lists to be
|
cannam@133
|
359 encoded using any element size, so that small structs could be compacted to take less that a word
|
cannam@133
|
360 per element, and many encoders in fact implemented this. Unfortunately, this "optimization" made
|
cannam@133
|
361 canonicalization impossible without knowing the schema, which is a significant obstacle. Therefore,
|
cannam@133
|
362 the rules have been changed in 0.5, but data written by previous versions may not be possible to
|
cannam@133
|
363 canonicalize.
|
cannam@133
|
364
|
cannam@133
|
365 ## Security Considerations
|
cannam@133
|
366
|
cannam@133
|
367 A naive implementation of a Cap'n Proto reader may be vulnerable to attacks based on various kinds
|
cannam@133
|
368 of malicious input. Implementations MUST guard against these.
|
cannam@133
|
369
|
cannam@133
|
370 ### Pointer Validation
|
cannam@133
|
371
|
cannam@133
|
372 Cap'n Proto readers must validate pointers, e.g. to check that the target object is within the
|
cannam@133
|
373 bounds of its segment. To avoid an upfront scan of the message (which would defeat Cap'n Proto's
|
cannam@133
|
374 O(1) parsing performance), validation should occur lazily when the getter method for a pointer is
|
cannam@133
|
375 called, throwing an exception or returning a default value if the pointer is invalid.
|
cannam@133
|
376
|
cannam@133
|
377 ### Amplification attack
|
cannam@133
|
378
|
cannam@133
|
379 A message containing cyclic (or even just overlapping) pointers can cause the reader to go into
|
cannam@133
|
380 an infinite loop while traversing the content.
|
cannam@133
|
381
|
cannam@133
|
382 To defend against this, as the application traverses the message, each time a pointer is
|
cannam@133
|
383 dereferenced, a counter should be incremented by the size of the data to which it points. If this
|
cannam@133
|
384 counter goes over some limit, an error should be raised, and/or default values should be returned. We call this limit the "traversal limit" (or, sometimes, the "read limit").
|
cannam@133
|
385
|
cannam@133
|
386 The C++ implementation currently defaults to a limit of 64MiB, but allows the caller to set a
|
cannam@133
|
387 different limit if desired. Another reasonable strategy is to set the limit to some multiple of
|
cannam@133
|
388 the original message size; however, most applications should place limits on overall message sizes
|
cannam@133
|
389 anyway, so it makes sense to have one check cover both.
|
cannam@133
|
390
|
cannam@133
|
391 **List amplification:** A list of `Void` values or zero-size structs can have a very large element count while taking constant space on the wire. If the receiving application expects a list of structs, it will see these zero-sized elements as valid structs set to their default values. If it iterates through the list processing each element, it could spend a large amount of CPU time or other resources despite the message being small. To defend against this, the "traversal limit" should count a list of zero-sized elements as if each element were one word instead. This rule was introduced in the C++ implementation in [commit 1048706](https://github.com/sandstorm-io/capnproto/commit/104870608fde3c698483fdef6b97f093fc15685d).
|
cannam@133
|
392
|
cannam@133
|
393 ### Stack overflow DoS attack
|
cannam@133
|
394
|
cannam@133
|
395 A message with deeply-nested objects can cause a stack overflow in typical code which processes
|
cannam@133
|
396 messages recursively.
|
cannam@133
|
397
|
cannam@133
|
398 To defend against this, as the application traverses the message, the pointer depth should be
|
cannam@133
|
399 tracked. If it goes over some limit, an error should be raised. The C++ implementation currently
|
cannam@133
|
400 defaults to a limit of 64 pointers, but allows the caller to set a different limit.
|