Chris@1
|
1 % -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
|
Chris@1
|
2 %!TEX root = Vorbis_I_spec.tex
|
Chris@1
|
3 % $Id$
|
Chris@1
|
4 \section{Introduction and Description} \label{vorbis:spec:intro}
|
Chris@1
|
5
|
Chris@1
|
6 \subsection{Overview}
|
Chris@1
|
7
|
Chris@1
|
8 This document provides a high level description of the Vorbis codec's
|
Chris@1
|
9 construction. A bit-by-bit specification appears beginning in
|
Chris@1
|
10 \xref{vorbis:spec:codec}.
|
Chris@1
|
11 The later sections assume a high-level
|
Chris@1
|
12 understanding of the Vorbis decode process, which is
|
Chris@1
|
13 provided here.
|
Chris@1
|
14
|
Chris@1
|
15 \subsubsection{Application}
|
Chris@1
|
16 Vorbis is a general purpose perceptual audio CODEC intended to allow
|
Chris@1
|
17 maximum encoder flexibility, thus allowing it to scale competitively
|
Chris@1
|
18 over an exceptionally wide range of bitrates. At the high
|
Chris@1
|
19 quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
|
Chris@1
|
20 it is in the same league as MPEG-2 and MPC. Similarly, the 1.0
|
Chris@1
|
21 encoder can encode high-quality CD and DAT rate stereo at below 48kbps
|
Chris@1
|
22 without resampling to a lower rate. Vorbis is also intended for
|
Chris@1
|
23 lower and higher sample rates (from 8kHz telephony to 192kHz digital
|
Chris@1
|
24 masters) and a range of channel representations (monaural,
|
Chris@1
|
25 polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
|
Chris@1
|
26 discrete channels).
|
Chris@1
|
27
|
Chris@1
|
28
|
Chris@1
|
29 \subsubsection{Classification}
|
Chris@1
|
30 Vorbis I is a forward-adaptive monolithic transform CODEC based on the
|
Chris@1
|
31 Modified Discrete Cosine Transform. The codec is structured to allow
|
Chris@1
|
32 addition of a hybrid wavelet filterbank in Vorbis II to offer better
|
Chris@1
|
33 transient response and reproduction using a transform better suited to
|
Chris@1
|
34 localized time events.
|
Chris@1
|
35
|
Chris@1
|
36
|
Chris@1
|
37 \subsubsection{Assumptions}
|
Chris@1
|
38
|
Chris@1
|
39 The Vorbis CODEC design assumes a complex, psychoacoustically-aware
|
Chris@1
|
40 encoder and simple, low-complexity decoder. Vorbis decode is
|
Chris@1
|
41 computationally simpler than mp3, although it does require more
|
Chris@1
|
42 working memory as Vorbis has no static probability model; the vector
|
Chris@1
|
43 codebooks used in the first stage of decoding from the bitstream are
|
Chris@1
|
44 packed in their entirety into the Vorbis bitstream headers. In
|
Chris@1
|
45 packed form, these codebooks occupy only a few kilobytes; the extent
|
Chris@1
|
46 to which they are pre-decoded into a cache is the dominant factor in
|
Chris@1
|
47 decoder memory usage.
|
Chris@1
|
48
|
Chris@1
|
49
|
Chris@1
|
50 Vorbis provides none of its own framing, synchronization or protection
|
Chris@1
|
51 against errors; it is solely a method of accepting input audio,
|
Chris@1
|
52 dividing it into individual frames and compressing these frames into
|
Chris@1
|
53 raw, unformatted 'packets'. The decoder then accepts these raw
|
Chris@1
|
54 packets in sequence, decodes them, synthesizes audio frames from
|
Chris@1
|
55 them, and reassembles the frames into a facsimile of the original
|
Chris@1
|
56 audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
|
Chris@1
|
57 minimum size, maximum size, or fixed/expected size. Packets
|
Chris@1
|
58 are designed that they may be truncated (or padded) and remain
|
Chris@1
|
59 decodable; this is not to be considered an error condition and is used
|
Chris@1
|
60 extensively in bitrate management in peeling. Both the transport
|
Chris@1
|
61 mechanism and decoder must allow that a packet may be any size, or
|
Chris@1
|
62 end before or after packet decode expects.
|
Chris@1
|
63
|
Chris@1
|
64 Vorbis packets are thus intended to be used with a transport mechanism
|
Chris@1
|
65 that provides free-form framing, sync, positioning and error correction
|
Chris@1
|
66 in accordance with these design assumptions, such as Ogg (for file
|
Chris@1
|
67 transport) or RTP (for network multicast). For purposes of a few
|
Chris@1
|
68 examples in this document, we will assume that Vorbis is to be
|
Chris@1
|
69 embedded in an Ogg stream specifically, although this is by no means a
|
Chris@1
|
70 requirement or fundamental assumption in the Vorbis design.
|
Chris@1
|
71
|
Chris@1
|
72 The specification for embedding Vorbis into
|
Chris@1
|
73 an Ogg transport stream is in \xref{vorbis:over:ogg}.
|
Chris@1
|
74
|
Chris@1
|
75
|
Chris@1
|
76
|
Chris@1
|
77 \subsubsection{Codec Setup and Probability Model}
|
Chris@1
|
78
|
Chris@1
|
79 Vorbis' heritage is as a research CODEC and its current design
|
Chris@1
|
80 reflects a desire to allow multiple decades of continuous encoder
|
Chris@1
|
81 improvement before running out of room within the codec specification.
|
Chris@1
|
82 For these reasons, configurable aspects of codec setup intentionally
|
Chris@1
|
83 lean toward the extreme of forward adaptive.
|
Chris@1
|
84
|
Chris@1
|
85 The single most controversial design decision in Vorbis (and the most
|
Chris@1
|
86 unusual for a Vorbis developer to keep in mind) is that the entire
|
Chris@1
|
87 probability model of the codec, the Huffman and VQ codebooks, is
|
Chris@1
|
88 packed into the bitstream header along with extensive CODEC setup
|
Chris@1
|
89 parameters (often several hundred fields). This makes it impossible,
|
Chris@1
|
90 as it would be with MPEG audio layers, to embed a simple frame type
|
Chris@1
|
91 flag in each audio packet, or begin decode at any frame in the stream
|
Chris@1
|
92 without having previously fetched the codec setup header.
|
Chris@1
|
93
|
Chris@1
|
94
|
Chris@1
|
95 \begin{note}
|
Chris@1
|
96 Vorbis \emph{can} initiate decode at any arbitrary packet within a
|
Chris@1
|
97 bitstream so long as the codec has been initialized/setup with the
|
Chris@1
|
98 setup headers.
|
Chris@1
|
99 \end{note}
|
Chris@1
|
100
|
Chris@1
|
101 Thus, Vorbis headers are both required for decode to begin and
|
Chris@1
|
102 relatively large as bitstream headers go. The header size is
|
Chris@1
|
103 unbounded, although for streaming a rule-of-thumb of 4kB or less is
|
Chris@1
|
104 recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
|
Chris@1
|
105
|
Chris@1
|
106 Our own design work indicates the primary liability of the
|
Chris@1
|
107 required header is in mindshare; it is an unusual design and thus
|
Chris@1
|
108 causes some amount of complaint among engineers as this runs against
|
Chris@1
|
109 current design trends (and also points out limitations in some
|
Chris@1
|
110 existing software/interface designs, such as Windows' ACM codec
|
Chris@1
|
111 framework). However, we find that it does not fundamentally limit
|
Chris@1
|
112 Vorbis' suitable application space.
|
Chris@1
|
113
|
Chris@1
|
114
|
Chris@1
|
115 \subsubsection{Format Specification}
|
Chris@1
|
116 The Vorbis format is well-defined by its decode specification; any
|
Chris@1
|
117 encoder that produces packets that are correctly decoded by the
|
Chris@1
|
118 reference Vorbis decoder described below may be considered a proper
|
Chris@1
|
119 Vorbis encoder. A decoder must faithfully and completely implement
|
Chris@1
|
120 the specification defined below (except where noted) to be considered
|
Chris@1
|
121 a proper Vorbis decoder.
|
Chris@1
|
122
|
Chris@1
|
123 \subsubsection{Hardware Profile}
|
Chris@1
|
124 Although Vorbis decode is computationally simple, it may still run
|
Chris@1
|
125 into specific limitations of an embedded design. For this reason,
|
Chris@1
|
126 embedded designs are allowed to deviate in limited ways from the
|
Chris@1
|
127 `full' decode specification yet still be certified compliant. These
|
Chris@1
|
128 optional omissions are labelled in the spec where relevant.
|
Chris@1
|
129
|
Chris@1
|
130
|
Chris@1
|
131 \subsection{Decoder Configuration}
|
Chris@1
|
132
|
Chris@1
|
133 Decoder setup consists of configuration of multiple, self-contained
|
Chris@1
|
134 component abstractions that perform specific functions in the decode
|
Chris@1
|
135 pipeline. Each different component instance of a specific type is
|
Chris@1
|
136 semantically interchangeable; decoder configuration consists both of
|
Chris@1
|
137 internal component configuration, as well as arrangement of specific
|
Chris@1
|
138 instances into a decode pipeline. Componentry arrangement is roughly
|
Chris@1
|
139 as follows:
|
Chris@1
|
140
|
Chris@1
|
141 \begin{center}
|
Chris@1
|
142 \includegraphics[width=\textwidth]{components}
|
Chris@1
|
143 \captionof{figure}{decoder pipeline configuration}
|
Chris@1
|
144 \end{center}
|
Chris@1
|
145
|
Chris@1
|
146 \subsubsection{Global Config}
|
Chris@1
|
147 Global codec configuration consists of a few audio related fields
|
Chris@1
|
148 (sample rate, channels), Vorbis version (always '0' in Vorbis I),
|
Chris@1
|
149 bitrate hints, and the lists of component instances. All other
|
Chris@1
|
150 configuration is in the context of specific components.
|
Chris@1
|
151
|
Chris@1
|
152 \subsubsection{Mode}
|
Chris@1
|
153
|
Chris@1
|
154 Each Vorbis frame is coded according to a master 'mode'. A bitstream
|
Chris@1
|
155 may use one or many modes.
|
Chris@1
|
156
|
Chris@1
|
157 The mode mechanism is used to encode a frame according to one of
|
Chris@1
|
158 multiple possible methods with the intention of choosing a method best
|
Chris@1
|
159 suited to that frame. Different modes are, e.g. how frame size
|
Chris@1
|
160 is changed from frame to frame. The mode number of a frame serves as a
|
Chris@1
|
161 top level configuration switch for all other specific aspects of frame
|
Chris@1
|
162 decode.
|
Chris@1
|
163
|
Chris@1
|
164 A 'mode' configuration consists of a frame size setting, window type
|
Chris@1
|
165 (always 0, the Vorbis window, in Vorbis I), transform type (always
|
Chris@1
|
166 type 0, the MDCT, in Vorbis I) and a mapping number. The mapping
|
Chris@1
|
167 number specifies which mapping configuration instance to use for
|
Chris@1
|
168 low-level packet decode and synthesis.
|
Chris@1
|
169
|
Chris@1
|
170
|
Chris@1
|
171 \subsubsection{Mapping}
|
Chris@1
|
172
|
Chris@1
|
173 A mapping contains a channel coupling description and a list of
|
Chris@1
|
174 'submaps' that bundle sets of channel vectors together for grouped
|
Chris@1
|
175 encoding and decoding. These submaps are not references to external
|
Chris@1
|
176 components; the submap list is internal and specific to a mapping.
|
Chris@1
|
177
|
Chris@1
|
178 A 'submap' is a configuration/grouping that applies to a subset of
|
Chris@1
|
179 floor and residue vectors within a mapping. The submap functions as a
|
Chris@1
|
180 last layer of indirection such that specific special floor or residue
|
Chris@1
|
181 settings can be applied not only to all the vectors in a given mode,
|
Chris@1
|
182 but also specific vectors in a specific mode. Each submap specifies
|
Chris@1
|
183 the proper floor and residue instance number to use for decoding that
|
Chris@1
|
184 submap's spectral floor and spectral residue vectors.
|
Chris@1
|
185
|
Chris@1
|
186 As an example:
|
Chris@1
|
187
|
Chris@1
|
188 Assume a Vorbis stream that contains six channels in the standard 5.1
|
Chris@1
|
189 format. The sixth channel, as is normal in 5.1, is bass only.
|
Chris@1
|
190 Therefore it would be wasteful to encode a full-spectrum version of it
|
Chris@1
|
191 as with the other channels. The submapping mechanism can be used to
|
Chris@1
|
192 apply a full range floor and residue encoding to channels 0 through 4,
|
Chris@1
|
193 and a bass-only representation to the bass channel, thus saving space.
|
Chris@1
|
194 In this example, channels 0-4 belong to submap 0 (which indicates use
|
Chris@1
|
195 of a full-range floor) and channel 5 belongs to submap 1, which uses a
|
Chris@1
|
196 bass-only representation.
|
Chris@1
|
197
|
Chris@1
|
198
|
Chris@1
|
199 \subsubsection{Floor}
|
Chris@1
|
200
|
Chris@1
|
201 Vorbis encodes a spectral 'floor' vector for each PCM channel. This
|
Chris@1
|
202 vector is a low-resolution representation of the audio spectrum for
|
Chris@1
|
203 the given channel in the current frame, generally used akin to a
|
Chris@1
|
204 whitening filter. It is named a 'floor' because the Xiph.Org
|
Chris@1
|
205 reference encoder has historically used it as a unit-baseline for
|
Chris@1
|
206 spectral resolution.
|
Chris@1
|
207
|
Chris@1
|
208 A floor encoding may be of two types. Floor 0 uses a packed LSP
|
Chris@1
|
209 representation on a dB amplitude scale and Bark frequency scale.
|
Chris@1
|
210 Floor 1 represents the curve as a piecewise linear interpolated
|
Chris@1
|
211 representation on a dB amplitude scale and linear frequency scale.
|
Chris@1
|
212 The two floors are semantically interchangeable in
|
Chris@1
|
213 encoding/decoding. However, floor type 1 provides more stable
|
Chris@1
|
214 inter-frame behavior, and so is the preferred choice in all
|
Chris@1
|
215 coupled-stereo and high bitrate modes. Floor 1 is also considerably
|
Chris@1
|
216 less expensive to decode than floor 0.
|
Chris@1
|
217
|
Chris@1
|
218 Floor 0 is not to be considered deprecated, but it is of limited
|
Chris@1
|
219 modern use. No known Vorbis encoder past Xiph.Org's own beta 4 makes
|
Chris@1
|
220 use of floor 0.
|
Chris@1
|
221
|
Chris@1
|
222 The values coded/decoded by a floor are both compactly formatted and
|
Chris@1
|
223 make use of entropy coding to save space. For this reason, a floor
|
Chris@1
|
224 configuration generally refers to multiple codebooks in the codebook
|
Chris@1
|
225 component list. Entropy coding is thus provided as an abstraction,
|
Chris@1
|
226 and each floor instance may choose from any and all available
|
Chris@1
|
227 codebooks when coding/decoding.
|
Chris@1
|
228
|
Chris@1
|
229
|
Chris@1
|
230 \subsubsection{Residue}
|
Chris@1
|
231 The spectral residue is the fine structure of the audio spectrum
|
Chris@1
|
232 once the floor curve has been subtracted out. In simplest terms, it
|
Chris@1
|
233 is coded in the bitstream using cascaded (multi-pass) vector
|
Chris@1
|
234 quantization according to one of three specific packing/coding
|
Chris@1
|
235 algorithms numbered 0 through 2. The packing algorithm details are
|
Chris@1
|
236 configured by residue instance. As with the floor components, the
|
Chris@1
|
237 final VQ/entropy encoding is provided by external codebook instances
|
Chris@1
|
238 and each residue instance may choose from any and all available
|
Chris@1
|
239 codebooks.
|
Chris@1
|
240
|
Chris@1
|
241 \subsubsection{Codebooks}
|
Chris@1
|
242
|
Chris@1
|
243 Codebooks are a self-contained abstraction that perform entropy
|
Chris@1
|
244 decoding and, optionally, use the entropy-decoded integer value as an
|
Chris@1
|
245 offset into an index of output value vectors, returning the indicated
|
Chris@1
|
246 vector of values.
|
Chris@1
|
247
|
Chris@1
|
248 The entropy coding in a Vorbis I codebook is provided by a standard
|
Chris@1
|
249 Huffman binary tree representation. This tree is tightly packed using
|
Chris@1
|
250 one of several methods, depending on whether codeword lengths are
|
Chris@1
|
251 ordered or unordered, or the tree is sparse.
|
Chris@1
|
252
|
Chris@1
|
253 The codebook vector index is similarly packed according to index
|
Chris@1
|
254 characteristic. Most commonly, the vector index is encoded as a
|
Chris@1
|
255 single list of values of possible values that are then permuted into
|
Chris@1
|
256 a list of n-dimensional rows (lattice VQ).
|
Chris@1
|
257
|
Chris@1
|
258
|
Chris@1
|
259
|
Chris@1
|
260 \subsection{High-level Decode Process}
|
Chris@1
|
261
|
Chris@1
|
262 \subsubsection{Decode Setup}
|
Chris@1
|
263
|
Chris@1
|
264 Before decoding can begin, a decoder must initialize using the
|
Chris@1
|
265 bitstream headers matching the stream to be decoded. Vorbis uses
|
Chris@1
|
266 three header packets; all are required, in-order, by this
|
Chris@1
|
267 specification. Once set up, decode may begin at any audio packet
|
Chris@1
|
268 belonging to the Vorbis stream. In Vorbis I, all packets after the
|
Chris@1
|
269 three initial headers are audio packets.
|
Chris@1
|
270
|
Chris@1
|
271 The header packets are, in order, the identification
|
Chris@1
|
272 header, the comments header, and the setup header.
|
Chris@1
|
273
|
Chris@1
|
274 \paragraph{Identification Header}
|
Chris@1
|
275 The identification header identifies the bitstream as Vorbis, Vorbis
|
Chris@1
|
276 version, and the simple audio characteristics of the stream such as
|
Chris@1
|
277 sample rate and number of channels.
|
Chris@1
|
278
|
Chris@1
|
279 \paragraph{Comment Header}
|
Chris@1
|
280 The comment header includes user text comments (``tags'') and a vendor
|
Chris@1
|
281 string for the application/library that produced the bitstream. The
|
Chris@1
|
282 encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
|
Chris@1
|
283
|
Chris@1
|
284 \paragraph{Setup Header}
|
Chris@1
|
285 The setup header includes extensive CODEC setup information as well as
|
Chris@1
|
286 the complete VQ and Huffman codebooks needed for decode.
|
Chris@1
|
287
|
Chris@1
|
288
|
Chris@1
|
289 \subsubsection{Decode Procedure}
|
Chris@1
|
290
|
Chris@1
|
291 The decoding and synthesis procedure for all audio packets is
|
Chris@1
|
292 fundamentally the same.
|
Chris@1
|
293 \begin{enumerate}
|
Chris@1
|
294 \item decode packet type flag
|
Chris@1
|
295 \item decode mode number
|
Chris@1
|
296 \item decode window shape (long windows only)
|
Chris@1
|
297 \item decode floor
|
Chris@1
|
298 \item decode residue into residue vectors
|
Chris@1
|
299 \item inverse channel coupling of residue vectors
|
Chris@1
|
300 \item generate floor curve from decoded floor data
|
Chris@1
|
301 \item compute dot product of floor and residue, producing audio spectrum vector
|
Chris@1
|
302 \item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
|
Chris@1
|
303 \item overlap/add left-hand output of transform with right-hand output of previous frame
|
Chris@1
|
304 \item store right hand-data from transform of current frame for future lapping
|
Chris@1
|
305 \item if not first frame, return results of overlap/add as audio result of current frame
|
Chris@1
|
306 \end{enumerate}
|
Chris@1
|
307
|
Chris@1
|
308 Note that clever rearrangement of the synthesis arithmetic is
|
Chris@1
|
309 possible; as an example, one can take advantage of symmetries in the
|
Chris@1
|
310 MDCT to store the right-hand transform data of a partial MDCT for a
|
Chris@1
|
311 50\% inter-frame buffer space savings, and then complete the transform
|
Chris@1
|
312 later before overlap/add with the next frame. This optimization
|
Chris@1
|
313 produces entirely equivalent output and is naturally perfectly legal.
|
Chris@1
|
314 The decoder must be \emph{entirely mathematically equivalent} to the
|
Chris@1
|
315 specification, it need not be a literal semantic implementation.
|
Chris@1
|
316
|
Chris@1
|
317 \paragraph{Packet type decode}
|
Chris@1
|
318
|
Chris@1
|
319 Vorbis I uses four packet types. The first three packet types mark each
|
Chris@1
|
320 of the three Vorbis headers described above. The fourth packet type
|
Chris@1
|
321 marks an audio packet. All other packet types are reserved; packets
|
Chris@1
|
322 marked with a reserved type should be ignored.
|
Chris@1
|
323
|
Chris@1
|
324 Following the three header packets, all packets in a Vorbis I stream
|
Chris@1
|
325 are audio. The first step of audio packet decode is to read and
|
Chris@1
|
326 verify the packet type; \emph{a non-audio packet when audio is expected
|
Chris@1
|
327 indicates stream corruption or a non-compliant stream. The decoder
|
Chris@1
|
328 must ignore the packet and not attempt decoding it to
|
Chris@1
|
329 audio}.
|
Chris@1
|
330
|
Chris@1
|
331
|
Chris@1
|
332
|
Chris@1
|
333
|
Chris@1
|
334 \paragraph{Mode decode}
|
Chris@1
|
335 Vorbis allows an encoder to set up multiple, numbered packet 'modes',
|
Chris@1
|
336 as described earlier, all of which may be used in a given Vorbis
|
Chris@1
|
337 stream. The mode is encoded as an integer used as a direct offset into
|
Chris@1
|
338 the mode instance index.
|
Chris@1
|
339
|
Chris@1
|
340
|
Chris@1
|
341 \paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
|
Chris@1
|
342
|
Chris@1
|
343 Vorbis frames may be one of two PCM sample sizes specified during
|
Chris@1
|
344 codec setup. In Vorbis I, legal frame sizes are powers of two from 64
|
Chris@1
|
345 to 8192 samples. Aside from coupling, Vorbis handles channels as
|
Chris@1
|
346 independent vectors and these frame sizes are in samples per channel.
|
Chris@1
|
347
|
Chris@1
|
348 Vorbis uses an overlapping transform, namely the MDCT, to blend one
|
Chris@1
|
349 frame into the next, avoiding most inter-frame block boundary
|
Chris@1
|
350 artifacts. The MDCT output of one frame is windowed according to MDCT
|
Chris@1
|
351 requirements, overlapped 50\% with the output of the previous frame and
|
Chris@1
|
352 added. The window shape assures seamless reconstruction.
|
Chris@1
|
353
|
Chris@1
|
354 This is easy to visualize in the case of equal sized-windows:
|
Chris@1
|
355
|
Chris@1
|
356 \begin{center}
|
Chris@1
|
357 \includegraphics[width=\textwidth]{window1}
|
Chris@1
|
358 \captionof{figure}{overlap of two equal-sized windows}
|
Chris@1
|
359 \end{center}
|
Chris@1
|
360
|
Chris@1
|
361 And slightly more complex in the case of overlapping unequal sized
|
Chris@1
|
362 windows:
|
Chris@1
|
363
|
Chris@1
|
364 \begin{center}
|
Chris@1
|
365 \includegraphics[width=\textwidth]{window2}
|
Chris@1
|
366 \captionof{figure}{overlap of a long and a short window}
|
Chris@1
|
367 \end{center}
|
Chris@1
|
368
|
Chris@1
|
369 In the unequal-sized window case, the window shape of the long window
|
Chris@1
|
370 must be modified for seamless lapping as above. It is possible to
|
Chris@1
|
371 correctly infer window shape to be applied to the current window from
|
Chris@1
|
372 knowing the sizes of the current, previous and next window. It is
|
Chris@1
|
373 legal for a decoder to use this method. However, in the case of a long
|
Chris@1
|
374 window (short windows require no modification), Vorbis also codes two
|
Chris@1
|
375 flag bits to specify pre- and post- window shape. Although not
|
Chris@1
|
376 strictly necessary for function, this minor redundancy allows a packet
|
Chris@1
|
377 to be fully decoded to the point of lapping entirely independently of
|
Chris@1
|
378 any other packet, allowing easier abstraction of decode layers as well
|
Chris@1
|
379 as allowing a greater level of easy parallelism in encode and
|
Chris@1
|
380 decode.
|
Chris@1
|
381
|
Chris@1
|
382 A description of valid window functions for use with an inverse MDCT
|
Chris@1
|
383 can be found in \cite{Sporer/Brandenburg/Edler}. Vorbis windows
|
Chris@1
|
384 all use the slope function
|
Chris@1
|
385 \[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
|
Chris@1
|
386
|
Chris@1
|
387
|
Chris@1
|
388
|
Chris@1
|
389 \paragraph{floor decode}
|
Chris@1
|
390 Each floor is encoded/decoded in channel order, however each floor
|
Chris@1
|
391 belongs to a 'submap' that specifies which floor configuration to
|
Chris@1
|
392 use. All floors are decoded before residue decode begins.
|
Chris@1
|
393
|
Chris@1
|
394
|
Chris@1
|
395 \paragraph{residue decode}
|
Chris@1
|
396
|
Chris@1
|
397 Although the number of residue vectors equals the number of channels,
|
Chris@1
|
398 channel coupling may mean that the raw residue vectors extracted
|
Chris@1
|
399 during decode do not map directly to specific channels. When channel
|
Chris@1
|
400 coupling is in use, some vectors will correspond to coupled magnitude
|
Chris@1
|
401 or angle. The coupling relationships are described in the codec setup
|
Chris@1
|
402 and may differ from frame to frame, due to different mode numbers.
|
Chris@1
|
403
|
Chris@1
|
404 Vorbis codes residue vectors in groups by submap; the coding is done
|
Chris@1
|
405 in submap order from submap 0 through n-1. This differs from floors
|
Chris@1
|
406 which are coded using a configuration provided by submap number, but
|
Chris@1
|
407 are coded individually in channel order.
|
Chris@1
|
408
|
Chris@1
|
409
|
Chris@1
|
410
|
Chris@1
|
411 \paragraph{inverse channel coupling}
|
Chris@1
|
412
|
Chris@1
|
413 A detailed discussion of stereo in the Vorbis codec can be found in
|
Chris@1
|
414 the document \href{stereo.html}{Stereo Channel Coupling in the
|
Chris@1
|
415 Vorbis CODEC}. Vorbis is not limited to only stereo coupling, but
|
Chris@1
|
416 the stereo document also gives a good overview of the generic coupling
|
Chris@1
|
417 mechanism.
|
Chris@1
|
418
|
Chris@1
|
419 Vorbis coupling applies to pairs of residue vectors at a time;
|
Chris@1
|
420 decoupling is done in-place a pair at a time in the order and using
|
Chris@1
|
421 the vectors specified in the current mapping configuration. The
|
Chris@1
|
422 decoupling operation is the same for all pairs, converting square
|
Chris@1
|
423 polar representation (where one vector is magnitude and the second
|
Chris@1
|
424 angle) back to Cartesian representation.
|
Chris@1
|
425
|
Chris@1
|
426 After decoupling, in order, each pair of vectors on the coupling list,
|
Chris@1
|
427 the resulting residue vectors represent the fine spectral detail
|
Chris@1
|
428 of each output channel.
|
Chris@1
|
429
|
Chris@1
|
430
|
Chris@1
|
431
|
Chris@1
|
432 \paragraph{generate floor curve}
|
Chris@1
|
433
|
Chris@1
|
434 The decoder may choose to generate the floor curve at any appropriate
|
Chris@1
|
435 time. It is reasonable to generate the output curve when the floor
|
Chris@1
|
436 data is decoded from the raw packet, or it can be generated after
|
Chris@1
|
437 inverse coupling and applied to the spectral residue directly,
|
Chris@1
|
438 combining generation and the dot product into one step and eliminating
|
Chris@1
|
439 some working space.
|
Chris@1
|
440
|
Chris@1
|
441 Both floor 0 and floor 1 generate a linear-range, linear-domain output
|
Chris@1
|
442 vector to be multiplied (dot product) by the linear-range,
|
Chris@1
|
443 linear-domain spectral residue.
|
Chris@1
|
444
|
Chris@1
|
445
|
Chris@1
|
446
|
Chris@1
|
447 \paragraph{compute floor/residue dot product}
|
Chris@1
|
448
|
Chris@1
|
449 This step is straightforward; for each output channel, the decoder
|
Chris@1
|
450 multiplies the floor curve and residue vectors element by element,
|
Chris@1
|
451 producing the finished audio spectrum of each channel.
|
Chris@1
|
452
|
Chris@1
|
453 % TODO/FIXME: The following two paragraphs have identical twins
|
Chris@1
|
454 % in section 4 (under "dot product")
|
Chris@1
|
455 One point is worth mentioning about this dot product; a common mistake
|
Chris@1
|
456 in a fixed point implementation might be to assume that a 32 bit
|
Chris@1
|
457 fixed-point representation for floor and residue and direct
|
Chris@1
|
458 multiplication of the vectors is sufficient for acceptable spectral
|
Chris@1
|
459 depth in all cases because it happens to mostly work with the current
|
Chris@1
|
460 Xiph.Org reference encoder.
|
Chris@1
|
461
|
Chris@1
|
462 However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
|
Chris@1
|
463 the audio spectrum vector should represent a minimum of 120dB (\~{}21
|
Chris@1
|
464 bits with sign), even when output is to a 16 bit PCM device. For the
|
Chris@1
|
465 residue vector to represent full scale if the floor is nailed to
|
Chris@1
|
466 $-140$dB, it must be able to span 0 to $+140$dB. For the residue vector
|
Chris@1
|
467 to reach full scale if the floor is nailed at 0dB, it must be able to
|
Chris@1
|
468 represent $-140$dB to $+0$dB. Thus, in order to handle full range
|
Chris@1
|
469 dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
|
Chris@1
|
470 spec. A 280dB range is approximately 48 bits with sign; thus the
|
Chris@1
|
471 residue vector must be able to represent a 48 bit range and the dot
|
Chris@1
|
472 product must be able to handle an effective 48 bit times 24 bit
|
Chris@1
|
473 multiplication. This range may be achieved using large (64 bit or
|
Chris@1
|
474 larger) integers, or implementing a movable binary point
|
Chris@1
|
475 representation.
|
Chris@1
|
476
|
Chris@1
|
477
|
Chris@1
|
478
|
Chris@1
|
479 \paragraph{inverse monolithic transform (MDCT)}
|
Chris@1
|
480
|
Chris@1
|
481 The audio spectrum is converted back into time domain PCM audio via an
|
Chris@1
|
482 inverse Modified Discrete Cosine Transform (MDCT). A detailed
|
Chris@1
|
483 description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
|
Chris@1
|
484
|
Chris@1
|
485 Note that the PCM produced directly from the MDCT is not yet finished
|
Chris@1
|
486 audio; it must be lapped with surrounding frames using an appropriate
|
Chris@1
|
487 window (such as the Vorbis window) before the MDCT can be considered
|
Chris@1
|
488 orthogonal.
|
Chris@1
|
489
|
Chris@1
|
490
|
Chris@1
|
491
|
Chris@1
|
492 \paragraph{overlap/add data}
|
Chris@1
|
493 Windowed MDCT output is overlapped and added with the right hand data
|
Chris@1
|
494 of the previous window such that the 3/4 point of the previous window
|
Chris@1
|
495 is aligned with the 1/4 point of the current window (as illustrated in
|
Chris@1
|
496 the window overlap diagram). At this point, the audio data between the
|
Chris@1
|
497 center of the previous frame and the center of the current frame is
|
Chris@1
|
498 now finished and ready to be returned.
|
Chris@1
|
499
|
Chris@1
|
500
|
Chris@1
|
501 \paragraph{cache right hand data}
|
Chris@1
|
502 The decoder must cache the right hand portion of the current frame to
|
Chris@1
|
503 be lapped with the left hand portion of the next frame.
|
Chris@1
|
504
|
Chris@1
|
505
|
Chris@1
|
506
|
Chris@1
|
507 \paragraph{return finished audio data}
|
Chris@1
|
508
|
Chris@1
|
509 The overlapped portion produced from overlapping the previous and
|
Chris@1
|
510 current frame data is finished data to be returned by the decoder.
|
Chris@1
|
511 This data spans from the center of the previous window to the center
|
Chris@1
|
512 of the current window. In the case of same-sized windows, the amount
|
Chris@1
|
513 of data to return is one-half block consisting of and only of the
|
Chris@1
|
514 overlapped portions. When overlapping a short and long window, much of
|
Chris@1
|
515 the returned range is not actually overlap. This does not damage
|
Chris@1
|
516 transform orthogonality. Pay attention however to returning the
|
Chris@1
|
517 correct data range; the amount of data to be returned is:
|
Chris@1
|
518
|
Chris@1
|
519 \begin{Verbatim}[commandchars=\\\{\}]
|
Chris@1
|
520 window\_blocksize(previous\_window)/4+window\_blocksize(current\_window)/4
|
Chris@1
|
521 \end{Verbatim}
|
Chris@1
|
522
|
Chris@1
|
523 from the center of the previous window to the center of the current
|
Chris@1
|
524 window.
|
Chris@1
|
525
|
Chris@1
|
526 Data is not returned from the first frame; it must be used to 'prime'
|
Chris@1
|
527 the decode engine. The encoder accounts for this priming when
|
Chris@1
|
528 calculating PCM offsets; after the first frame, the proper PCM output
|
Chris@1
|
529 offset is '0' (as no data has been returned yet).
|