annotate src/libvorbis-1.3.3/doc/01-introduction.tex @ 1:05aa0afa9217

Bring in flac, ogg, vorbis
author Chris Cannam
date Tue, 19 Mar 2013 17:37:49 +0000
parents
children
rev   line source
Chris@1 1 % -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
Chris@1 2 %!TEX root = Vorbis_I_spec.tex
Chris@1 3 % $Id$
Chris@1 4 \section{Introduction and Description} \label{vorbis:spec:intro}
Chris@1 5
Chris@1 6 \subsection{Overview}
Chris@1 7
Chris@1 8 This document provides a high level description of the Vorbis codec's
Chris@1 9 construction. A bit-by-bit specification appears beginning in
Chris@1 10 \xref{vorbis:spec:codec}.
Chris@1 11 The later sections assume a high-level
Chris@1 12 understanding of the Vorbis decode process, which is
Chris@1 13 provided here.
Chris@1 14
Chris@1 15 \subsubsection{Application}
Chris@1 16 Vorbis is a general purpose perceptual audio CODEC intended to allow
Chris@1 17 maximum encoder flexibility, thus allowing it to scale competitively
Chris@1 18 over an exceptionally wide range of bitrates. At the high
Chris@1 19 quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
Chris@1 20 it is in the same league as MPEG-2 and MPC. Similarly, the 1.0
Chris@1 21 encoder can encode high-quality CD and DAT rate stereo at below 48kbps
Chris@1 22 without resampling to a lower rate. Vorbis is also intended for
Chris@1 23 lower and higher sample rates (from 8kHz telephony to 192kHz digital
Chris@1 24 masters) and a range of channel representations (monaural,
Chris@1 25 polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
Chris@1 26 discrete channels).
Chris@1 27
Chris@1 28
Chris@1 29 \subsubsection{Classification}
Chris@1 30 Vorbis I is a forward-adaptive monolithic transform CODEC based on the
Chris@1 31 Modified Discrete Cosine Transform. The codec is structured to allow
Chris@1 32 addition of a hybrid wavelet filterbank in Vorbis II to offer better
Chris@1 33 transient response and reproduction using a transform better suited to
Chris@1 34 localized time events.
Chris@1 35
Chris@1 36
Chris@1 37 \subsubsection{Assumptions}
Chris@1 38
Chris@1 39 The Vorbis CODEC design assumes a complex, psychoacoustically-aware
Chris@1 40 encoder and simple, low-complexity decoder. Vorbis decode is
Chris@1 41 computationally simpler than mp3, although it does require more
Chris@1 42 working memory as Vorbis has no static probability model; the vector
Chris@1 43 codebooks used in the first stage of decoding from the bitstream are
Chris@1 44 packed in their entirety into the Vorbis bitstream headers. In
Chris@1 45 packed form, these codebooks occupy only a few kilobytes; the extent
Chris@1 46 to which they are pre-decoded into a cache is the dominant factor in
Chris@1 47 decoder memory usage.
Chris@1 48
Chris@1 49
Chris@1 50 Vorbis provides none of its own framing, synchronization or protection
Chris@1 51 against errors; it is solely a method of accepting input audio,
Chris@1 52 dividing it into individual frames and compressing these frames into
Chris@1 53 raw, unformatted 'packets'. The decoder then accepts these raw
Chris@1 54 packets in sequence, decodes them, synthesizes audio frames from
Chris@1 55 them, and reassembles the frames into a facsimile of the original
Chris@1 56 audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
Chris@1 57 minimum size, maximum size, or fixed/expected size. Packets
Chris@1 58 are designed that they may be truncated (or padded) and remain
Chris@1 59 decodable; this is not to be considered an error condition and is used
Chris@1 60 extensively in bitrate management in peeling. Both the transport
Chris@1 61 mechanism and decoder must allow that a packet may be any size, or
Chris@1 62 end before or after packet decode expects.
Chris@1 63
Chris@1 64 Vorbis packets are thus intended to be used with a transport mechanism
Chris@1 65 that provides free-form framing, sync, positioning and error correction
Chris@1 66 in accordance with these design assumptions, such as Ogg (for file
Chris@1 67 transport) or RTP (for network multicast). For purposes of a few
Chris@1 68 examples in this document, we will assume that Vorbis is to be
Chris@1 69 embedded in an Ogg stream specifically, although this is by no means a
Chris@1 70 requirement or fundamental assumption in the Vorbis design.
Chris@1 71
Chris@1 72 The specification for embedding Vorbis into
Chris@1 73 an Ogg transport stream is in \xref{vorbis:over:ogg}.
Chris@1 74
Chris@1 75
Chris@1 76
Chris@1 77 \subsubsection{Codec Setup and Probability Model}
Chris@1 78
Chris@1 79 Vorbis' heritage is as a research CODEC and its current design
Chris@1 80 reflects a desire to allow multiple decades of continuous encoder
Chris@1 81 improvement before running out of room within the codec specification.
Chris@1 82 For these reasons, configurable aspects of codec setup intentionally
Chris@1 83 lean toward the extreme of forward adaptive.
Chris@1 84
Chris@1 85 The single most controversial design decision in Vorbis (and the most
Chris@1 86 unusual for a Vorbis developer to keep in mind) is that the entire
Chris@1 87 probability model of the codec, the Huffman and VQ codebooks, is
Chris@1 88 packed into the bitstream header along with extensive CODEC setup
Chris@1 89 parameters (often several hundred fields). This makes it impossible,
Chris@1 90 as it would be with MPEG audio layers, to embed a simple frame type
Chris@1 91 flag in each audio packet, or begin decode at any frame in the stream
Chris@1 92 without having previously fetched the codec setup header.
Chris@1 93
Chris@1 94
Chris@1 95 \begin{note}
Chris@1 96 Vorbis \emph{can} initiate decode at any arbitrary packet within a
Chris@1 97 bitstream so long as the codec has been initialized/setup with the
Chris@1 98 setup headers.
Chris@1 99 \end{note}
Chris@1 100
Chris@1 101 Thus, Vorbis headers are both required for decode to begin and
Chris@1 102 relatively large as bitstream headers go. The header size is
Chris@1 103 unbounded, although for streaming a rule-of-thumb of 4kB or less is
Chris@1 104 recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
Chris@1 105
Chris@1 106 Our own design work indicates the primary liability of the
Chris@1 107 required header is in mindshare; it is an unusual design and thus
Chris@1 108 causes some amount of complaint among engineers as this runs against
Chris@1 109 current design trends (and also points out limitations in some
Chris@1 110 existing software/interface designs, such as Windows' ACM codec
Chris@1 111 framework). However, we find that it does not fundamentally limit
Chris@1 112 Vorbis' suitable application space.
Chris@1 113
Chris@1 114
Chris@1 115 \subsubsection{Format Specification}
Chris@1 116 The Vorbis format is well-defined by its decode specification; any
Chris@1 117 encoder that produces packets that are correctly decoded by the
Chris@1 118 reference Vorbis decoder described below may be considered a proper
Chris@1 119 Vorbis encoder. A decoder must faithfully and completely implement
Chris@1 120 the specification defined below (except where noted) to be considered
Chris@1 121 a proper Vorbis decoder.
Chris@1 122
Chris@1 123 \subsubsection{Hardware Profile}
Chris@1 124 Although Vorbis decode is computationally simple, it may still run
Chris@1 125 into specific limitations of an embedded design. For this reason,
Chris@1 126 embedded designs are allowed to deviate in limited ways from the
Chris@1 127 `full' decode specification yet still be certified compliant. These
Chris@1 128 optional omissions are labelled in the spec where relevant.
Chris@1 129
Chris@1 130
Chris@1 131 \subsection{Decoder Configuration}
Chris@1 132
Chris@1 133 Decoder setup consists of configuration of multiple, self-contained
Chris@1 134 component abstractions that perform specific functions in the decode
Chris@1 135 pipeline. Each different component instance of a specific type is
Chris@1 136 semantically interchangeable; decoder configuration consists both of
Chris@1 137 internal component configuration, as well as arrangement of specific
Chris@1 138 instances into a decode pipeline. Componentry arrangement is roughly
Chris@1 139 as follows:
Chris@1 140
Chris@1 141 \begin{center}
Chris@1 142 \includegraphics[width=\textwidth]{components}
Chris@1 143 \captionof{figure}{decoder pipeline configuration}
Chris@1 144 \end{center}
Chris@1 145
Chris@1 146 \subsubsection{Global Config}
Chris@1 147 Global codec configuration consists of a few audio related fields
Chris@1 148 (sample rate, channels), Vorbis version (always '0' in Vorbis I),
Chris@1 149 bitrate hints, and the lists of component instances. All other
Chris@1 150 configuration is in the context of specific components.
Chris@1 151
Chris@1 152 \subsubsection{Mode}
Chris@1 153
Chris@1 154 Each Vorbis frame is coded according to a master 'mode'. A bitstream
Chris@1 155 may use one or many modes.
Chris@1 156
Chris@1 157 The mode mechanism is used to encode a frame according to one of
Chris@1 158 multiple possible methods with the intention of choosing a method best
Chris@1 159 suited to that frame. Different modes are, e.g. how frame size
Chris@1 160 is changed from frame to frame. The mode number of a frame serves as a
Chris@1 161 top level configuration switch for all other specific aspects of frame
Chris@1 162 decode.
Chris@1 163
Chris@1 164 A 'mode' configuration consists of a frame size setting, window type
Chris@1 165 (always 0, the Vorbis window, in Vorbis I), transform type (always
Chris@1 166 type 0, the MDCT, in Vorbis I) and a mapping number. The mapping
Chris@1 167 number specifies which mapping configuration instance to use for
Chris@1 168 low-level packet decode and synthesis.
Chris@1 169
Chris@1 170
Chris@1 171 \subsubsection{Mapping}
Chris@1 172
Chris@1 173 A mapping contains a channel coupling description and a list of
Chris@1 174 'submaps' that bundle sets of channel vectors together for grouped
Chris@1 175 encoding and decoding. These submaps are not references to external
Chris@1 176 components; the submap list is internal and specific to a mapping.
Chris@1 177
Chris@1 178 A 'submap' is a configuration/grouping that applies to a subset of
Chris@1 179 floor and residue vectors within a mapping. The submap functions as a
Chris@1 180 last layer of indirection such that specific special floor or residue
Chris@1 181 settings can be applied not only to all the vectors in a given mode,
Chris@1 182 but also specific vectors in a specific mode. Each submap specifies
Chris@1 183 the proper floor and residue instance number to use for decoding that
Chris@1 184 submap's spectral floor and spectral residue vectors.
Chris@1 185
Chris@1 186 As an example:
Chris@1 187
Chris@1 188 Assume a Vorbis stream that contains six channels in the standard 5.1
Chris@1 189 format. The sixth channel, as is normal in 5.1, is bass only.
Chris@1 190 Therefore it would be wasteful to encode a full-spectrum version of it
Chris@1 191 as with the other channels. The submapping mechanism can be used to
Chris@1 192 apply a full range floor and residue encoding to channels 0 through 4,
Chris@1 193 and a bass-only representation to the bass channel, thus saving space.
Chris@1 194 In this example, channels 0-4 belong to submap 0 (which indicates use
Chris@1 195 of a full-range floor) and channel 5 belongs to submap 1, which uses a
Chris@1 196 bass-only representation.
Chris@1 197
Chris@1 198
Chris@1 199 \subsubsection{Floor}
Chris@1 200
Chris@1 201 Vorbis encodes a spectral 'floor' vector for each PCM channel. This
Chris@1 202 vector is a low-resolution representation of the audio spectrum for
Chris@1 203 the given channel in the current frame, generally used akin to a
Chris@1 204 whitening filter. It is named a 'floor' because the Xiph.Org
Chris@1 205 reference encoder has historically used it as a unit-baseline for
Chris@1 206 spectral resolution.
Chris@1 207
Chris@1 208 A floor encoding may be of two types. Floor 0 uses a packed LSP
Chris@1 209 representation on a dB amplitude scale and Bark frequency scale.
Chris@1 210 Floor 1 represents the curve as a piecewise linear interpolated
Chris@1 211 representation on a dB amplitude scale and linear frequency scale.
Chris@1 212 The two floors are semantically interchangeable in
Chris@1 213 encoding/decoding. However, floor type 1 provides more stable
Chris@1 214 inter-frame behavior, and so is the preferred choice in all
Chris@1 215 coupled-stereo and high bitrate modes. Floor 1 is also considerably
Chris@1 216 less expensive to decode than floor 0.
Chris@1 217
Chris@1 218 Floor 0 is not to be considered deprecated, but it is of limited
Chris@1 219 modern use. No known Vorbis encoder past Xiph.Org's own beta 4 makes
Chris@1 220 use of floor 0.
Chris@1 221
Chris@1 222 The values coded/decoded by a floor are both compactly formatted and
Chris@1 223 make use of entropy coding to save space. For this reason, a floor
Chris@1 224 configuration generally refers to multiple codebooks in the codebook
Chris@1 225 component list. Entropy coding is thus provided as an abstraction,
Chris@1 226 and each floor instance may choose from any and all available
Chris@1 227 codebooks when coding/decoding.
Chris@1 228
Chris@1 229
Chris@1 230 \subsubsection{Residue}
Chris@1 231 The spectral residue is the fine structure of the audio spectrum
Chris@1 232 once the floor curve has been subtracted out. In simplest terms, it
Chris@1 233 is coded in the bitstream using cascaded (multi-pass) vector
Chris@1 234 quantization according to one of three specific packing/coding
Chris@1 235 algorithms numbered 0 through 2. The packing algorithm details are
Chris@1 236 configured by residue instance. As with the floor components, the
Chris@1 237 final VQ/entropy encoding is provided by external codebook instances
Chris@1 238 and each residue instance may choose from any and all available
Chris@1 239 codebooks.
Chris@1 240
Chris@1 241 \subsubsection{Codebooks}
Chris@1 242
Chris@1 243 Codebooks are a self-contained abstraction that perform entropy
Chris@1 244 decoding and, optionally, use the entropy-decoded integer value as an
Chris@1 245 offset into an index of output value vectors, returning the indicated
Chris@1 246 vector of values.
Chris@1 247
Chris@1 248 The entropy coding in a Vorbis I codebook is provided by a standard
Chris@1 249 Huffman binary tree representation. This tree is tightly packed using
Chris@1 250 one of several methods, depending on whether codeword lengths are
Chris@1 251 ordered or unordered, or the tree is sparse.
Chris@1 252
Chris@1 253 The codebook vector index is similarly packed according to index
Chris@1 254 characteristic. Most commonly, the vector index is encoded as a
Chris@1 255 single list of values of possible values that are then permuted into
Chris@1 256 a list of n-dimensional rows (lattice VQ).
Chris@1 257
Chris@1 258
Chris@1 259
Chris@1 260 \subsection{High-level Decode Process}
Chris@1 261
Chris@1 262 \subsubsection{Decode Setup}
Chris@1 263
Chris@1 264 Before decoding can begin, a decoder must initialize using the
Chris@1 265 bitstream headers matching the stream to be decoded. Vorbis uses
Chris@1 266 three header packets; all are required, in-order, by this
Chris@1 267 specification. Once set up, decode may begin at any audio packet
Chris@1 268 belonging to the Vorbis stream. In Vorbis I, all packets after the
Chris@1 269 three initial headers are audio packets.
Chris@1 270
Chris@1 271 The header packets are, in order, the identification
Chris@1 272 header, the comments header, and the setup header.
Chris@1 273
Chris@1 274 \paragraph{Identification Header}
Chris@1 275 The identification header identifies the bitstream as Vorbis, Vorbis
Chris@1 276 version, and the simple audio characteristics of the stream such as
Chris@1 277 sample rate and number of channels.
Chris@1 278
Chris@1 279 \paragraph{Comment Header}
Chris@1 280 The comment header includes user text comments (``tags'') and a vendor
Chris@1 281 string for the application/library that produced the bitstream. The
Chris@1 282 encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
Chris@1 283
Chris@1 284 \paragraph{Setup Header}
Chris@1 285 The setup header includes extensive CODEC setup information as well as
Chris@1 286 the complete VQ and Huffman codebooks needed for decode.
Chris@1 287
Chris@1 288
Chris@1 289 \subsubsection{Decode Procedure}
Chris@1 290
Chris@1 291 The decoding and synthesis procedure for all audio packets is
Chris@1 292 fundamentally the same.
Chris@1 293 \begin{enumerate}
Chris@1 294 \item decode packet type flag
Chris@1 295 \item decode mode number
Chris@1 296 \item decode window shape (long windows only)
Chris@1 297 \item decode floor
Chris@1 298 \item decode residue into residue vectors
Chris@1 299 \item inverse channel coupling of residue vectors
Chris@1 300 \item generate floor curve from decoded floor data
Chris@1 301 \item compute dot product of floor and residue, producing audio spectrum vector
Chris@1 302 \item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
Chris@1 303 \item overlap/add left-hand output of transform with right-hand output of previous frame
Chris@1 304 \item store right hand-data from transform of current frame for future lapping
Chris@1 305 \item if not first frame, return results of overlap/add as audio result of current frame
Chris@1 306 \end{enumerate}
Chris@1 307
Chris@1 308 Note that clever rearrangement of the synthesis arithmetic is
Chris@1 309 possible; as an example, one can take advantage of symmetries in the
Chris@1 310 MDCT to store the right-hand transform data of a partial MDCT for a
Chris@1 311 50\% inter-frame buffer space savings, and then complete the transform
Chris@1 312 later before overlap/add with the next frame. This optimization
Chris@1 313 produces entirely equivalent output and is naturally perfectly legal.
Chris@1 314 The decoder must be \emph{entirely mathematically equivalent} to the
Chris@1 315 specification, it need not be a literal semantic implementation.
Chris@1 316
Chris@1 317 \paragraph{Packet type decode}
Chris@1 318
Chris@1 319 Vorbis I uses four packet types. The first three packet types mark each
Chris@1 320 of the three Vorbis headers described above. The fourth packet type
Chris@1 321 marks an audio packet. All other packet types are reserved; packets
Chris@1 322 marked with a reserved type should be ignored.
Chris@1 323
Chris@1 324 Following the three header packets, all packets in a Vorbis I stream
Chris@1 325 are audio. The first step of audio packet decode is to read and
Chris@1 326 verify the packet type; \emph{a non-audio packet when audio is expected
Chris@1 327 indicates stream corruption or a non-compliant stream. The decoder
Chris@1 328 must ignore the packet and not attempt decoding it to
Chris@1 329 audio}.
Chris@1 330
Chris@1 331
Chris@1 332
Chris@1 333
Chris@1 334 \paragraph{Mode decode}
Chris@1 335 Vorbis allows an encoder to set up multiple, numbered packet 'modes',
Chris@1 336 as described earlier, all of which may be used in a given Vorbis
Chris@1 337 stream. The mode is encoded as an integer used as a direct offset into
Chris@1 338 the mode instance index.
Chris@1 339
Chris@1 340
Chris@1 341 \paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
Chris@1 342
Chris@1 343 Vorbis frames may be one of two PCM sample sizes specified during
Chris@1 344 codec setup. In Vorbis I, legal frame sizes are powers of two from 64
Chris@1 345 to 8192 samples. Aside from coupling, Vorbis handles channels as
Chris@1 346 independent vectors and these frame sizes are in samples per channel.
Chris@1 347
Chris@1 348 Vorbis uses an overlapping transform, namely the MDCT, to blend one
Chris@1 349 frame into the next, avoiding most inter-frame block boundary
Chris@1 350 artifacts. The MDCT output of one frame is windowed according to MDCT
Chris@1 351 requirements, overlapped 50\% with the output of the previous frame and
Chris@1 352 added. The window shape assures seamless reconstruction.
Chris@1 353
Chris@1 354 This is easy to visualize in the case of equal sized-windows:
Chris@1 355
Chris@1 356 \begin{center}
Chris@1 357 \includegraphics[width=\textwidth]{window1}
Chris@1 358 \captionof{figure}{overlap of two equal-sized windows}
Chris@1 359 \end{center}
Chris@1 360
Chris@1 361 And slightly more complex in the case of overlapping unequal sized
Chris@1 362 windows:
Chris@1 363
Chris@1 364 \begin{center}
Chris@1 365 \includegraphics[width=\textwidth]{window2}
Chris@1 366 \captionof{figure}{overlap of a long and a short window}
Chris@1 367 \end{center}
Chris@1 368
Chris@1 369 In the unequal-sized window case, the window shape of the long window
Chris@1 370 must be modified for seamless lapping as above. It is possible to
Chris@1 371 correctly infer window shape to be applied to the current window from
Chris@1 372 knowing the sizes of the current, previous and next window. It is
Chris@1 373 legal for a decoder to use this method. However, in the case of a long
Chris@1 374 window (short windows require no modification), Vorbis also codes two
Chris@1 375 flag bits to specify pre- and post- window shape. Although not
Chris@1 376 strictly necessary for function, this minor redundancy allows a packet
Chris@1 377 to be fully decoded to the point of lapping entirely independently of
Chris@1 378 any other packet, allowing easier abstraction of decode layers as well
Chris@1 379 as allowing a greater level of easy parallelism in encode and
Chris@1 380 decode.
Chris@1 381
Chris@1 382 A description of valid window functions for use with an inverse MDCT
Chris@1 383 can be found in \cite{Sporer/Brandenburg/Edler}. Vorbis windows
Chris@1 384 all use the slope function
Chris@1 385 \[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
Chris@1 386
Chris@1 387
Chris@1 388
Chris@1 389 \paragraph{floor decode}
Chris@1 390 Each floor is encoded/decoded in channel order, however each floor
Chris@1 391 belongs to a 'submap' that specifies which floor configuration to
Chris@1 392 use. All floors are decoded before residue decode begins.
Chris@1 393
Chris@1 394
Chris@1 395 \paragraph{residue decode}
Chris@1 396
Chris@1 397 Although the number of residue vectors equals the number of channels,
Chris@1 398 channel coupling may mean that the raw residue vectors extracted
Chris@1 399 during decode do not map directly to specific channels. When channel
Chris@1 400 coupling is in use, some vectors will correspond to coupled magnitude
Chris@1 401 or angle. The coupling relationships are described in the codec setup
Chris@1 402 and may differ from frame to frame, due to different mode numbers.
Chris@1 403
Chris@1 404 Vorbis codes residue vectors in groups by submap; the coding is done
Chris@1 405 in submap order from submap 0 through n-1. This differs from floors
Chris@1 406 which are coded using a configuration provided by submap number, but
Chris@1 407 are coded individually in channel order.
Chris@1 408
Chris@1 409
Chris@1 410
Chris@1 411 \paragraph{inverse channel coupling}
Chris@1 412
Chris@1 413 A detailed discussion of stereo in the Vorbis codec can be found in
Chris@1 414 the document \href{stereo.html}{Stereo Channel Coupling in the
Chris@1 415 Vorbis CODEC}. Vorbis is not limited to only stereo coupling, but
Chris@1 416 the stereo document also gives a good overview of the generic coupling
Chris@1 417 mechanism.
Chris@1 418
Chris@1 419 Vorbis coupling applies to pairs of residue vectors at a time;
Chris@1 420 decoupling is done in-place a pair at a time in the order and using
Chris@1 421 the vectors specified in the current mapping configuration. The
Chris@1 422 decoupling operation is the same for all pairs, converting square
Chris@1 423 polar representation (where one vector is magnitude and the second
Chris@1 424 angle) back to Cartesian representation.
Chris@1 425
Chris@1 426 After decoupling, in order, each pair of vectors on the coupling list,
Chris@1 427 the resulting residue vectors represent the fine spectral detail
Chris@1 428 of each output channel.
Chris@1 429
Chris@1 430
Chris@1 431
Chris@1 432 \paragraph{generate floor curve}
Chris@1 433
Chris@1 434 The decoder may choose to generate the floor curve at any appropriate
Chris@1 435 time. It is reasonable to generate the output curve when the floor
Chris@1 436 data is decoded from the raw packet, or it can be generated after
Chris@1 437 inverse coupling and applied to the spectral residue directly,
Chris@1 438 combining generation and the dot product into one step and eliminating
Chris@1 439 some working space.
Chris@1 440
Chris@1 441 Both floor 0 and floor 1 generate a linear-range, linear-domain output
Chris@1 442 vector to be multiplied (dot product) by the linear-range,
Chris@1 443 linear-domain spectral residue.
Chris@1 444
Chris@1 445
Chris@1 446
Chris@1 447 \paragraph{compute floor/residue dot product}
Chris@1 448
Chris@1 449 This step is straightforward; for each output channel, the decoder
Chris@1 450 multiplies the floor curve and residue vectors element by element,
Chris@1 451 producing the finished audio spectrum of each channel.
Chris@1 452
Chris@1 453 % TODO/FIXME: The following two paragraphs have identical twins
Chris@1 454 % in section 4 (under "dot product")
Chris@1 455 One point is worth mentioning about this dot product; a common mistake
Chris@1 456 in a fixed point implementation might be to assume that a 32 bit
Chris@1 457 fixed-point representation for floor and residue and direct
Chris@1 458 multiplication of the vectors is sufficient for acceptable spectral
Chris@1 459 depth in all cases because it happens to mostly work with the current
Chris@1 460 Xiph.Org reference encoder.
Chris@1 461
Chris@1 462 However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
Chris@1 463 the audio spectrum vector should represent a minimum of 120dB (\~{}21
Chris@1 464 bits with sign), even when output is to a 16 bit PCM device. For the
Chris@1 465 residue vector to represent full scale if the floor is nailed to
Chris@1 466 $-140$dB, it must be able to span 0 to $+140$dB. For the residue vector
Chris@1 467 to reach full scale if the floor is nailed at 0dB, it must be able to
Chris@1 468 represent $-140$dB to $+0$dB. Thus, in order to handle full range
Chris@1 469 dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
Chris@1 470 spec. A 280dB range is approximately 48 bits with sign; thus the
Chris@1 471 residue vector must be able to represent a 48 bit range and the dot
Chris@1 472 product must be able to handle an effective 48 bit times 24 bit
Chris@1 473 multiplication. This range may be achieved using large (64 bit or
Chris@1 474 larger) integers, or implementing a movable binary point
Chris@1 475 representation.
Chris@1 476
Chris@1 477
Chris@1 478
Chris@1 479 \paragraph{inverse monolithic transform (MDCT)}
Chris@1 480
Chris@1 481 The audio spectrum is converted back into time domain PCM audio via an
Chris@1 482 inverse Modified Discrete Cosine Transform (MDCT). A detailed
Chris@1 483 description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
Chris@1 484
Chris@1 485 Note that the PCM produced directly from the MDCT is not yet finished
Chris@1 486 audio; it must be lapped with surrounding frames using an appropriate
Chris@1 487 window (such as the Vorbis window) before the MDCT can be considered
Chris@1 488 orthogonal.
Chris@1 489
Chris@1 490
Chris@1 491
Chris@1 492 \paragraph{overlap/add data}
Chris@1 493 Windowed MDCT output is overlapped and added with the right hand data
Chris@1 494 of the previous window such that the 3/4 point of the previous window
Chris@1 495 is aligned with the 1/4 point of the current window (as illustrated in
Chris@1 496 the window overlap diagram). At this point, the audio data between the
Chris@1 497 center of the previous frame and the center of the current frame is
Chris@1 498 now finished and ready to be returned.
Chris@1 499
Chris@1 500
Chris@1 501 \paragraph{cache right hand data}
Chris@1 502 The decoder must cache the right hand portion of the current frame to
Chris@1 503 be lapped with the left hand portion of the next frame.
Chris@1 504
Chris@1 505
Chris@1 506
Chris@1 507 \paragraph{return finished audio data}
Chris@1 508
Chris@1 509 The overlapped portion produced from overlapping the previous and
Chris@1 510 current frame data is finished data to be returned by the decoder.
Chris@1 511 This data spans from the center of the previous window to the center
Chris@1 512 of the current window. In the case of same-sized windows, the amount
Chris@1 513 of data to return is one-half block consisting of and only of the
Chris@1 514 overlapped portions. When overlapping a short and long window, much of
Chris@1 515 the returned range is not actually overlap. This does not damage
Chris@1 516 transform orthogonality. Pay attention however to returning the
Chris@1 517 correct data range; the amount of data to be returned is:
Chris@1 518
Chris@1 519 \begin{Verbatim}[commandchars=\\\{\}]
Chris@1 520 window\_blocksize(previous\_window)/4+window\_blocksize(current\_window)/4
Chris@1 521 \end{Verbatim}
Chris@1 522
Chris@1 523 from the center of the previous window to the center of the current
Chris@1 524 window.
Chris@1 525
Chris@1 526 Data is not returned from the first frame; it must be used to 'prime'
Chris@1 527 the decode engine. The encoder accounts for this priming when
Chris@1 528 calculating PCM offsets; after the first frame, the proper PCM output
Chris@1 529 offset is '0' (as no data has been returned yet).