annotate src/libvorbis-1.3.3/doc/stereo.html @ 83:ae30d91d2ffe

Replace these with versions built using an older toolset (so as to avoid ABI compatibilities when linking on Ubuntu 14.04 for packaging purposes)
author Chris Cannam
date Fri, 07 Feb 2020 11:51:13 +0000
parents 05aa0afa9217
children
rev   line source
Chris@1 1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Chris@1 2 <html>
Chris@1 3 <head>
Chris@1 4
Chris@1 5 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
Chris@1 6 <title>Ogg Vorbis Documentation</title>
Chris@1 7
Chris@1 8 <style type="text/css">
Chris@1 9 body {
Chris@1 10 margin: 0 18px 0 18px;
Chris@1 11 padding-bottom: 30px;
Chris@1 12 font-family: Verdana, Arial, Helvetica, sans-serif;
Chris@1 13 color: #333333;
Chris@1 14 font-size: .8em;
Chris@1 15 }
Chris@1 16
Chris@1 17 a {
Chris@1 18 color: #3366cc;
Chris@1 19 }
Chris@1 20
Chris@1 21 img {
Chris@1 22 border: 0;
Chris@1 23 }
Chris@1 24
Chris@1 25 #xiphlogo {
Chris@1 26 margin: 30px 0 16px 0;
Chris@1 27 }
Chris@1 28
Chris@1 29 #content p {
Chris@1 30 line-height: 1.4;
Chris@1 31 }
Chris@1 32
Chris@1 33 h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
Chris@1 34 font-weight: bold;
Chris@1 35 color: #ff9900;
Chris@1 36 margin: 1.3em 0 8px 0;
Chris@1 37 }
Chris@1 38
Chris@1 39 h1 {
Chris@1 40 font-size: 1.3em;
Chris@1 41 }
Chris@1 42
Chris@1 43 h2 {
Chris@1 44 font-size: 1.2em;
Chris@1 45 }
Chris@1 46
Chris@1 47 h3 {
Chris@1 48 font-size: 1.1em;
Chris@1 49 }
Chris@1 50
Chris@1 51 li {
Chris@1 52 line-height: 1.4;
Chris@1 53 }
Chris@1 54
Chris@1 55 #copyright {
Chris@1 56 margin-top: 30px;
Chris@1 57 line-height: 1.5em;
Chris@1 58 text-align: center;
Chris@1 59 font-size: .8em;
Chris@1 60 color: #888888;
Chris@1 61 clear: both;
Chris@1 62 }
Chris@1 63 </style>
Chris@1 64
Chris@1 65 </head>
Chris@1 66
Chris@1 67 <body>
Chris@1 68
Chris@1 69 <div id="xiphlogo">
Chris@1 70 <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.Org"/></a>
Chris@1 71 </div>
Chris@1 72
Chris@1 73 <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
Chris@1 74
Chris@1 75 <h2>Abstract</h2>
Chris@1 76
Chris@1 77 <p>The Vorbis audio CODEC provides a channel coupling
Chris@1 78 mechanisms designed to reduce effective bitrate by both eliminating
Chris@1 79 interchannel redundancy and eliminating stereo image information
Chris@1 80 labeled inaudible or undesirable according to spatial psychoacoustic
Chris@1 81 models. This document describes both the mechanical coupling
Chris@1 82 mechanisms available within the Vorbis specification, as well as the
Chris@1 83 specific stereo coupling models used by the reference
Chris@1 84 <tt>libvorbis</tt> codec provided by xiph.org.</p>
Chris@1 85
Chris@1 86 <h2>Mechanisms</h2>
Chris@1 87
Chris@1 88 <p>In encoder release beta 4 and earlier, Vorbis supported multiple
Chris@1 89 channel encoding, but the channels were encoded entirely separately
Chris@1 90 with no cross-analysis or redundancy elimination between channels.
Chris@1 91 This multichannel strategy is very similar to the mp3's <em>dual
Chris@1 92 stereo</em> mode and Vorbis uses the same name for its analogous
Chris@1 93 uncoupled multichannel modes.</p>
Chris@1 94
Chris@1 95 <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
Chris@1 96 later implement a coupled channel strategy. Vorbis has two specific
Chris@1 97 mechanisms that may be used alone or in conjunction to implement
Chris@1 98 channel coupling. The first is <em>channel interleaving</em> via
Chris@1 99 residue backend type 2, and the second is <em>square polar
Chris@1 100 mapping</em>. These two general mechanisms are particularly well
Chris@1 101 suited to coupling due to the structure of Vorbis encoding, as we'll
Chris@1 102 explore below, and using both we can implement both totally
Chris@1 103 <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
Chris@1 104 to uncoupled modes], as well as various lossy models that seek to
Chris@1 105 eliminate inaudible or unimportant aspects of the stereo image in
Chris@1 106 order to enhance bitrate. The exact coupling implementation is
Chris@1 107 generalized to allow the encoder a great deal of flexibility in
Chris@1 108 implementation of a stereo or surround model without requiring any
Chris@1 109 significant complexity increase over the combinatorially simpler
Chris@1 110 mid/side joint stereo of mp3 and other current audio codecs.</p>
Chris@1 111
Chris@1 112 <p>A particular Vorbis bitstream may apply channel coupling directly to
Chris@1 113 more than a pair of channels; polar mapping is hierarchical such that
Chris@1 114 polar coupling may be extrapolated to an arbitrary number of channels
Chris@1 115 and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
Chris@1 116 surround. However, the scope of this document restricts itself to the
Chris@1 117 stereo coupling case.</p>
Chris@1 118
Chris@1 119 <a name="sqpm"></a>
Chris@1 120 <h3>Square Polar Mapping</h3>
Chris@1 121
Chris@1 122 <h4>maximal correlation</h4>
Chris@1 123
Chris@1 124 <p>Recall that the basic structure of a a Vorbis I stream first generates
Chris@1 125 from input audio a spectral 'floor' function that serves as an
Chris@1 126 MDCT-domain whitening filter. This floor is meant to represent the
Chris@1 127 rough envelope of the frequency spectrum, using whatever metric the
Chris@1 128 encoder cares to define. This floor is subtracted from the log
Chris@1 129 frequency spectrum, effectively normalizing the spectrum by frequency.
Chris@1 130 Each input channel is associated with a unique floor function.</p>
Chris@1 131
Chris@1 132 <p>The basic idea behind any stereo coupling is that the left and right
Chris@1 133 channels usually correlate. This correlation is even stronger if one
Chris@1 134 first accounts for energy differences in any given frequency band
Chris@1 135 across left and right; think for example of individual instruments
Chris@1 136 mixed into different portions of the stereo image, or a stereo
Chris@1 137 recording with a dominant feature not perfectly in the center. The
Chris@1 138 floor functions, each specific to a channel, provide the perfect means
Chris@1 139 of normalizing left and right energies across the spectrum to maximize
Chris@1 140 correlation before coupling. This feature of the Vorbis format is not
Chris@1 141 a convenient accident.</p>
Chris@1 142
Chris@1 143 <p>Because we strive to maximally correlate the left and right channels
Chris@1 144 and generally succeed in doing so, left and right residue is typically
Chris@1 145 nearly identical. We could use channel interleaving (discussed below)
Chris@1 146 alone to efficiently remove the redundancy between the left and right
Chris@1 147 channels as a side effect of entropy encoding, but a polar
Chris@1 148 representation gives benefits when left/right correlation is
Chris@1 149 strong.</p>
Chris@1 150
Chris@1 151 <h4>point and diffuse imaging</h4>
Chris@1 152
Chris@1 153 <p>The first advantage of a polar representation is that it effectively
Chris@1 154 separates the spatial audio information into a 'point image'
Chris@1 155 (magnitude) at a given frequency and located somewhere in the sound
Chris@1 156 field, and a 'diffuse image' (angle) that fills a large amount of
Chris@1 157 space simultaneously. Even if we preserve only the magnitude (point)
Chris@1 158 data, a detailed and carefully chosen floor function in each channel
Chris@1 159 provides us with a free, fine-grained, frequency relative intensity
Chris@1 160 stereo*. Angle information represents diffuse sound fields, such as
Chris@1 161 reverberation that fills the entire space simultaneously.</p>
Chris@1 162
Chris@1 163 <p>*<em>Because the Vorbis model supports a number of different possible
Chris@1 164 stereo models and these models may be mixed, we do not use the term
Chris@1 165 'intensity stereo' talking about Vorbis; instead we use the terms
Chris@1 166 'point stereo', 'phase stereo' and subcategories of each.</em></p>
Chris@1 167
Chris@1 168 <p>The majority of a stereo image is representable by polar magnitude
Chris@1 169 alone, as strong sounds tend to be produced at near-point sources;
Chris@1 170 even non-diffuse, fast, sharp echoes track very accurately using
Chris@1 171 magnitude representation almost alone (for those experimenting with
Chris@1 172 Vorbis tuning, this strategy works much better with the precise,
Chris@1 173 piecewise control of floor 1; the continuous approximation of floor 0
Chris@1 174 results in unstable imaging). Reverberation and diffuse sounds tend
Chris@1 175 to contain less energy and be psychoacoustically dominated by the
Chris@1 176 point sources embedded in them. Thus, we again tend to concentrate
Chris@1 177 more represented energy into a predictably smaller number of numbers.
Chris@1 178 Separating representation of point and diffuse imaging also allows us
Chris@1 179 to model and manipulate point and diffuse qualities separately.</p>
Chris@1 180
Chris@1 181 <h4>controlling bit leakage and symbol crosstalk</h4>
Chris@1 182
Chris@1 183 <p>Because polar
Chris@1 184 representation concentrates represented energy into fewer large
Chris@1 185 values, we reduce bit 'leakage' during cascading (multistage VQ
Chris@1 186 encoding) as a secondary benefit. A single large, monolithic VQ
Chris@1 187 codebook is more efficient than a cascaded book due to entropy
Chris@1 188 'crosstalk' among symbols between different stages of a multistage cascade.
Chris@1 189 Polar representation is a way of further concentrating entropy into
Chris@1 190 predictable locations so that codebook design can take steps to
Chris@1 191 improve multistage codebook efficiency. It also allows us to cascade
Chris@1 192 various elements of the stereo image independently.</p>
Chris@1 193
Chris@1 194 <h4>eliminating trigonometry and rounding</h4>
Chris@1 195
Chris@1 196 <p>Rounding and computational complexity are potential problems with a
Chris@1 197 polar representation. As our encoding process involves quantization,
Chris@1 198 mixing a polar representation and quantization makes it potentially
Chris@1 199 impossible, depending on implementation, to construct a coupled stereo
Chris@1 200 mechanism that results in bit-identical decompressed output compared
Chris@1 201 to an uncoupled encoding should the encoder desire it.</p>
Chris@1 202
Chris@1 203 <p>Vorbis uses a mapping that preserves the most useful qualities of
Chris@1 204 polar representation, relies only on addition/subtraction (during
Chris@1 205 decode; high quality encoding still requires some trig), and makes it
Chris@1 206 trivial before or after quantization to represent an angle/magnitude
Chris@1 207 through a one-to-one mapping from possible left/right value
Chris@1 208 permutations. We do this by basing our polar representation on the
Chris@1 209 unit square rather than the unit-circle.</p>
Chris@1 210
Chris@1 211 <p>Given a magnitude and angle, we recover left and right using the
Chris@1 212 following function (note that A/B may be left/right or right/left
Chris@1 213 depending on the coupling definition used by the encoder):</p>
Chris@1 214
Chris@1 215 <pre>
Chris@1 216 if(magnitude>0)
Chris@1 217 if(angle>0){
Chris@1 218 A=magnitude;
Chris@1 219 B=magnitude-angle;
Chris@1 220 }else{
Chris@1 221 B=magnitude;
Chris@1 222 A=magnitude+angle;
Chris@1 223 }
Chris@1 224 else
Chris@1 225 if(angle>0){
Chris@1 226 A=magnitude;
Chris@1 227 B=magnitude+angle;
Chris@1 228 }else{
Chris@1 229 B=magnitude;
Chris@1 230 A=magnitude-angle;
Chris@1 231 }
Chris@1 232 }
Chris@1 233 </pre>
Chris@1 234
Chris@1 235 <p>The function is antisymmetric for positive and negative magnitudes in
Chris@1 236 order to eliminate a redundant value when quantizing. For example, if
Chris@1 237 we're quantizing to integer values, we can visualize a magnitude of 5
Chris@1 238 and an angle of -2 as follows:</p>
Chris@1 239
Chris@1 240 <p><img src="squarepolar.png" alt="square polar"/></p>
Chris@1 241
Chris@1 242 <p>This representation loses or replicates no values; if the range of A
Chris@1 243 and B are integral -5 through 5, the number of possible Cartesian
Chris@1 244 permutations is 121. Represented in square polar notation, the
Chris@1 245 possible values are:</p>
Chris@1 246
Chris@1 247 <pre>
Chris@1 248 0, 0
Chris@1 249
Chris@1 250 -1,-2 -1,-1 -1, 0 -1, 1
Chris@1 251
Chris@1 252 1,-2 1,-1 1, 0 1, 1
Chris@1 253
Chris@1 254 -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3
Chris@1 255
Chris@1 256 2,-4 2,-3 ... following the pattern ...
Chris@1 257
Chris@1 258 ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9
Chris@1 259
Chris@1 260 </pre>
Chris@1 261
Chris@1 262 <p>...for a grand total of 121 possible values, the same number as in
Chris@1 263 Cartesian representation (note that, for example, <tt>5,-10</tt> is
Chris@1 264 the same as <tt>-5,10</tt>, so there's no reason to represent
Chris@1 265 both. 2,10 cannot happen, and there's no reason to account for it.)
Chris@1 266 It's also obvious that this mapping is exactly reversible.</p>
Chris@1 267
Chris@1 268 <h3>Channel interleaving</h3>
Chris@1 269
Chris@1 270 <p>We can remap and A/B vector using polar mapping into a magnitude/angle
Chris@1 271 vector, and it's clear that, in general, this concentrates energy in
Chris@1 272 the magnitude vector and reduces the amount of information to encode
Chris@1 273 in the angle vector. Encoding these vectors independently with
Chris@1 274 residue backend #0 or residue backend #1 will result in bitrate
Chris@1 275 savings. However, there are still implicit correlations between the
Chris@1 276 magnitude and angle vectors. The most obvious is that the amplitude
Chris@1 277 of the angle is bounded by its corresponding magnitude value.</p>
Chris@1 278
Chris@1 279 <p>Entropy coding the results, then, further benefits from the entropy
Chris@1 280 model being able to compress magnitude and angle simultaneously. For
Chris@1 281 this reason, Vorbis implements residue backend #2 which pre-interleaves
Chris@1 282 a number of input vectors (in the stereo case, two, A and B) into a
Chris@1 283 single output vector (with the elements in the order of
Chris@1 284 A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
Chris@1 285 each vector to be coded by the vector quantization backend consists of
Chris@1 286 matching magnitude and angle values.</p>
Chris@1 287
Chris@1 288 <p>The astute reader, at this point, will notice that in the theoretical
Chris@1 289 case in which we can use monolithic codebooks of arbitrarily large
Chris@1 290 size, we can directly interleave and encode left and right without
Chris@1 291 polar mapping; in fact, the polar mapping does not appear to lend any
Chris@1 292 benefit whatsoever to the efficiency of the entropy coding. In fact,
Chris@1 293 it is perfectly possible and reasonable to build a Vorbis encoder that
Chris@1 294 dispenses with polar mapping entirely and merely interleaves the
Chris@1 295 channel. Libvorbis based encoders may configure such an encoding and
Chris@1 296 it will work as intended.</p>
Chris@1 297
Chris@1 298 <p>However, when we leave the ideal/theoretical domain, we notice that
Chris@1 299 polar mapping does give additional practical benefits, as discussed in
Chris@1 300 the above section on polar mapping and summarized again here:</p>
Chris@1 301
Chris@1 302 <ul>
Chris@1 303 <li>Polar mapping aids in controlling entropy 'leakage' between stages
Chris@1 304 of a cascaded codebook.</li>
Chris@1 305 <li>Polar mapping separates the stereo image
Chris@1 306 into point and diffuse components which may be analyzed and handled
Chris@1 307 differently.</li>
Chris@1 308 </ul>
Chris@1 309
Chris@1 310 <h2>Stereo Models</h2>
Chris@1 311
Chris@1 312 <h3>Dual Stereo</h3>
Chris@1 313
Chris@1 314 <p>Dual stereo refers to stereo encoding where the channels are entirely
Chris@1 315 separate; they are analyzed and encoded as entirely distinct entities.
Chris@1 316 This terminology is familiar from mp3.</p>
Chris@1 317
Chris@1 318 <h3>Lossless Stereo</h3>
Chris@1 319
Chris@1 320 <p>Using polar mapping and/or channel interleaving, it's possible to
Chris@1 321 couple Vorbis channels losslessly, that is, construct a stereo
Chris@1 322 coupling encoding that both saves space but also decodes
Chris@1 323 bit-identically to dual stereo. OggEnc 1.0 and later uses this
Chris@1 324 mode in all high-bitrate encoding.</p>
Chris@1 325
Chris@1 326 <p>Overall, this stereo mode is overkill; however, it offers a safe
Chris@1 327 alternative to users concerned about the slightest possible
Chris@1 328 degradation to the stereo image or archival quality audio.</p>
Chris@1 329
Chris@1 330 <h3>Phase Stereo</h3>
Chris@1 331
Chris@1 332 <p>Phase stereo is the least aggressive means of gracefully dropping
Chris@1 333 resolution from the stereo image; it affects only diffuse imaging.</p>
Chris@1 334
Chris@1 335 <p>It's often quoted that the human ear is deaf to signal phase above
Chris@1 336 about 4kHz; this is nearly true and a passable rule of thumb, but it
Chris@1 337 can be demonstrated that even an average user can tell the difference
Chris@1 338 between high frequency in-phase and out-of-phase noise. Obviously
Chris@1 339 then, the statement is not entirely true. However, it's also the case
Chris@1 340 that one must resort to nearly such an extreme demonstration before
Chris@1 341 finding the counterexample.</p>
Chris@1 342
Chris@1 343 <p>'Phase stereo' is simply a more aggressive quantization of the polar
Chris@1 344 angle vector; above 4kHz it's generally quite safe to quantize noise
Chris@1 345 and noisy elements to only a handful of allowed phases, or to thin the
Chris@1 346 phase with respect to the magnitude. The phases of high amplitude
Chris@1 347 pure tones may or may not be preserved more carefully (they are
Chris@1 348 relatively rare and L/R tend to be in phase, so there is generally
Chris@1 349 little reason not to spend a few more bits on them)</p>
Chris@1 350
Chris@1 351 <h4>example: eight phase stereo</h4>
Chris@1 352
Chris@1 353 <p>Vorbis may implement phase stereo coupling by preserving the entirety
Chris@1 354 of the magnitude vector (essential to fine amplitude and energy
Chris@1 355 resolution overall) and quantizing the angle vector to one of only
Chris@1 356 four possible values. Given that the magnitude vector may be positive
Chris@1 357 or negative, this results in left and right phase having eight
Chris@1 358 possible permutation, thus 'eight phase stereo':</p>
Chris@1 359
Chris@1 360 <p><img src="eightphase.png" alt="eight phase"/></p>
Chris@1 361
Chris@1 362 <p>Left and right may be in phase (positive or negative), the most common
Chris@1 363 case by far, or out of phase by 90 or 180 degrees.</p>
Chris@1 364
Chris@1 365 <h4>example: four phase stereo</h4>
Chris@1 366
Chris@1 367 <p>Similarly, four phase stereo takes the quantization one step further;
Chris@1 368 it allows only in-phase and 180 degree out-out-phase signals:</p>
Chris@1 369
Chris@1 370 <p><img src="fourphase.png" alt="four phase"/></p>
Chris@1 371
Chris@1 372 <h3>example: point stereo</h3>
Chris@1 373
Chris@1 374 <p>Point stereo eliminates the possibility of out-of-phase signal
Chris@1 375 entirely. Any diffuse quality to a sound source tends to collapse
Chris@1 376 inward to a point somewhere within the stereo image. A practical
Chris@1 377 example would be balanced reverberations within a large, live space;
Chris@1 378 normally the sound is diffuse and soft, giving a sonic impression of
Chris@1 379 volume. In point-stereo, the reverberations would still exist, but
Chris@1 380 sound fairly firmly centered within the image (assuming the
Chris@1 381 reverberation was centered overall; if the reverberation is stronger
Chris@1 382 to the left, then the point of localization in point stereo would be
Chris@1 383 to the left). This effect is most noticeable at low and mid
Chris@1 384 frequencies and using headphones (which grant perfect stereo
Chris@1 385 separation). Point stereo is is a graceful but generally easy to
Chris@1 386 detect degradation to the sound quality and is thus used in frequency
Chris@1 387 ranges where it is least noticeable.</p>
Chris@1 388
Chris@1 389 <h3>Mixed Stereo</h3>
Chris@1 390
Chris@1 391 <p>Mixed stereo is the simultaneous use of more than one of the above
Chris@1 392 stereo encoding models, generally using more aggressive modes in
Chris@1 393 higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
Chris@1 394
Chris@1 395 <p>It is also the case that near-DC frequencies should be encoded using
Chris@1 396 lossless coupling to avoid frame blocking artifacts.</p>
Chris@1 397
Chris@1 398 <h3>Vorbis Stereo Modes</h3>
Chris@1 399
Chris@1 400 <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
Chris@1 401 constructed out of lossless and point stereo. Phase stereo was used
Chris@1 402 in the rc2 encoder, but is not currently used for simplicity's sake. It
Chris@1 403 will likely be re-added to the stereo model in the future.</p>
Chris@1 404
Chris@1 405 <div id="copyright">
Chris@1 406 The Xiph Fish Logo is a
Chris@1 407 trademark (&trade;) of Xiph.Org.<br/>
Chris@1 408
Chris@1 409 These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.
Chris@1 410 </div>
Chris@1 411
Chris@1 412 </body>
Chris@1 413 </html>
Chris@1 414
Chris@1 415
Chris@1 416
Chris@1 417
Chris@1 418
Chris@1 419