annotate src/libvorbis-1.3.3/doc/stereo.html @ 1:05aa0afa9217

Bring in flac, ogg, vorbis
author Chris Cannam
date Tue, 19 Mar 2013 17:37:49 +0000
parents
children
rev   line source
Chris@1 1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Chris@1 2 <html>
Chris@1 3 <head>
Chris@1 4
Chris@1 5 <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-15"/>
Chris@1 6 <title>Ogg Vorbis Documentation</title>
Chris@1 7
Chris@1 8 <style type="text/css">
Chris@1 9 body {
Chris@1 10 margin: 0 18px 0 18px;
Chris@1 11 padding-bottom: 30px;
Chris@1 12 font-family: Verdana, Arial, Helvetica, sans-serif;
Chris@1 13 color: #333333;
Chris@1 14 font-size: .8em;
Chris@1 15 }
Chris@1 16
Chris@1 17 a {
Chris@1 18 color: #3366cc;
Chris@1 19 }
Chris@1 20
Chris@1 21 img {
Chris@1 22 border: 0;
Chris@1 23 }
Chris@1 24
Chris@1 25 #xiphlogo {
Chris@1 26 margin: 30px 0 16px 0;
Chris@1 27 }
Chris@1 28
Chris@1 29 #content p {
Chris@1 30 line-height: 1.4;
Chris@1 31 }
Chris@1 32
Chris@1 33 h1, h1 a, h2, h2 a, h3, h3 a, h4, h4 a {
Chris@1 34 font-weight: bold;
Chris@1 35 color: #ff9900;
Chris@1 36 margin: 1.3em 0 8px 0;
Chris@1 37 }
Chris@1 38
Chris@1 39 h1 {
Chris@1 40 font-size: 1.3em;
Chris@1 41 }
Chris@1 42
Chris@1 43 h2 {
Chris@1 44 font-size: 1.2em;
Chris@1 45 }
Chris@1 46
Chris@1 47 h3 {
Chris@1 48 font-size: 1.1em;
Chris@1 49 }
Chris@1 50
Chris@1 51 li {
Chris@1 52 line-height: 1.4;
Chris@1 53 }
Chris@1 54
Chris@1 55 #copyright {
Chris@1 56 margin-top: 30px;
Chris@1 57 line-height: 1.5em;
Chris@1 58 text-align: center;
Chris@1 59 font-size: .8em;
Chris@1 60 color: #888888;
Chris@1 61 clear: both;
Chris@1 62 }
Chris@1 63 </style>
Chris@1 64
Chris@1 65 </head>
Chris@1 66
Chris@1 67 <body>
Chris@1 68
Chris@1 69 <div id="xiphlogo">
Chris@1 70 <a href="http://www.xiph.org/"><img src="fish_xiph_org.png" alt="Fish Logo and Xiph.Org"/></a>
Chris@1 71 </div>
Chris@1 72
Chris@1 73 <h1>Ogg Vorbis stereo-specific channel coupling discussion</h1>
Chris@1 74
Chris@1 75 <h2>Abstract</h2>
Chris@1 76
Chris@1 77 <p>The Vorbis audio CODEC provides a channel coupling
Chris@1 78 mechanisms designed to reduce effective bitrate by both eliminating
Chris@1 79 interchannel redundancy and eliminating stereo image information
Chris@1 80 labeled inaudible or undesirable according to spatial psychoacoustic
Chris@1 81 models. This document describes both the mechanical coupling
Chris@1 82 mechanisms available within the Vorbis specification, as well as the
Chris@1 83 specific stereo coupling models used by the reference
Chris@1 84 <tt>libvorbis</tt> codec provided by xiph.org.</p>
Chris@1 85
Chris@1 86 <h2>Mechanisms</h2>
Chris@1 87
Chris@1 88 <p>In encoder release beta 4 and earlier, Vorbis supported multiple
Chris@1 89 channel encoding, but the channels were encoded entirely separately
Chris@1 90 with no cross-analysis or redundancy elimination between channels.
Chris@1 91 This multichannel strategy is very similar to the mp3's <em>dual
Chris@1 92 stereo</em> mode and Vorbis uses the same name for its analogous
Chris@1 93 uncoupled multichannel modes.</p>
Chris@1 94
Chris@1 95 <p>However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and
Chris@1 96 later implement a coupled channel strategy. Vorbis has two specific
Chris@1 97 mechanisms that may be used alone or in conjunction to implement
Chris@1 98 channel coupling. The first is <em>channel interleaving</em> via
Chris@1 99 residue backend type 2, and the second is <em>square polar
Chris@1 100 mapping</em>. These two general mechanisms are particularly well
Chris@1 101 suited to coupling due to the structure of Vorbis encoding, as we'll
Chris@1 102 explore below, and using both we can implement both totally
Chris@1 103 <em>lossless stereo image coupling</em> [bit-for-bit decode-identical
Chris@1 104 to uncoupled modes], as well as various lossy models that seek to
Chris@1 105 eliminate inaudible or unimportant aspects of the stereo image in
Chris@1 106 order to enhance bitrate. The exact coupling implementation is
Chris@1 107 generalized to allow the encoder a great deal of flexibility in
Chris@1 108 implementation of a stereo or surround model without requiring any
Chris@1 109 significant complexity increase over the combinatorially simpler
Chris@1 110 mid/side joint stereo of mp3 and other current audio codecs.</p>
Chris@1 111
Chris@1 112 <p>A particular Vorbis bitstream may apply channel coupling directly to
Chris@1 113 more than a pair of channels; polar mapping is hierarchical such that
Chris@1 114 polar coupling may be extrapolated to an arbitrary number of channels
Chris@1 115 and is not restricted to only stereo, quadraphonics, ambisonics or 5.1
Chris@1 116 surround. However, the scope of this document restricts itself to the
Chris@1 117 stereo coupling case.</p>
Chris@1 118
Chris@1 119 <a name="sqpm"></a>
Chris@1 120 <h3>Square Polar Mapping</h3>
Chris@1 121
Chris@1 122 <h4>maximal correlation</h4>
Chris@1 123
Chris@1 124 <p>Recall that the basic structure of a a Vorbis I stream first generates
Chris@1 125 from input audio a spectral 'floor' function that serves as an
Chris@1 126 MDCT-domain whitening filter. This floor is meant to represent the
Chris@1 127 rough envelope of the frequency spectrum, using whatever metric the
Chris@1 128 encoder cares to define. This floor is subtracted from the log
Chris@1 129 frequency spectrum, effectively normalizing the spectrum by frequency.
Chris@1 130 Each input channel is associated with a unique floor function.</p>
Chris@1 131
Chris@1 132 <p>The basic idea behind any stereo coupling is that the left and right
Chris@1 133 channels usually correlate. This correlation is even stronger if one
Chris@1 134 first accounts for energy differences in any given frequency band
Chris@1 135 across left and right; think for example of individual instruments
Chris@1 136 mixed into different portions of the stereo image, or a stereo
Chris@1 137 recording with a dominant feature not perfectly in the center. The
Chris@1 138 floor functions, each specific to a channel, provide the perfect means
Chris@1 139 of normalizing left and right energies across the spectrum to maximize
Chris@1 140 correlation before coupling. This feature of the Vorbis format is not
Chris@1 141 a convenient accident.</p>
Chris@1 142
Chris@1 143 <p>Because we strive to maximally correlate the left and right channels
Chris@1 144 and generally succeed in doing so, left and right residue is typically
Chris@1 145 nearly identical. We could use channel interleaving (discussed below)
Chris@1 146 alone to efficiently remove the redundancy between the left and right
Chris@1 147 channels as a side effect of entropy encoding, but a polar
Chris@1 148 representation gives benefits when left/right correlation is
Chris@1 149 strong.</p>
Chris@1 150
Chris@1 151 <h4>point and diffuse imaging</h4>
Chris@1 152
Chris@1 153 <p>The first advantage of a polar representation is that it effectively
Chris@1 154 separates the spatial audio information into a 'point image'
Chris@1 155 (magnitude) at a given frequency and located somewhere in the sound
Chris@1 156 field, and a 'diffuse image' (angle) that fills a large amount of
Chris@1 157 space simultaneously. Even if we preserve only the magnitude (point)
Chris@1 158 data, a detailed and carefully chosen floor function in each channel
Chris@1 159 provides us with a free, fine-grained, frequency relative intensity
Chris@1 160 stereo*. Angle information represents diffuse sound fields, such as
Chris@1 161 reverberation that fills the entire space simultaneously.</p>
Chris@1 162
Chris@1 163 <p>*<em>Because the Vorbis model supports a number of different possible
Chris@1 164 stereo models and these models may be mixed, we do not use the term
Chris@1 165 'intensity stereo' talking about Vorbis; instead we use the terms
Chris@1 166 'point stereo', 'phase stereo' and subcategories of each.</em></p>
Chris@1 167
Chris@1 168 <p>The majority of a stereo image is representable by polar magnitude
Chris@1 169 alone, as strong sounds tend to be produced at near-point sources;
Chris@1 170 even non-diffuse, fast, sharp echoes track very accurately using
Chris@1 171 magnitude representation almost alone (for those experimenting with
Chris@1 172 Vorbis tuning, this strategy works much better with the precise,
Chris@1 173 piecewise control of floor 1; the continuous approximation of floor 0
Chris@1 174 results in unstable imaging). Reverberation and diffuse sounds tend
Chris@1 175 to contain less energy and be psychoacoustically dominated by the
Chris@1 176 point sources embedded in them. Thus, we again tend to concentrate
Chris@1 177 more represented energy into a predictably smaller number of numbers.
Chris@1 178 Separating representation of point and diffuse imaging also allows us
Chris@1 179 to model and manipulate point and diffuse qualities separately.</p>
Chris@1 180
Chris@1 181 <h4>controlling bit leakage and symbol crosstalk</h4>
Chris@1 182
Chris@1 183 <p>Because polar
Chris@1 184 representation concentrates represented energy into fewer large
Chris@1 185 values, we reduce bit 'leakage' during cascading (multistage VQ
Chris@1 186 encoding) as a secondary benefit. A single large, monolithic VQ
Chris@1 187 codebook is more efficient than a cascaded book due to entropy
Chris@1 188 'crosstalk' among symbols between different stages of a multistage cascade.
Chris@1 189 Polar representation is a way of further concentrating entropy into
Chris@1 190 predictable locations so that codebook design can take steps to
Chris@1 191 improve multistage codebook efficiency. It also allows us to cascade
Chris@1 192 various elements of the stereo image independently.</p>
Chris@1 193
Chris@1 194 <h4>eliminating trigonometry and rounding</h4>
Chris@1 195
Chris@1 196 <p>Rounding and computational complexity are potential problems with a
Chris@1 197 polar representation. As our encoding process involves quantization,
Chris@1 198 mixing a polar representation and quantization makes it potentially
Chris@1 199 impossible, depending on implementation, to construct a coupled stereo
Chris@1 200 mechanism that results in bit-identical decompressed output compared
Chris@1 201 to an uncoupled encoding should the encoder desire it.</p>
Chris@1 202
Chris@1 203 <p>Vorbis uses a mapping that preserves the most useful qualities of
Chris@1 204 polar representation, relies only on addition/subtraction (during
Chris@1 205 decode; high quality encoding still requires some trig), and makes it
Chris@1 206 trivial before or after quantization to represent an angle/magnitude
Chris@1 207 through a one-to-one mapping from possible left/right value
Chris@1 208 permutations. We do this by basing our polar representation on the
Chris@1 209 unit square rather than the unit-circle.</p>
Chris@1 210
Chris@1 211 <p>Given a magnitude and angle, we recover left and right using the
Chris@1 212 following function (note that A/B may be left/right or right/left
Chris@1 213 depending on the coupling definition used by the encoder):</p>
Chris@1 214
Chris@1 215 <pre>
Chris@1 216 if(magnitude>0)
Chris@1 217 if(angle>0){
Chris@1 218 A=magnitude;
Chris@1 219 B=magnitude-angle;
Chris@1 220 }else{
Chris@1 221 B=magnitude;
Chris@1 222 A=magnitude+angle;
Chris@1 223 }
Chris@1 224 else
Chris@1 225 if(angle>0){
Chris@1 226 A=magnitude;
Chris@1 227 B=magnitude+angle;
Chris@1 228 }else{
Chris@1 229 B=magnitude;
Chris@1 230 A=magnitude-angle;
Chris@1 231 }
Chris@1 232 }
Chris@1 233 </pre>
Chris@1 234
Chris@1 235 <p>The function is antisymmetric for positive and negative magnitudes in
Chris@1 236 order to eliminate a redundant value when quantizing. For example, if
Chris@1 237 we're quantizing to integer values, we can visualize a magnitude of 5
Chris@1 238 and an angle of -2 as follows:</p>
Chris@1 239
Chris@1 240 <p><img src="squarepolar.png" alt="square polar"/></p>
Chris@1 241
Chris@1 242 <p>This representation loses or replicates no values; if the range of A
Chris@1 243 and B are integral -5 through 5, the number of possible Cartesian
Chris@1 244 permutations is 121. Represented in square polar notation, the
Chris@1 245 possible values are:</p>
Chris@1 246
Chris@1 247 <pre>
Chris@1 248 0, 0
Chris@1 249
Chris@1 250 -1,-2 -1,-1 -1, 0 -1, 1
Chris@1 251
Chris@1 252 1,-2 1,-1 1, 0 1, 1
Chris@1 253
Chris@1 254 -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3
Chris@1 255
Chris@1 256 2,-4 2,-3 ... following the pattern ...
Chris@1 257
Chris@1 258 ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9
Chris@1 259
Chris@1 260 </pre>
Chris@1 261
Chris@1 262 <p>...for a grand total of 121 possible values, the same number as in
Chris@1 263 Cartesian representation (note that, for example, <tt>5,-10</tt> is
Chris@1 264 the same as <tt>-5,10</tt>, so there's no reason to represent
Chris@1 265 both. 2,10 cannot happen, and there's no reason to account for it.)
Chris@1 266 It's also obvious that this mapping is exactly reversible.</p>
Chris@1 267
Chris@1 268 <h3>Channel interleaving</h3>
Chris@1 269
Chris@1 270 <p>We can remap and A/B vector using polar mapping into a magnitude/angle
Chris@1 271 vector, and it's clear that, in general, this concentrates energy in
Chris@1 272 the magnitude vector and reduces the amount of information to encode
Chris@1 273 in the angle vector. Encoding these vectors independently with
Chris@1 274 residue backend #0 or residue backend #1 will result in bitrate
Chris@1 275 savings. However, there are still implicit correlations between the
Chris@1 276 magnitude and angle vectors. The most obvious is that the amplitude
Chris@1 277 of the angle is bounded by its corresponding magnitude value.</p>
Chris@1 278
Chris@1 279 <p>Entropy coding the results, then, further benefits from the entropy
Chris@1 280 model being able to compress magnitude and angle simultaneously. For
Chris@1 281 this reason, Vorbis implements residue backend #2 which pre-interleaves
Chris@1 282 a number of input vectors (in the stereo case, two, A and B) into a
Chris@1 283 single output vector (with the elements in the order of
Chris@1 284 A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus
Chris@1 285 each vector to be coded by the vector quantization backend consists of
Chris@1 286 matching magnitude and angle values.</p>
Chris@1 287
Chris@1 288 <p>The astute reader, at this point, will notice that in the theoretical
Chris@1 289 case in which we can use monolithic codebooks of arbitrarily large
Chris@1 290 size, we can directly interleave and encode left and right without
Chris@1 291 polar mapping; in fact, the polar mapping does not appear to lend any
Chris@1 292 benefit whatsoever to the efficiency of the entropy coding. In fact,
Chris@1 293 it is perfectly possible and reasonable to build a Vorbis encoder that
Chris@1 294 dispenses with polar mapping entirely and merely interleaves the
Chris@1 295 channel. Libvorbis based encoders may configure such an encoding and
Chris@1 296 it will work as intended.</p>
Chris@1 297
Chris@1 298 <p>However, when we leave the ideal/theoretical domain, we notice that
Chris@1 299 polar mapping does give additional practical benefits, as discussed in
Chris@1 300 the above section on polar mapping and summarized again here:</p>
Chris@1 301
Chris@1 302 <ul>
Chris@1 303 <li>Polar mapping aids in controlling entropy 'leakage' between stages
Chris@1 304 of a cascaded codebook.</li>
Chris@1 305 <li>Polar mapping separates the stereo image
Chris@1 306 into point and diffuse components which may be analyzed and handled
Chris@1 307 differently.</li>
Chris@1 308 </ul>
Chris@1 309
Chris@1 310 <h2>Stereo Models</h2>
Chris@1 311
Chris@1 312 <h3>Dual Stereo</h3>
Chris@1 313
Chris@1 314 <p>Dual stereo refers to stereo encoding where the channels are entirely
Chris@1 315 separate; they are analyzed and encoded as entirely distinct entities.
Chris@1 316 This terminology is familiar from mp3.</p>
Chris@1 317
Chris@1 318 <h3>Lossless Stereo</h3>
Chris@1 319
Chris@1 320 <p>Using polar mapping and/or channel interleaving, it's possible to
Chris@1 321 couple Vorbis channels losslessly, that is, construct a stereo
Chris@1 322 coupling encoding that both saves space but also decodes
Chris@1 323 bit-identically to dual stereo. OggEnc 1.0 and later uses this
Chris@1 324 mode in all high-bitrate encoding.</p>
Chris@1 325
Chris@1 326 <p>Overall, this stereo mode is overkill; however, it offers a safe
Chris@1 327 alternative to users concerned about the slightest possible
Chris@1 328 degradation to the stereo image or archival quality audio.</p>
Chris@1 329
Chris@1 330 <h3>Phase Stereo</h3>
Chris@1 331
Chris@1 332 <p>Phase stereo is the least aggressive means of gracefully dropping
Chris@1 333 resolution from the stereo image; it affects only diffuse imaging.</p>
Chris@1 334
Chris@1 335 <p>It's often quoted that the human ear is deaf to signal phase above
Chris@1 336 about 4kHz; this is nearly true and a passable rule of thumb, but it
Chris@1 337 can be demonstrated that even an average user can tell the difference
Chris@1 338 between high frequency in-phase and out-of-phase noise. Obviously
Chris@1 339 then, the statement is not entirely true. However, it's also the case
Chris@1 340 that one must resort to nearly such an extreme demonstration before
Chris@1 341 finding the counterexample.</p>
Chris@1 342
Chris@1 343 <p>'Phase stereo' is simply a more aggressive quantization of the polar
Chris@1 344 angle vector; above 4kHz it's generally quite safe to quantize noise
Chris@1 345 and noisy elements to only a handful of allowed phases, or to thin the
Chris@1 346 phase with respect to the magnitude. The phases of high amplitude
Chris@1 347 pure tones may or may not be preserved more carefully (they are
Chris@1 348 relatively rare and L/R tend to be in phase, so there is generally
Chris@1 349 little reason not to spend a few more bits on them)</p>
Chris@1 350
Chris@1 351 <h4>example: eight phase stereo</h4>
Chris@1 352
Chris@1 353 <p>Vorbis may implement phase stereo coupling by preserving the entirety
Chris@1 354 of the magnitude vector (essential to fine amplitude and energy
Chris@1 355 resolution overall) and quantizing the angle vector to one of only
Chris@1 356 four possible values. Given that the magnitude vector may be positive
Chris@1 357 or negative, this results in left and right phase having eight
Chris@1 358 possible permutation, thus 'eight phase stereo':</p>
Chris@1 359
Chris@1 360 <p><img src="eightphase.png" alt="eight phase"/></p>
Chris@1 361
Chris@1 362 <p>Left and right may be in phase (positive or negative), the most common
Chris@1 363 case by far, or out of phase by 90 or 180 degrees.</p>
Chris@1 364
Chris@1 365 <h4>example: four phase stereo</h4>
Chris@1 366
Chris@1 367 <p>Similarly, four phase stereo takes the quantization one step further;
Chris@1 368 it allows only in-phase and 180 degree out-out-phase signals:</p>
Chris@1 369
Chris@1 370 <p><img src="fourphase.png" alt="four phase"/></p>
Chris@1 371
Chris@1 372 <h3>example: point stereo</h3>
Chris@1 373
Chris@1 374 <p>Point stereo eliminates the possibility of out-of-phase signal
Chris@1 375 entirely. Any diffuse quality to a sound source tends to collapse
Chris@1 376 inward to a point somewhere within the stereo image. A practical
Chris@1 377 example would be balanced reverberations within a large, live space;
Chris@1 378 normally the sound is diffuse and soft, giving a sonic impression of
Chris@1 379 volume. In point-stereo, the reverberations would still exist, but
Chris@1 380 sound fairly firmly centered within the image (assuming the
Chris@1 381 reverberation was centered overall; if the reverberation is stronger
Chris@1 382 to the left, then the point of localization in point stereo would be
Chris@1 383 to the left). This effect is most noticeable at low and mid
Chris@1 384 frequencies and using headphones (which grant perfect stereo
Chris@1 385 separation). Point stereo is is a graceful but generally easy to
Chris@1 386 detect degradation to the sound quality and is thus used in frequency
Chris@1 387 ranges where it is least noticeable.</p>
Chris@1 388
Chris@1 389 <h3>Mixed Stereo</h3>
Chris@1 390
Chris@1 391 <p>Mixed stereo is the simultaneous use of more than one of the above
Chris@1 392 stereo encoding models, generally using more aggressive modes in
Chris@1 393 higher frequencies, lower amplitudes or 'nearly' in-phase sound.</p>
Chris@1 394
Chris@1 395 <p>It is also the case that near-DC frequencies should be encoded using
Chris@1 396 lossless coupling to avoid frame blocking artifacts.</p>
Chris@1 397
Chris@1 398 <h3>Vorbis Stereo Modes</h3>
Chris@1 399
Chris@1 400 <p>Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes
Chris@1 401 constructed out of lossless and point stereo. Phase stereo was used
Chris@1 402 in the rc2 encoder, but is not currently used for simplicity's sake. It
Chris@1 403 will likely be re-added to the stereo model in the future.</p>
Chris@1 404
Chris@1 405 <div id="copyright">
Chris@1 406 The Xiph Fish Logo is a
Chris@1 407 trademark (&trade;) of Xiph.Org.<br/>
Chris@1 408
Chris@1 409 These pages &copy; 1994 - 2005 Xiph.Org. All rights reserved.
Chris@1 410 </div>
Chris@1 411
Chris@1 412 </body>
Chris@1 413 </html>
Chris@1 414
Chris@1 415
Chris@1 416
Chris@1 417
Chris@1 418
Chris@1 419