cannam@86: cannam@86: cannam@86:
cannam@86: cannam@86: cannam@86:The Vorbis audio CODEC provides a channel coupling cannam@86: mechanisms designed to reduce effective bitrate by both eliminating cannam@86: interchannel redundancy and eliminating stereo image information cannam@86: labeled inaudible or undesirable according to spatial psychoacoustic cannam@86: models. This document describes both the mechanical coupling cannam@86: mechanisms available within the Vorbis specification, as well as the cannam@86: specific stereo coupling models used by the reference cannam@86: libvorbis codec provided by xiph.org.
cannam@86: cannam@86:In encoder release beta 4 and earlier, Vorbis supported multiple cannam@86: channel encoding, but the channels were encoded entirely separately cannam@86: with no cross-analysis or redundancy elimination between channels. cannam@86: This multichannel strategy is very similar to the mp3's dual cannam@86: stereo mode and Vorbis uses the same name for its analogous cannam@86: uncoupled multichannel modes.
cannam@86: cannam@86:However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and cannam@86: later implement a coupled channel strategy. Vorbis has two specific cannam@86: mechanisms that may be used alone or in conjunction to implement cannam@86: channel coupling. The first is channel interleaving via cannam@86: residue backend type 2, and the second is square polar cannam@86: mapping. These two general mechanisms are particularly well cannam@86: suited to coupling due to the structure of Vorbis encoding, as we'll cannam@86: explore below, and using both we can implement both totally cannam@86: lossless stereo image coupling [bit-for-bit decode-identical cannam@86: to uncoupled modes], as well as various lossy models that seek to cannam@86: eliminate inaudible or unimportant aspects of the stereo image in cannam@86: order to enhance bitrate. The exact coupling implementation is cannam@86: generalized to allow the encoder a great deal of flexibility in cannam@86: implementation of a stereo or surround model without requiring any cannam@86: significant complexity increase over the combinatorially simpler cannam@86: mid/side joint stereo of mp3 and other current audio codecs.
cannam@86: cannam@86:A particular Vorbis bitstream may apply channel coupling directly to cannam@86: more than a pair of channels; polar mapping is hierarchical such that cannam@86: polar coupling may be extrapolated to an arbitrary number of channels cannam@86: and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 cannam@86: surround. However, the scope of this document restricts itself to the cannam@86: stereo coupling case.
cannam@86: cannam@86: cannam@86:Recall that the basic structure of a a Vorbis I stream first generates cannam@86: from input audio a spectral 'floor' function that serves as an cannam@86: MDCT-domain whitening filter. This floor is meant to represent the cannam@86: rough envelope of the frequency spectrum, using whatever metric the cannam@86: encoder cares to define. This floor is subtracted from the log cannam@86: frequency spectrum, effectively normalizing the spectrum by frequency. cannam@86: Each input channel is associated with a unique floor function.
cannam@86: cannam@86:The basic idea behind any stereo coupling is that the left and right cannam@86: channels usually correlate. This correlation is even stronger if one cannam@86: first accounts for energy differences in any given frequency band cannam@86: across left and right; think for example of individual instruments cannam@86: mixed into different portions of the stereo image, or a stereo cannam@86: recording with a dominant feature not perfectly in the center. The cannam@86: floor functions, each specific to a channel, provide the perfect means cannam@86: of normalizing left and right energies across the spectrum to maximize cannam@86: correlation before coupling. This feature of the Vorbis format is not cannam@86: a convenient accident.
cannam@86: cannam@86:Because we strive to maximally correlate the left and right channels cannam@86: and generally succeed in doing so, left and right residue is typically cannam@86: nearly identical. We could use channel interleaving (discussed below) cannam@86: alone to efficiently remove the redundancy between the left and right cannam@86: channels as a side effect of entropy encoding, but a polar cannam@86: representation gives benefits when left/right correlation is cannam@86: strong.
cannam@86: cannam@86:The first advantage of a polar representation is that it effectively cannam@86: separates the spatial audio information into a 'point image' cannam@86: (magnitude) at a given frequency and located somewhere in the sound cannam@86: field, and a 'diffuse image' (angle) that fills a large amount of cannam@86: space simultaneously. Even if we preserve only the magnitude (point) cannam@86: data, a detailed and carefully chosen floor function in each channel cannam@86: provides us with a free, fine-grained, frequency relative intensity cannam@86: stereo*. Angle information represents diffuse sound fields, such as cannam@86: reverberation that fills the entire space simultaneously.
cannam@86: cannam@86:*Because the Vorbis model supports a number of different possible cannam@86: stereo models and these models may be mixed, we do not use the term cannam@86: 'intensity stereo' talking about Vorbis; instead we use the terms cannam@86: 'point stereo', 'phase stereo' and subcategories of each.
cannam@86: cannam@86:The majority of a stereo image is representable by polar magnitude cannam@86: alone, as strong sounds tend to be produced at near-point sources; cannam@86: even non-diffuse, fast, sharp echoes track very accurately using cannam@86: magnitude representation almost alone (for those experimenting with cannam@86: Vorbis tuning, this strategy works much better with the precise, cannam@86: piecewise control of floor 1; the continuous approximation of floor 0 cannam@86: results in unstable imaging). Reverberation and diffuse sounds tend cannam@86: to contain less energy and be psychoacoustically dominated by the cannam@86: point sources embedded in them. Thus, we again tend to concentrate cannam@86: more represented energy into a predictably smaller number of numbers. cannam@86: Separating representation of point and diffuse imaging also allows us cannam@86: to model and manipulate point and diffuse qualities separately.
cannam@86: cannam@86:Because polar cannam@86: representation concentrates represented energy into fewer large cannam@86: values, we reduce bit 'leakage' during cascading (multistage VQ cannam@86: encoding) as a secondary benefit. A single large, monolithic VQ cannam@86: codebook is more efficient than a cascaded book due to entropy cannam@86: 'crosstalk' among symbols between different stages of a multistage cascade. cannam@86: Polar representation is a way of further concentrating entropy into cannam@86: predictable locations so that codebook design can take steps to cannam@86: improve multistage codebook efficiency. It also allows us to cascade cannam@86: various elements of the stereo image independently.
cannam@86: cannam@86:Rounding and computational complexity are potential problems with a cannam@86: polar representation. As our encoding process involves quantization, cannam@86: mixing a polar representation and quantization makes it potentially cannam@86: impossible, depending on implementation, to construct a coupled stereo cannam@86: mechanism that results in bit-identical decompressed output compared cannam@86: to an uncoupled encoding should the encoder desire it.
cannam@86: cannam@86:Vorbis uses a mapping that preserves the most useful qualities of cannam@86: polar representation, relies only on addition/subtraction (during cannam@86: decode; high quality encoding still requires some trig), and makes it cannam@86: trivial before or after quantization to represent an angle/magnitude cannam@86: through a one-to-one mapping from possible left/right value cannam@86: permutations. We do this by basing our polar representation on the cannam@86: unit square rather than the unit-circle.
cannam@86: cannam@86:Given a magnitude and angle, we recover left and right using the cannam@86: following function (note that A/B may be left/right or right/left cannam@86: depending on the coupling definition used by the encoder):
cannam@86: cannam@86:cannam@86: if(magnitude>0) cannam@86: if(angle>0){ cannam@86: A=magnitude; cannam@86: B=magnitude-angle; cannam@86: }else{ cannam@86: B=magnitude; cannam@86: A=magnitude+angle; cannam@86: } cannam@86: else cannam@86: if(angle>0){ cannam@86: A=magnitude; cannam@86: B=magnitude+angle; cannam@86: }else{ cannam@86: B=magnitude; cannam@86: A=magnitude-angle; cannam@86: } cannam@86: } cannam@86:cannam@86: cannam@86:
The function is antisymmetric for positive and negative magnitudes in cannam@86: order to eliminate a redundant value when quantizing. For example, if cannam@86: we're quantizing to integer values, we can visualize a magnitude of 5 cannam@86: and an angle of -2 as follows:
cannam@86: cannam@86:This representation loses or replicates no values; if the range of A cannam@86: and B are integral -5 through 5, the number of possible Cartesian cannam@86: permutations is 121. Represented in square polar notation, the cannam@86: possible values are:
cannam@86: cannam@86:cannam@86: 0, 0 cannam@86: cannam@86: -1,-2 -1,-1 -1, 0 -1, 1 cannam@86: cannam@86: 1,-2 1,-1 1, 0 1, 1 cannam@86: cannam@86: -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3 cannam@86: cannam@86: 2,-4 2,-3 ... following the pattern ... cannam@86: cannam@86: ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9 cannam@86: cannam@86:cannam@86: cannam@86:
...for a grand total of 121 possible values, the same number as in cannam@86: Cartesian representation (note that, for example, 5,-10 is cannam@86: the same as -5,10, so there's no reason to represent cannam@86: both. 2,10 cannot happen, and there's no reason to account for it.) cannam@86: It's also obvious that this mapping is exactly reversible.
cannam@86: cannam@86:We can remap and A/B vector using polar mapping into a magnitude/angle cannam@86: vector, and it's clear that, in general, this concentrates energy in cannam@86: the magnitude vector and reduces the amount of information to encode cannam@86: in the angle vector. Encoding these vectors independently with cannam@86: residue backend #0 or residue backend #1 will result in bitrate cannam@86: savings. However, there are still implicit correlations between the cannam@86: magnitude and angle vectors. The most obvious is that the amplitude cannam@86: of the angle is bounded by its corresponding magnitude value.
cannam@86: cannam@86:Entropy coding the results, then, further benefits from the entropy cannam@86: model being able to compress magnitude and angle simultaneously. For cannam@86: this reason, Vorbis implements residue backend #2 which pre-interleaves cannam@86: a number of input vectors (in the stereo case, two, A and B) into a cannam@86: single output vector (with the elements in the order of cannam@86: A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus cannam@86: each vector to be coded by the vector quantization backend consists of cannam@86: matching magnitude and angle values.
cannam@86: cannam@86:The astute reader, at this point, will notice that in the theoretical cannam@86: case in which we can use monolithic codebooks of arbitrarily large cannam@86: size, we can directly interleave and encode left and right without cannam@86: polar mapping; in fact, the polar mapping does not appear to lend any cannam@86: benefit whatsoever to the efficiency of the entropy coding. In fact, cannam@86: it is perfectly possible and reasonable to build a Vorbis encoder that cannam@86: dispenses with polar mapping entirely and merely interleaves the cannam@86: channel. Libvorbis based encoders may configure such an encoding and cannam@86: it will work as intended.
cannam@86: cannam@86:However, when we leave the ideal/theoretical domain, we notice that cannam@86: polar mapping does give additional practical benefits, as discussed in cannam@86: the above section on polar mapping and summarized again here:
cannam@86: cannam@86:Dual stereo refers to stereo encoding where the channels are entirely cannam@86: separate; they are analyzed and encoded as entirely distinct entities. cannam@86: This terminology is familiar from mp3.
cannam@86: cannam@86:Using polar mapping and/or channel interleaving, it's possible to cannam@86: couple Vorbis channels losslessly, that is, construct a stereo cannam@86: coupling encoding that both saves space but also decodes cannam@86: bit-identically to dual stereo. OggEnc 1.0 and later uses this cannam@86: mode in all high-bitrate encoding.
cannam@86: cannam@86:Overall, this stereo mode is overkill; however, it offers a safe cannam@86: alternative to users concerned about the slightest possible cannam@86: degradation to the stereo image or archival quality audio.
cannam@86: cannam@86:Phase stereo is the least aggressive means of gracefully dropping cannam@86: resolution from the stereo image; it affects only diffuse imaging.
cannam@86: cannam@86:It's often quoted that the human ear is deaf to signal phase above cannam@86: about 4kHz; this is nearly true and a passable rule of thumb, but it cannam@86: can be demonstrated that even an average user can tell the difference cannam@86: between high frequency in-phase and out-of-phase noise. Obviously cannam@86: then, the statement is not entirely true. However, it's also the case cannam@86: that one must resort to nearly such an extreme demonstration before cannam@86: finding the counterexample.
cannam@86: cannam@86:'Phase stereo' is simply a more aggressive quantization of the polar cannam@86: angle vector; above 4kHz it's generally quite safe to quantize noise cannam@86: and noisy elements to only a handful of allowed phases, or to thin the cannam@86: phase with respect to the magnitude. The phases of high amplitude cannam@86: pure tones may or may not be preserved more carefully (they are cannam@86: relatively rare and L/R tend to be in phase, so there is generally cannam@86: little reason not to spend a few more bits on them)
cannam@86: cannam@86:Vorbis may implement phase stereo coupling by preserving the entirety cannam@86: of the magnitude vector (essential to fine amplitude and energy cannam@86: resolution overall) and quantizing the angle vector to one of only cannam@86: four possible values. Given that the magnitude vector may be positive cannam@86: or negative, this results in left and right phase having eight cannam@86: possible permutation, thus 'eight phase stereo':
cannam@86: cannam@86:Left and right may be in phase (positive or negative), the most common cannam@86: case by far, or out of phase by 90 or 180 degrees.
cannam@86: cannam@86:Similarly, four phase stereo takes the quantization one step further; cannam@86: it allows only in-phase and 180 degree out-out-phase signals:
cannam@86: cannam@86:Point stereo eliminates the possibility of out-of-phase signal cannam@86: entirely. Any diffuse quality to a sound source tends to collapse cannam@86: inward to a point somewhere within the stereo image. A practical cannam@86: example would be balanced reverberations within a large, live space; cannam@86: normally the sound is diffuse and soft, giving a sonic impression of cannam@86: volume. In point-stereo, the reverberations would still exist, but cannam@86: sound fairly firmly centered within the image (assuming the cannam@86: reverberation was centered overall; if the reverberation is stronger cannam@86: to the left, then the point of localization in point stereo would be cannam@86: to the left). This effect is most noticeable at low and mid cannam@86: frequencies and using headphones (which grant perfect stereo cannam@86: separation). Point stereo is is a graceful but generally easy to cannam@86: detect degradation to the sound quality and is thus used in frequency cannam@86: ranges where it is least noticeable.
cannam@86: cannam@86:Mixed stereo is the simultaneous use of more than one of the above cannam@86: stereo encoding models, generally using more aggressive modes in cannam@86: higher frequencies, lower amplitudes or 'nearly' in-phase sound.
cannam@86: cannam@86:It is also the case that near-DC frequencies should be encoded using cannam@86: lossless coupling to avoid frame blocking artifacts.
cannam@86: cannam@86:Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes cannam@86: constructed out of lossless and point stereo. Phase stereo was used cannam@86: in the rc2 encoder, but is not currently used for simplicity's sake. It cannam@86: will likely be re-added to the stereo model in the future.
cannam@86: cannam@86: