Chris@1: Chris@1: Chris@1:
Chris@1: Chris@1: Chris@1:The Vorbis audio CODEC provides a channel coupling Chris@1: mechanisms designed to reduce effective bitrate by both eliminating Chris@1: interchannel redundancy and eliminating stereo image information Chris@1: labeled inaudible or undesirable according to spatial psychoacoustic Chris@1: models. This document describes both the mechanical coupling Chris@1: mechanisms available within the Vorbis specification, as well as the Chris@1: specific stereo coupling models used by the reference Chris@1: libvorbis codec provided by xiph.org.
Chris@1: Chris@1:In encoder release beta 4 and earlier, Vorbis supported multiple Chris@1: channel encoding, but the channels were encoded entirely separately Chris@1: with no cross-analysis or redundancy elimination between channels. Chris@1: This multichannel strategy is very similar to the mp3's dual Chris@1: stereo mode and Vorbis uses the same name for its analogous Chris@1: uncoupled multichannel modes.
Chris@1: Chris@1:However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and Chris@1: later implement a coupled channel strategy. Vorbis has two specific Chris@1: mechanisms that may be used alone or in conjunction to implement Chris@1: channel coupling. The first is channel interleaving via Chris@1: residue backend type 2, and the second is square polar Chris@1: mapping. These two general mechanisms are particularly well Chris@1: suited to coupling due to the structure of Vorbis encoding, as we'll Chris@1: explore below, and using both we can implement both totally Chris@1: lossless stereo image coupling [bit-for-bit decode-identical Chris@1: to uncoupled modes], as well as various lossy models that seek to Chris@1: eliminate inaudible or unimportant aspects of the stereo image in Chris@1: order to enhance bitrate. The exact coupling implementation is Chris@1: generalized to allow the encoder a great deal of flexibility in Chris@1: implementation of a stereo or surround model without requiring any Chris@1: significant complexity increase over the combinatorially simpler Chris@1: mid/side joint stereo of mp3 and other current audio codecs.
Chris@1: Chris@1:A particular Vorbis bitstream may apply channel coupling directly to Chris@1: more than a pair of channels; polar mapping is hierarchical such that Chris@1: polar coupling may be extrapolated to an arbitrary number of channels Chris@1: and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 Chris@1: surround. However, the scope of this document restricts itself to the Chris@1: stereo coupling case.
Chris@1: Chris@1: Chris@1:Recall that the basic structure of a a Vorbis I stream first generates Chris@1: from input audio a spectral 'floor' function that serves as an Chris@1: MDCT-domain whitening filter. This floor is meant to represent the Chris@1: rough envelope of the frequency spectrum, using whatever metric the Chris@1: encoder cares to define. This floor is subtracted from the log Chris@1: frequency spectrum, effectively normalizing the spectrum by frequency. Chris@1: Each input channel is associated with a unique floor function.
Chris@1: Chris@1:The basic idea behind any stereo coupling is that the left and right Chris@1: channels usually correlate. This correlation is even stronger if one Chris@1: first accounts for energy differences in any given frequency band Chris@1: across left and right; think for example of individual instruments Chris@1: mixed into different portions of the stereo image, or a stereo Chris@1: recording with a dominant feature not perfectly in the center. The Chris@1: floor functions, each specific to a channel, provide the perfect means Chris@1: of normalizing left and right energies across the spectrum to maximize Chris@1: correlation before coupling. This feature of the Vorbis format is not Chris@1: a convenient accident.
Chris@1: Chris@1:Because we strive to maximally correlate the left and right channels Chris@1: and generally succeed in doing so, left and right residue is typically Chris@1: nearly identical. We could use channel interleaving (discussed below) Chris@1: alone to efficiently remove the redundancy between the left and right Chris@1: channels as a side effect of entropy encoding, but a polar Chris@1: representation gives benefits when left/right correlation is Chris@1: strong.
Chris@1: Chris@1:The first advantage of a polar representation is that it effectively Chris@1: separates the spatial audio information into a 'point image' Chris@1: (magnitude) at a given frequency and located somewhere in the sound Chris@1: field, and a 'diffuse image' (angle) that fills a large amount of Chris@1: space simultaneously. Even if we preserve only the magnitude (point) Chris@1: data, a detailed and carefully chosen floor function in each channel Chris@1: provides us with a free, fine-grained, frequency relative intensity Chris@1: stereo*. Angle information represents diffuse sound fields, such as Chris@1: reverberation that fills the entire space simultaneously.
Chris@1: Chris@1:*Because the Vorbis model supports a number of different possible Chris@1: stereo models and these models may be mixed, we do not use the term Chris@1: 'intensity stereo' talking about Vorbis; instead we use the terms Chris@1: 'point stereo', 'phase stereo' and subcategories of each.
Chris@1: Chris@1:The majority of a stereo image is representable by polar magnitude Chris@1: alone, as strong sounds tend to be produced at near-point sources; Chris@1: even non-diffuse, fast, sharp echoes track very accurately using Chris@1: magnitude representation almost alone (for those experimenting with Chris@1: Vorbis tuning, this strategy works much better with the precise, Chris@1: piecewise control of floor 1; the continuous approximation of floor 0 Chris@1: results in unstable imaging). Reverberation and diffuse sounds tend Chris@1: to contain less energy and be psychoacoustically dominated by the Chris@1: point sources embedded in them. Thus, we again tend to concentrate Chris@1: more represented energy into a predictably smaller number of numbers. Chris@1: Separating representation of point and diffuse imaging also allows us Chris@1: to model and manipulate point and diffuse qualities separately.
Chris@1: Chris@1:Because polar Chris@1: representation concentrates represented energy into fewer large Chris@1: values, we reduce bit 'leakage' during cascading (multistage VQ Chris@1: encoding) as a secondary benefit. A single large, monolithic VQ Chris@1: codebook is more efficient than a cascaded book due to entropy Chris@1: 'crosstalk' among symbols between different stages of a multistage cascade. Chris@1: Polar representation is a way of further concentrating entropy into Chris@1: predictable locations so that codebook design can take steps to Chris@1: improve multistage codebook efficiency. It also allows us to cascade Chris@1: various elements of the stereo image independently.
Chris@1: Chris@1:Rounding and computational complexity are potential problems with a Chris@1: polar representation. As our encoding process involves quantization, Chris@1: mixing a polar representation and quantization makes it potentially Chris@1: impossible, depending on implementation, to construct a coupled stereo Chris@1: mechanism that results in bit-identical decompressed output compared Chris@1: to an uncoupled encoding should the encoder desire it.
Chris@1: Chris@1:Vorbis uses a mapping that preserves the most useful qualities of Chris@1: polar representation, relies only on addition/subtraction (during Chris@1: decode; high quality encoding still requires some trig), and makes it Chris@1: trivial before or after quantization to represent an angle/magnitude Chris@1: through a one-to-one mapping from possible left/right value Chris@1: permutations. We do this by basing our polar representation on the Chris@1: unit square rather than the unit-circle.
Chris@1: Chris@1:Given a magnitude and angle, we recover left and right using the Chris@1: following function (note that A/B may be left/right or right/left Chris@1: depending on the coupling definition used by the encoder):
Chris@1: Chris@1:Chris@1: if(magnitude>0) Chris@1: if(angle>0){ Chris@1: A=magnitude; Chris@1: B=magnitude-angle; Chris@1: }else{ Chris@1: B=magnitude; Chris@1: A=magnitude+angle; Chris@1: } Chris@1: else Chris@1: if(angle>0){ Chris@1: A=magnitude; Chris@1: B=magnitude+angle; Chris@1: }else{ Chris@1: B=magnitude; Chris@1: A=magnitude-angle; Chris@1: } Chris@1: } Chris@1:Chris@1: Chris@1:
The function is antisymmetric for positive and negative magnitudes in Chris@1: order to eliminate a redundant value when quantizing. For example, if Chris@1: we're quantizing to integer values, we can visualize a magnitude of 5 Chris@1: and an angle of -2 as follows:
Chris@1: Chris@1:This representation loses or replicates no values; if the range of A Chris@1: and B are integral -5 through 5, the number of possible Cartesian Chris@1: permutations is 121. Represented in square polar notation, the Chris@1: possible values are:
Chris@1: Chris@1:Chris@1: 0, 0 Chris@1: Chris@1: -1,-2 -1,-1 -1, 0 -1, 1 Chris@1: Chris@1: 1,-2 1,-1 1, 0 1, 1 Chris@1: Chris@1: -2,-4 -2,-3 -2,-2 -2,-1 -2, 0 -2, 1 -2, 2 -2, 3 Chris@1: Chris@1: 2,-4 2,-3 ... following the pattern ... Chris@1: Chris@1: ... 5, 1 5, 2 5, 3 5, 4 5, 5 5, 6 5, 7 5, 8 5, 9 Chris@1: Chris@1:Chris@1: Chris@1:
...for a grand total of 121 possible values, the same number as in Chris@1: Cartesian representation (note that, for example, 5,-10 is Chris@1: the same as -5,10, so there's no reason to represent Chris@1: both. 2,10 cannot happen, and there's no reason to account for it.) Chris@1: It's also obvious that this mapping is exactly reversible.
Chris@1: Chris@1:We can remap and A/B vector using polar mapping into a magnitude/angle Chris@1: vector, and it's clear that, in general, this concentrates energy in Chris@1: the magnitude vector and reduces the amount of information to encode Chris@1: in the angle vector. Encoding these vectors independently with Chris@1: residue backend #0 or residue backend #1 will result in bitrate Chris@1: savings. However, there are still implicit correlations between the Chris@1: magnitude and angle vectors. The most obvious is that the amplitude Chris@1: of the angle is bounded by its corresponding magnitude value.
Chris@1: Chris@1:Entropy coding the results, then, further benefits from the entropy Chris@1: model being able to compress magnitude and angle simultaneously. For Chris@1: this reason, Vorbis implements residue backend #2 which pre-interleaves Chris@1: a number of input vectors (in the stereo case, two, A and B) into a Chris@1: single output vector (with the elements in the order of Chris@1: A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus Chris@1: each vector to be coded by the vector quantization backend consists of Chris@1: matching magnitude and angle values.
Chris@1: Chris@1:The astute reader, at this point, will notice that in the theoretical Chris@1: case in which we can use monolithic codebooks of arbitrarily large Chris@1: size, we can directly interleave and encode left and right without Chris@1: polar mapping; in fact, the polar mapping does not appear to lend any Chris@1: benefit whatsoever to the efficiency of the entropy coding. In fact, Chris@1: it is perfectly possible and reasonable to build a Vorbis encoder that Chris@1: dispenses with polar mapping entirely and merely interleaves the Chris@1: channel. Libvorbis based encoders may configure such an encoding and Chris@1: it will work as intended.
Chris@1: Chris@1:However, when we leave the ideal/theoretical domain, we notice that Chris@1: polar mapping does give additional practical benefits, as discussed in Chris@1: the above section on polar mapping and summarized again here:
Chris@1: Chris@1:Dual stereo refers to stereo encoding where the channels are entirely Chris@1: separate; they are analyzed and encoded as entirely distinct entities. Chris@1: This terminology is familiar from mp3.
Chris@1: Chris@1:Using polar mapping and/or channel interleaving, it's possible to Chris@1: couple Vorbis channels losslessly, that is, construct a stereo Chris@1: coupling encoding that both saves space but also decodes Chris@1: bit-identically to dual stereo. OggEnc 1.0 and later uses this Chris@1: mode in all high-bitrate encoding.
Chris@1: Chris@1:Overall, this stereo mode is overkill; however, it offers a safe Chris@1: alternative to users concerned about the slightest possible Chris@1: degradation to the stereo image or archival quality audio.
Chris@1: Chris@1:Phase stereo is the least aggressive means of gracefully dropping Chris@1: resolution from the stereo image; it affects only diffuse imaging.
Chris@1: Chris@1:It's often quoted that the human ear is deaf to signal phase above Chris@1: about 4kHz; this is nearly true and a passable rule of thumb, but it Chris@1: can be demonstrated that even an average user can tell the difference Chris@1: between high frequency in-phase and out-of-phase noise. Obviously Chris@1: then, the statement is not entirely true. However, it's also the case Chris@1: that one must resort to nearly such an extreme demonstration before Chris@1: finding the counterexample.
Chris@1: Chris@1:'Phase stereo' is simply a more aggressive quantization of the polar Chris@1: angle vector; above 4kHz it's generally quite safe to quantize noise Chris@1: and noisy elements to only a handful of allowed phases, or to thin the Chris@1: phase with respect to the magnitude. The phases of high amplitude Chris@1: pure tones may or may not be preserved more carefully (they are Chris@1: relatively rare and L/R tend to be in phase, so there is generally Chris@1: little reason not to spend a few more bits on them)
Chris@1: Chris@1:Vorbis may implement phase stereo coupling by preserving the entirety Chris@1: of the magnitude vector (essential to fine amplitude and energy Chris@1: resolution overall) and quantizing the angle vector to one of only Chris@1: four possible values. Given that the magnitude vector may be positive Chris@1: or negative, this results in left and right phase having eight Chris@1: possible permutation, thus 'eight phase stereo':
Chris@1: Chris@1:Left and right may be in phase (positive or negative), the most common Chris@1: case by far, or out of phase by 90 or 180 degrees.
Chris@1: Chris@1:Similarly, four phase stereo takes the quantization one step further; Chris@1: it allows only in-phase and 180 degree out-out-phase signals:
Chris@1: Chris@1:Point stereo eliminates the possibility of out-of-phase signal Chris@1: entirely. Any diffuse quality to a sound source tends to collapse Chris@1: inward to a point somewhere within the stereo image. A practical Chris@1: example would be balanced reverberations within a large, live space; Chris@1: normally the sound is diffuse and soft, giving a sonic impression of Chris@1: volume. In point-stereo, the reverberations would still exist, but Chris@1: sound fairly firmly centered within the image (assuming the Chris@1: reverberation was centered overall; if the reverberation is stronger Chris@1: to the left, then the point of localization in point stereo would be Chris@1: to the left). This effect is most noticeable at low and mid Chris@1: frequencies and using headphones (which grant perfect stereo Chris@1: separation). Point stereo is is a graceful but generally easy to Chris@1: detect degradation to the sound quality and is thus used in frequency Chris@1: ranges where it is least noticeable.
Chris@1: Chris@1:Mixed stereo is the simultaneous use of more than one of the above Chris@1: stereo encoding models, generally using more aggressive modes in Chris@1: higher frequencies, lower amplitudes or 'nearly' in-phase sound.
Chris@1: Chris@1:It is also the case that near-DC frequencies should be encoded using Chris@1: lossless coupling to avoid frame blocking artifacts.
Chris@1: Chris@1:Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes Chris@1: constructed out of lossless and point stereo. Phase stereo was used Chris@1: in the rc2 encoder, but is not currently used for simplicity's sake. It Chris@1: will likely be re-added to the stereo model in the future.
Chris@1: Chris@1: