cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: Ogg Vorbis Documentation cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86:

cannam@86:

cannam@86: cannam@86:

Ogg Vorbis stereo-specific channel coupling discussion

cannam@86: cannam@86:

Abstract

cannam@86: cannam@86:

The Vorbis audio CODEC provides a channel coupling cannam@86: mechanisms designed to reduce effective bitrate by both eliminating cannam@86: interchannel redundancy and eliminating stereo image information cannam@86: labeled inaudible or undesirable according to spatial psychoacoustic cannam@86: models. This document describes both the mechanical coupling cannam@86: mechanisms available within the Vorbis specification, as well as the cannam@86: specific stereo coupling models used by the reference cannam@86: libvorbis codec provided by xiph.org.

cannam@86: cannam@86:

Mechanisms

cannam@86: cannam@86:

In encoder release beta 4 and earlier, Vorbis supported multiple cannam@86: channel encoding, but the channels were encoded entirely separately cannam@86: with no cross-analysis or redundancy elimination between channels. cannam@86: This multichannel strategy is very similar to the mp3's dual cannam@86: stereo mode and Vorbis uses the same name for its analogous cannam@86: uncoupled multichannel modes.

cannam@86: cannam@86:

However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and cannam@86: later implement a coupled channel strategy. Vorbis has two specific cannam@86: mechanisms that may be used alone or in conjunction to implement cannam@86: channel coupling. The first is channel interleaving via cannam@86: residue backend type 2, and the second is square polar cannam@86: mapping. These two general mechanisms are particularly well cannam@86: suited to coupling due to the structure of Vorbis encoding, as we'll cannam@86: explore below, and using both we can implement both totally cannam@86: lossless stereo image coupling [bit-for-bit decode-identical cannam@86: to uncoupled modes], as well as various lossy models that seek to cannam@86: eliminate inaudible or unimportant aspects of the stereo image in cannam@86: order to enhance bitrate. The exact coupling implementation is cannam@86: generalized to allow the encoder a great deal of flexibility in cannam@86: implementation of a stereo or surround model without requiring any cannam@86: significant complexity increase over the combinatorially simpler cannam@86: mid/side joint stereo of mp3 and other current audio codecs.

cannam@86: cannam@86:

A particular Vorbis bitstream may apply channel coupling directly to cannam@86: more than a pair of channels; polar mapping is hierarchical such that cannam@86: polar coupling may be extrapolated to an arbitrary number of channels cannam@86: and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 cannam@86: surround. However, the scope of this document restricts itself to the cannam@86: stereo coupling case.

cannam@86: cannam@86: cannam@86:

Square Polar Mapping

cannam@86: cannam@86:

maximal correlation

cannam@86: cannam@86:

Recall that the basic structure of a a Vorbis I stream first generates cannam@86: from input audio a spectral 'floor' function that serves as an cannam@86: MDCT-domain whitening filter. This floor is meant to represent the cannam@86: rough envelope of the frequency spectrum, using whatever metric the cannam@86: encoder cares to define. This floor is subtracted from the log cannam@86: frequency spectrum, effectively normalizing the spectrum by frequency. cannam@86: Each input channel is associated with a unique floor function.

cannam@86: cannam@86:

The basic idea behind any stereo coupling is that the left and right cannam@86: channels usually correlate. This correlation is even stronger if one cannam@86: first accounts for energy differences in any given frequency band cannam@86: across left and right; think for example of individual instruments cannam@86: mixed into different portions of the stereo image, or a stereo cannam@86: recording with a dominant feature not perfectly in the center. The cannam@86: floor functions, each specific to a channel, provide the perfect means cannam@86: of normalizing left and right energies across the spectrum to maximize cannam@86: correlation before coupling. This feature of the Vorbis format is not cannam@86: a convenient accident.

cannam@86: cannam@86:

Because we strive to maximally correlate the left and right channels cannam@86: and generally succeed in doing so, left and right residue is typically cannam@86: nearly identical. We could use channel interleaving (discussed below) cannam@86: alone to efficiently remove the redundancy between the left and right cannam@86: channels as a side effect of entropy encoding, but a polar cannam@86: representation gives benefits when left/right correlation is cannam@86: strong.

cannam@86: cannam@86:

point and diffuse imaging

cannam@86: cannam@86:

The first advantage of a polar representation is that it effectively cannam@86: separates the spatial audio information into a 'point image' cannam@86: (magnitude) at a given frequency and located somewhere in the sound cannam@86: field, and a 'diffuse image' (angle) that fills a large amount of cannam@86: space simultaneously. Even if we preserve only the magnitude (point) cannam@86: data, a detailed and carefully chosen floor function in each channel cannam@86: provides us with a free, fine-grained, frequency relative intensity cannam@86: stereo*. Angle information represents diffuse sound fields, such as cannam@86: reverberation that fills the entire space simultaneously.

cannam@86: cannam@86:

*Because the Vorbis model supports a number of different possible cannam@86: stereo models and these models may be mixed, we do not use the term cannam@86: 'intensity stereo' talking about Vorbis; instead we use the terms cannam@86: 'point stereo', 'phase stereo' and subcategories of each.

cannam@86: cannam@86:

The majority of a stereo image is representable by polar magnitude cannam@86: alone, as strong sounds tend to be produced at near-point sources; cannam@86: even non-diffuse, fast, sharp echoes track very accurately using cannam@86: magnitude representation almost alone (for those experimenting with cannam@86: Vorbis tuning, this strategy works much better with the precise, cannam@86: piecewise control of floor 1; the continuous approximation of floor 0 cannam@86: results in unstable imaging). Reverberation and diffuse sounds tend cannam@86: to contain less energy and be psychoacoustically dominated by the cannam@86: point sources embedded in them. Thus, we again tend to concentrate cannam@86: more represented energy into a predictably smaller number of numbers. cannam@86: Separating representation of point and diffuse imaging also allows us cannam@86: to model and manipulate point and diffuse qualities separately.

cannam@86: cannam@86:

controlling bit leakage and symbol crosstalk

cannam@86: cannam@86:

Because polar cannam@86: representation concentrates represented energy into fewer large cannam@86: values, we reduce bit 'leakage' during cascading (multistage VQ cannam@86: encoding) as a secondary benefit. A single large, monolithic VQ cannam@86: codebook is more efficient than a cascaded book due to entropy cannam@86: 'crosstalk' among symbols between different stages of a multistage cascade. cannam@86: Polar representation is a way of further concentrating entropy into cannam@86: predictable locations so that codebook design can take steps to cannam@86: improve multistage codebook efficiency. It also allows us to cascade cannam@86: various elements of the stereo image independently.

cannam@86: cannam@86:

eliminating trigonometry and rounding

cannam@86: cannam@86:

Rounding and computational complexity are potential problems with a cannam@86: polar representation. As our encoding process involves quantization, cannam@86: mixing a polar representation and quantization makes it potentially cannam@86: impossible, depending on implementation, to construct a coupled stereo cannam@86: mechanism that results in bit-identical decompressed output compared cannam@86: to an uncoupled encoding should the encoder desire it.

cannam@86: cannam@86:

Vorbis uses a mapping that preserves the most useful qualities of cannam@86: polar representation, relies only on addition/subtraction (during cannam@86: decode; high quality encoding still requires some trig), and makes it cannam@86: trivial before or after quantization to represent an angle/magnitude cannam@86: through a one-to-one mapping from possible left/right value cannam@86: permutations. We do this by basing our polar representation on the cannam@86: unit square rather than the unit-circle.

cannam@86: cannam@86:

Given a magnitude and angle, we recover left and right using the cannam@86: following function (note that A/B may be left/right or right/left cannam@86: depending on the coupling definition used by the encoder):

cannam@86: cannam@86:

cannam@86:       if(magnitude>0)
cannam@86:         if(angle>0){
cannam@86:           A=magnitude;
cannam@86:           B=magnitude-angle;
cannam@86:         }else{
cannam@86:           B=magnitude;
cannam@86:           A=magnitude+angle;
cannam@86:         }
cannam@86:       else
cannam@86:         if(angle>0){
cannam@86:           A=magnitude;
cannam@86:           B=magnitude+angle;
cannam@86:         }else{
cannam@86:           B=magnitude;
cannam@86:           A=magnitude-angle;
cannam@86:         }
cannam@86:     }
cannam@86:

cannam@86: cannam@86:

The function is antisymmetric for positive and negative magnitudes in cannam@86: order to eliminate a redundant value when quantizing. For example, if cannam@86: we're quantizing to integer values, we can visualize a magnitude of 5 cannam@86: and an angle of -2 as follows:

cannam@86: cannam@86:

square polar

cannam@86: cannam@86:

This representation loses or replicates no values; if the range of A cannam@86: and B are integral -5 through 5, the number of possible Cartesian cannam@86: permutations is 121. Represented in square polar notation, the cannam@86: possible values are:

cannam@86: cannam@86:

cannam@86:  0, 0
cannam@86: 
cannam@86: -1,-2  -1,-1  -1, 0  -1, 1
cannam@86: 
cannam@86:  1,-2   1,-1   1, 0   1, 1
cannam@86: 
cannam@86: -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3  
cannam@86: 
cannam@86:  2,-4   2,-3   ... following the pattern ...
cannam@86: 
cannam@86:  ...   5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9
cannam@86: 
cannam@86:

cannam@86: cannam@86:

...for a grand total of 121 possible values, the same number as in cannam@86: Cartesian representation (note that, for example, 5,-10 is cannam@86: the same as -5,10, so there's no reason to represent cannam@86: both. 2,10 cannot happen, and there's no reason to account for it.) cannam@86: It's also obvious that this mapping is exactly reversible.

cannam@86: cannam@86:

Channel interleaving

cannam@86: cannam@86:

We can remap and A/B vector using polar mapping into a magnitude/angle cannam@86: vector, and it's clear that, in general, this concentrates energy in cannam@86: the magnitude vector and reduces the amount of information to encode cannam@86: in the angle vector. Encoding these vectors independently with cannam@86: residue backend #0 or residue backend #1 will result in bitrate cannam@86: savings. However, there are still implicit correlations between the cannam@86: magnitude and angle vectors. The most obvious is that the amplitude cannam@86: of the angle is bounded by its corresponding magnitude value.

cannam@86: cannam@86:

Entropy coding the results, then, further benefits from the entropy cannam@86: model being able to compress magnitude and angle simultaneously. For cannam@86: this reason, Vorbis implements residue backend #2 which pre-interleaves cannam@86: a number of input vectors (in the stereo case, two, A and B) into a cannam@86: single output vector (with the elements in the order of cannam@86: A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus cannam@86: each vector to be coded by the vector quantization backend consists of cannam@86: matching magnitude and angle values.

cannam@86: cannam@86:

The astute reader, at this point, will notice that in the theoretical cannam@86: case in which we can use monolithic codebooks of arbitrarily large cannam@86: size, we can directly interleave and encode left and right without cannam@86: polar mapping; in fact, the polar mapping does not appear to lend any cannam@86: benefit whatsoever to the efficiency of the entropy coding. In fact, cannam@86: it is perfectly possible and reasonable to build a Vorbis encoder that cannam@86: dispenses with polar mapping entirely and merely interleaves the cannam@86: channel. Libvorbis based encoders may configure such an encoding and cannam@86: it will work as intended.

cannam@86: cannam@86:

However, when we leave the ideal/theoretical domain, we notice that cannam@86: polar mapping does give additional practical benefits, as discussed in cannam@86: the above section on polar mapping and summarized again here:

cannam@86: cannam@86:

Polar mapping aids in controlling entropy 'leakage' between stages cannam@86: of a cascaded codebook.
Polar mapping separates the stereo image cannam@86: into point and diffuse components which may be analyzed and handled cannam@86: differently.

cannam@86: cannam@86:

Stereo Models

cannam@86: cannam@86:

Dual Stereo

cannam@86: cannam@86:

Dual stereo refers to stereo encoding where the channels are entirely cannam@86: separate; they are analyzed and encoded as entirely distinct entities. cannam@86: This terminology is familiar from mp3.

cannam@86: cannam@86:

Lossless Stereo

cannam@86: cannam@86:

Using polar mapping and/or channel interleaving, it's possible to cannam@86: couple Vorbis channels losslessly, that is, construct a stereo cannam@86: coupling encoding that both saves space but also decodes cannam@86: bit-identically to dual stereo. OggEnc 1.0 and later uses this cannam@86: mode in all high-bitrate encoding.

cannam@86: cannam@86:

Overall, this stereo mode is overkill; however, it offers a safe cannam@86: alternative to users concerned about the slightest possible cannam@86: degradation to the stereo image or archival quality audio.

cannam@86: cannam@86:

Phase Stereo

cannam@86: cannam@86:

Phase stereo is the least aggressive means of gracefully dropping cannam@86: resolution from the stereo image; it affects only diffuse imaging.

cannam@86: cannam@86:

It's often quoted that the human ear is deaf to signal phase above cannam@86: about 4kHz; this is nearly true and a passable rule of thumb, but it cannam@86: can be demonstrated that even an average user can tell the difference cannam@86: between high frequency in-phase and out-of-phase noise. Obviously cannam@86: then, the statement is not entirely true. However, it's also the case cannam@86: that one must resort to nearly such an extreme demonstration before cannam@86: finding the counterexample.

cannam@86: cannam@86:

'Phase stereo' is simply a more aggressive quantization of the polar cannam@86: angle vector; above 4kHz it's generally quite safe to quantize noise cannam@86: and noisy elements to only a handful of allowed phases, or to thin the cannam@86: phase with respect to the magnitude. The phases of high amplitude cannam@86: pure tones may or may not be preserved more carefully (they are cannam@86: relatively rare and L/R tend to be in phase, so there is generally cannam@86: little reason not to spend a few more bits on them)

cannam@86: cannam@86:

example: eight phase stereo

cannam@86: cannam@86:

Vorbis may implement phase stereo coupling by preserving the entirety cannam@86: of the magnitude vector (essential to fine amplitude and energy cannam@86: resolution overall) and quantizing the angle vector to one of only cannam@86: four possible values. Given that the magnitude vector may be positive cannam@86: or negative, this results in left and right phase having eight cannam@86: possible permutation, thus 'eight phase stereo':

cannam@86: cannam@86:

eight phase

cannam@86: cannam@86:

Left and right may be in phase (positive or negative), the most common cannam@86: case by far, or out of phase by 90 or 180 degrees.

cannam@86: cannam@86:

example: four phase stereo

cannam@86: cannam@86:

Similarly, four phase stereo takes the quantization one step further; cannam@86: it allows only in-phase and 180 degree out-out-phase signals:

cannam@86: cannam@86:

four phase

cannam@86: cannam@86:

example: point stereo

cannam@86: cannam@86:

Point stereo eliminates the possibility of out-of-phase signal cannam@86: entirely. Any diffuse quality to a sound source tends to collapse cannam@86: inward to a point somewhere within the stereo image. A practical cannam@86: example would be balanced reverberations within a large, live space; cannam@86: normally the sound is diffuse and soft, giving a sonic impression of cannam@86: volume. In point-stereo, the reverberations would still exist, but cannam@86: sound fairly firmly centered within the image (assuming the cannam@86: reverberation was centered overall; if the reverberation is stronger cannam@86: to the left, then the point of localization in point stereo would be cannam@86: to the left). This effect is most noticeable at low and mid cannam@86: frequencies and using headphones (which grant perfect stereo cannam@86: separation). Point stereo is is a graceful but generally easy to cannam@86: detect degradation to the sound quality and is thus used in frequency cannam@86: ranges where it is least noticeable.

cannam@86: cannam@86:

Mixed Stereo

cannam@86: cannam@86:

Mixed stereo is the simultaneous use of more than one of the above cannam@86: stereo encoding models, generally using more aggressive modes in cannam@86: higher frequencies, lower amplitudes or 'nearly' in-phase sound.

cannam@86: cannam@86:

It is also the case that near-DC frequencies should be encoded using cannam@86: lossless coupling to avoid frame blocking artifacts.

cannam@86: cannam@86:

Vorbis Stereo Modes

cannam@86: cannam@86:

Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes cannam@86: constructed out of lossless and point stereo. Phase stereo was used cannam@86: in the rc2 encoder, but is not currently used for simplicity's sake. It cannam@86: will likely be re-added to the stereo model in the future.

cannam@86: cannam@86:

cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: cannam@86: