Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Ogg Vorbis Documentation Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1:

Ogg Vorbis stereo-specific channel coupling discussion

Chris@1: Chris@1:

Abstract

Chris@1: Chris@1:

The Vorbis audio CODEC provides a channel coupling Chris@1: mechanisms designed to reduce effective bitrate by both eliminating Chris@1: interchannel redundancy and eliminating stereo image information Chris@1: labeled inaudible or undesirable according to spatial psychoacoustic Chris@1: models. This document describes both the mechanical coupling Chris@1: mechanisms available within the Vorbis specification, as well as the Chris@1: specific stereo coupling models used by the reference Chris@1: libvorbis codec provided by xiph.org.

Chris@1: Chris@1:

Mechanisms

Chris@1: Chris@1:

In encoder release beta 4 and earlier, Vorbis supported multiple Chris@1: channel encoding, but the channels were encoded entirely separately Chris@1: with no cross-analysis or redundancy elimination between channels. Chris@1: This multichannel strategy is very similar to the mp3's dual Chris@1: stereo mode and Vorbis uses the same name for its analogous Chris@1: uncoupled multichannel modes.

Chris@1: Chris@1:

However, the Vorbis spec provides for, and Vorbis release 1.0 rc1 and Chris@1: later implement a coupled channel strategy. Vorbis has two specific Chris@1: mechanisms that may be used alone or in conjunction to implement Chris@1: channel coupling. The first is channel interleaving via Chris@1: residue backend type 2, and the second is square polar Chris@1: mapping. These two general mechanisms are particularly well Chris@1: suited to coupling due to the structure of Vorbis encoding, as we'll Chris@1: explore below, and using both we can implement both totally Chris@1: lossless stereo image coupling [bit-for-bit decode-identical Chris@1: to uncoupled modes], as well as various lossy models that seek to Chris@1: eliminate inaudible or unimportant aspects of the stereo image in Chris@1: order to enhance bitrate. The exact coupling implementation is Chris@1: generalized to allow the encoder a great deal of flexibility in Chris@1: implementation of a stereo or surround model without requiring any Chris@1: significant complexity increase over the combinatorially simpler Chris@1: mid/side joint stereo of mp3 and other current audio codecs.

Chris@1: Chris@1:

A particular Vorbis bitstream may apply channel coupling directly to Chris@1: more than a pair of channels; polar mapping is hierarchical such that Chris@1: polar coupling may be extrapolated to an arbitrary number of channels Chris@1: and is not restricted to only stereo, quadraphonics, ambisonics or 5.1 Chris@1: surround. However, the scope of this document restricts itself to the Chris@1: stereo coupling case.

Chris@1: Chris@1: Chris@1:

Square Polar Mapping

Chris@1: Chris@1:

maximal correlation

Chris@1: Chris@1:

Recall that the basic structure of a a Vorbis I stream first generates Chris@1: from input audio a spectral 'floor' function that serves as an Chris@1: MDCT-domain whitening filter. This floor is meant to represent the Chris@1: rough envelope of the frequency spectrum, using whatever metric the Chris@1: encoder cares to define. This floor is subtracted from the log Chris@1: frequency spectrum, effectively normalizing the spectrum by frequency. Chris@1: Each input channel is associated with a unique floor function.

Chris@1: Chris@1:

The basic idea behind any stereo coupling is that the left and right Chris@1: channels usually correlate. This correlation is even stronger if one Chris@1: first accounts for energy differences in any given frequency band Chris@1: across left and right; think for example of individual instruments Chris@1: mixed into different portions of the stereo image, or a stereo Chris@1: recording with a dominant feature not perfectly in the center. The Chris@1: floor functions, each specific to a channel, provide the perfect means Chris@1: of normalizing left and right energies across the spectrum to maximize Chris@1: correlation before coupling. This feature of the Vorbis format is not Chris@1: a convenient accident.

Chris@1: Chris@1:

Because we strive to maximally correlate the left and right channels Chris@1: and generally succeed in doing so, left and right residue is typically Chris@1: nearly identical. We could use channel interleaving (discussed below) Chris@1: alone to efficiently remove the redundancy between the left and right Chris@1: channels as a side effect of entropy encoding, but a polar Chris@1: representation gives benefits when left/right correlation is Chris@1: strong.

Chris@1: Chris@1:

point and diffuse imaging

Chris@1: Chris@1:

The first advantage of a polar representation is that it effectively Chris@1: separates the spatial audio information into a 'point image' Chris@1: (magnitude) at a given frequency and located somewhere in the sound Chris@1: field, and a 'diffuse image' (angle) that fills a large amount of Chris@1: space simultaneously. Even if we preserve only the magnitude (point) Chris@1: data, a detailed and carefully chosen floor function in each channel Chris@1: provides us with a free, fine-grained, frequency relative intensity Chris@1: stereo*. Angle information represents diffuse sound fields, such as Chris@1: reverberation that fills the entire space simultaneously.

Chris@1: Chris@1:

*Because the Vorbis model supports a number of different possible Chris@1: stereo models and these models may be mixed, we do not use the term Chris@1: 'intensity stereo' talking about Vorbis; instead we use the terms Chris@1: 'point stereo', 'phase stereo' and subcategories of each.

Chris@1: Chris@1:

The majority of a stereo image is representable by polar magnitude Chris@1: alone, as strong sounds tend to be produced at near-point sources; Chris@1: even non-diffuse, fast, sharp echoes track very accurately using Chris@1: magnitude representation almost alone (for those experimenting with Chris@1: Vorbis tuning, this strategy works much better with the precise, Chris@1: piecewise control of floor 1; the continuous approximation of floor 0 Chris@1: results in unstable imaging). Reverberation and diffuse sounds tend Chris@1: to contain less energy and be psychoacoustically dominated by the Chris@1: point sources embedded in them. Thus, we again tend to concentrate Chris@1: more represented energy into a predictably smaller number of numbers. Chris@1: Separating representation of point and diffuse imaging also allows us Chris@1: to model and manipulate point and diffuse qualities separately.

Chris@1: Chris@1:

controlling bit leakage and symbol crosstalk

Chris@1: Chris@1:

Because polar Chris@1: representation concentrates represented energy into fewer large Chris@1: values, we reduce bit 'leakage' during cascading (multistage VQ Chris@1: encoding) as a secondary benefit. A single large, monolithic VQ Chris@1: codebook is more efficient than a cascaded book due to entropy Chris@1: 'crosstalk' among symbols between different stages of a multistage cascade. Chris@1: Polar representation is a way of further concentrating entropy into Chris@1: predictable locations so that codebook design can take steps to Chris@1: improve multistage codebook efficiency. It also allows us to cascade Chris@1: various elements of the stereo image independently.

Chris@1: Chris@1:

eliminating trigonometry and rounding

Chris@1: Chris@1:

Rounding and computational complexity are potential problems with a Chris@1: polar representation. As our encoding process involves quantization, Chris@1: mixing a polar representation and quantization makes it potentially Chris@1: impossible, depending on implementation, to construct a coupled stereo Chris@1: mechanism that results in bit-identical decompressed output compared Chris@1: to an uncoupled encoding should the encoder desire it.

Chris@1: Chris@1:

Vorbis uses a mapping that preserves the most useful qualities of Chris@1: polar representation, relies only on addition/subtraction (during Chris@1: decode; high quality encoding still requires some trig), and makes it Chris@1: trivial before or after quantization to represent an angle/magnitude Chris@1: through a one-to-one mapping from possible left/right value Chris@1: permutations. We do this by basing our polar representation on the Chris@1: unit square rather than the unit-circle.

Chris@1: Chris@1:

Given a magnitude and angle, we recover left and right using the Chris@1: following function (note that A/B may be left/right or right/left Chris@1: depending on the coupling definition used by the encoder):

Chris@1: Chris@1:
Chris@1:       if(magnitude>0)
Chris@1:         if(angle>0){
Chris@1:           A=magnitude;
Chris@1:           B=magnitude-angle;
Chris@1:         }else{
Chris@1:           B=magnitude;
Chris@1:           A=magnitude+angle;
Chris@1:         }
Chris@1:       else
Chris@1:         if(angle>0){
Chris@1:           A=magnitude;
Chris@1:           B=magnitude+angle;
Chris@1:         }else{
Chris@1:           B=magnitude;
Chris@1:           A=magnitude-angle;
Chris@1:         }
Chris@1:     }
Chris@1: 
Chris@1: Chris@1:

The function is antisymmetric for positive and negative magnitudes in Chris@1: order to eliminate a redundant value when quantizing. For example, if Chris@1: we're quantizing to integer values, we can visualize a magnitude of 5 Chris@1: and an angle of -2 as follows:

Chris@1: Chris@1:

square polar

Chris@1: Chris@1:

This representation loses or replicates no values; if the range of A Chris@1: and B are integral -5 through 5, the number of possible Cartesian Chris@1: permutations is 121. Represented in square polar notation, the Chris@1: possible values are:

Chris@1: Chris@1:
Chris@1:  0, 0
Chris@1: 
Chris@1: -1,-2  -1,-1  -1, 0  -1, 1
Chris@1: 
Chris@1:  1,-2   1,-1   1, 0   1, 1
Chris@1: 
Chris@1: -2,-4  -2,-3  -2,-2  -2,-1  -2, 0  -2, 1  -2, 2  -2, 3  
Chris@1: 
Chris@1:  2,-4   2,-3   ... following the pattern ...
Chris@1: 
Chris@1:  ...   5, 1   5, 2   5, 3   5, 4   5, 5   5, 6   5, 7   5, 8   5, 9
Chris@1: 
Chris@1: 
Chris@1: Chris@1:

...for a grand total of 121 possible values, the same number as in Chris@1: Cartesian representation (note that, for example, 5,-10 is Chris@1: the same as -5,10, so there's no reason to represent Chris@1: both. 2,10 cannot happen, and there's no reason to account for it.) Chris@1: It's also obvious that this mapping is exactly reversible.

Chris@1: Chris@1:

Channel interleaving

Chris@1: Chris@1:

We can remap and A/B vector using polar mapping into a magnitude/angle Chris@1: vector, and it's clear that, in general, this concentrates energy in Chris@1: the magnitude vector and reduces the amount of information to encode Chris@1: in the angle vector. Encoding these vectors independently with Chris@1: residue backend #0 or residue backend #1 will result in bitrate Chris@1: savings. However, there are still implicit correlations between the Chris@1: magnitude and angle vectors. The most obvious is that the amplitude Chris@1: of the angle is bounded by its corresponding magnitude value.

Chris@1: Chris@1:

Entropy coding the results, then, further benefits from the entropy Chris@1: model being able to compress magnitude and angle simultaneously. For Chris@1: this reason, Vorbis implements residue backend #2 which pre-interleaves Chris@1: a number of input vectors (in the stereo case, two, A and B) into a Chris@1: single output vector (with the elements in the order of Chris@1: A_0, B_0, A_1, B_1, A_2 ... A_n-1, B_n-1) before entropy encoding. Thus Chris@1: each vector to be coded by the vector quantization backend consists of Chris@1: matching magnitude and angle values.

Chris@1: Chris@1:

The astute reader, at this point, will notice that in the theoretical Chris@1: case in which we can use monolithic codebooks of arbitrarily large Chris@1: size, we can directly interleave and encode left and right without Chris@1: polar mapping; in fact, the polar mapping does not appear to lend any Chris@1: benefit whatsoever to the efficiency of the entropy coding. In fact, Chris@1: it is perfectly possible and reasonable to build a Vorbis encoder that Chris@1: dispenses with polar mapping entirely and merely interleaves the Chris@1: channel. Libvorbis based encoders may configure such an encoding and Chris@1: it will work as intended.

Chris@1: Chris@1:

However, when we leave the ideal/theoretical domain, we notice that Chris@1: polar mapping does give additional practical benefits, as discussed in Chris@1: the above section on polar mapping and summarized again here:

Chris@1: Chris@1: Chris@1: Chris@1:

Stereo Models

Chris@1: Chris@1:

Dual Stereo

Chris@1: Chris@1:

Dual stereo refers to stereo encoding where the channels are entirely Chris@1: separate; they are analyzed and encoded as entirely distinct entities. Chris@1: This terminology is familiar from mp3.

Chris@1: Chris@1:

Lossless Stereo

Chris@1: Chris@1:

Using polar mapping and/or channel interleaving, it's possible to Chris@1: couple Vorbis channels losslessly, that is, construct a stereo Chris@1: coupling encoding that both saves space but also decodes Chris@1: bit-identically to dual stereo. OggEnc 1.0 and later uses this Chris@1: mode in all high-bitrate encoding.

Chris@1: Chris@1:

Overall, this stereo mode is overkill; however, it offers a safe Chris@1: alternative to users concerned about the slightest possible Chris@1: degradation to the stereo image or archival quality audio.

Chris@1: Chris@1:

Phase Stereo

Chris@1: Chris@1:

Phase stereo is the least aggressive means of gracefully dropping Chris@1: resolution from the stereo image; it affects only diffuse imaging.

Chris@1: Chris@1:

It's often quoted that the human ear is deaf to signal phase above Chris@1: about 4kHz; this is nearly true and a passable rule of thumb, but it Chris@1: can be demonstrated that even an average user can tell the difference Chris@1: between high frequency in-phase and out-of-phase noise. Obviously Chris@1: then, the statement is not entirely true. However, it's also the case Chris@1: that one must resort to nearly such an extreme demonstration before Chris@1: finding the counterexample.

Chris@1: Chris@1:

'Phase stereo' is simply a more aggressive quantization of the polar Chris@1: angle vector; above 4kHz it's generally quite safe to quantize noise Chris@1: and noisy elements to only a handful of allowed phases, or to thin the Chris@1: phase with respect to the magnitude. The phases of high amplitude Chris@1: pure tones may or may not be preserved more carefully (they are Chris@1: relatively rare and L/R tend to be in phase, so there is generally Chris@1: little reason not to spend a few more bits on them)

Chris@1: Chris@1:

example: eight phase stereo

Chris@1: Chris@1:

Vorbis may implement phase stereo coupling by preserving the entirety Chris@1: of the magnitude vector (essential to fine amplitude and energy Chris@1: resolution overall) and quantizing the angle vector to one of only Chris@1: four possible values. Given that the magnitude vector may be positive Chris@1: or negative, this results in left and right phase having eight Chris@1: possible permutation, thus 'eight phase stereo':

Chris@1: Chris@1:

eight phase

Chris@1: Chris@1:

Left and right may be in phase (positive or negative), the most common Chris@1: case by far, or out of phase by 90 or 180 degrees.

Chris@1: Chris@1:

example: four phase stereo

Chris@1: Chris@1:

Similarly, four phase stereo takes the quantization one step further; Chris@1: it allows only in-phase and 180 degree out-out-phase signals:

Chris@1: Chris@1:

four phase

Chris@1: Chris@1:

example: point stereo

Chris@1: Chris@1:

Point stereo eliminates the possibility of out-of-phase signal Chris@1: entirely. Any diffuse quality to a sound source tends to collapse Chris@1: inward to a point somewhere within the stereo image. A practical Chris@1: example would be balanced reverberations within a large, live space; Chris@1: normally the sound is diffuse and soft, giving a sonic impression of Chris@1: volume. In point-stereo, the reverberations would still exist, but Chris@1: sound fairly firmly centered within the image (assuming the Chris@1: reverberation was centered overall; if the reverberation is stronger Chris@1: to the left, then the point of localization in point stereo would be Chris@1: to the left). This effect is most noticeable at low and mid Chris@1: frequencies and using headphones (which grant perfect stereo Chris@1: separation). Point stereo is is a graceful but generally easy to Chris@1: detect degradation to the sound quality and is thus used in frequency Chris@1: ranges where it is least noticeable.

Chris@1: Chris@1:

Mixed Stereo

Chris@1: Chris@1:

Mixed stereo is the simultaneous use of more than one of the above Chris@1: stereo encoding models, generally using more aggressive modes in Chris@1: higher frequencies, lower amplitudes or 'nearly' in-phase sound.

Chris@1: Chris@1:

It is also the case that near-DC frequencies should be encoded using Chris@1: lossless coupling to avoid frame blocking artifacts.

Chris@1: Chris@1:

Vorbis Stereo Modes

Chris@1: Chris@1:

Vorbis, as of 1.0, uses lossless stereo and a number of mixed modes Chris@1: constructed out of lossless and point stereo. Phase stereo was used Chris@1: in the rc2 encoder, but is not currently used for simplicity's sake. It Chris@1: will likely be re-added to the stereo model in the future.

Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: Chris@1: