The problem we're trying to solve is:
We have two (or more) recordings of a particular score. We want to carry out some task such as audio alignment on them, but they differ in pitch (tuning frequency) sufficiently to confuse the feature extractor we want to use. We therefore want to first detect the difference in pitch between the two, so as to compensate for it in our subsequent processing.
This differs from the problem of deducing the tuning frequency of a recording in isolation, because the frequency difference may be quite extreme: more than a semitone. If you ran a tuning-frequency detector (of the type that works by calculating the difference between predominant frequencies and chroma bin centre frequencies) on both recordings, you would believe them closer in frequency than they actually are, because the whole semitones spanned by the difference would be perceived as a change of key rather than tuning frequency.
- Recording of Bach BWV846 (C major prelude from the Well-Tempered Clavier) by Richard Egarr, harpsichord, at roughly A=400Hz
- MIDI rendering at A=440 for comparison
(based on the experiments below)
Our feature is the normalised mean across the whole input duration of 60-bin-per-octave chroma with an adjustable tuning frequency. (Or a matrix of such features at half a dozen intervals through the file?)
Our metric is the Manhattan distance between two feature vectors.
- Calculate the chroma feature from the reference input at A=440
- Calculate it from the other input at A=440
- Rotate the second chroma feature up by successive 1-bin increments, calculating the distance at each rotation, until a local minimum is found. Repeat in the downward direction. This gives an approximate tuning frequency to 20 cent resolution.
- Starting with the approximate tuning frequency, recalculate the second chroma feature at each tuning frequency adjusting upward by 1-cent steps until a local minimum is found. Repeat in the downward direction. This gives (very slowly) a tuning frequency to 1 cent resolution.
Iterative chroma comparison¶
Using the script
Extracts chroma means (using CQ Chromagram) from the first 30 sec of the reference (at 440Hz) and then repeatedly extracts chroma means from the first 30 sec of the test recording with the chroma tuned to various numbers of cents below and above 440 (from -400 to 400 in 10 cent steps). At each step it calculates the Euclidean distance between the chroma vector just extracted and that from the reference. The frequency yielding the lowest distance is reported.
This takes 2m11sec to run and reports the best tuning frequency as 548Hz.
The closest probe frequencies to the actual tuning are 398.85Hz (-170c) and 401.16Hz (-160c). Both score worse than 440Hz does.
Switching to a 36-bin or 60-bin chromagram gets us an estimate of 529.3Hz. That seems very hard to believe -- I'm sure those should work! I think I need to take a closer look at this.
OK, I think it's a lack of normalisation...
Switching to the QM Vamp Plugins chromagram which has a normalisation option -- better than nothing, though we should really be normalising the means rather than taking the mean of the normalised chroma -- gets us an estimate of 396.56Hz with a fairly clear curve having a minimum somewhere between that probe value and the next one (398.85). There is another minimum around 529Hz. Searching a narrower range in 1 cent increments gets us a more precise estimate of 397.24Hz.
Iterative MATCH path comparison¶
Using the script
As above, except that the score is based on lowest MATCH overall path cost between the two files (with the tuning frequency adjusted appropriately for the second one).
This takes 4m12sec to run and reports the best tuning frequency as 363.6Hz.
Once again the two closest probe frequencies score worse than 440Hz does.
MATCH does actually seem to find a reasonable alignment if you feed it the test file pitch-shifted to A=440Hz along with the reference, so this doesn't seem to be a fault in the aligner. I think I am simply misinterpreting the underlying meaning of the overall path cost.
Switching to chroma features
The iterative MATCH path comparison using chroma features performs much better: it estimates 398.85 Hz which is the closest of the probe frequencies. This may have potential, although it does require that MATCH alignment works somewhat on the pieces in question.
A plugin written for this purpose, in
spectrum-compare. It works by calculating a mean harmonic spectrum for each of its two input channels, then repeatedly frequency-scaling one using a multiplicative factor and comparing the values for each rescaled version, within a limited frequency range, with the reference version.
It probes shifts up to 2400 cents in both directions and reports a shift of -1021 as the best result. There are various other local minima but the true difference is nowhere near any of them. This may be down to arithmetic error, it seems hard to believe that there wouldn't be a minimum nearby - must review.
The TempEst plugin tries to estimate temperament and tuning frequency. I haven't read up yet on how it does this. For the Egarr recording it estimates A=415.98 Hz, shifted quarter-comma meantone.
Separate tuning and key estimation¶
NNLS Chroma Tuning plugin¶
Run on the Egarr recording alone, takes 0.54 sec to produce an estimate of A=445.8 Hz. This is strangely fast! But obviously on its own it has no way to tell the tuning is more than a semitone different.
If concert A actually was 445.8Hz, our actual tuning frequency of 400Hz would be just above G. Can we adjust for the greater-than-a-semitone shift using a key detector as well?
NNLS Chroma and QM Key Detector¶
The QM Key Detector reports a modal key of C major for the reference and B major for the test piece.
The tuning plugin had reported a tuning frequency of 445.8. If we take this at face value and apply the pitch shift necessary to adjust from 445.8Hz to 440Hz, and then run the key detector, we get a modal key of Bb major.
From 445.8Hz to 440Hz is about 23 cents, so we have shifted the piece down by 23 cents and found it to be a whole tone lower than the reference -- implying that the original tuning frequency was about 177 cents below the reference, or about 397Hz.
That's pretty good, but relying on a key detector feels fragile. We got the right modal key here, but this is an easy piece (in fact it's one of the pieces the key detector's reference templates came from). We could easily have got a complementary key as the modal key and ended up miles out.
I started implementing this in a script (
tuning-and-key/keycompare.sh), taking advantage as well of the fact that the key detector also has a tuning frequency parameter. But then I discovered that the tuning frequency estimation from NNLS Chroma was not reliable after all -- it produces a different result (433.753) if you resample the audio first (from 48 to 44.1 kHz).
Taking the same approach, if we shift the audio up from 433.753 to 440 and then run the key detector, we get a modal key of A major, three semitones below the reference. We have shifted the piece up by 24.8 cents and found it to be three semis lower than the reference, so the original tuning frequency must have been 324.8 cents below the reference, or about 364.7Hz.
That's no good -- back to the drawing board.