Piano Evaluation for Level Normalisation

Lack of normalisation for Vamp plugin inputs is a problem when analysing quiet recordings (see #1028).

Testing using a small set of piano recordings, quickly evaluating performance across the first 30 seconds under a number of different normalisation / level management regimes.

Input files

Filename Signal max approx
31.wav 0.57
MAPS_MUS-bach_846_AkPnBcht.wav 0.12
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 0.33
MAPS_MUS-scn15_7_SptkBGAm.wav 0.13
mz_333_1MINp_align.wav 0.10

The plugin has one internal threshold parameter, which can be lowered to find quieter notes (at the expense of course of more false positives). We don't really want to expose this (or any continuous controls) as a parameter. But we need to have approximately predictable input levels, for this threshold to be meaningful.


Name Hg revision Description
norm d721a17f3e14 Normalise to 0.50 max before running plugin (can't do this in plugin: it's here as the reference case)
as-is d721a17f3e14 No normalisation
to-date d9b688700819 Track max signal level so far, adjust each sample so that max is at 0.50
r2,r3,r4,r5,r6 b5a8836dd2a4 Preprocess with Flatten Dynamics at 0.02, 0.03, 0.04, 0.05, 0.06 target RMS levels respectively
q8 4ac067799e0b With Flatten Dynamics second attempt with max RMS targeted to 0.08
t4 d67fae2bb29e With Flatten Dynamics attempt 2a with max RMS targeted to 0.04
u4 70773820e719 With Flatten Dynamics attempt 2b with max RMS targeted to 0.04
s5 1d5258a37cdd Drop back to slightly simpler version (see discussion below)


Reporting only the note onset F-measure for the first 30 seconds of each piece.

Filename norm as-is to-date r2 r3 r4 r5 r6 q8 t4 u4 s5
31.wav 50 33 40 45 47 48 45 43 42 49 45 45
MAPS_MUS-bach_846_AkPnBcht.wav 87 15 62 64 85 87 87 86 81 86 87 88
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 31 31 11 25 31 32 31 32 34 35 33
MAPS_MUS-scn15_7_SptkBGAm.wav 73 16 61 50 57 67 74 75 70 69 68 71
mz_333_1MINp_align.wav 66 3 58 42 60 64 66 63 66 63 65 66

The precision (proportion of correct onsets among detected onsets, or 1 minus the false-positive rate) and recall (proportion of correctly-detected onsets among all ground-truth onsets, or true-positive rate) vary as you would hope:

  • when the resulting audio level is quieter than the norm case, precision is high and recall is low but the F-measure is worse than the norm case
  • when the resulting audio level is louder than the norm case, precision is low and recall is high and the F-measure is still worse than the norm case

This suggests that our threshold (which happens to be 6) is moderately well-suited to the norm case, at least to optimise F-measure (this might not be the most perceptually useful measure though).

The best results (apart from norm) above seem to be r5 and u4. Let's try to refine the parameters for each of those and see if any patterns emerge.

Flatten Dynamics fine-tuning

The adjustable parameters within r5, with their defaults, are

Parameter Description Default
historySeconds Length of RMS window 4.0 sec
catchUpSeconds Length of gain slide window 0.5 sec
targetRMS Target RMS value 0.05
maxGain Hard limit on gain 20.0

The targetRMS is the one we have been varying across r2, r3 etc -- for r5 it is fixed at 0.05. We don't need to test maxGain variation.

Here r5hNcM represents the r5 method with historySeconds = N and catchUpSeconds = M/10. So r5 is the same as r5h4c05. The r5 test was run again, hence variation from above results.

Filename norm r5 r5h2c05 r5h5c05 r5h6c05 r5h8c05 r5h4c01 r5h4c10
31.wav 50 47 38 47 48 46 46 53
MAPS_MUS-bach_846_AkPnBcht.wav 87 87 87 87 87 88 86 88
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 32 33 32 29 31 32 31
MAPS_MUS-scn15_7_SptkBGAm.wav 73 73 66 72 76 73 73 73
mz_333_1MINp_align.wav 66 66 64 64 66 63 65 66

The adjustable parameters within u4, with their defaults, are

Parameter Description Default
longTermSeconds Length of long-term RMS window 4.0 sec
shortTermSeconds Length of short-term RMS window 1.0 sec
catchUpSeconds Length of gain slide window 0.2 sec
targetMaxRMS Target RMS value 0.04
rmsMaxDecay Fallback multiplier for max RMS per sample 0.999
squashFactor Exponent to skew 0,1 range toward top of range 0.3
maxGain Hard limit on gain 20.0

Start by varying squashFactor with others at defaults:

Filename norm r5 0.1 0.3 0.5 1.0
31.wav 50 47 42 40 41 45
MAPS_MUS-bach_846_AkPnBcht.wav 87 87 81 82 82 85
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 32 29 30 33 30
MAPS_MUS-scn15_7_SptkBGAm.wav 73 73 59 64 68 63
mz_333_1MINp_align.wav 66 66 65 67 64 59

The 0.3 results are far worse than the u4 results obtained earlier (even though this is the same code). Variance is evidently high.

I don't think u4 is showing good enough results to justify its complexity over the global-only r5 code, and the squash factor seems to offer little.

Let's supersede the u-series with an s-series that uses the long-term window (only) from r5 but with some decay in max RMS value to account for pieces that go loud-soft alternately. Parameters:

Parameter Description Default
historySeconds Length of long-term RMS window 4.0 sec
catchUpSeconds Length of gain slide window 0.2 sec
targetMaxRMS Target RMS value 0.05
rmsMaxDecay Fallback multiplier for max RMS per sample 0.999
maxGain Hard limit on gain 20.0

We have not yet adjusted this for target RMS, never mind the others. Here's target RMS variation:

Filename norm r5 s3 s4 s5 s6 s7
31.wav 50 47 45 46 42 44 45
MAPS_MUS-bach_846_AkPnBcht.wav 87 87 84 84 83 81 76
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 32 21 31 33 30 30
MAPS_MUS-scn15_7_SptkBGAm.wav 73 73 57 64 68 66 63
mz_333_1MINp_align.wav 66 66 56 60 63 63 63

Varying fallback multiplier for s5:

Filename norm r5 s5 0.9 0.99 1.0
31.wav 50 47 42 44 45 47
MAPS_MUS-bach_846_AkPnBcht.wav 87 87 83 83 84 83
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 32 33 31 29 1(??)
MAPS_MUS-scn15_7_SptkBGAm.wav 73 73 68 67 69 57
mz_333_1MINp_align.wav 66 66 63 63 63 57

Might as well stick with the default.

For different piano template sets

The above results are all generated using four piano templates, numbered 1-3 plus pianorwc.

Here are results using the norm and as-is methods, but with different sets of piano templates: first with three templates (1-3) and then with each template in turn as the only one.

The template turns out not to make an enormous difference -- perhaps because these recordings contain nothing but piano?

Filename norm/all as-is/all norm/3of4 as-is/3of4 norm/1 as-is/1 norm/2 as-is/2 norm/3 as-is/3 norm/rwc as-is/rwc
31.wav 50 33 51 30 50 34 44 42 50 32 56 36
MAPS_MUS-bach_846_AkPnBcht.wav 87 15 86 16 86 24 75 20 73 10 71 18
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 31 32 32 31 22 29 31 35 34 32 28
MAPS_MUS-scn15_7_SptkBGAm.wav 73 16 71 19 71 12 68 14 72 17 70 15
mz_333_1MINp_align.wav 66 3 68 1 63 4 67 2 67 1 63 3

For "generic" template set

The above results all use template sets with only piano templates in them.

Here are results using the norm and as-is methods, but with the full set of instrument templates (four pianos plus all the rest).

Filename norm as-is
31.wav 49 37
MAPS_MUS-bach_846_AkPnBcht.wav 79 34
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 31 28
MAPS_MUS-scn15_7_SptkBGAm.wav 67 16
mz_333_1MINp_align.wav 63 5

Cross-checking with non-piano test data

The results need to be roughly comparable with those obtained from pre-normalised data using other datasets as well as the piano one. Here is a subset of the TRIOS dataset. The norm result is that obtained from the plugin prior to doing this work, using pre-normalised data.

The mirex result is that from the MIREX 2012 submission in MATLAB, but note that this always uses all instrument templates while the plugin results are based on selecting the "right" instrument for the piece (which is assumed to be the best, though we aren't actually testing that here).

File mirex norm u4 s5
mozart/piano 60 64 56 59
mozart/viola 33 37 35 39
mozart/mix 51 58 55 52
mozart/clarinet 74 80 86 89
lussier/piano 45 52 63 59
lussier/mix 36 43 40 38
lussier/bassoon 43 75 80 79
lussier/trumpet 43 46 51 47
take_five/piano 61 46 69 64
take_five/mix 62 73 69 70
take_five/saxophone 78 80 84 86