Piano Evaluation for Level Normalisation » History » Version 10

« Previous - Version 10/47 (diff) - Next » - Current version
Chris Cannam, 2014-07-16 07:05 PM


Piano Evaluation for Level Normalisation

Lack of normalisation for Vamp plugin inputs is a problem when analysing quiet recordings (see #1028).

Testing using a small set of piano recordings, quickly evaluating performance across the first 30 seconds under a number of different normalisation / level management regimes.

Input files

Filename Signal max approx
31.wav 0.57
MAPS_MUS-bach_846_AkPnBcht.wav 0.12
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 0.33
MAPS_MUS-scn15_7_SptkBGAm.wav 0.13
mz_333_1MINp_align.wav 0.10

The plugin has one internal threshold parameter, which can be lowered to find quieter notes (at the expense of course of more false positives). We don't really want to expose this (or any continuous controls) as a parameter. But we need to have approximately predictable input levels, for this threshold to be meaningful.

Methods

Name Hg revision Description
as-is d721a17f3e14 No normalisation
norm d721a17f3e14 Normalise to 0.50 max before running plugin (can't do this in plugin)
to-date d9b688700819 Track max signal level so far, adjust each sample so that max is at 0.50

Results

Reporting only the note onset F-measure for the first 30 seconds of each piece.

Filename norm as-is to-date
31.wav 50 33 40
MAPS_MUS-bach_846_AkPnBcht.wav 87 15 62
MAPS_MUS-chpn_op7_1_ENSTDkAm.wav 33 31 31
MAPS_MUS-scn15_7_SptkBGAm.wav 73 16 61
mz_333_1MINp_align.wav 66 3 58

The precision (proportion of correct onsets among detected onsets, or 1 minus the false-positive rate) and recall (proportion of correctly-detected onsets among all ground-truth onsets, or true-positive rate) vary as you would hope:

  • when the resulting audio level is quieter than the norm case, precision is high and recall is low but the F-measure is worse than the norm case
  • when the resulting audio level is louder than the norm case, precision is low and recall is high and the F-measure is still worse than the norm case

This suggests that our threshold is moderately well-suited to the norm case, at least to optimise F-measure (this might not be the most perceptually useful measure though).