Implementation Notes » History » Version 2

« Previous - Version 2/4 (diff) - Next » - Current version
Chris Cannam, 2014-02-13 12:38 PM


Method and implementation Notes

Publications

We are aiming to produce an "online" implementation (i.e. not requiring all audio before processing starts) in Vamp plugin form, based on the MIREX 2012 method.

About the method

The basic flow is audio -> Constant-Q transform -> Probablistic Latent Component Analysis using Expectation-Maximization -> some note-clustering algorithm.

The CMJ paper uses a hidden Markov model for note clustering, but the MIREX submission used a simple thresholding method. The thresholding method worked quite well, so we should probably use that.

The published method has a convolution stage to handle fine pitch variations with a 20 cent resolution. This makes the process many times slower, gaining maybe 2% in overall performance, so we should probably omit it to begin with. The diagrams in Benetos 2013 illustrate this stage.

There is no accommodation for percussion; one might preprocess to remove broadband percussive events.

There is no accommodation for typical qualities of vocal performance (melisma, vibrato etc).

Implementation notes

Constant-Q transform: the existing code uses it uses Anssi's MATLAB toolbox. This is substantially better than the Constant-Q in the qm-dsp library. A good first step would be to do a good new C++ Constant-Q implementation. See this project for that work.

Other notes:

  • We could potentially parameterise the sparsity level on z ("sz" variable in MATLAB) as a rough correspondence with number of simultaneous notes
  • The MIREX method also eliminated any polyphony > 4 by dropping weaker notes
  • No particular temporal constraints -- the template has no "duration" -- meaning input can be broken up as required, could be processed completely frame-by-frame at the cost of having to reinitialise EM at each frame

How to test

  • Constant-Q -- compare with Anssi's MATLAB
  • Random initialisers for EM mean the method doesn't always produce the same output but it generally converges to within 1% say
  • Can compare pitch-activations (equation 12 in the paper, "z" variable in the MATLAB code)
  • Test data: Trios dataset (in C4DM datasets) + MIREX development dataset + RWC + MAPS

See also: Joachim's work and comparison of different CQT methods: http://www.eecs.qmul.ac.uk/~jga/eusipco2012.html