History

Method and implementation Notes¶

Publications¶

Main reference. A Shift-Invariant Latent Variable Model for Automatic Music Transcription
MIREX 2012: a simplified method, without the HMM block. Multiple-F0 Estimation and Note Tracking for MIREX 2012 using a Shift-Invariant Latent Variable Model
Updated in 2013 as Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model
For more recent work in MATLAB, see the project Automatic Music Transcription using efficient Multi-Source SI-PLCA.

We are aiming to produce an "online" implementation (i.e. not requiring all audio before processing starts) in Vamp plugin form, based on the MIREX 2012 method.

About the method¶

The basic flow is audio -> Constant-Q transform -> Probabilistic Latent Component Analysis using Expectation-Maximization -> some note-clustering algorithm.

The CMJ paper uses a hidden Markov model for note clustering, but the MIREX submission used a simple thresholding method. The thresholding method worked quite well, so we should probably use that.

The published method has a convolution stage to handle fine pitch variations with a 20 cent resolution. This makes the process many times slower, gaining maybe 2% in overall performance, so we should probably omit it to begin with. The diagrams in Benetos 2013 illustrate this stage.

There is no accommodation for percussion; one might preprocess to remove broadband percussive events.

There is no accommodation for typical qualities of vocal performance (portamento, vibrato etc).

Implementation notes¶

Constant-Q transform: the existing code uses it uses Anssi's MATLAB toolbox. This is substantially better than the Constant-Q in the qm-dsp library. A good first step would be to do a good new C++ Constant-Q implementation. See this project for that work.

Other notes:

We could potentially parameterise the sparsity level on z ("sz" variable in MATLAB) as a rough correspondence with number of simultaneous notes
The MIREX method also eliminated any polyphony > 4 by dropping weaker notes
No particular temporal constraints -- the template has no "duration" -- meaning input can be broken up as required, could be processed completely frame-by-frame at the cost of having to reinitialise EM at each frame

How to test¶

Constant-Q -- compare with Anssi's MATLAB
Random initialisers for EM mean the method doesn't always produce the same output but it generally converges to within 1% say
Can compare pitch-activations (equation 12 in the paper, "z" variable in the MATLAB code)
Test data: Trios dataset (in C4DM datasets) + MIREX development dataset + RWC + MAPS

See also: Joachim's work and comparison of different CQT methods: http://www.eecs.qmul.ac.uk/~jga/eusipco2012.html

Silvet Note Transcription

Wiki

Method and implementation Notes¶

Publications¶

About the method¶

Implementation notes¶

How to test¶