Implementation Notes » History » Version 2

Version 1 (Chris Cannam, 2014-02-13 12:33 PM) → Version 2/4 (Chris Cannam, 2014-02-13 12:38 PM)

h1. Method and implementation Notes

h2. Publications

* Main reference. "A Shift-Invariant Latent Variable Model for Automatic Music Transcription":http://www.mitpressjournals.org/doi/abs/10.1162/COMJ_a_00146
* MIREX 2012: a simplified method, without the HMM block. "Multiple-F0 Estimation and Note Tracking for MIREX 2012 using a Shift-Invariant Latent Variable Model":http://www.music-ir.org/mirex/abstracts/2012/BD1.pdf
* Updated in 2013 as "Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model":http://openaccess.city.ac.uk/2155/
* For more recent work in MATLAB, see the project
2013. See "Automatic Music Transcription using efficient Multi-Source SI-PLCA (+GPU support)":/projects/amt_mssiplca_fast. support)":/projects/amt_mssiplca_fast for a MATLAB implementation.

We are aiming to produce an "online" implementation (i.e. not requiring all audio before processing starts) in Vamp plugin form, based on the MIREX 2012 method.

h2. About the method

The basic flow is audio -> "Constant-Q transform":/projects/constant-q-cpp -> "Probablistic Latent Component Analysis":http://www.cs.illinois.edu/~paris/pubs/plca-report.pdf using "Expectation-Maximization":http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm -> some note-clustering algorithm.

The CMJ paper uses a hidden Markov model for note clustering, but the MIREX submission used a simple thresholding method. The thresholding method worked quite well, so we should probably use that.

The published method has a convolution stage to handle fine pitch variations with a 20 cent resolution. This makes the process many times slower, gaining maybe 2% in overall performance, so we should probably omit it to begin with. The diagrams in "Benetos 2013":http://openaccess.city.ac.uk/2155/ illustrate this stage.

There is no accommodation for percussion; one might preprocess to remove broadband percussive events.

There is no accommodation for typical qualities of vocal performance (melisma, vibrato etc).

h2. Implementation notes

*Constant-Q transform*: the existing code uses it uses Anssi's "MATLAB toolbox":/projects/constant-q-toolbox. This is substantially better than the Constant-Q in the "qm-dsp library":/projects/qm-dsp. A good first step would be to do a good new C++ Constant-Q implementation. See "this project":/projects/constant-q-cpp for that work.

*Other notes*:

* We could potentially parameterise the sparsity level on z ("sz" variable in MATLAB) as a rough correspondence with number of simultaneous notes
* The MIREX method also eliminated any polyphony > 4 by dropping weaker notes
* No particular temporal constraints -- the template has no "duration" -- meaning input can be broken up as required, could be processed completely frame-by-frame at the cost of having to reinitialise EM at each frame

h2. How to test

* Constant-Q -- compare with Anssi's MATLAB
* Random initialisers for EM mean the method doesn't always produce the same output but it generally converges to within 1% say
* Can compare pitch-activations (equation 12 in the paper, "z" variable in the MATLAB code)
* Test data: Trios dataset (in C4DM datasets) + MIREX development dataset + RWC + MAPS

See also: Joachim's work and comparison of different CQT methods: http://www.eecs.qmul.ac.uk/~jga/eusipco2012.html