# Method and implementation Notes¶

## Publications¶

- Main reference. A Shift-Invariant Latent Variable Model for Automatic Music Transcription
- MIREX 2012: a simplified method, without the HMM block. Multiple-F0 Estimation and Note Tracking for MIREX 2012 using a Shift-Invariant Latent Variable Model
- Updated in 2013 as Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model
- For more recent work in MATLAB, see the project Automatic Music Transcription using efficient Multi-Source SI-PLCA.

We are aiming to produce an "online" implementation (i.e. not requiring all audio before processing starts) in Vamp plugin form, based on the MIREX 2012 method.

## About the method¶

The basic flow is audio -> Constant-Q transform -> Probabilistic Latent Component Analysis using Expectation-Maximization -> some note-clustering algorithm.

The CMJ paper uses a hidden Markov model for note clustering, but the MIREX submission used a simple thresholding method. The thresholding method worked quite well, so we should probably use that.

The published method has a convolution stage to handle fine pitch variations with a 20 cent resolution. This makes the process many times slower, gaining maybe 2% in overall performance, so we should probably omit it to begin with. The diagrams in Benetos 2013 illustrate this stage.

There is no accommodation for percussion; one might preprocess to remove broadband percussive events.

There is no accommodation for typical qualities of vocal performance (portamento, vibrato etc).

## Implementation notes¶

**Constant-Q transform**: the existing code uses it uses Anssi's MATLAB toolbox. This is substantially better than the Constant-Q in the qm-dsp library. A good first step would be to do a good new C++ Constant-Q implementation. See this project for that work.

**Other notes**:

- We could potentially parameterise the sparsity level on z ("sz" variable in MATLAB) as a rough correspondence with number of simultaneous notes
- The MIREX method also eliminated any polyphony > 4 by dropping weaker notes
- No particular temporal constraints -- the template has no "duration" -- meaning input can be broken up as required, could be processed completely frame-by-frame at the cost of having to reinitialise EM at each frame

## How to test¶

- Constant-Q -- compare with Anssi's MATLAB
- Random initialisers for EM mean the method doesn't always produce the same output but it generally converges to within 1% say
- Can compare pitch-activations (equation 12 in the paper, "z" variable in the MATLAB code)
- Test data: Trios dataset (in C4DM datasets) + MIREX development dataset + RWC + MAPS

See also: Joachim's work and comparison of different CQT methods: http://www.eecs.qmul.ac.uk/~jga/eusipco2012.html