Speed » History » Version 18

« Previous - Version 18/29 (diff) - Next » - Current version
Chris Cannam, 2014-05-08 12:08 PM


Speed

Aims

We want to make the plugin as fast as possible, but I think there's a case to be made for providing fast and slow modes (see Possibilities for Plugin Parameters).

In "fast" mode we should have the aim of producing a reasonable transcription in faster than real-time on any computer from the past 5 years or so. "Slow" mode has no particular speed constraint, simply as fast as possible an implementation of the best results we can easily do.

See the timing directory in the repo for timing tests. These are all carried out on a Thinkpad T540p with Intel i5-4330M under 64-bit Linux. See the end of the results file, and "slower computers" below, for some figures from older hardware.

Work so far

  • ce64d11ef336, pre-optimisation (release build) takes 104 seconds to process a 43.5-second file. (For reference, a debug build takes over 850 seconds.)
  • Experiments to test where the time is spent:
    • 78a7bf247016 removing the unused Vamp plugin outputs: no more than 1% difference
    • f3bf6503e6c6 removing debug printouts: no more than 1% difference
    • Adjusting the CQ resampler parameters to allow a lower SNR: no more than 1% difference
    • 5314d3361dfb halving the number of EM iterations: reduces runtime by 43% (to 59 sec). If this is linear, then EM must be taking around 86% of the total.
  • Optimising EM:
    • 97b77e7cb94c storing the templates as double instead of single-precision floats saves around 4% overall, for 100 sec
    • (Alternatively, 840c0d703bbb storing them as floats and using single-precision arithmetic throughout saves around 14%, but presumably produces different results -- not pursued at this point)
    • 19f6832fdc8a using bqvec library for raw vector allocation and manipulation instead of std::vector saves a further 10%, for 89 sec
    • A couple of experiments to try to get the template arrays better aligned failed
    • 6890dea115c3 factoring out a further loop saves another 11%, for 78 sec
  • Multi-threading:
    • df05f855f63b using OpenMP for the loop through columns when calling out to EM halves the runtime again (for 41s total), though now consuming 122s "user" time
    • the same code with OMP_NUM_THREADS=1 now runs in 78 sec

That work was merged to default, for a new baseline time of 41s.

  • Optimising EM again:
    • f25b8e7de0ed not processing templates that are out of range for an instrument: saves 58% for 24 sec, or 41s single-threaded

Slower computers

Thinkpad T40p, single-core 1.6GHz Pentium-M. This is about ten years old and quite a lot slower than any reasonable target for real-time performance.

  • ce64d11ef336 (104s on reference computer): 541 sec
  • f25b8e7de0ed (24s on reference computer or 41s single-threaded): 415 sec (only 23% faster, or less than 11% of real-time performance)

Other possibilities

  • Compare the quality of results using float arithmetic to those using doubles
  • Adaptively select the number of EM iterations -- if the process is converging more quickly, break off sooner (how to measure convergence?)
  • Optimise the constant-Q -- it wasn't a very significant part of the runtime to start with, but is presumably becoming more significant now