Chris@72: Chris@72: Thinkpad T540p i5-4330M @2.80GHz with 16GB RAM, plugged in Chris@72: Arch Linux, gcc 4.8.2 Chris@72: Using sonic-annotator v1.0 (commit:41c4de1e05d8), release build Chris@72: Chris@72: Debug flags: -g -fPIC Chris@72: Release flags: -O3 -ffast-math -msse -mfpmath=sse -ftree-vectorize -fPIC Chris@72: Chris@73: Release flags for qm-dsp also include -fomit-frame-pointer Chris@73: Chris@73: The input file is 1-channel 16-bit PCM at 44100Hz, duration 0m43.5s. Chris@72: Chris@72: Chris@72: DEBUG/RELEASE: Chris@72: Chris@72: commit:ce64d11ef336, release build of Silvet, release build of qm-dsp Chris@72: Chris@73: real 1m44.456s Chris@73: user 1m44.343s Chris@73: sys 0m0.210s Chris@72: Chris@72: commit:ce64d11ef336, debug build of Silvet, release build of qm-dsp Chris@72: Chris@73: real 14m16.124s Chris@73: user 14m16.907s Chris@73: sys 0m0.217s Chris@72: Chris@72: commit:ce64d11ef336, release build of Silvet, debug build of qm-dsp Chris@72: Chris@73: real 1m55.204s Chris@73: user 1m55.053s Chris@73: sys 0m0.253s Chris@72: Chris@72: Subsequent tests use release builds of both. Chris@72: Chris@72: Chris@73: VAMP FEATURE SUPPRESSION: Chris@73: Chris@78: commit:7133f78ccbf6, as commit:ce64d11ef336 but with CQ output feature Chris@78: return commented out Chris@78: Chris@78: real 1m46.162s Chris@78: user 1m46.093s Chris@78: sys 0m0.157s Chris@78: Chris@78: commit:78a7bf247016, as commit:ce64d11ef336 but with CQ output and FCQ Chris@78: output feature return commented out Chris@78: Chris@78: real 1m45.206s Chris@78: user 1m45.153s Chris@78: sys 0m0.147s Chris@78: Chris@78: conclusion: no advantage in removing these Chris@78: Chris@78: Chris@78: DEBUG PRINTOUTS: Chris@78: Chris@78: commit:f3bf6503e6c6, as commit:ce64d11ef336 but with debug printouts Chris@78: removed Chris@78: Chris@78: real 1m43.744s Chris@78: user 1m43.657s Chris@78: sys 0m0.203s Chris@78: Chris@82: conclusion: obviously we want to remove these eventually, but might as Chris@78: well keep in during testing Chris@78: Chris@82: Chris@82: EM ITERATIONS: Chris@82: Chris@82: commit:5314d3361dfb, as commit:ce64d11ef336 but with only 6 EM Chris@82: iterations instead of 12 Chris@82: Chris@82: real 0m59.055s Chris@82: user 0m58.897s Chris@82: sys 0m0.193s Chris@82: Chris@82: conclusion: EM dominates the time taken, not CQ or note forming Chris@82: Chris@82: Chris@82: CQ DECIMATOR CONFIGURATION: Chris@82: Chris@82: Uncommitted revision (because changes are in CQ subrepo) that is as Chris@82: commit:ce64d11ef336 but with resampler SNR=30 and BW=0.04 instead of Chris@82: SNR=60 and BW=0.02 Chris@82: Chris@82: real 1m43.176s Chris@82: user 1m43.067s Chris@82: sys 0m0.190s Chris@82: Chris@82: conclusion: supports the previous test Chris@82: Chris@108: Chris@108: OPENMP: Chris@108: Chris@108: commit:62b7be1226d5, as commit:ce64d11ef336 but with OpenMP parallel Chris@108: "for" in the main EM iteration loop (4 cores) Chris@108: Chris@108: real 0m56.400s Chris@108: user 2m59.740s Chris@108: sys 0m0.237s Chris@108: Chris@108: Chris@108: EM TWEAKS: Chris@108: Chris@108: commit:a0dedcbfa628, as commit:ce64d11ef336 but with variables hoisted Chris@108: out of loops and consts added wherever applicable Chris@108: Chris@108: real 1m44.548s Chris@108: user 1m44.460s Chris@108: sys 0m0.183s Chris@108: Chris@108: conclusion: compiler already knows this stuff Chris@108: Chris@108: commit:64b08cc12da0, as commit:ce64d11ef336 but with loops merged so Chris@108: as theoretically to reduce intermediate calculations Chris@108: Chris@108: real 3m46.969s Chris@108: user 3m46.850s Chris@108: sys 0m0.220s Chris@108: Chris@108: commit:6075e92d63ab, as commit:64b08cc12da0 but with innermost loop Chris@108: reverted to three loops with simple bodies instead of one with a more Chris@108: complex body Chris@108: Chris@108: real 1m44.767s Chris@108: user 1m44.490s Chris@108: sys 0m0.190s Chris@108: Chris@108: commit:97b77e7cb94c, as commit:6075e92d63ab but with templates stored Chris@108: as doubles instead of floats (doubling the size of the plugin binary) Chris@108: Chris@108: real 1m40.135s Chris@108: user 1m39.820s Chris@108: sys 0m0.230s Chris@108: Chris@108: commit:a6e136aaa202, as commit:97b77e7cb94c but with target vectors & Chris@108: grids initialised to epsilon instead of copied & then overwritten Chris@108: (this one also makes the intention clearer I think so is worth doing) Chris@108: Chris@108: real 1m39.277s Chris@108: user 1m39.000s Chris@108: sys 0m0.183s Chris@108: Chris@108: commit:840c0d703bbb, as commit:a6e136aaa202 but using single-precision Chris@108: floats for all EM code (and templates). This is probably not wise Chris@108: without separately testing the quality of the results but it's Chris@108: interesting to compare Chris@108: Chris@108: real 1m29.003s Chris@108: user 1m28.697s Chris@108: sys 0m0.197s Chris@108: Chris@118: commit:91bb029a847a, as commit:a6e136aaa202 but with the series of Chris@118: calculations reordered to match that in the recent bqvec code Chris@118: commit:b2f0967cb8d1. Just testing whether it is the replacement of Chris@118: std::vector or the reordering of vector operations that was saving the Chris@118: time in bqvec branch. Chris@118: Chris@118: real 2m52.922s Chris@118: user 2m52.480s Chris@118: sys 0m0.263s Chris@118: Chris@108: Chris@108: BQVEC: Chris@108: Chris@108: commit:81eaba98985b, as commit:a6e136aaa202 but converted to use bqvec Chris@108: for basic allocation etc; processing logic unchanged Chris@108: Chris@108: real 1m37.320s Chris@108: user 1m36.863s Chris@108: sys 0m0.240s Chris@108: Chris@108: commit:891cbcf1e4d2, as commit:81eaba98985b but with some calculations Chris@108: vectorised [note: has silly bug] Chris@108: Chris@108: real 1m24.961s Chris@108: user 1m24.663s Chris@108: sys 0m0.177s Chris@108: Chris@108: commit:853b2d750688, as commit:891cbcf1e4d2 but with silly bug fixed Chris@108: Chris@108: real 1m26.876s Chris@108: user 1m26.387s Chris@108: sys 0m0.267s Chris@108: Chris@108: commit:9ecad4c9c2a2, as commit:853b2d750688 but using a couple of Chris@108: bqvec calls in expectation function Chris@108: Chris@108: real 1m9.153s Chris@108: user 1m8.837s Chris@108: sys 0m0.187s Chris@108: Chris@108: (this seems unlikely -- what have I broken?) Chris@108: Chris@108: commit:8259193b3b16, as commit:9ecad4c9c2a2 but avoiding some Chris@108: allocations Chris@108: Chris@108: real 1m10.631s Chris@108: user 1m10.327s Chris@108: sys 0m0.180s Chris@108: Chris@108: (still broken?) Chris@108: Chris@108: commit:19f6832fdc8a, as commit:9ecad4c9c2a2 but with the arguments to Chris@108: v_add_with_gain supplied in the right order (that's what I'd broken!) Chris@108: Chris@108: real 1m28.957s Chris@108: user 1m28.437s Chris@108: sys 0m0.213s Chris@108: Chris@108: Chris@108: BQVEC and OPENMP Chris@108: Chris@108: commit:ac750e222ad3, result of merging openmp branch Chris@108: commit:62b7be1226d into bqvec branch commit:19f6832fdc8a Chris@108: Chris@108: real 0m44.650s Chris@108: user 2m19.997s Chris@108: sys 0m0.343s Chris@118: Chris@118: commit:c4eae816bdb3, as commit:ac750e222ad3 but with some logic to Chris@118: make using the shifts optional (though on by default). Performance Chris@118: *should* be unchanged here. Chris@118: Chris@118: real 0m43.979s Chris@118: user 2m19.297s Chris@118: sys 0m0.360s Chris@118: Chris@118: commit:b2f0967cb8d1, as commit:c4eae816bdb3 but storing the templates Chris@118: as float arrays and then pulling them out into individual Chris@118: one-per-shift-factor double arrays each of which is explicitly Chris@118: allocated with the proper alignment. Uses more memory, and the code is Chris@118: ugly, but gets aligned starts for slightly more of the vector ops. Chris@118: Chris@118: real 0m50.856s Chris@118: user 2m44.937s Chris@118: sys 0m0.463s Chris@118: Chris@120: commit:6890dea115c3, as commit:c4eae816bdb3 with a loop factored out Chris@120: Chris@120: real 0m40.565s Chris@120: user 2m3.883s Chris@120: sys 0m0.307s Chris@120: Chris@124: commit:230920148ee5, as commit:6890dea115c3 with a simpler openmp loop Chris@124: Chris@124: real 0m39.761s Chris@124: user 2m3.093s Chris@124: sys 0m0.347s Chris@124: Chris@129: commit:df05f855f63b, same code as commit:230920148ee5 but with bqvec Chris@129: as subrepo (just checking) Chris@129: Chris@129: real 0m40.799s Chris@129: user 2m2.603s Chris@129: sys 0m0.313s Chris@129: Chris@129: commit:df05f855f63b again, with OMP_NUM_THREADS=1 Chris@129: Chris@129: real 1m18.265s Chris@129: user 1m17.707s Chris@129: sys 0m0.223s Chris@129: Chris@131: Chris@131: FURTHER WORK FOLLOWING BQVEC and OPENMP MERGE Chris@131: Chris@131: commit:f25b8e7de0ed, as commit:df05f855f63b but not processing Chris@131: templates that are out of range for an instrument (since they should Chris@131: be all zeros anyway) Chris@131: Chris@131: real 0m23.640s Chris@131: user 0m59.903s Chris@131: sys 0m0.277s Chris@131: Chris@133: commit:f25b8e7de0ed in draft mode (no shifts): Chris@133: Chris@133: real 0m13.970s Chris@133: user 0m21.670s Chris@133: sys 0m0.260s Chris@133: Chris@132: Chris@132: COMPARATIVE TIMINGS from OTHER COMPUTERS Chris@132: Chris@132: Thinkpad T40p Pentium-M (Centrino) @1.6GHz with 1.5GB RAM, plugged in Chris@132: Arch Linux, gcc 4.8.2 Chris@132: Using sonic-annotator v0.8, release build Chris@132: Chris@132: Release flags: -O3 -ffast-math -msse -mfpmath=sse -ftree-vectorize -fPIC Chris@132: Chris@132: commit:f25b8e7de0ed Chris@132: Chris@132: real 6m54.670s Chris@132: user 6m31.817s Chris@132: sys 0m14.753s Chris@132: Chris@132: commit:ce64d11ef336 Chris@132: Chris@132: real 9m0.637s Chris@132: user 8m16.800s Chris@132: sys 0m25.877s Chris@132: Chris@132: commit:f25b8e7de0ed with -msse2 -march=pentium-m added to compiler flags: Chris@132: Chris@132: real 7m4.231s Chris@132: user 6m41.760s Chris@132: sys 0m13.807s Chris@132: Chris@132: commit:f25b8e7de0ed in draft mode (no shifts): Chris@132: Chris@132: real 3m30.218s Chris@132: user 3m10.527s Chris@132: sys 0m15.887s Chris@132: