view testdata/timing/results.txt @ 117:c3c768ac4340 bqvec-openmp

Another timing report
author Chris Cannam
date Wed, 07 May 2014 09:59:20 +0100
parents fbf9b824aaf3
children
line wrap: on
line source

Thinkpad T540p i5-4330M @2.80GHz with 16GB RAM, plugged in
Arch Linux, gcc 4.8.2
Using sonic-annotator v1.0 (commit:41c4de1e05d8), release build

Debug flags: -g -fPIC
Release flags: -O3 -ffast-math -msse -mfpmath=sse -ftree-vectorize -fPIC

Release flags for qm-dsp also include -fomit-frame-pointer

The input file is 1-channel 16-bit PCM at 44100Hz, duration 0m43.5s.


DEBUG/RELEASE:

commit:ce64d11ef336, release build of Silvet, release build of qm-dsp

real	1m44.456s
user	1m44.343s
sys	0m0.210s

commit:ce64d11ef336, debug build of Silvet, release build of qm-dsp

real	14m16.124s
user	14m16.907s
sys	0m0.217s

commit:ce64d11ef336, release build of Silvet, debug build of qm-dsp

real	1m55.204s
user	1m55.053s
sys	0m0.253s

Subsequent tests use release builds of both.


VAMP FEATURE SUPPRESSION:

commit:7133f78ccbf6, as commit:ce64d11ef336 but with CQ output feature
return commented out

real	1m46.162s
user	1m46.093s
sys	0m0.157s

commit:78a7bf247016, as commit:ce64d11ef336 but with CQ output and FCQ
output feature return commented out

real	1m45.206s
user	1m45.153s
sys	0m0.147s

conclusion: no advantage in removing these


DEBUG PRINTOUTS:

commit:f3bf6503e6c6, as commit:ce64d11ef336 but with debug printouts
removed

real	1m43.744s
user	1m43.657s
sys	0m0.203s

conclusion: obviously we want to remove these eventually, but might as
well keep in during testing


EM ITERATIONS:

commit:5314d3361dfb, as commit:ce64d11ef336 but with only 6 EM
iterations instead of 12

real	0m59.055s
user	0m58.897s
sys	0m0.193s

conclusion: EM dominates the time taken, not CQ or note forming


CQ DECIMATOR CONFIGURATION:

Uncommitted revision (because changes are in CQ subrepo) that is as
commit:ce64d11ef336 but with resampler SNR=30 and BW=0.04 instead of
SNR=60 and BW=0.02

real	1m43.176s
user	1m43.067s
sys	0m0.190s

conclusion: supports the previous test


OPENMP:

commit:62b7be1226d5, as commit:ce64d11ef336 but with OpenMP parallel
"for" in the main EM iteration loop (4 cores)

real	0m56.400s
user	2m59.740s
sys	0m0.237s


EM TWEAKS:

commit:a0dedcbfa628, as commit:ce64d11ef336 but with variables hoisted
out of loops and consts added wherever applicable

real	1m44.548s
user	1m44.460s
sys	0m0.183s

conclusion: compiler already knows this stuff

commit:64b08cc12da0, as commit:ce64d11ef336 but with loops merged so
as theoretically to reduce intermediate calculations

real	3m46.969s
user	3m46.850s
sys	0m0.220s

commit:6075e92d63ab, as commit:64b08cc12da0 but with innermost loop
reverted to three loops with simple bodies instead of one with a more
complex body

real	1m44.767s
user	1m44.490s
sys	0m0.190s

commit:97b77e7cb94c, as commit:6075e92d63ab but with templates stored
as doubles instead of floats (doubling the size of the plugin binary)

real	1m40.135s
user	1m39.820s
sys	0m0.230s

commit:a6e136aaa202, as commit:97b77e7cb94c but with target vectors &
grids initialised to epsilon instead of copied & then overwritten
(this one also makes the intention clearer I think so is worth doing)

real	1m39.277s
user	1m39.000s
sys	0m0.183s

commit:840c0d703bbb, as commit:a6e136aaa202 but using single-precision
floats for all EM code (and templates). This is probably not wise
without separately testing the quality of the results but it's
interesting to compare

real	1m29.003s
user	1m28.697s
sys	0m0.197s

commit:91bb029a847a, as commit:a6e136aaa202 but with the series of
calculations reordered to match that in the recent bqvec code
commit:b2f0967cb8d1. Just testing whether it is the replacement of
std::vector or the reordering of vector operations that was saving the
time in bqvec branch.

real	2m52.922s
user	2m52.480s
sys	0m0.263s


BQVEC:

commit:81eaba98985b, as commit:a6e136aaa202 but converted to use bqvec
for basic allocation etc; processing logic unchanged

real	1m37.320s
user	1m36.863s
sys	0m0.240s

commit:891cbcf1e4d2, as commit:81eaba98985b but with some calculations
vectorised [note: has silly bug]

real	1m24.961s
user	1m24.663s
sys	0m0.177s

commit:853b2d750688, as commit:891cbcf1e4d2 but with silly bug fixed

real	1m26.876s
user	1m26.387s
sys	0m0.267s

commit:9ecad4c9c2a2, as commit:853b2d750688 but using a couple of
bqvec calls in expectation function

real	1m9.153s
user	1m8.837s
sys	0m0.187s

(this seems unlikely -- what have I broken?)

commit:8259193b3b16, as commit:9ecad4c9c2a2 but avoiding some
allocations

real	1m10.631s
user	1m10.327s
sys	0m0.180s

(still broken?)

commit:19f6832fdc8a, as commit:9ecad4c9c2a2 but with the arguments to
v_add_with_gain supplied in the right order (that's what I'd broken!)

real	1m28.957s
user	1m28.437s
sys	0m0.213s


BQVEC and OPENMP

commit:ac750e222ad3, result of merging openmp branch
commit:62b7be1226d into bqvec branch commit:19f6832fdc8a

real	0m44.650s
user	2m19.997s
sys	0m0.343s

commit:c4eae816bdb3, as commit:ac750e222ad3 but with some logic to
make using the shifts optional (though on by default). Performance
*should* be unchanged here.

real	0m43.979s
user	2m19.297s
sys	0m0.360s

commit:b2f0967cb8d1, as commit:c4eae816bdb3 but storing the templates
as float arrays and then pulling them out into individual
one-per-shift-factor double arrays each of which is explicitly
allocated with the proper alignment. Uses more memory, and the code is
ugly, but gets aligned starts for slightly more of the vector ops.

real	0m50.856s
user	2m44.937s
sys	0m0.463s