Quote:
Timings for 2048K FFT length (6 cores, 6 workers): 18.79, 19.54, 18.55, 18.68, 18.50, 18.08 ms. Throughput: 321.19 iter/sec.
|
Six workers, six corresponding average times per iteration at the stated fft length. The corresponding iterations/sec for each worker are 1000ms/sec / (average iteration time in ms). The total throughput is the sum of those six figures. Generalize from six workers to N.
1000/18.79 = 53.22 iter/sec
1000/19.54 = 51.18
1000/18.55 = 53.91
1000/18.68 = 53.53
1000/18.50 = 54.05
1000/18.08 = 55.31
Sum of six = 321.20 iter / sec (0.01 /sec difference is probably due to 2-digit roundoff)
Work required for an iteration is roughly exponent * log (exponent) * log( log(exponent)) and fft length is a nearly linear function of exponent, while the processor's rate of work is fairly constant. See for example the last two attachments of
https://www.mersenneforum.org/showpo...19&postcount=5, right columns; constant within +-20% over 2M-64M fft length. (Numerous processor types have been exhaustively benchmarked and posted in that thread.) Large multiprecision multiplication is so for some rather fundamental reasons; see Donald Knuth, Seminumerical Algorithms or
https://www.mersenneforum.org/showpo...21&postcount=7
If this still doesn't make sense that iteration time is dependent on fft length or exponent, time yourself for each of squaring a one-digit decimal number; a 4 digit, and a 10-digit.
What is most efficient on a given system depends on system and processor details and fft length. The optimal number of workers can change versus exponent or fft length. Dual-Xeon systems do MUCH better with 2 workers or more than with one; single-worker throughput on the Knights Landing I'm benchmarking now is positively dreadful with one worker (less than 10% of maximum in some fft lengths).
Hyperthreading usually is not an advantage in fft-based multiplication, but in some cases provides an advantage.
Benchmarking them is the right thing to do.
Welcome to the forum. And the learning curve.