Hi ernst, thanks for looking at this and apologies for delays on my end.
Quote:
Do you recall which precise radix set you saw the warning at in your case? To see it for 4threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the selftests to create the mlucas.cfg file for your particular cpu [lo:hi] choice, the above suboptimality will likely cause a different FFTradixcombo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.

Does this mean that the selftest run is taking longer because it's... weeding out the unsuitable radicies? I think this makes sense given what I see in the resulting cfg files (at any given FFT length, the msec/iter (roughly) scales with the number of cores used even when the 4core self test takes unexpectedly too much time overall.
Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4).
here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size):
M43765019: using FFT length 2304K = 2359296 8byte floats, initial residue shift count = 29224505
this gives an average 18.550033145480686 bits per digit
Using complex FFT radices 36 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M48515021: using FFT length 2560K = 2621440 8byte floats, initial residue shift count = 31467905
this gives an average 18.507011795043944 bits per digit
Using complex FFT radices 20 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M53254447: using FFT length 2816K = 2883584 8byte floats, initial residue shift count = 35280290
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M53254447: using FFT length 2816K = 2883584 8byte floats, initial residue shift count = 23722047
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 8 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M62705077: using FFT length 3328K = 3407872 8byte floats, initial residue shift count = 61480382
this gives an average 18.400068136361931 bits per digit
Using complex FFT radices 52 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M67417873: using FFT length 3584K = 3670016 8byte floats, initial residue shift count = 63290971
this gives an average 18.369912556239537 bits per digit
Using complex FFT radices 28 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M72123137: using FFT length 3840K = 3932160 8byte floats, initial residue shift count = 65799790
this gives an average 18.341862233479819 bits per digit
Using complex FFT radices 60 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M86198291: using FFT length 4608K = 4718592 8byte floats, initial residue shift count = 21266494
this gives an average 18.267799165513779 bits per digit
Using complex FFT radices 36 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M95551873: using FFT length 5120K = 5242880 8byte floats, initial residue shift count = 93620243
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M95551873: using FFT length 5120K = 5242880 8byte floats, initial residue shift count = 43929528
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M104884309: using FFT length 5632K = 5767168 8byte floats, initial residue shift count = 24783492
this gives an average 18.186449397693981 bits per digit
Using complex FFT radices 44 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M123493333: using FFT length 6656K = 6815744 8byte floats, initial residue shift count = 30371346
this gives an average 18.118833835308369 bits per digit
Using complex FFT radices 52 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M132772789: using FFT length 7168K = 7340032 8byte floats, initial residue shift count = 24638813
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M132772789: using FFT length 7168K = 7340032 8byte floats, initial residue shift count = 92450206
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.
M142037359: using FFT length 7680K = 7864320 8byte floats, initial residue shift count = 90349695
this gives an average 18.060984166463218 bits per digit
Using complex FFT radices 60 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance.