View Single Post
Old 2021-01-12, 21:02   #65
ewmayer's Avatar
Sep 2002
Rep├║blica de California

2×5,813 Posts


The self-tests are intended to do two things:

[1] Check correctness of the compiled code;

[2] Find the best-performing combination of radices for each FFT length on the user's platform. That means trying each combination of radices available for assembling each FFT length and picking the one which runs fastest, unless the fastest happens to show unacceptably high levels of roundoff error, in which the combo which runs fastest *and* has acceptable ROE levels gets stored to the mlucas.cfg file.

The mlucas.cfg file is read at start of each LL or PRP test: for the current exponent being tested, the program computes the default FFT length based on expected levels of roundoff error, then reads the radix-combo data for that FFT length from mlucas.cfg and uses those FFT radices for the run.

The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try. Some examples:

o On my Intel Haswell quad, there are 4 physical cores, no hyperthreading: run self-tests with '-s m -cpu 0:3' to use all 4 cores;

o On my Intel Broadwell NUC mini, there are 2 physical cores, but with hyperthreading: I ran self-tests with '-s m -cpu 0:1' to use just the 2 physical cores, then 'mv mlucas.cfg mlucas.cfg.2' to not get those timings mixed up with the next self-test. Next ran with '-s m -cpu 0:3' to use all 4 cores (2 physical, 2 logical), then 'mv mlucas.cfg mlucas.cfg.4'. Comparing the msec/iter numbers between the 2 files showed the latter set of timings to be 5-10% faster, meaning the hyperthreading was beneficial, so that's the run mode I use: 'ln -s -f mlucas.cfg.4 mlucas.cfg' to link the desired .4-renamed cfg-file to the name 'mlucas.cfg' looked for by the code at runtime, then queue up some work using the script and fire up the program using flags '-cpu 0:3'.

On manycore and multisocket systems finding the run mode which gives best total throughput takes a bit more work, but "don't split runs across sockets" is rule #1, so then you find the way to max out throughput on an individual socket, and duplicate that setup on socket 2, by incrementing the low:high indices following the -cpu flag appropriately.

Regarding your other observations:

o It's not surprising that all of the radix sets that appear in your mlucas.cfg when running -cpu 0:3 having leading radix evenly divisible by NTHREADS*2 - like the runtime warning says, if that does not hold (say radix0 = 12 and 4-threads using -cpu 0:3), it will generally hurt performance, meaning such combos will run more slowly due to suboptimal thread utilization, and will nearly always be bested by one or more radix combos which satisfy the divisibility criterion. Nothing the user need worry about, it's all automated, whichever combo runs fastest appears in the cfg file.

o The reason the self-tests with 4 threads (-cpu 0:3) take longer than you expected is that for 4 or more threads the default #iters used for each timing test gets raised from 100 to 1000, in order to get a more accurate timing sample. You can override that by specifying -iters 100 for such tests.

Cheers, and have fun,
ewmayer is offline   Reply With Quote