@tdulcet: Glad to be of service to someone else who wants be of service, or something. :)

o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core.

o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores. Remember, those pthread affinity settings are treated as *hints* to the OS, we hope that under heavy load the OS will respect them because there are no otherwise-idle physical cores it can bounce threads to.

o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3? And did you as a result just use the precompiled Arm/Linux binaries on that machine?
