View Single Post
Old 2017-02-27, 00:37   #287
ewmayer's Avatar
Sep 2002
Rep├║blica de California

7·1,663 Posts

Originally Posted by Prime95 View Post
Perhaps this question belongs in the programming subforum:

Prime95 multithreading basically divides a pass of the FFT into big chunks and each thread operates on their chunk independently. When running two threads on a hyperthreaded CPU, this strategy doubles the pressure on the L1, L2, and L3 caches. This is a major problem as prime95 is optimized to use a significant portion of these caches when running single-threaded.

My idea for a solution is have the two hyperthreads run in a tightly-coupled manner processing a single big chunk. To do this I need to break the big chunk into much finer pieces. Such finer granularity would require many more instances where the threads need to "sync up" (wait for the slower hyperthread to catch up).
That fine-grainedness kills it, IMO - my first line of attack would be to try to fiddle things so instead of 1 thread using all the available L1 and L2, aim at half(ish) of that, allowing a companion hyperthread to share the resource, with just 1 sync needed.

I suspect we are already seeing the benefits of that in some fashion, on systems where using 2x as many threads as physical cores shows benefit. E.g. for my code, I get no gain from such a strategy on my 4-core Haswell, but on my 2-core Broadwell NUC using 4 threads gains me 5-10%. As I noted a few posts ago here, in my F33 timing tests on the KNL going from 64 to 128-threads dropped the per-iteration time from 500 to 440 ms. (But I think the Haswell/Broadwell comparison is more salient for cache-based consumer CPUs, where there is no massive blob of superfast RAM acting as a giant L3 cache to save us from our sins at the L1 and L2 levels.)
ewmayer is offline   Reply With Quote