Addendum: OK, I think the roadmap needs to look something like this  abbreviationwise, 'c' refers to physical cores, 't' to threadcount:
1. Based on the user's HW topology, identify a set of 'most likely to succeed' core/thread combos, like tdulcet did in his above post. For x86 this needs to take into account the different corenumbering conventions used by Intel and AMD;
2. For each combo in [1], run the automated selftests, and save the resulting mlucas.cfg file under a unique name, e.g. for 4c/8t call it mlucas.cfg.4c.8t;
3. The various cfgfiles hold the best FFTradix combo to use at each FFT length for the given c/t combo, i.e. in terms of maximizing total throughput on the user's system we can focus on just those. So let's take a hypothetical example: Say on my 8c/16t AMD processor the round of selftests in [1] has shown that using just 1c, 1c2t is 10% faster than 1c1t. We now need to see how 1c2t scales to all physical cores, across the various FFT lengths in the selftest. E.g. at FFT length 4096K, say the best radix combo found for 1c2t is 64,32,32,32 (note the product of those = 2048K rather than 4096K because to match general GIMPS convention "FFT length" refers to #doubles, but Mlucas uses an underlying complexFFT, so the individual radices are complex and refer to pairsofdoubles). So we next want to fire up 8 separate 1c2t jobs at 4096K, each using that radix combo and running on a distinct physical core, thus our 8 jobs would use cpu flags (I used AMD for my example to avoid the comm confusion Inte;'s convention would case here) 0:1,2:3,4:5,6:7,8:9,10:11,12:13 and 14:15, respectively. I would further like to specify the foregoing radix combo via the radset flag, but here we hit a small snag: at present, there is no way to specify an actual radixcombo. Instead one must find the target FFT length in the big case() table in get_fft_radices.c and match the desired radixcombo to a caseindex. For 4096K, we see 64,32,32,32 maps to 'case 7', so we'd use radset 7 for each of our 8 launchatsametime jobs. I may need to do some codefiddling to make that less awkward.
Anyhow, since we're now using just 1 radixcombo at each FFT length and we want a decent timing sample not dominated by startup init and threadmanagement overhead, we might use iters 1000 for each of our 8 jobs. Launch at moreorless same time, they will have a range of msec/iter timings t0t7 which we convert into total throughput in iters/sec via 1000*(1/t0+1/t1+1/t2+1/t3+1/t4+1/t5+1/t6+1/t7). Repeat for each FFT length of interest, generating a set of total throughput numbers.
4. Repeast [3] for each c/t combo in [1]. It may well prove the case that a single c/t combo does not give best total throughput across all FFT lengths, but for a first cut it seems best to somehow generate some kind of weighted averageacrossallFFTlengths for each c/t combo and pick the best one. In [3] we generated total throughput iters/sec numbers at each FFT length, maybe multiply each by its corresponding FFT length and sum over all FFT lengths.
Last fiddled with by ewmayer on 20210116 at 22:24
