View Single Post
Old 2021-01-16, 22:23   #71
ewmayer's Avatar
Sep 2002
Rep├║blica de California

3×3,877 Posts

Addendum: OK, I think the roadmap needs to look something like this - abbreviation-wise, 'c' refers to physical cores, 't' to threadcount:

1. Based on the user's HW topology, identify a set of 'most likely to succeed' core/thread combos, like tdulcet did in his above post. For x86 this needs to take into account the different core-numbering conventions used by Intel and AMD;

2. For each combo in [1], run the automated self-tests, and save the resulting mlucas.cfg file under a unique name, e.g. for 4c/8t call it mlucas.cfg.4c.8t;

3. The various cfg-files hold the best FFT-radix combo to use at each FFT length for the given c/t combo, i.e. in terms of maximizing total throughput on the user's system we can focus on just those. So let's take a hypothetical example: Say on my 8c/16t AMD processor the round of self-tests in [1] has shown that using just 1c, 1c2t is 10% faster than 1c1t. We now need to see how 1c2t scales to all physical cores, across the various FFT lengths in the self-test. E.g. at FFT length 4096K, say the best radix combo found for 1c2t is 64,32,32,32 (note the product of those = 2048K rather than 4096K because to match general GIMPS convention "FFT length" refers to #doubles, but Mlucas uses an underlying complex-FFT, so the individual radices are complex and refer to pairs-of-doubles). So we next want to fire up 8 separate 1c2t jobs at 4096K, each using that radix combo and running on a distinct physical core, thus our 8 jobs would use -cpu flags (I used AMD for my example to avoid the comm confusion Inte;'s convention would case here) 0:1,2:3,4:5,6:7,8:9,10:11,12:13 and 14:15, respectively. I would further like to specify the foregoing radix combo via the -radset flag, but here we hit a small snag: at present, there is no way to specify an actual radix-combo. Instead one must find the target FFT length in the big case() table in get_fft_radices.c and match the desired radix-combo to a case-index. For 4096K, we see 64,32,32,32 maps to 'case 7', so we'd use -radset 7 for each of our 8 launch-at-same-time jobs. I may need to do some code-fiddling to make that less awkward.

Anyhow, since we're now using just 1 radix-combo at each FFT length and we want a decent timing sample not dominated by start-up init and thread-management overhead, we might use -iters 1000 for each of our 8 jobs. Launch at more-or-less same time, they will have a range of msec/iter timings t0-t7 which we convert into total throughput in iters/sec via 1000*(1/t0+1/t1+1/t2+1/t3+1/t4+1/t5+1/t6+1/t7). Repeat for each FFT length of interest, generating a set of total throughput numbers.

4. Repeast [3] for each c/t combo in [1]. It may well prove the case that a single c/t combo does not give best total throughput across all FFT lengths, but for a first cut it seems best to somehow generate some kind of weighted average-across-all-FFT-lengths for each c/t combo and pick the best one. In [3] we generated total throughput iters/sec numbers at each FFT length, maybe multiply each by its corresponding FFT length and sum over all FFT lengths.

Last fiddled with by ewmayer on 2021-01-16 at 22:24
ewmayer is offline   Reply With Quote