View Single Post
Old 2021-01-17, 16:05   #72
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

29 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)
No problem. I will look forward to your new v19.1 release.

Quote:
Originally Posted by ewmayer View Post
First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here
Thanks for the fix. This had been preventing me from running Mlucas on my Raspberry Pis for a couple years, so that is great that it will now work.

Quote:
Originally Posted by ewmayer View Post
Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?
Yes, this will be difficult. I implemented a preliminary version that follows the instructions on the Mlucas README. Specifically, it will multiply the 4x2t msec/iter times by 1.5 before comparing them. Multiplying by 2 would of course produce different results in this case.

Quote:
Originally Posted by ewmayer View Post
Addendum: OK, I think the roadmap needs to look something like this
Wow, thanks for the detailed roadmap, it is very helpful!

1. OK, I wrote Bash code to automatically generate the combinations from my previous post above, for the user's CPU and number of CPU cores/threads. It will generate a nice table like one of these for example:
Code:
The CPU is Intel.
#  Workers/Runs  Threads         -cpu arguments
1  1             16, 2 per core  0:15
2  1             8, 1 per core   0:7
3  2             4, 1 per core   0:3 4:7
4  4             2, 1 per core   0:1 2:3 4:5 6:7
5  8             1, 1 per core   0 1 2 3 4 5 6 7
6  2             4, 2 per core   0:3,8:11 4:7,12:15
7  4             2, 2 per core   0:1,8:9 2:3,10:11 4:5,12:13 6:7,14:15
8  8             1, 2 per core   0,8 1,9 2,10 3,11 4,12 5,13 6,14 7,15

The CPU is AMD.
#  Workers/Runs  Threads         -cpu arguments
1  1             16, 2 per core  0:15
2  1             8, 1 per core   0:15:2
3  2             4, 1 per core   0:7:2 8:15:2
4  4             2, 1 per core   0:3:2 4:7:2 8:11:2 12:15:2
5  8             1, 1 per core   0 2 4 6 8 10 12 14
6  2             4, 2 per core   0:7 8:15
7  4             2, 2 per core   0:3 4:7 8:11 12:15
8  8             1, 2 per core   0:1 2:3 4:5 6:7 8:9 10:11 12:13 14:15

The CPU is ARM.
#  Workers/Runs  Threads  -cpu arguments
1  1             8        0:7
2  2             4        0:3 4:7
3  4             2        0:1 2:3 4:5 6:7
4  8             1        0 1 2 3 4 5 6 7
The combinations are the same as my previous post above, except I added a 2-threaded combination for ARM and the ordering is different.

2. Done.
3./4. Interesting, this is going to be a lot more complex to implement then I originally thought. The switch statement in get_fft_radices.c is too big to store in my script and creating an awk command to extract the case number based on the FFT length and radix combo would obviously be extremely difficult, particularly because there are nested switch statements. I am going to have to think about how best to do this... I welcome suggestions from anyone who is reading this. In the meantime, I wrote code to directly compare the adjusted msec/iter times from the mlucas.cfg files from step #2. This of course does not account for any of the scaling issues that @ewmayer described. It will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 6 core/12 thread Intel system for example:
Code:
Fastest
#  Workers/Runs  Threads        First -cpu argument  Adjusted msec/iter times
6  6             1, 2 per core  0,6                  8.47  9.69  10.72  12.26  12.71  14.53  14.76  16.54  16.1  18.89  20.94  23.94  26.39  28.85  29.16  32.98

Mean/Average faster     #  Workers/Runs  Threads         First -cpu argument  Adjusted msec/iter times
3.248 ± 0.101 (324.8%)  1  1             12, 2 per core  0:11                 28.92  31.74  33.78  38.64  42.66  44.52  46.56  51.06  52.26  61.8  70.14  79.38  88.62  92.94  97.26  106.2
3.627 ± 0.146 (362.7%)  2  1             6, 1 per core   0:5                  34.14  34.8  39.66  45.48  47.28  51.18  51.3  56.64  60.78  66.24  73.02  87.78  96.12  102.6  108.12  116.22
2.607 ± 0.068 (260.7%)  3  2             3, 1 per core   0:2                  22.98  25.53  27.3  30.12  33.63  37.83  36.72  42.15  42.69  48.9  54.66  61.68  71.19  76.44  77.49  87.36
1.736 ± 0.029 (173.6%)  4  6             1, 1 per core   0                    14.41  17.1  18.46  20.88  22.72  24.99  25.36  28.38  28.82  32.67  36.06  40.67  46.12  49.87  51.32  58.26
1.816 ± 0.047 (181.6%)  5  2             3, 2 per core   0:2,6:8              16.11  18.09  19.32  21.99  23.64  25.41  25.92  29.85  30.57  33.78  38.7  42.42  48.57  51.12  52.53  59.19
The two tables show that 1-threaded with 2 threads per core is ~1.7 times faster then 1-threaded with 1 thread per core for example.
tdulcet is offline   Reply With Quote