View Single Post
Old 2021-01-14, 13:21   #69
tdulcet's Avatar
"Teal Dulcet"
Jun 2018

368 Posts

Originally Posted by pvn View Post
Dockerfile etc can be found here:

I will review your script as well, it looks like you've thought a lot more about this than I have :)
Nice! Thanks. With my script you should be able to compile Mlucas on demand, since is uses a parallel Makefile with one job for each CPU thread, it should only take a couple minutes or less to compile on most systems. It uses the -march=native compile flag on x86 systems, so the resulting binaries should also be slightly faster, although they are generally not portable. What was the issue you had building on ARM?

There is a longstanding issue with 32-bit ARM, where the mi64.c file hangs when compiling with GCC. If you remove the -O3 optimization you get these errors:
../src/mi64.c: In function ‘mi64_shl_short’:
../src/mi64.c:1038:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
../src/mi64.c:1038:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rax’ in ‘asm’
../src/mi64.c: In function ‘mi64_shrl_short’:
../src/mi64.c:1536:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
../src/mi64.c:1536:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rax’ in ‘asm’
Originally Posted by ewmayer View Post
(Basically, there's just no good reason to omit the above flag anymore).
OK, thanks for the info. That is what I thought. I just wanted to make sure that there was not some edge case where my script should omit the flag.

Originally Posted by ewmayer View Post
Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point.
OK, I was just thinking that there was some procedure my script could use given the CPU (Intel, AMD or ARM), the number of CPU Cores and the number of CPU threads to generate all possible candidate combinations for the -cpu argument that could realistically generate the best performance. It could then try the different candidate combinations (as described in the two examples of your previous post) and pick the one with the best performance.

Based on the "Advanced Users" and "Advanced Usage" sections of the Mlucas README, for an example 8 core/16 thread system, this is my best guess of the candidate combinations to try with the -cpu argument:

0     (1-threaded)
0:1     (2-threaded)
0:3     (4-threaded)
0:7     (8-threaded)
0:15     (16-threaded)
0,8     (2 threads per core, 1-threaded) (current default)
0:1,8:9     (2 threads per core, 2-threaded)
0:3,8:11     (2 threads per core, 4-threaded)


0     (1-threaded)
0:3:2     (2-threaded)
0:7:2     (4-threaded)
0:15:2     (8-threaded)
0:1     (2 threads per core, 1-threaded) (current default)
0:3     (2 threads per core, 2-threaded)
0:7     (2 threads per core, 4-threaded)
0:15     (2 threads per core, 8-threaded)

(8 core/8 thread)
0     (1-threaded)
0:3     (4-threaded) (current default)
0:7     (8-threaded)
I am not sure if these are all the combinations worth testing or if we could rule any of them out.

Last fiddled with by tdulcet on 2021-01-14 at 13:22
tdulcet is offline   Reply With Quote