mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2020-07-13, 19:26   #56
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×7×829 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
As you may be able to tell I've had to use it on several platforms. But I prefer scp or sftp if they are available.
I always use scp when available, perhaps my expectations re. fs-path handling have been colored by that. But on this particular server (or perhaps my remote-access privileges to it), only ftp is available.
ewmayer is offline   Reply With Quote
Old 2020-07-29, 16:48   #57
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·701 Posts
Default PRP proof

Are you implementing patnashev's prp proof generation in Mlucas?
kriesel is online now   Reply With Quote
Old 2020-07-29, 20:17   #58
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

23·7·167 Posts
Default


Hadn't thought to ask that myself. If it had come to mind, I would have. It will be useful when we find the next candidate prime.
Uncwilly is online now   Reply With Quote
Old 2020-07-29, 22:27   #59
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D5616 Posts
Default

PRP-proof support will be in v20, yes. I am alas behind the curve there - between the pandemic and a series of non-life-threatening but still frequently day-week-and-month-ruining health bugaboos, this year has been one of continual annoying distractions. And EOM my housemates-of-2-years (young professional couple who just bought a starter home in the area) are vacating the MBR suite of our large shared apartment, so I have tons of busywork to do getting the place ready to show to prospective renters. What a year...

My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)
ewmayer is offline   Reply With Quote
Old 2020-07-30, 04:01   #60
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·701 Posts
Default

Quote:
Originally Posted by ewmayer View Post
My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)
Low power proofs are better than none. Standalone devices could drop to 6 (or even 5 if necessary?) and still save ~90+% of a DC.

Per https://mersenneforum.org/showpost.p...5&postcount=46 power 7 takes 1.5GB disk space for residues at 100M p. Since Odroid is Ubuntu and GigE, why not pile residues on a network shared drive and then clean them up after the proof file exists? A Droid, Pi or phone farm could share a single TB drive.
Quote:
Originally Posted by ewmayer View Post
PRP-proof support will be in v20, yes. I am alas behind the curve
Right is more important than soon. And life happening affects how soon is practical.

Last fiddled with by kriesel on 2020-07-30 at 04:09
kriesel is online now   Reply With Quote
Old 2020-11-28, 19:46   #61
Dylan14
 
Dylan14's Avatar
 
"Dylan"
Mar 2017

23216 Posts
Default

I have posted a working PKGBUILD for the latest Mlucas to the AUR. You can find it here.

There are two patches that I had to make to the source to get it to build correctly:
1. In the file platform.h, I had to comment out line 1304:
Code:
#include <sys/sysctl.h>
This is because the sysctl.h header was removed in Linux Kernel 5.5, per this issue on the PowerShell GitHub.
2. In the file Mlucas.c, I removed the *fp part of FILE on line 100. This is because the linker (gcc 10.2.0) was complaining that fp was defined elsewhere (namely, in gcd_lehmer.c).
Dylan14 is online now   Reply With Quote
Old 2020-11-30, 18:54   #62
pvn
 
Nov 2020

48 Posts
Default

I just built v19 and I'm fairly new to the Arm universe. I am poking around on some of the AWS EC2 instances with "graviton" processors. I notice that if I run with 4 cores, using a command line like this:


Code:
#  ./Mlucas -s m -cpu 0:3

then I get this message in the output quite a bit:


Code:
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

and, sure enough, runs with four cores tend to be (much) slower than 2 cores or even 1 core on the same instance. is there something I should be doing differently?
pvn is offline   Reply With Quote
Old 2020-12-11, 21:41   #63
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×7×829 Posts
Default

@pvn: Sorry for belated reply - that warning message is more common for larger threadcounts, it's basically telling you that part of the FFT code needs the leading (leftmost in the "Using complex FFT radices" info-print) to be divisible by #threads in order to run optimally. Example from a DC my last functioning bought-cheap-used Android phone is currently doing:

Using complex FFT radices 192 32 16 16

The leading radix here is radix0 = 192, thus radix0/2 = 96 = 32*3. Sticking to power-of-2 thread counts (which the other main part of my 2-phases-per-iteration FFT code needs to run optimally) we'd be fine for #threads = 2,4,8,16,32, but 64 would give you the warning you saw.

Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.

I've always gotten quite good multithreaded scaling on my Arm devices (Odroid min-PC and Android phone) up to 4-threads - did you run separate self-tests for -cpu 0, -cpu 0:1 and -cpu 0:3 and compare the resulting mlucas.cfg files?

On the Graviton instance you're using, what does /proc/cpu show in terms of #cores?
ewmayer is offline   Reply With Quote
Old 2021-01-10, 17:03   #64
pvn
 
Nov 2020

416 Posts
Default

Hi ernst, thanks for looking at this and apologies for delays on my end.

Quote:
Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.
Does this mean that the self-test run is taking longer because it's... weeding out the unsuitable radicies? I think this makes sense given what I see in the resulting cfg files (at any given FFT length, the msec/iter (roughly) scales with the number of cores used even when the 4-core self test takes unexpectedly too much time overall.


Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running -cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4).


here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size):

M43765019: using FFT length 2304K = 2359296 8-byte floats, initial residue shift count = 29224505
this gives an average 18.550033145480686 bits per digit
Using complex FFT radices 36 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M48515021: using FFT length 2560K = 2621440 8-byte floats, initial residue shift count = 31467905
this gives an average 18.507011795043944 bits per digit
Using complex FFT radices 20 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 35280290
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 23722047
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 8 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M62705077: using FFT length 3328K = 3407872 8-byte floats, initial residue shift count = 61480382
this gives an average 18.400068136361931 bits per digit
Using complex FFT radices 52 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M67417873: using FFT length 3584K = 3670016 8-byte floats, initial residue shift count = 63290971
this gives an average 18.369912556239537 bits per digit
Using complex FFT radices 28 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M72123137: using FFT length 3840K = 3932160 8-byte floats, initial residue shift count = 65799790
this gives an average 18.341862233479819 bits per digit
Using complex FFT radices 60 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M86198291: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 21266494
this gives an average 18.267799165513779 bits per digit
Using complex FFT radices 36 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 93620243
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 43929528
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M104884309: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 24783492
this gives an average 18.186449397693981 bits per digit
Using complex FFT radices 44 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M123493333: using FFT length 6656K = 6815744 8-byte floats, initial residue shift count = 30371346
this gives an average 18.118833835308369 bits per digit
Using complex FFT radices 52 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 24638813
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 92450206
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M142037359: using FFT length 7680K = 7864320 8-byte floats, initial residue shift count = 90349695
this gives an average 18.060984166463218 bits per digit
Using complex FFT radices 60 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

Last fiddled with by pvn on 2021-01-10 at 17:06
pvn is offline   Reply With Quote
Old 2021-01-12, 21:02   #65
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×7×829 Posts
Default

@pvn:

The self-tests are intended to do two things:

[1] Check correctness of the compiled code;

[2] Find the best-performing combination of radices for each FFT length on the user's platform. That means trying each combination of radices available for assembling each FFT length and picking the one which runs fastest, unless the fastest happens to show unacceptably high levels of roundoff error, in which the combo which runs fastest *and* has acceptable ROE levels gets stored to the mlucas.cfg file.

The mlucas.cfg file is read at start of each LL or PRP test: for the current exponent being tested, the program computes the default FFT length based on expected levels of roundoff error, then reads the radix-combo data for that FFT length from mlucas.cfg and uses those FFT radices for the run.

The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try. Some examples:

o On my Intel Haswell quad, there are 4 physical cores, no hyperthreading: run self-tests with '-s m -cpu 0:3' to use all 4 cores;

o On my Intel Broadwell NUC mini, there are 2 physical cores, but with hyperthreading: I ran self-tests with '-s m -cpu 0:1' to use just the 2 physical cores, then 'mv mlucas.cfg mlucas.cfg.2' to not get those timings mixed up with the next self-test. Next ran with '-s m -cpu 0:3' to use all 4 cores (2 physical, 2 logical), then 'mv mlucas.cfg mlucas.cfg.4'. Comparing the msec/iter numbers between the 2 files showed the latter set of timings to be 5-10% faster, meaning the hyperthreading was beneficial, so that's the run mode I use: 'ln -s -f mlucas.cfg.4 mlucas.cfg' to link the desired .4-renamed cfg-file to the name 'mlucas.cfg' looked for by the code at runtime, then queue up some work using the primenet.py script and fire up the program using flags '-cpu 0:3'.

On manycore and multisocket systems finding the run mode which gives best total throughput takes a bit more work, but "don't split runs across sockets" is rule #1, so then you find the way to max out throughput on an individual socket, and duplicate that setup on socket 2, by incrementing the low:high indices following the -cpu flag appropriately.

Regarding your other observations:

o It's not surprising that all of the radix sets that appear in your mlucas.cfg when running -cpu 0:3 having leading radix evenly divisible by NTHREADS*2 - like the runtime warning says, if that does not hold (say radix0 = 12 and 4-threads using -cpu 0:3), it will generally hurt performance, meaning such combos will run more slowly due to suboptimal thread utilization, and will nearly always be bested by one or more radix combos which satisfy the divisibility criterion. Nothing the user need worry about, it's all automated, whichever combo runs fastest appears in the cfg file.

o The reason the self-tests with 4 threads (-cpu 0:3) take longer than you expected is that for 4 or more threads the default #iters used for each timing test gets raised from 100 to 1000, in order to get a more accurate timing sample. You can override that by specifying -iters 100 for such tests.

Cheers, and have fun,
-E
ewmayer is offline   Reply With Quote
Old 2021-01-13, 12:10   #66
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

258 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try.
My install script for Linux currently follows the recommended instructions on the Mlucas README for each architecture to hopefully provide the best performance for most users, but I would be interested in adding this feature to automatically try different combinations of CPU cores/threads and then picking the one with the best performance, although I am not sure what the correct procedure is to do this for each architecture and CPU or how the -DUSE_THREADS compile flag factors in. The scripts goal is to automate the entire download, build, setup and run process for Mlucas, so I think this could be an important component of that. I have not received any feedback on the script so far, so I am also not even sure if there is any interest in this feature or what percentage of systems it would affect.
tdulcet is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas v18 available ewmayer Mlucas 48 2019-11-28 02:53
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
Mlucas on Sparc - Unregistered Mlucas 0 2009-10-27 20:35
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:07.

Sun Feb 28 01:07:04 UTC 2021 up 86 days, 21:18, 0 users, load averages: 1.84, 2.20, 2.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.