mersenneforum.org Mlucas v19 available
 Register FAQ Search Today's Posts Mark Forums Read

2020-07-13, 19:26   #56
ewmayer
2ω=0

Sep 2002
República de California

3×53×73 Posts

Quote:
 Originally Posted by chris2be8 As you may be able to tell I've had to use it on several platforms. But I prefer scp or sftp if they are available.
I always use scp when available, perhaps my expectations re. fs-path handling have been colored by that. But on this particular server (or perhaps my remote-access privileges to it), only ftp is available.

 2020-07-29, 16:48 #57 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 5×983 Posts PRP proof Are you implementing patnashev's prp proof generation in Mlucas?
 2020-07-29, 20:17 #58 Uncwilly 6809 > 6502     """"""""""""""""""" Aug 2003 101×103 Posts 3·3,121 Posts Hadn't thought to ask that myself. If it had come to mind, I would have. It will be useful when we find the next candidate prime.
 2020-07-29, 22:27 #59 ewmayer ∂2ω=0     Sep 2002 República de California 265278 Posts PRP-proof support will be in v20, yes. I am alas behind the curve there - between the pandemic and a series of non-life-threatening but still frequently day-week-and-month-ruining health bugaboos, this year has been one of continual annoying distractions. And EOM my housemates-of-2-years (young professional couple who just bought a starter home in the area) are vacating the MBR suite of our large shared apartment, so I have tons of busywork to do getting the place ready to show to prospective renters. What a year... My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)
2020-07-30, 04:01   #60
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5×983 Posts

Quote:
 Originally Posted by ewmayer My one main concern re. PRP-proof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPi-style micros) to doing LL-DC and cleanup PRP-DC. It's downright undemocratic elitism, it is. ;)
Low power proofs are better than none. Standalone devices could drop to 6 (or even 5 if necessary?) and still save ~90+% of a DC.

Per https://mersenneforum.org/showpost.p...5&postcount=46 power 7 takes 1.5GB disk space for residues at 100M p. Since Odroid is Ubuntu and GigE, why not pile residues on a network shared drive and then clean them up after the proof file exists? A Droid, Pi or phone farm could share a single TB drive.
Quote:
 Originally Posted by ewmayer PRP-proof support will be in v20, yes. I am alas behind the curve
Right is more important than soon. And life happening affects how soon is practical.

Last fiddled with by kriesel on 2020-07-30 at 04:09

 2020-11-28, 19:46 #61 Dylan14     "Dylan" Mar 2017 56210 Posts I have posted a working PKGBUILD for the latest Mlucas to the AUR. You can find it here. There are two patches that I had to make to the source to get it to build correctly: 1. In the file platform.h, I had to comment out line 1304: Code: #include  This is because the sysctl.h header was removed in Linux Kernel 5.5, per this issue on the PowerShell GitHub. 2. In the file Mlucas.c, I removed the *fp part of FILE on line 100. This is because the linker (gcc 10.2.0) was complaining that fp was defined elsewhere (namely, in gcd_lehmer.c).
 2020-11-30, 18:54 #62 pvn   Nov 2020 22 Posts I just built v19 and I'm fairly new to the Arm universe. I am poking around on some of the AWS EC2 instances with "graviton" processors. I notice that if I run with 4 cores, using a command line like this: Code: # ./Mlucas -s m -cpu 0:3 then I get this message in the output quite a bit: Code: mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance. and, sure enough, runs with four cores tend to be (much) slower than 2 cores or even 1 core on the same instance. is there something I should be doing differently?
2021-01-10, 17:03   #64
pvn

Nov 2020

22 Posts

Hi ernst, thanks for looking at this and apologies for delays on my end.

Quote:
 Do you recall which precise radix set you saw the warning at in your case? To see it for 4-threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the self-tests to create the mlucas.cfg file for your particular -cpu [lo:hi] choice, the above suboptimality will likely cause a different FFT-radix-combo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry.
Does this mean that the self-test run is taking longer because it's... weeding out the unsuitable radicies? I think this makes sense given what I see in the resulting cfg files (at any given FFT length, the msec/iter (roughly) scales with the number of cores used even when the 4-core self test takes unexpectedly too much time overall.

Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running -cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4).

here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size):

M43765019: using FFT length 2304K = 2359296 8-byte floats, initial residue shift count = 29224505
this gives an average 18.550033145480686 bits per digit
Using complex FFT radices 36 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M48515021: using FFT length 2560K = 2621440 8-byte floats, initial residue shift count = 31467905
this gives an average 18.507011795043944 bits per digit
Using complex FFT radices 20 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 35280290
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M53254447: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 23722047
this gives an average 18.468144850297406 bits per digit
Using complex FFT radices 44 8 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M62705077: using FFT length 3328K = 3407872 8-byte floats, initial residue shift count = 61480382
this gives an average 18.400068136361931 bits per digit
Using complex FFT radices 52 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M67417873: using FFT length 3584K = 3670016 8-byte floats, initial residue shift count = 63290971
this gives an average 18.369912556239537 bits per digit
Using complex FFT radices 28 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M72123137: using FFT length 3840K = 3932160 8-byte floats, initial residue shift count = 65799790
this gives an average 18.341862233479819 bits per digit
Using complex FFT radices 60 32 32 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M86198291: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 21266494
this gives an average 18.267799165513779 bits per digit
Using complex FFT radices 36 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 93620243
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M95551873: using FFT length 5120K = 5242880 8-byte floats, initial residue shift count = 43929528
this gives an average 18.225073432922365 bits per digit
Using complex FFT radices 20 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M104884309: using FFT length 5632K = 5767168 8-byte floats, initial residue shift count = 24783492
this gives an average 18.186449397693981 bits per digit
Using complex FFT radices 44 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M123493333: using FFT length 6656K = 6815744 8-byte floats, initial residue shift count = 30371346
this gives an average 18.118833835308369 bits per digit
Using complex FFT radices 52 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 24638813
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 16 16 16 32
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M132772789: using FFT length 7168K = 7340032 8-byte floats, initial residue shift count = 92450206
this gives an average 18.088856969560897 bits per digit
Using complex FFT radices 28 32 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

M142037359: using FFT length 7680K = 7864320 8-byte floats, initial residue shift count = 90349695
this gives an average 18.060984166463218 bits per digit
Using complex FFT radices 60 16 16 16 16
mers_mod_square: radix0/2 not exactly divisible by NTHREADS - This will hurt performance.

Last fiddled with by pvn on 2021-01-10 at 17:06

2021-01-13, 12:10   #66
tdulcet

"Teal Dulcet"
Jun 2018

2110 Posts

Quote:
 Originally Posted by ewmayer The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the self-tests using one or more -cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate core-combos to try.
My install script for Linux currently follows the recommended instructions on the Mlucas README for each architecture to hopefully provide the best performance for most users, but I would be interested in adding this feature to automatically try different combinations of CPU cores/threads and then picking the one with the best performance, although I am not sure what the correct procedure is to do this for each architecture and CPU or how the -DUSE_THREADS compile flag factors in. The scripts goal is to automate the entire download, build, setup and run process for Mlucas, so I think this could be an important component of that. I have not received any feedback on the script so far, so I am also not even sure if there is any interest in this feature or what percentage of systems it would affect.

 Similar Threads Thread Thread Starter Forum Replies Last Post ewmayer Mlucas 48 2019-11-28 02:53 ewmayer Mlucas 3 2017-06-17 11:18 Lorenzo Mlucas 52 2016-03-13 08:45 Unregistered Mlucas 0 2009-10-27 20:35 delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 11:39.

Wed Mar 3 11:39:47 UTC 2021 up 90 days, 7:51, 0 users, load averages: 2.06, 2.13, 1.76