20200713, 19:26  #56 
∂^{2}ω=0
Sep 2002
República de California
26552_{8} Posts 
I always use scp when available, perhaps my expectations re. fspath handling have been colored by that. But on this particular server (or perhaps my remoteaccess privileges to it), only ftp is available.

20200729, 16:48  #57 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×11×229 Posts 
PRP proof
Are you implementing patnashev's prp proof generation in Mlucas?

20200729, 20:17  #58 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
2×3×1,583 Posts 
Hadn't thought to ask that myself. If it had come to mind, I would have. It will be useful when we find the next candidate prime. 
20200729, 22:27  #59 
∂^{2}ω=0
Sep 2002
República de California
26552_{8} Posts 
PRPproof support will be in v20, yes. I am alas behind the curve there  between the pandemic and a series of nonlifethreatening but still frequently dayweekandmonthruining health bugaboos, this year has been one of continual annoying distractions. And EOM my housematesof2years (young professional couple who just bought a starter home in the area) are vacating the MBR suite of our large shared apartment, so I have tons of busywork to do getting the place ready to show to prospective renters. What a year...
My one main concern re. PRPproof support is that it appears that the memory needs will relegate many smaller compute devices (Android phones, Odroid and RPistyle micros) to doing LLDC and cleanup PRPDC. It's downright undemocratic elitism, it is. ;) 
20200730, 04:01  #60  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×11×229 Posts 
Quote:
Per https://mersenneforum.org/showpost.p...5&postcount=46 power 7 takes 1.5GB disk space for residues at 100M p. Since Odroid is Ubuntu and GigE, why not pile residues on a network shared drive and then clean them up after the proof file exists? A Droid, Pi or phone farm could share a single TB drive. Right is more important than soon. And life happening affects how soon is practical. Last fiddled with by kriesel on 20200730 at 04:09 

20201128, 19:46  #61 
"Dylan"
Mar 2017
2·7·41 Posts 
I have posted a working PKGBUILD for the latest Mlucas to the AUR. You can find it here.
There are two patches that I had to make to the source to get it to build correctly: 1. In the file platform.h, I had to comment out line 1304: Code:
#include <sys/sysctl.h> 2. In the file Mlucas.c, I removed the *fp part of FILE on line 100. This is because the linker (gcc 10.2.0) was complaining that fp was defined elsewhere (namely, in gcd_lehmer.c). 
20201130, 18:54  #62 
Nov 2020
2^{2} Posts 
I just built v19 and I'm fairly new to the Arm universe. I am poking around on some of the AWS EC2 instances with "graviton" processors. I notice that if I run with 4 cores, using a command line like this:
Code:
# ./Mlucas s m cpu 0:3 then I get this message in the output quite a bit: Code:
mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. and, sure enough, runs with four cores tend to be (much) slower than 2 cores or even 1 core on the same instance. is there something I should be doing differently? 
20201211, 21:41  #63 
∂^{2}ω=0
Sep 2002
República de California
2×5,813 Posts 
@pvn: Sorry for belated reply  that warning message is more common for larger threadcounts, it's basically telling you that part of the FFT code needs the leading (leftmost in the "Using complex FFT radices" infoprint) to be divisible by #threads in order to run optimally. Example from a DC my last functioning boughtcheapused Android phone is currently doing:
Using complex FFT radices 192 32 16 16 The leading radix here is radix0 = 192, thus radix0/2 = 96 = 32*3. Sticking to powerof2 thread counts (which the other main part of my 2phasesperiteration FFT code needs to run optimally) we'd be fine for #threads = 2,4,8,16,32, but 64 would give you the warning you saw. Do you recall which precise radix set you saw the warning at in your case? To see it for 4threads implies radix0/2 is not divisible by 4, which is only true for a handful of small leading radices: radix0 = 12,20,28,36,44,52,60. That's no problem, it just means that in using the selftests to create the mlucas.cfg file for your particular cpu [lo:hi] choice, the above suboptimality will likely cause a different FFTradixcombo at the given FFT length to run best, which will be reflected in the corresponding mlucas.cfg file entry. I've always gotten quite good multithreaded scaling on my Arm devices (Odroid minPC and Android phone) up to 4threads  did you run separate selftests for cpu 0, cpu 0:1 and cpu 0:3 and compare the resulting mlucas.cfg files? On the Graviton instance you're using, what does /proc/cpu show in terms of #cores? 
20210110, 17:03  #64  
Nov 2020
2^{2} Posts 
Hi ernst, thanks for looking at this and apologies for delays on my end.
Quote:
Also, it seems important to note that all of the radicies that actually get saved in the mlucas.cfg when running cpu 0:3 are evenly divisible by NTHREADS*2 (in this case, NTHREADS=4). here's some of the output with the radix sets that gave the "this will hurt perforamnce" message (these runs seem to take about 50% more time than the other runs at the same FFT size): M43765019: using FFT length 2304K = 2359296 8byte floats, initial residue shift count = 29224505 this gives an average 18.550033145480686 bits per digit Using complex FFT radices 36 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M48515021: using FFT length 2560K = 2621440 8byte floats, initial residue shift count = 31467905 this gives an average 18.507011795043944 bits per digit Using complex FFT radices 20 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M53254447: using FFT length 2816K = 2883584 8byte floats, initial residue shift count = 35280290 this gives an average 18.468144850297406 bits per digit Using complex FFT radices 44 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M53254447: using FFT length 2816K = 2883584 8byte floats, initial residue shift count = 23722047 this gives an average 18.468144850297406 bits per digit Using complex FFT radices 44 8 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M62705077: using FFT length 3328K = 3407872 8byte floats, initial residue shift count = 61480382 this gives an average 18.400068136361931 bits per digit Using complex FFT radices 52 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M67417873: using FFT length 3584K = 3670016 8byte floats, initial residue shift count = 63290971 this gives an average 18.369912556239537 bits per digit Using complex FFT radices 28 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M72123137: using FFT length 3840K = 3932160 8byte floats, initial residue shift count = 65799790 this gives an average 18.341862233479819 bits per digit Using complex FFT radices 60 32 32 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M86198291: using FFT length 4608K = 4718592 8byte floats, initial residue shift count = 21266494 this gives an average 18.267799165513779 bits per digit Using complex FFT radices 36 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M95551873: using FFT length 5120K = 5242880 8byte floats, initial residue shift count = 93620243 this gives an average 18.225073432922365 bits per digit Using complex FFT radices 20 16 16 16 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M95551873: using FFT length 5120K = 5242880 8byte floats, initial residue shift count = 43929528 this gives an average 18.225073432922365 bits per digit Using complex FFT radices 20 32 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M104884309: using FFT length 5632K = 5767168 8byte floats, initial residue shift count = 24783492 this gives an average 18.186449397693981 bits per digit Using complex FFT radices 44 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M123493333: using FFT length 6656K = 6815744 8byte floats, initial residue shift count = 30371346 this gives an average 18.118833835308369 bits per digit Using complex FFT radices 52 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M132772789: using FFT length 7168K = 7340032 8byte floats, initial residue shift count = 24638813 this gives an average 18.088856969560897 bits per digit Using complex FFT radices 28 16 16 16 32 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M132772789: using FFT length 7168K = 7340032 8byte floats, initial residue shift count = 92450206 this gives an average 18.088856969560897 bits per digit Using complex FFT radices 28 32 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. M142037359: using FFT length 7680K = 7864320 8byte floats, initial residue shift count = 90349695 this gives an average 18.060984166463218 bits per digit Using complex FFT radices 60 16 16 16 16 mers_mod_square: radix0/2 not exactly divisible by NTHREADS  This will hurt performance. Last fiddled with by pvn on 20210110 at 17:06 

20210112, 21:02  #65 
∂^{2}ω=0
Sep 2002
República de California
2·5,813 Posts 
@pvn:
The selftests are intended to do two things: [1] Check correctness of the compiled code; [2] Find the bestperforming combination of radices for each FFT length on the user's platform. That means trying each combination of radices available for assembling each FFT length and picking the one which runs fastest, unless the fastest happens to show unacceptably high levels of roundoff error, in which the combo which runs fastest *and* has acceptable ROE levels gets stored to the mlucas.cfg file. The mlucas.cfg file is read at start of each LL or PRP test: for the current exponent being tested, the program computes the default FFT length based on expected levels of roundoff error, then reads the radixcombo data for that FFT length from mlucas.cfg and uses those FFT radices for the run. The user is still expected to have a basic understanding of their hardware's multicore aspects in terms of running the selftests using one or more cpu [core number range] settings. I haven't found a good way to automate this "identify best core topology" step, but it's usually pretty obvious which candidate corecombos to try. Some examples: o On my Intel Haswell quad, there are 4 physical cores, no hyperthreading: run selftests with 's m cpu 0:3' to use all 4 cores; o On my Intel Broadwell NUC mini, there are 2 physical cores, but with hyperthreading: I ran selftests with 's m cpu 0:1' to use just the 2 physical cores, then 'mv mlucas.cfg mlucas.cfg.2' to not get those timings mixed up with the next selftest. Next ran with 's m cpu 0:3' to use all 4 cores (2 physical, 2 logical), then 'mv mlucas.cfg mlucas.cfg.4'. Comparing the msec/iter numbers between the 2 files showed the latter set of timings to be 510% faster, meaning the hyperthreading was beneficial, so that's the run mode I use: 'ln s f mlucas.cfg.4 mlucas.cfg' to link the desired .4renamed cfgfile to the name 'mlucas.cfg' looked for by the code at runtime, then queue up some work using the primenet.py script and fire up the program using flags 'cpu 0:3'. On manycore and multisocket systems finding the run mode which gives best total throughput takes a bit more work, but "don't split runs across sockets" is rule #1, so then you find the way to max out throughput on an individual socket, and duplicate that setup on socket 2, by incrementing the low:high indices following the cpu flag appropriately. Regarding your other observations: o It's not surprising that all of the radix sets that appear in your mlucas.cfg when running cpu 0:3 having leading radix evenly divisible by NTHREADS*2  like the runtime warning says, if that does not hold (say radix0 = 12 and 4threads using cpu 0:3), it will generally hurt performance, meaning such combos will run more slowly due to suboptimal thread utilization, and will nearly always be bested by one or more radix combos which satisfy the divisibility criterion. Nothing the user need worry about, it's all automated, whichever combo runs fastest appears in the cfg file. o The reason the selftests with 4 threads (cpu 0:3) take longer than you expected is that for 4 or more threads the default #iters used for each timing test gets raised from 100 to 1000, in order to get a more accurate timing sample. You can override that by specifying iters 100 for such tests. Cheers, and have fun, E 
20210113, 12:10  #66  
"Teal Dulcet"
Jun 2018
2×3×5 Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Mlucas v18 available  ewmayer  Mlucas  48  20191128 02:53 
Mlucas version 17  ewmayer  Mlucas  3  20170617 11:18 
MLucas on IBM Mainframe  Lorenzo  Mlucas  52  20160313 08:45 
Mlucas on Sparc   Unregistered  Mlucas  0  20091027 20:35 
mlucas on sun  delta_t  Mlucas  14  20071004 05:45 