 mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)

 Prime95 2021-09-30 20:05

[QUOTE=ixfd64;589095]Any chance we could write PRP results to [C]results.txt[/C] too?
[/QUOTE]

Try "OutputComposites=1" and, in case you might get very lucky, "OutputPrimes=1"

 chalsall 2021-09-30 20:31

[QUOTE=James Heinrich;589096]If by pretty-print you mean presenting JSON over multiple lines with indenting and such then no, as this will break manual results which is based on the assumption that one-line=one-result.[/QUOTE]

IMHO, JSON was designed for machines, not humans.

For the latter tools are available, for development, testing, and QA purposes.

[CODE]chalsall@hobbit:~\$ echo '{"hello":"World"}' | jq
{
"hello": "World"
}
[/CODE]

 ixfd64 2021-09-30 22:40

[QUOTE=Prime95;589100]Try "OutputComposites=1" and, in case you might get very lucky, "OutputPrimes=1"[/QUOTE]

Yes, that's exactly what I was looking for. I checked [C]undoc.txt[/C] and can't believe I didn't see these. Thank you!

 Zhangrc 2021-10-01 12:27

[QUOTE=kriesel;588502]Assuming the ~p[SUP]2.1[/SUP] scaling also applies to GCD operations[/QUOTE]
The time complexity is almost linear logarithmic, O(log(ab)).
[CODE]
int gcd(int x,int y )
{
if(x < y) return gcd(y,x); // x>y
if( y == 0) return x; // if y=0, x is GCD
else
{
if( !(x%2) )
{
if( !(y%2) ) //x,y both even
return 2*gcd(x >> 1, y >> 1);
else // x is even, y is odd
return gcd(x >> 1, y );
}
else
{
if( !(y%2) ) // x is odd, y is even
return gcd(x, y >> 1);
else // x, y both odd
return gcd(y,x-y);
}
}
}
[/CODE]
and AFAIK there's no speed to gain using multithreaded GCD. It's a kind of function iteration where each step depends on the results from the last step.

 tha 2021-10-01 14:44

What cause of action would be recommended for this core segmentation fault? Nothing else was running at the time.

[CODE]
[Comm thread Oct 1 13:40] Sending result to server: UID: Tha/Z-170, M9262289 completed P-1, B1=2000000, B2=146000000, Wi8: E7182B42
[Work thread Oct 1 13:40] P-1 on M9194659 with B1=5000000, B2=TBD
[Work thread Oct 1 13:40] Setting affinity to run helper thread 1 on CPU core #2
[Work thread Oct 1 13:40] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads
[Work thread Oct 1 13:40] Setting affinity to run helper thread 2 on CPU core #3
[Work thread Oct 1 13:40] Setting affinity to run helper thread 3 on CPU core #4
[Comm thread Oct 1 13:40] CPU credit is 3.9457 GHz-days.
[Comm thread Oct 1 13:40] Done communicating with server.
[Work thread Oct 1 13:48] M9194659 stage 1 is 13.85% complete. Time: 458.543 sec.
[Work thread Oct 1 13:56] M9194659 stage 1 is 27.71% complete. Time: 454.645 sec.
[Work thread Oct 1 14:03] M9194659 stage 1 is 41.58% complete. Time: 458.479 sec.
[Work thread Oct 1 14:11] M9194659 stage 1 is 55.44% complete. Time: 456.359 sec.
[Work thread Oct 1 14:18] M9194659 stage 1 is 69.30% complete. Time: 460.225 sec.
[Work thread Oct 1 14:26] M9194659 stage 1 is 83.16% complete. Time: 436.106 sec.
[Work thread Oct 1 14:33] M9194659 stage 1 is 97.02% complete. Time: 436.115 sec.
[Work thread Oct 1 14:35] M9194659 stage 1 complete. 14429932 transforms. Time: 3254.242 sec.
[Work thread Oct 1 14:35] With trial factoring done to 2^71, optimal B2 is 91*B1 = 455000000.
[Work thread Oct 1 14:35] If no prior P-1, chance of a new factor is 8.92%
[Work thread Oct 1 14:35] D: 1050, relative primes: 2848, stage 2 primes: 23755332, pair%=94.69
[Work thread Oct 1 14:35] D: 1050, relative primes: 2819, stage 2 primes: 23755332, pair%=94.61
[Work thread Oct 1 14:35] Using 11021MB of memory.
[Work thread Oct 1 14:35] Stage 2 init complete. 27785 transforms. Time: 16.280 sec.
[Work thread Oct 1 14:44] M9194659 stage 2 is 7.41% complete. Time: 526.239 sec.
[Work thread Oct 1 14:52] M9194659 stage 2 is 14.96% complete. Time: 526.555 sec.
[Work thread Oct 1 15:01] M9194659 stage 2 is 22.61% complete. Time: 526.607 sec.
[Work thread Oct 1 15:10] M9194659 stage 2 is 30.29% complete. Time: 526.931 sec.
[Work thread Oct 1 15:19] M9194659 stage 2 is 37.96% complete. Time: 526.598 sec.
[Work thread Oct 1 15:27] M9194659 stage 2 is 45.63% complete. Time: 526.763 sec.
[Work thread Oct 1 15:36] M9194659 stage 2 is 53.28% complete. Time: 526.889 sec.
[Work thread Oct 1 15:45] M9194659 stage 2 is 60.82% complete. Time: 526.498 sec.
[Work thread Oct 1 15:54] M9194659 stage 2 is 68.22% complete. Time: 526.293 sec.
[Work thread Oct 1 16:03] M9194659 stage 2 is 75.58% complete. Time: 526.346 sec.
[Work thread Oct 1 16:11] M9194659 stage 2 is 82.96% complete. Time: 532.123 sec.
[Work thread Oct 1 16:20] M9194659 stage 2 is 90.33% complete. Time: 526.133 sec.
[Work thread Oct 1 16:29] M9194659 stage 2 is 97.73% complete. Time: 526.361 sec.
[Work thread Oct 1 16:32] M9194659 stage 2 complete. 26612490 transforms. Time: 7011.609 sec.
[Work thread Oct 1 16:32] Stage 2 GCD complete. Time: 1.302 sec.
[Work thread Oct 1 16:32] P-1 found a factor in stage #2, B1=5000000, B2=455000000.
[Work thread Oct 1 16:32] M9194659 has a factor: 1184258844011627168698855977003308763574671960534764772350463733066848456139362163010122679897458456388791012515229078287455071 (P-1, B1=5000000, B2=455000000)
[Comm thread Oct 1 16:32] Sending result to server: UID: Tha/Z-170, M9194659 has a factor: 1184258844011627168698855977003308763574671960534764772350463733066848456139362163010122679897458456388791012515229078287455071 (P-1, B1=5000000, B2=455000000)
[Work thread Oct 1 16:32] P-1 on M9262387 with B1=2000000, B2=TBD
[Work thread Oct 1 16:32] Setting affinity to run helper thread 1 on CPU core #2
[Work thread Oct 1 16:32] Using FMA3 FFT length 480K, Pass1=384, Pass2=1280, clm=4, 4 threads
[Work thread Oct 1 16:32] Setting affinity to run helper thread 2 on CPU core #3
[Work thread Oct 1 16:32] Setting affinity to run helper thread 3 on CPU core #4
[Comm thread Oct 1 16:32] Composite factor 1184258844011627168698855977003308763574671960534764772350463733066848456139362163010122679897458456388791012515229078287455071 = 367786361 * 201803650751527169 * 267438738796927 * 101994518728598164687111 * 441343633 * 422426099229769 * 23010527886744119 * 62376566657 * 2185975009297
[Comm thread Oct 1 16:32] CPU credit is 11.8294 GHz-days.
Segmentation fault (core dumped)
henk@Z170:~/mersenne\$ ^C

[/CODE]
[CODE]

Linux64,Prime95,v30.6,build 3

Consult readme.txt prior to changing any of these settings.

Temporary disk space limit in GB/worker (6.000000):
Daytime P-1/P+1/ECM stage 2 memory in GB (10.800000):
Nighttime P-1/P+1/ECM stage 2 memory in GB (10.800000):
Upload bandwidth limit in Mbps (5.000000):
Upload large files time period start (00:00):
Upload large files time period end (24:00):
Skip advanced resource settings (Y): [/CODE]

 kriesel 2021-10-01 14:49

Parallelism in GCD computation; runtime order

[QUOTE=Zhangrc;589131]The time complexity is almost linear logarithmic, O(log(ab)).
and AFAIK there's no speed to gain using multithreaded GCD. It's a kind of function iteration where each step depends on the results from the last step.[/QUOTE]Thanks for that for single-precision operands. The larger operand would be Mp. The smaller would be ~Mp/2. o(log Mp[SUP]2[/SUP]/2) would be o(p[SUP]2[/SUP]). [URL]https://stackoverflow.com/questions/18137019/running-time-of-gcd-function-recursively-euclid-algorithm[/URL]

Now consider the usual GIMPS case where one of the operands is Mp, p~10[SUP]8[/SUP], requiring millions of words to store, not just one, and even simple operations may take time ~proportional to p to execute once. The other starting operand is 3[SUP]large_power[/SUP] Mod Mp, so ~p-1 bits large on average. Also note that potential factors are typically >2[SUP]64[/SUP] before P-1 is attempted, due to prior TF depth, so the final operands are likely multi-word also. Even assuming the operands are stored in multi-word arrays of >1M packed 64-bit-unsigned initially, one can imagine dividing x>>1 into some small m parallel threads for most of the work, and alternatively computing a new least significant word and bit offset and mask. I don't see how you avoid at least proportional to p iterations, since the algorithm posted removes one bit out of p bits per pass. What is the run time for multiprecision general subtraction, o(n) each where n is the number of bits in the lesser operand?

Anything less than ~p[SUP]2.1[/SUP] scaling (really O(p[SUP]2[/SUP] log p log log p)) makes my case stronger, that scaling downward from larger exponents to the wavefront exponents indicates significant single-core-GCD time at the wavefront.

GIMPS code parallelizes individual iterations of PRP or LL or P-1 powering. The fact that one iteration depends on the previous does not preclude parallelizing in iterations or individual operations within iterations, and gaining considerably in speed with available numbers of processor cores. (Four cores / task optimal throughput is routine in CPU-based code prime95, Mlucas. GPU code uses much higher parallelism for the same algorithms.)

Making use of parallelism to speed individual iterations of GCD is possible, as a quick web search reveals: [URL]https://www.sciencedirect.com/science/article/pii/S1570866707000585[/URL]. No one's coded it yet for GIMPS, choosing to focus optimization efforts on larger portions of the P-1 factoring computation first. That paper gives o(n / log n) for n[SUP]1+e[/SUP] processors "where [I]n[/I] is the bit-length of the larger input" which is p for Mp. That is a LOT of processors. GPUs have lots of processors, but not that many. Converting that back to sequential not parallel gives o(p[SUP]2+e[/SUP]/log p).

There are probably slight savings to be had based on knowledge of Mp and its possible factors always being odd, and knowing an Mp-oriented GCD can always be called with a specific operand order, avoiding the initial x>y test.

The order of GMP's GCD is given in [URL]https://oaciss.uoregon.edu/icpp18/publications/pos123s2-file1.pdf[/URL] as [QUOTE]T[FONT=sans-serif]he GCD algorithms used by[/FONT]
[FONT=sans-serif]GMP for large input are essentially[/FONT][FONT=sans-serif]O[/FONT][FONT=sans-serif]([/FONT][FONT=sans-serif]N[/FONT][FONT=sans-serif]^(1[/FONT][FONT=sans-serif]+[/FONT][FONT=sans-serif]ε)[/FONT][FONT=sans-serif] log[/FONT][FONT=sans-serif]N[/FONT][FONT=sans-serif])[/FONT][FONT=sans-serif] for a fairly large[/FONT]
[FONT=sans-serif]value of[/FONT][FONT=monospace]ε[/FONT][FONT=sans-serif][6, section 15.3.3][/FONT][/QUOTE](An interesting paper in its own right from 2018, extrapolating that modular integers may be useful to compute GCDs faster using multiple GPUs.)

[URL]http://www.iaeng.org/IJCS/issues_v42/issue_4/IJCS_42_4_01.pdf[/URL] is a somewhat different approach, where many GCDs are done in parallel on CPU or GPU.

 Prime95 2021-10-01 15:19

[QUOTE=tha;589140]What cause of action would be recommended for this core segmentation fault? Nothing else was running at the time.[/QUOTE]

Probably already fixed (in 30.7). Looks like the "message too large from server" bug.

 Zhangrc 2021-10-02 03:05

[QUOTE=kriesel;589141]o(log Mp[SUP]2[/SUP]/2) would be o(p[SUP]2[/SUP]). [/QUOTE]
No because log(ab)=log(a)+log(b). O(log Mp[SUP]2[/SUP]/2) is O(log 2[SUP]2p-1[/SUP]) = O(2p-1).

 kriesel 2021-10-02 14:51

[QUOTE=Zhangrc;589190]No.[/QUOTE]
In gmp, & gmplib, which IIRC is used in prime95 / mprime, gpuowl, and now also Mlucas v20.x, P-1 GCD (and probably P+1 GCD in mprime/ prime95), per [URL]https://gmplib.org/manual/Nomenclature-and-Types[/URL]
"A limb means the part of a multi-precision number that fits in a single machine word. (We chose this word because a limb of the human body is analogous to a digit, only larger, and containing several digits.) Normally a limb is 32 or 64 bits."

[URL]https://gmplib.org/manual/Binary-GCD[/URL]
"At small sizes GMP uses an [B]O(N^2)[/B] binary style GCD." where N is number of limbs per operand.
"Currently, the binary algorithm is used for GCD only when N < 3." So x,y each less than 129 bits at most.

[URL]https://gmplib.org/manual/Lehmer_0027s-Algorithm[/URL]
Per iteration, reduces inputs by almost a word size.
"The resulting algorithm is asymptotically O(N^2), just as the Euclidean algorithm and the binary algorithm." where N is number of limbs per operand.
For Mp of interest to GIMPS wavefront activity, N= p/limbsize = ~10^8/64 or /32.

"For inputs larger than GCD_DC_THRESHOLD, GCD is computed via the HGCD (Half GCD) function, as a generalization to Lehmer’s algorithm."
"The asymptotic running time of both HGCD and GCD is O(M(N)*log(N)), where M(N) is the time for multiplying two N-limb numbers." and where N is number of limbs per operand.
M(Mp) is O(p log p log log p) as per Knuth and many other sources for large operands using fft methods. Log2 Mp is ~p; N is p/64 or p/32 depending on unsigned-64-bit or 32-bit in gmp; log N is log (p-64 or p-32) = ~log p for large p such as the >10^8 bit operands common in GIMPS first test or preparatory P-1.
So O(M(N)*log(N)) ~ O(p log p log log p * log p) = [B]O(p (log p)^2 log log p)[/B]

(Hmm, where does one find or determine a value for GCD_DC_THRESHOLD?)

[URL]https://www.ams.org/journals/mcom/2008-77-261/S0025-5718-07-02017-0/home.html[/URL] confirms asymptotic run time O(n (log n)^2 log log n) where n is number of bits of an operand.

 SethTro 2021-10-02 18:26

[QUOTE=kriesel;589205]
(Hmm, where does one find or determine a value for GCD_DC_THRESHOLD?)
[/QUOTE]

I wrote some of that documentation :)

You can find thresholds at [URL]https://gmplib.org/devel/thres/[/URL]
[URL="https://gmplib.org/devel/thres/GCD_DC_THRESHOLD"]GCD_DC_THRESHOLD[/URL] seems to be around 300-400 depending on platform.

 tha 2021-10-02 20:12

[QUOTE=Prime95;589142]Probably already fixed (in 30.7). Looks like the "message too large from server" bug.[/QUOTE]

Can we download and test that version? Or should we wait for more wrinkles to be ironed out?

All times are UTC. The time now is 09:01.