mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2019-02-26, 22:04   #12
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

One of the beta-code build & test-ers reports build and runtime errors on several varieties of big-endian hardware ... a code review confirms that some byte-array-based bitwise-utilities funtionality I added in the last few years for the sake of efficiency breaks endian-independence. Easy enough to fix the issue - just need to wrap the handful of byte-array-based utils in an endian-ness preprocessor clause and run the byte-processing in reverse order in the big-endian case. But - what compiler predefine to use for said preprocessor clauses? On my Mac, 'gcc -dM -E [random source file] < /dev/null | grep ENDIAN' gives this:

#define __LITTLE_ENDIAN__ 1

My hopes that that would be a gcc-standard predef were quickly dashed - On my ARMv8/linux, things are far less straightforward:

#define __ORDER_LITTLE_ENDIAN__ 1234
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
#define __ORDER_PDP_ENDIAN__ 3412
#define __ORDER_BIG_ENDIAN__ 4321
#define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__

As long as the range of supported predefs across Posixworld is decently small that's OK - can folks reading this try the above gcc predef-dump command on their systems and let me know if they spot anything that would not be covered by the following?
Code:
#if (__LITTLE_ENDIAN__ == 0) || (__BYTE_ORDER__ != __ORDER_LITTLE_ENDIAN__)
	#define USE_BIG_ENDIAN
#endif
Edit: Another option would be to key off the relatively limited set of CPU families using big-endian in the platform.h file and set an internal USE_BIG_ENDIAN preprocessor flag based on CPU-family. Since most major CPU families on which the code has been built already have their own little predef-sections in the header file, it would simply be another predef that gets set-or-not there. Thoughts welcome!

Last fiddled with by ewmayer on 2019-02-26 at 22:17
ewmayer is offline   Reply With Quote
Old 2019-03-06, 21:43   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1151810 Posts
Default

Mlucas v18 has gone live - I've updated the OP in this thread to note that and to remove the beta source-tarball. Thanks to all who built and provide feedback. Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.

Summary of changes-since-beta-source posted:

I found and fixed several bugs in Mlucas.c, the first 2 of which were exposed by the same testing circumstance, where I had a 1st-time LL-test with p in the 80M range complete and then the code started in on the next assignment, which was a partially-complete (migrated from another machine) DC in the 50M range. So those 2 bugs can be considered corner(ish)-case scenarios, but it was still important to get them fixed.

Bug 1: During processing of the savefiles for the 50M-exponent, the code spat this out:

read_ppm1_savefiles: On restart: Res35m1 checksum error!ERROR: read_ppm1_savefiles Failed on savefile p55******!

After inserting the obviously-missing newline that should follow the !, I dug into the source of the error, which is referring the Selfridge-Hurwitz residues (LL-test residue mod 235-1 and 236-1) which I compute as an integity-check for Mlucas savefiles. (The S-H residues were pioneered by those 2 luminaries in their Fermat-number-testing work during the 1950s on hardware of the day which supported 36-bit integers in addition to floating-point numbers). When writing an interime savefile I compute those during the conversion of the residue from floating-point to packed-bit form and tack them onto the full-length residue written to the 2 redundant savefiles. When I restart from a savefile, after reading the full-length residue R and the 2 checksums, I use a different method to on-the-fly compute R mod 235-1 and 236-1, namely the fast Montgomery-mod remaindering I described in this manuscript a few years back. I then compare those 2 just-computed remainders to the ones stored in the savefile, to make sure the savefile data were not corrupted in some fashion. It was that check which was failing, and doing so on both the primary and (normally identical) secondary savefiles. The problem turned out to be this: I use a bytewise array to store R, and in calling the aforementioned remaindering function, which is part of my mi64 (personal GMP-style library) function suite, I cast said array from (uint8*) to (uint64*). (If you're about to ask "but won;t that break endian-portability?", indeed it does - more on that in Bug 5 below.) Problem was, in the above finish-big-exponent-then-proceed-to-DC scenario, I was failing to clear any high bytes in the topmost 64-bit limb of the resulting treated-as-64-bit-integer array, i.e. bytes above those needed for the current p-bit residue which had previously held bytes of the larger previous-test residue. Adding the needed short (1-7 passes) clear-bytes loop, all is well, but then I hit...

Bug 2: The first test in the above one-test-finishes-and-we-proceed-to-the-next-one scenario was for an exponent very close to the 4608K FFT-length upper limit, so much so that at several points during the run the program detected a 0.4375 fractional part during the per-iteration rounding step, causing it to stop execution, reread from the last savefile and restart at FFT length 5120K. Based on the relative rarity of the 0.4375 ROEs I decided running at 4608K was safe except for the occasional 0.4375-containing iteration interval, so whenever I noticed such auto-switching to 5120K had occurred I killed the code and restarted with an explicit '-fftlen 4608' added to the command line, which overrides the last-FFT-length-used stored in the savefile. Problem was, after finishing the first-time LL test using that length, the program also overrode the 3072K default length for the subsequent DC exponent with 4608K. So a bug in the control logic, now fixed.

Note that Bug 1 is not in play if the next-assignment is the typical from-scratch one.

Alex Vong, who is working to incorporate Mlucas v18 into the Debian freeware suite, also reported a few bugs:

Bug 3: src/radix16_dyadic_square.c', the function
'SSE2_RADI16_CALC_TWIDDLES_1_2_4_8_13(...)' misses a 'X', it should be
'SSE2_RADIX16_CALC_TWIDDLES_1_2_4_8_13(...)' instead.
-- This sllly typo is in the 32-bit-build preprocessor-flag-wrapped portion of said sourcefile, and the code is used only for Fermat-number testing, not Mersenne, but it's still a showstopper in that build mode because it will prevent object-linkability-into-a-binary. Rather bizarrely, on my Mac bth cc and clang fail to flag the no-macro-by-this-name error.

Bug 4: Missing wide-integer-product macros and wide-mul macro syntax errors in PowerPC 32-bit builds. It's been so long since I've built on PPC32 that this is a wayback-machine exercise, but anyhow: The missing __MULL64 and __MULH64 macros have been added, and the macro name-collisions which caused the syntax errors have been fixed.

Bug 5: Endian-portability broken due to several byte-array-based functions I added to Mlucas v17. This has been fixed. At least I believe it's been fixed - I don't have access to any big-endian hardware.

Bug 6: Fixed one-dereference-too-few error "_cy_r[0] = -2" instead of "_cy_r[0][0] = -2" in non-SIMD code in radix[1008|1024|4032]*c. These were not present in v17, rather were introduced by some careless search-and-replace-across-multiple-files editing I did in my v18 development work. I only noticed the errors when I did a non-SIMD v18 build on ARM just prior to release and hit segfaults for those carry-step-wrapping radices during self-testing.

Last fiddled with by ewmayer on 2019-03-06 at 21:49
ewmayer is offline   Reply With Quote
Old 2019-03-07, 08:06   #14
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

23·41 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Would someone with access to an ARMv8 CPU please try the prebuilt-with-SIMD binary I posted? I built it on my Odroid C2, with non-static linkage (which is how the v17.1 binaries were done, IIRC), and need to see how portable that is.
Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).
nomead is offline   Reply With Quote
Old 2019-03-07, 20:15   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

263768 Posts
Default

Quote:
Originally Posted by nomead View Post
Works on Raspberry Pi 3B+ and 3A+... No changes in performance though, at 5120K FFT size it can't use the radix-320 front end due to excessive roundoff. Works at 2560K though, but there it's about the same speed as before (but it still chose 160 32 16 16 before on version 17.1, and 320 16 16 16 now on version 18.0, so maybe there's some difference anyway).
Hmm, just did a quick single-FFT-length self-test @2560 and 5120K on my Odroid C2 using the SIMD binary, '-cpu 0:3 -iters 1000' - here is the summary:

2560K: Radices 320,16x3 run @123.3 ms/iter, maxROE = 0.3125; 160,32,16,16 @124.1 ms/iter, maxROE = 0.34375, so radix 320 gives a tiny speedup here, and both top-candidate radix sets give acceptable ROE levels.

5120K: 320,32,16,16 gives 286.9 ms/iter but ROE = 0.4375 on iters 80,752, (thus deemed ineligible as the cfg-file entry for this FFT length); 160,32,32,16 gives 281.9 ms/iter and maxROE = 0.3125, thus is both fastest and has acceptably low ROE, thus gets the nod. Now were those 2 timings reversed and were I planning to do some first-time-tests @5120K on the hardware in question, I would consider manually hacking the mlucas.cfg file to force radix set 320,32,16,16 at this length. Do you still have your self-test screenlog so you can check the timings in this manner?

The timing deterioration on the C2 between 2560 and 5120K is marked - this hardware is thus ill-suited for first-time-tests even ignoring the long runtime and risk of assignment-expiry such a run would incur.
ewmayer is offline   Reply With Quote
Old 2019-03-08, 22:44   #16
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

14816 Posts
Default

Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway.

2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg)
2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125
2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125

5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference)
5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125
5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg)

So the same here, 5120K with radix-320 is slower for some reason.

I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all?

Oh, and one more thing. When I stop the program with Control-C as before, there is this error message:
received SIGINT signal.
ERROR: at line 2146 of file ../src/mers_mod_square.c
Assertion failed: nanosleep fail!
nomead is offline   Reply With Quote
Old 2019-03-08, 23:10   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by nomead View Post
Well I thought I had saved the output from the self-test but I forgot that it goes to stderr, not stdout. Meh. Ran it again, anyway.

2560K 320 16 16 16: 145.1 ms/iter, MaxErr = 0.34375 (chosen in mlucas.cfg)
2560K 160 16 16 32: 152.2 ms/iter, MaxErr = 0.3125
2560K 160 32 16 16: 146.7 ms/iter, MaxErr = 0.3125

5120K 320 16 16 32: 356.2 ms/iter, MaxErr = 0.46875 (limit exceeded on iters 425, 709, 795 if it makes a difference)
5120K 160 16 32 32: 349.3 ms/iter, MaxErr = 0.3125
5120K 160 32 32 16: 336.7 ms/iter, MaxErr = 0.28125 (chosen in mlucas.cfg)

So the same here, 5120K with radix-320 is slower for some reason.
Thanks - eagle-eyed readers may note that while your overall results are essentially the same, some of the details - specifically the precise maxROE value and iterations-with-ROE-warning - differ from those I posted. Same binary, same-ARMv8-compliant-hardware, so shouldn't the numbers be *exactly* the same? The reason for subtle differences lies in v18's usage of random residue shift - if your initial shift count differs from mine, the ROE numbers will as well.

Quote:
I'm only running double checks, and the exponents I'm getting from Primenet seem to run on a 2816K FFT. I really don't know how the internals work, so what I'm asking may be totally silly: would a doubling of radix-176 (352?) be beneficial, or even work at all?
Not silly at all - but the larger initial radices appear quite hit-or-miss in terms of speedups: 288 is better tha 144 across most platforms especially at 4608K. (2304K is more precise-platform dependent). Radix-320 was rather more disappointing in that regard. I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.

Quote:
Oh, and one more thing. When I stop the program with Control-C as before, there is this error message:
received SIGINT signal.
ERROR: at line 2146 of file ../src/mers_mod_square.c
Assertion failed: nanosleep fail!
I get those errors sometimes, typically in the context of running under the debugger - they basically mean some signal has interacted badly with the nanosleep() command I use as part of my wait-for-all-threads-to-finish-current-task management in multithreaded execution mode. Future enhancements of the new signal-catching code using the supposedly more robust sigaction() may help here; for now, YMMV as to whether the signal code works as intended. Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?
ewmayer is offline   Reply With Quote
Old 2019-03-09, 03:52   #18
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

23×41 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Worst case you lose the iterations done since the last normally sheduled checkpoint. I assume you restarted the above run - can you post the snip of code from the p*.stat file bracketing the interrupt?
And indeed, that seems to happen. The program doesn't manage to save progress when interrupted and restarts from the last save file. I started this test with 17.1 way back in December so the residue shift is 0. There is also some very small random variance in the execution speed, but that doesn't seem to change while the program is running. This behaviour was the same with version 17.1. It was 168.4 ms for some time before this restart (and this was also spot on the same iteration speed as on 17.1, for a couple of months) , and has stayed at 167.9 ms now for the time it's been running since the restart.
Code:
[Mar 07 20:29:18] M5132xxxx Iter# = 36790000 [71.67% complete] clocks = 00:28:04.083 [168.4083 msec/iter] Res64: 74873862A50BB57E. AvgMaxErr = 0.071576885. MaxErr = 0.109375000. Residue shift count = 0.
Restarting M5132xxxx at iteration = 36790000. Res64: 74873862A50BB57E, residue shift count = 0
M5132xxxx: using FFT length 2816K = 2883584 8-byte floats, initial residue shift count = 0
 this gives an average   17.800407062877309 bits per digit
Using complex FFT radices       176        32        16        16
[Mar 08 17:10:27] M5132xxxx Iter# = 36800000 [71.69% complete] clocks = 00:27:58.834 [167.8834 msec/iter] Res64: C6533BF704CDF1F1. AvgMaxErr = 0.071412456. MaxErr = 0.101562500. Residue shift count = 0.
nomead is offline   Reply With Quote
Old 2019-03-09, 19:54   #19
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)

Also, a corrigendum to my note re. leading-radix 352 and FFT length 5632K:

Quote:
Originally Posted by ewmayer View Post
I plan to implement Radix-352 in v19, but it's more likely to have an impact at 5632K than at 2816K, i.e. once the GIMPS first-time-testing wavefront passes p ~106M, thus no huge rush.
Actually p ~106M is the *upper* limit for 5632, lower limit (i.e. upper limit for 5120K) is ~96M. So I guess I better get on that!
ewmayer is offline   Reply With Quote
Old 2019-03-09, 20:52   #20
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

32810 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Sorry to see signal-handling not working for you - that was a *very* late-breaking addition to v18. In any event you're no worse off than with the previous version, you just need to stop it with the constant interruptions! How do you expect us to get any work done... :)
Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.
nomead is offline   Reply With Quote
Old 2019-03-09, 21:11   #21
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Quote:
Originally Posted by nomead View Post
Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago... And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.
In order to aid this "army ant" computing model - I'm taking delivery of a couple of for-parts cellphones for my part - I'm currently working with Aaron (MadPoo) on enhancing the primenet.py script to do a couple of v5-server things to support assignment progress update. That should allow ARM user to run longer 1st-time tests, should they desire to, without having said assignments expire once they hit the 180-day mark.
ewmayer is offline   Reply With Quote
Old 2019-03-12, 19:21   #22
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Quote:
Originally Posted by nomead View Post
Yeah it's not a problem at all in real use, the last time I interrupted the run (before upgrading to 18.0) was a couple months ago...
Here's the signal stuff working on my Debian-running Intel Haswell quad, on a first-time LL test running on all 4 cores ... yesterday was first really springlike day in my neck of the woods, my BR where the box sits in a corner has southern exposure and gets pretty warm on days like that. The haswell uses just stock cooling and even with the case side panel on the CPU side removed, I find the system starts getting flaky when ambient goes above 75F. So late morning yesterday clicked the on/off switch on the case to turn the system off, then back on late evening once things had cooled off. Note the times-of-day in the following p*.stat snip are ~8 hours behind, this is a headless system and I've just let the internal clock drift in the years I've owned it:
Code:
[Mar 11 06:22:34] M86687009 Iter# = 8030000 [ 9.26% complete] clocks = 00:01:58.909 [ 11.8909 msec/iter] Res64: 5C4BB4BE6AE5BBB0. AvgMaxErr = 0.214745667. MaxErr = 0.312500000. Residue shift count = 12479875.
[Mar 11 06:24:32] M86687009 Iter# = 8040000 [ 9.27% complete] clocks = 00:01:58.691 [ 11.8691 msec/iter] Res64: 6757643F59CD637A. AvgMaxErr = 0.214732219. MaxErr = 0.312500000. Residue shift count = 15291034.
received SIGTERM signal.
Iter = 8041419: Writing savefiles and exiting.
[Mar 11 06:24:50] M86687009 Iter# = 8041419 [ 9.28% complete] clocks = 00:00:16.910 [ 11.9174 msec/iter] Res64: A15F129B39F5F7AD. AvgMaxErr = 0.214937561. MaxErr = 0.281250000. Residue shift count = 24137129.
...
Restarting M86687009 at iteration = 8041419. Res64: A15F129B39F5F7AD, residue shift count = 24137129
M86687009: using FFT length 4608K = 4718592 8-byte floats, initial residue shift count = 24137129
 this gives an average   18.371372010972763 bits per digit
Using complex FFT radices       288        16        16        32
[Mar 11 13:15:57] M86687009 Iter# = 8050000 [ 9.29% complete] clocks = 00:01:40.629 [ 11.7271 msec/iter] Res64: FA296EE64B5710E2. AvgMaxErr = 0.214812786. MaxErr = 0.281250000. Residue shift count = 20164181.
Quote:
And I'm building a small cluster of RPi 3A+ boards in a "set and forget" configuration double checking LL. I actually got the first five up and almost running, but one of the boards has almost deaf WiFi for some reason, and I haven't had the time yet to check what's wrong with it. Maybe I'll just have to run that one board completely offline if it requires too much effort to fix. But once they're running, there will be no real need to interrupt them, ever.
I just took delivery of two sold-for-parts-on-ebay Samsung Galaxy S7s in the past couple of days, project for the coming week is to get them rooted and running Mlucas, also awaiting delivery of a USB charging station (which should have enough juice to power 4 such phones running Mlucas on all cores) and USB fan (which the Q&A section on the product page says draws just 0.8W at top speed in USB mode) ... the fan should be sufficient to cool a pair of such 4-phone mini farms, which should give me a total compute throughput comparable to the above-mentioned Haswell quad, in a rather smaller footprint.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas version 17.1 ewmayer Mlucas 96 2019-10-16 12:55
Mlucas on ubuntu Damian Mlucas 17 2017-11-13 18:12
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:23.

Tue Sep 22 01:23:50 UTC 2020 up 11 days, 22:34, 0 users, load averages: 1.76, 1.64, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.