mersenneforum.org v18 pre-release discussion
 Register FAQ Search Today's Posts Mark Forums Read

2019-03-01, 04:10   #12
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Thought I'd share key parts of a PM-exchange GordP and I had this week by way of a followup to the foregoing posts in this thread. Don't think of it as tl;dr, think of it as the scandalous, sordid details of real-world code wrangling laid bare for the world to see! :)

Quote:
Originally Posted by ewmayer
Quote:
 Originally Posted by GP2 The new signal-catching functionality doesn't always work. On Skylake X on Google Cloud, it works most of the time. I search for "Using complex FFT radices" lines in the stat file, and see if there was a "received SIGTERM" six lines earlier. It's usually there. I also try stopping manually with a kill -s SIGTERM command. It didn't work one time, and then it did work another time. But on AWS on the ARM architecture, it seems like it doesn't work at all. Maybe the program writes to the savefile and the stat file asynchronously and doesn't wait for the writes to complete before exiting?
I've tested it on my Intel Haswell/linux, Macbook/osx and ARMv8/linux, on all 3 of those systems it works fine ... Let's review the associated code - at Mlucas.c:176 we have
Code:
void sig_handler(int signo)
{
if (signo == SIGINT) {
} else if(signo == SIGTERM) {
} else if(signo == SIGHUP) {
}
// Toggle a global to allow desired code sections to detect signal-received and take appropriate action:
MLUCAS_KEEP_RUNNING = 0;
}
The global MLUCAS_KEEP_RUNNING is used by the code to allow any function that needs to be informed of such an interrupt signal to do so. Open mers_mod_square.c in an edit window and search for the above global - you'll see in the main processing loop which does one LL-test iteration per loop, said loop now checks not only the iteration value but also the above global to see whether to break or not. That's because we can't simply exit willy-nilly on a signal, we need to cleanly finish the current iteration and do a few further things first. Keep grepping for MLUCAS_KEEP_RUNNING in mers_mod_square.c and you see
Code:
// On early-exit-due-to-interrupt, decrement iter since we didn't actually do the (iter)th iteration
if(!MLUCAS_KEEP_RUNNING) iter--;
if(iter < ihi) {
ASSERT(HERE, !MLUCAS_KEEP_RUNNING, "Premature iteration-loop exit due to unexpected condition!");
ierr = ERR_INTERRUPT;
ROE_ITER = iter;	// Function return value used for error code, so save number of last-iteration-completed-before-interrupt here
}
That catches early-loop-exit-due-to-signal, decrements the loop counter (since in such cases we didn't do the (iter)th iteration, rather we broke out of the loop at the start of it), sets a newly-added special error code, and saves the iteration-of-interrupt value in another global, ROE_ITER. The above function then proceeds to do just what it does on normal (iter == ihi) loop-exit and returns ERR_INTERRUPT. Now go back to Mlucas.c and grep for ERR_INTERRUPT ... right below the usual function call which takes the DP-float residue at the end of each iteration cycle and converts it to packed-bytewise form we have
Code:
		if(INTERACT) {
if(ierr == ERR_INTERRUPT)
exit(0);
else
break;
}
Ah, I think I see the problem you may be hitting - what in !%\$@ is that else-break doing inside the if()? The if() is supposed to cause immediate-exit-sans-savefile-write in interactive-timing-test (e.g. self-tests) mode, otherwise proceed to the following section of code, which writes the savefiles and is now followed by the signal-triggered exit:
Code:
		if(ierr == ERR_INTERRUPT) exit(0);
But the stray 'break' - I think I had a diagnostic-print there during my debugging of the new functionality, but why I replaced said print with a break instead of just deleting the whole else-portion of the conditional after my debug step-thru was complete is a mystery to me - would cause exit from the nearest enclosing for/while/switch instead, which in this case is the main for(;;) in Mlucas.c which simply processes LL-test assignments until it runs out of them.

So try modifying the above if() to
Code:
		if(INTERACT && (ierr == ERR_INTERRUPT))
exit(0);
rebuilding Mlucas.c, relinking and see if that cures the issue for, say, your AWS ARM build, since that one seems to be reliably failing to catch-signals. I rebuilt on my ARMv8 with the above code change, no change in behavior for me as expected since the signal-catching was already (and given the break-bug, incorrectly!) working there.

The thing that puzzles me is, why does the signal-catching work at all given the bug, much less work on every platform I tried it on?
That change, along with several other bugfixes, is on deck in a patched tarball; am waiting to hear from several builders who reported issues addressed in the patch for their feedback re. solution. It remains a mystery to me why the unpatched code (as posted in the OP of the "v18 available" thread) containing the "bad break" still seems to work - in the sense that signals are caught, at least in my tests - as though the else-break were not there at all. (And GP2 confirms that removing the else-break does not change anything for him on AWS.) Can anyone possibly shed some iight on this? (BTW, in case you were thinking that the AT&T code developer mentioned here was me, it wasn't, thank goodness.)

Getting back to signal-handling, GP2 did some further digging and suggests that Posix sigaction() is the more-robust way to do things, but having looked at the much-more-involved interface for that, that's going to go into the to-do list below some higher-priority items. If, like GPS on AWS, the signal catching doesn't work for you, you are no worse off than before, i.e. it's what is known as a "nice to have". One other concern GP2 had after reading the above-linked stackoverflow exchange re. sigaction was with regard to multithreaded code like Mlucas, since there it is noted that signal() may be unsuitable for multithreaded applications.

 Similar Threads Thread Thread Starter Forum Replies Last Post Andi47 GMP-ECM 6 2007-11-26 07:29 Prime95 Software 13 2005-07-14 23:29 Prime95 Software 45 2005-07-02 19:13 njcroquet1 Software 8 2005-06-24 14:40 nitro Lone Mersenne Hunters 3 2004-01-02 06:41

All times are UTC. The time now is 02:13.

Tue Sep 22 02:13:59 UTC 2020 up 11 days, 23:24, 0 users, load averages: 1.95, 1.81, 1.66