Go Back > Great Internet Mersenne Prime Search > Software > Mlucas

Thread Tools
Old 2019-03-01, 04:10   #12
ewmayer's Avatar
Sep 2002
Rep├║blica de California

2×13×443 Posts

Thought I'd share key parts of a PM-exchange GordP and I had this week by way of a followup to the foregoing posts in this thread. Don't think of it as tl;dr, think of it as the scandalous, sordid details of real-world code wrangling laid bare for the world to see! :)

Originally Posted by ewmayer
Originally Posted by GP2
The new signal-catching functionality doesn't always work.

On Skylake X on Google Cloud, it works most of the time. I search for "Using complex FFT radices" lines in the stat file, and see if there was a "received SIGTERM" six lines earlier. It's usually there. I also try stopping manually with a kill -s SIGTERM command. It didn't work one time, and then it did work another time.

But on AWS on the ARM architecture, it seems like it doesn't work at all.

Maybe the program writes to the savefile and the stat file asynchronously and doesn't wait for the writes to complete before exiting?
I've tested it on my Intel Haswell/linux, Macbook/osx and ARMv8/linux, on all 3 of those systems it works fine ... Let's review the associated code - at Mlucas.c:176 we have
void sig_handler(int signo)
	if (signo == SIGINT) {
		fprintf(stderr,"received SIGINT signal.\n");	sprintf(cbuf,"received SIGINT signal.\n");
	} else if(signo == SIGTERM) {
		fprintf(stderr,"received SIGTERM signal.\n");	sprintf(cbuf,"received SIGTERM signal.\n");
	} else if(signo == SIGHUP) {
		fprintf(stderr,"received SIGHUP signal.\n");	sprintf(cbuf,"received SIGHUP signal.\n");
	// Toggle a global to allow desired code sections to detect signal-received and take appropriate action:
The global MLUCAS_KEEP_RUNNING is used by the code to allow any function that needs to be informed of such an interrupt signal to do so. Open mers_mod_square.c in an edit window and search for the above global - you'll see in the main processing loop which does one LL-test iteration per loop, said loop now checks not only the iteration value but also the above global to see whether to break or not. That's because we can't simply exit willy-nilly on a signal, we need to cleanly finish the current iteration and do a few further things first. Keep grepping for MLUCAS_KEEP_RUNNING in mers_mod_square.c and you see
// On early-exit-due-to-interrupt, decrement iter since we didn't actually do the (iter)th iteration
if(iter < ihi) {
	ASSERT(HERE, !MLUCAS_KEEP_RUNNING, "Premature iteration-loop exit due to unexpected condition!");
	ROE_ITER = iter;	// Function return value used for error code, so save number of last-iteration-completed-before-interrupt here
That catches early-loop-exit-due-to-signal, decrements the loop counter (since in such cases we didn't do the (iter)th iteration, rather we broke out of the loop at the start of it), sets a newly-added special error code, and saves the iteration-of-interrupt value in another global, ROE_ITER. The above function then proceeds to do just what it does on normal (iter == ihi) loop-exit and returns ERR_INTERRUPT. Now go back to Mlucas.c and grep for ERR_INTERRUPT ... right below the usual function call which takes the DP-float residue at the end of each iteration cycle and converts it to packed-bytewise form we have
		if(INTERACT) {
			if(ierr == ERR_INTERRUPT)
Ah, I think I see the problem you may be hitting - what in !%$@ is that else-break doing inside the if()? The if() is supposed to cause immediate-exit-sans-savefile-write in interactive-timing-test (e.g. self-tests) mode, otherwise proceed to the following section of code, which writes the savefiles and is now followed by the signal-triggered exit:
		if(ierr == ERR_INTERRUPT) exit(0);
But the stray 'break' - I think I had a diagnostic-print there during my debugging of the new functionality, but why I replaced said print with a break instead of just deleting the whole else-portion of the conditional after my debug step-thru was complete is a mystery to me - would cause exit from the nearest enclosing for/while/switch instead, which in this case is the main for(;;) in Mlucas.c which simply processes LL-test assignments until it runs out of them.

So try modifying the above if() to
		if(INTERACT && (ierr == ERR_INTERRUPT))
rebuilding Mlucas.c, relinking and see if that cures the issue for, say, your AWS ARM build, since that one seems to be reliably failing to catch-signals. I rebuilt on my ARMv8 with the above code change, no change in behavior for me as expected since the signal-catching was already (and given the break-bug, incorrectly!) working there.

The thing that puzzles me is, why does the signal-catching work at all given the bug, much less work on every platform I tried it on?
That change, along with several other bugfixes, is on deck in a patched tarball; am waiting to hear from several builders who reported issues addressed in the patch for their feedback re. solution. It remains a mystery to me why the unpatched code (as posted in the OP of the "v18 available" thread) containing the "bad break" still seems to work - in the sense that signals are caught, at least in my tests - as though the else-break were not there at all. (And GP2 confirms that removing the else-break does not change anything for him on AWS.) Can anyone possibly shed some iight on this? (BTW, in case you were thinking that the AT&T code developer mentioned here was me, it wasn't, thank goodness.)

Getting back to signal-handling, GP2 did some further digging and suggests that Posix sigaction() is the more-robust way to do things, but having looked at the much-more-involved interface for that, that's going to go into the to-do list below some higher-priority items. If, like GPS on AWS, the signal catching doesn't work for you, you are no worse off than before, i.e. it's what is known as a "nice to have". One other concern GP2 had after reading the above-linked stackoverflow exchange re. sigaction was with regard to multithreaded code like Mlucas, since there it is noted that signal() may be unsuitable for multithreaded applications.

After reading the thread, I'm actually somewhat reassured on that point, here's why: In my implementation of signal-catching I added a global KEEP_RUNNING to encode whether an interrupt has been received in a manner such any thread can query it. According to the above link, we don't know which thread gets the signal, but we do know that only one of them does. That's good, because multiple threads getting a signal and trying to toggle KEEP_RUNNING as a result would be bad. Anyhow, the part of the code that checks whether-to-keep-running is single-threaded, that only happens after all threads have finished their work on the current iteration. So that should be OK, and the 3 systems I mentioned - x86/Linux (Intel Haswell), x86/osx (Macbook) and ARMv8/linux (Odroid C2) - on which I successfully tested the signal-catching functionality were all running the code multithreaded.
ewmayer is offline   Reply With Quote

Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Next GMP release? Andi47 GMP-ECM 6 2007-11-26 07:29
v24.13 release candidate 1 Prime95 Software 13 2005-07-14 23:29
V24.12 release candidate 3 Prime95 Software 45 2005-07-02 19:13
V24.12 Release 1 Crashing on P4 M njcroquet1 Software 8 2005-06-24 14:40
Release of Exponents nitro Lone Mersenne Hunters 3 2004-01-02 06:41

All times are UTC. The time now is 02:13.

Tue Sep 22 02:13:59 UTC 2020 up 11 days, 23:24, 0 users, load averages: 1.95, 1.81, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.