mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > News

Reply
 
Thread Tools
Old 2019-02-24, 01:18   #23
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·2,311 Posts
Default

Quote:
Originally Posted by dcheuk View Post
I'm curious, what is the probability that a residue from an LL test is bad? Is it dependent on the software/hardware/exponent/FFT size/etc?

If the residue is not guaranteed to be 100% correct (well I guess that's why we have double checks), what can we do overall to improve the accuracy (... beside getting ecc memory and quadros/tesla graphics) and should we be expecting bad results in some kind of frequency? And how unlikely is it for both the original LL test and the double-check test for an exponent to return the bad residue that is identical?

I should probably read the wikipedia page on FFT or something to understand it, but procrastinates. How much prerequisites do I need to understand the mathematical concept and proof behind fft assuming i have the standard algebra/analysis background as a grad student? Any recommendations on any resources that introduces and proves this topic?

Thanks guys. I appreciate it.
The usual figure is about 2% per LL test. That might be from before the addition of the Jacobi check to prime95. It will go up some as running larger exponents takes longer, roughly in proportion to p2.12, for the same hardware reliability hourly. The probability of an LL test being in error goes up considerably if the error counts accumulated during a prime95 run are nonzero. Even a single illegal sumout error recorded raises the probability of erroneous final residue to around 40%, if I recall Madpoo's recent post about that correctly. Hardware tends to get less reliable with age. PRP with Gerbicz check is much more reliable, so consider running that instead. It was bulletproof on a very unreliable system I tested it on. It is still possible to have errors in the final residue with PRP and Gerbicz check, but it is unlikely, and the best we can do for now.
In any event, run self tests such as double checks regularly, at least annually, to check system reliability on these very unforgiving calculations.

It does depend on the software and hardware used. CUDALucas and cllucas do LL and do not have the Jacobi check. The Jacobi check has a 50% chance of detecting an error if one occurs. Hardware with unreliable memory is more error prone. Overclocking too high or overheating increases error rates.
In CUDALucas, there are CUDA levels and gpu models that interact badly, even on highly reliable hardware. These produce errors such as instead of the usual LL sequence, at some point all zeros gets returned. If that is before the subtraction of 2, then
FFF...FFD is the result (the equivalent of -2). It gets squared and 2 subtracted, and voila, now you have 000...02. (-2)2-2=2. Then it will iterate at 2 until the end. These sorts of errors can be triggered at will. Some of them under certain circumstances have the side effect of making the iterations go much faster than expected. If something seems too good to be true, it probably is. (CUDA 4.0 or 4.1, 1024 threads, typically is trouble in CUDALucas, if I recall correctly.) That is an example where the first and second test probability of a false positive match may be 100%. More typical would be of order 10-6 to 10-12. CUDALucas 2.06 May 5 2017 version has software traps for these error residues built in. There are other modes of error. The recent false positive by CUDALucas 2.05.1 was resulting in the interim residue having value zero. I'm guessing that's some failure to copy an array of values. Don't run CUDALucas versions earlier than 2.06, and don't let your friends either.

Other applications have other characteristic error residues. Cllucas (now I believe largely supplanted by the faster and rapidly developing and more reliable gpuowl), will produce ffffffff8000000 sometimes, for example. Someone who wanted to use such bugs as the CUDALucas early zero bug to fake finding a prime would be disappointed, as the error would be quickly discovered early in the verification process.
In the blog area I was given, I've created application-specific reference threads for several of the popular GIMPS applications. Most of them have a post with a bug and wishlist tabulation attached, specific to that application.

https://www.mersenneforum.org/forumdisplay.php?f=154
It helps to know what to avoid and how to avoid it.
If you identify any issues that are not listed there yet, please PM me with details.
As such issues are identified, they might be fixable, or code to detect and guard against them could be added, if still of sufficient interest. (Fixing or trapping for CUDA 4.0 or 4.1 issues is not of much interest now, since many GPUs are running at CUDA 8 or above.)

It's common practice for the applications to keep more than one save file, and be able to restart from one or the other if something's detected to have gone seriously wrong in the past minutes of a lengthy run, thereby perhaps saving most of the time already expended. Some users will run side by side on two sets of hardware, duplicate runs that take months, periodically comparing interim 64-bit residues that should match along the way.
I wrote about odds of matching wrong residues at https://www.mersenneforum.org/showpo...&postcount=147
The problem is the bad residues from software or hardware issues are not randomly distributed. If they were, we would not be patching and trapping and searching databases for known application-specific bad residues as markers of what exponents to double or triple check. https://www.mersenneforum.org/showpo...&postcount=142 https://www.mersenneforum.org/showpo...&postcount=150
You might find the strategic double check thread and trippple check thread interesting background also.

Last fiddled with by kriesel on 2019-02-24 at 01:24
kriesel is offline   Reply With Quote
Old 2019-02-24, 02:20   #24
dcheuk
 
dcheuk's Avatar
 
Jan 2019
Pittsburgh, PA

3·7·11 Posts
Default

Quote:
Originally Posted by kriesel View Post
The usual figure is about 2% per LL test. That might be from before the addition of the Jacobi check to prime95. It will go up some as running larger exponents takes longer, roughly in proportion to p2.12, for the same hardware reliability hourly. The probability of an LL test being in error goes up considerably if the error counts accumulated during a prime95 run are nonzero. Even a single illegal sumout error recorded raises the probability of erroneous final residue to around 40%, if I recall Madpoo's recent post about that correctly. Hardware tends to get less reliable with age. PRP with Gerbicz check is much more reliable, so consider running that instead. It was bulletproof on a very unreliable system I tested it on. It is still possible to have errors in the final residue with PRP and Gerbicz check, but it is unlikely, and the best we can do for now.
In any event, run self tests such as double checks regularly, at least annually, to check system reliability on these very unforgiving calculations.

It does depend on the software and hardware used. CUDALucas and cllucas do LL and do not have the Jacobi check. The Jacobi check has a 50% chance of detecting an error if one occurs. Hardware with unreliable memory is more error prone. Overclocking too high or overheating increases error rates.
In CUDALucas, there are CUDA levels and gpu models that interact badly, even on highly reliable hardware. These produce errors such as instead of the usual LL sequence, at some point all zeros gets returned. If that is before the subtraction of 2, then
FFF...FFD is the result (the equivalent of -2). It gets squared and 2 subtracted, and voila, now you have 000...02. (-2)2-2=2. Then it will iterate at 2 until the end. These sorts of errors can be triggered at will. Some of them under certain circumstances have the side effect of making the iterations go much faster than expected. If something seems too good to be true, it probably is. (CUDA 4.0 or 4.1, 1024 threads, typically is trouble in CUDALucas, if I recall correctly.) That is an example where the first and second test probability of a false positive match may be 100%. More typical would be of order 10-6 to 10-12. CUDALucas 2.06 May 5 2017 version has software traps for these error residues built in. There are other modes of error. The recent false positive by CUDALucas 2.05.1 was resulting in the interim residue having value zero. I'm guessing that's some failure to copy an array of values. Don't run CUDALucas versions earlier than 2.06, and don't let your friends either.

Other applications have other characteristic error residues. Cllucas (now I believe largely supplanted by the faster and rapidly developing and more reliable gpuowl), will produce ffffffff8000000 sometimes, for example. Someone who wanted to use such bugs as the CUDALucas early zero bug to fake finding a prime would be disappointed, as the error would be quickly discovered early in the verification process.
In the blog area I was given, I've created application-specific reference threads for several of the popular GIMPS applications. Most of them have a post with a bug and wishlist tabulation attached, specific to that application.

https://www.mersenneforum.org/forumdisplay.php?f=154
It helps to know what to avoid and how to avoid it.
If you identify any issues that are not listed there yet, please PM me with details.
As such issues are identified, they might be fixable, or code to detect and guard against them could be added, if still of sufficient interest. (Fixing or trapping for CUDA 4.0 or 4.1 issues is not of much interest now, since many GPUs are running at CUDA 8 or above.)

It's common practice for the applications to keep more than one save file, and be able to restart from one or the other if something's detected to have gone seriously wrong in the past minutes of a lengthy run, thereby perhaps saving most of the time already expended. Some users will run side by side on two sets of hardware, duplicate runs that take months, periodically comparing interim 64-bit residues that should match along the way.
I wrote about odds of matching wrong residues at https://www.mersenneforum.org/showpo...&postcount=147
The problem is the bad residues from software or hardware issues are not randomly distributed. If they were, we would not be patching and trapping and searching databases for known application-specific bad residues as markers of what exponents to double or triple check. https://www.mersenneforum.org/showpo...&postcount=142 https://www.mersenneforum.org/showpo...&postcount=150
You might find the strategic double check thread and trippple check thread interesting background also.
Will be trying PRP on a machine(s) after current LL tests are completed. All my machines including GPUs passed DC at least one with matching residues ... more assuring now. sigh

I noticed while running CUDALucas on my 2070/2080 they were stuck in zero residue loops, fortunately by using 2.06 beta the problem went away. I tried to talk my friends into contributing idle time for this project but it seems like I am not very good at persuading people. Well at least the university pays for my electricity at the moment

I guess that means I should be expecting some bad GPU LL test results, but I just won't know until someone double checked it couple years later and found out my machine(s) screwed up... I noticed pretty frequently I see messages like “error greater than 0.4xxx, trying again with the previous save file” and then “it seems like the problem went away, moving on.” Maybe that has something to do with FFT size and my current assignments around 83-84 million? Hopefully that doesn’t screw things up, I should probably run more double checks on my GPUs than CPUs.

I read your post explaining the possibility of matching wrong residues. Good thing we don’t have to do triple checks lol

Thanks for the detailed explanation and your time. It was very helpful.
dcheuk is offline   Reply With Quote
Old 2019-02-24, 15:42   #25
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·2,311 Posts
Default

Quote:
Originally Posted by dcheuk View Post
Will be trying PRP on a machine(s) after current LL tests are completed. All my machines including GPUs passed DC at least one with matching residues ... more assuring now. sigh

I noticed while running CUDALucas on my 2070/2080 they were stuck in zero residue loops, fortunately by using 2.06 beta the problem went away. I tried to talk my friends into contributing idle time for this project but it seems like I am not very good at persuading people. Well at least the university pays for my electricity at the moment

I guess that means I should be expecting some bad GPU LL test results, but I just won't know until someone double checked it couple years later and found out my machine(s) screwed up... I noticed pretty frequently I see messages like “error greater than 0.4xxx, trying again with the previous save file” and then “it seems like the problem went away, moving on.” Maybe that has something to do with FFT size and my current assignments around 83-84 million? Hopefully that doesn’t screw things up, I should probably run more double checks on my GPUs than CPUs.

I read your post explaining the possibility of matching wrong residues. Good thing we don’t have to do triple checks lol

Thanks for the detailed explanation and your time. It was very helpful.
I'm glad to read that you are being a responsible participant and DC-qualifying your hardware, and checking for known-bad interim residues. Good!

Your RTX20x0's would contribute more to the project through TF than through LL. Your kit, your call, of course. I rotate my gpus through different assignment types. (Except for one that crashes the system it's in if running TF, probably through power draw. Shuts it down so it stays down until I intervene.)

If the possible fame, fortune, contributing to new knowledge, and hardware reliability checking don't provide sufficient incentives for your friends, you could mention that some stolen laptops have been recovered due to GIMPS clients reporting in, and you'd be happy to guide them through setting it up.

If you have particular exponents' first tests that are questionable to you for some reason, feel free to post them in the double check thread. That would move up the DC schedule.
The "roundoff >0.40, trying again" is nothing to be very concerned about. It's the built-in error checking protecting the reliability of your run.

Don't sweat scattered bad LL residues. But take note if one of your hardware is producing more than one or noticeably more than the usual 2%, ie, more than its share. Run memory tests at least annually. I've removed a couple of gpus and no-gpu systems from service for unreliability.

We do sometimes do triple checks, but the number is reduced. (Quad occasionally too.)

Re the information sharing, you're welcome. Two years ago I was a gpu beginner. I found lots of discussion threads, and less organized documentation than I expected to encounter. Periodically when I explore something for myself, or write on a thread, it turns into something lengthy and takes me into further depth of understanding than before. (An example of that is the trial factoring concepts post, which took days of steady effort to create.) I write it down and save it for my own use and defense against forgetting, and then, why not share? And in the sharing, sometimes I get back responses that identify misconceptions, omissions, nuances, etc. So I benefit too, rather directly. Indirectly, if my doing this unloads the primary code authors from questions or other distractions to any degree, that may help bring software enhancements out sooner, also.
Have fun! And maybe someday you'll choose a way to give back too.

Last fiddled with by kriesel on 2019-02-24 at 15:44
kriesel is offline   Reply With Quote
Old 2019-02-24, 20:24   #26
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

56308 Posts
Default

The exponent in this thread is almost certainly not prime (>99.9%). LL test is now done without any error code and on EC2 hardware with ECC RAM:

https://mersenne.org/M90094031
ATH is offline   Reply With Quote
Old 2019-02-24, 22:56   #27
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×647 Posts
Default

Quote:
Originally Posted by dcheuk View Post
I'm curious, what is the probability that a residue from an LL test is bad? Is it dependent on the software/hardware/exponent/FFT size/etc?
This has been discussed to some length in different places in this forum. The probability of error is roughly "up to 4%", and yes it is dependent on all the factors listed, and a few others such as temperature, power supply, etc.

Quote:
If the residue is not guaranteed to be 100% correct (well I guess that's why we have double checks), what can we do overall to improve the accuracy (... beside getting ecc memory and quadros/tesla graphics) and should we be expecting bad results in some kind of frequency?
The solution is PRP with the Gerbicz error check, which practically guarantees no such errors. (the main
remaining source of errors being "programmer error", i.e. software bugs)

Quote:
And how unlikely is it for both the original LL test and the double-check test for an exponent to return the bad residue that is identical?
Extremely unlikely, unless the residue is one of a few special values such as "0" where a coincidence of values is more likely. (but those special values are suspicious anyway)

Quote:
I should probably read the wikipedia page on FFT or something to understand it, but procrastinates. How much prerequisites do I need to understand the mathematical concept and proof behind fft assuming i have the standard algebra/analysis background as a grad student? Any recommendations on any resources that introduces and proves this topic?

Thanks guys. I appreciate it.
The starting point is Crandall's article "IBDWT", "Irrational Base Discrete Weighted Transform".
For FFTs, you may read "Matters computational" https://www.jjj.de/fxt/fxtbook.pdf , highly recommended.
preda is online now   Reply With Quote
Old 2019-02-25, 00:51   #28
dcheuk
 
dcheuk's Avatar
 
Jan 2019
Pittsburgh, PA

3×7×11 Posts
Default

Quote:
Originally Posted by kriesel View Post
If the possible fame, fortune, contributing to new knowledge, and hardware reliability checking don't provide sufficient incentives for your friends, you could mention that some stolen laptops have been recovered due to GIMPS clients reporting in, and you'd be happy to guide them through setting it up.
I find it kind of interesting .. because I am assuming the p95 reports back with an IP address because it was made to run on startup, but it needs internet connection. Assuming there is no user/password (I think you need to login before able to connect to any internet, unless wired unprotected), then the user has to connect to wifi first. Won't potential thieves notice that the computer is running at 100% (or insert your percentage here)?

Also I am still trying to configure my computer so that it runs cudalucas on startup BEFORE login since Windows 10 now forces me to restart except on Pro/Education. Someone is probably suing Microsoft right now for forcing users to update and restart.

Quote:
Originally Posted by kriesel View Post
(An example of that is the trial factoring concepts post, which took days of steady effort to create.)
I am going to find it and read it, have always been curious about the math/algorithm behind testing if a prime divides a super large number. I was amazed of how someone was able to create a sequence and an algorithm to test an Mersenne number when I read the proof of the LL test. Now I am eager to know about why some p divides 2^n-1.

Thanks again!
dcheuk is offline   Reply With Quote
Old 2019-02-25, 00:57   #29
dcheuk
 
dcheuk's Avatar
 
Jan 2019
Pittsburgh, PA

3·7·11 Posts
Default

Quote:
Originally Posted by preda View Post
This has been discussed to some length in different places in this forum. The probability of error is roughly "up to 4%", and yes it is dependent on all the factors listed, and a few others such as temperature, power supply, etc.
Sorry I guess should have used the search function extensively.

I have two machines i78700 and i79700 both with 2080 and 650w psu, it seems like my computer doesn't like it when i run both p95 and mfaktc at the same time (the gpu immediately spikes up to 80-85F). However, it was completely fine when I run p95 with less cores and cudalucas. Ironically, my computer GUI lags like crazy when I run cudalucas, but doesn't lag at all while running mfaktc.

Quote:
Originally Posted by preda View Post
The solution is PRP with the Gerbicz error check, which practically guarantees no such errors. (the main
remaining source of errors being "programmer error", i.e. software bugs)
Wondering if it takes more time running this time compared to LL, I guess if it practically guarantees no error other than bugs, it is worth it, as long as it doesn't take like 100x longer.

Thanks for your answer. Much appreciated!
dcheuk is offline   Reply With Quote
Old 2019-02-25, 09:05   #30
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

16B816 Posts
Default

Quote:
Originally Posted by dcheuk View Post
Wondering if it takes more time running this time compared to LL, I guess if it practically guarantees no error other than bugs, it is worth it, as long as it doesn't take like 100x longer.
The runtime is slightly more, on the order of about 1-2% or so. But another consideration is that the PRP test is not conclusive, it can only give high confidence, not proof, of a prime. But overall it is worth running PRP instead of LL since more tests will complete with a correct result, leading to fewer triple checks needed, saving time in the long term. And the slight chance of a false positive is really such a tiny chance that I'd not expect it to ever happen even if this project continues for millions of years
retina is online now   Reply With Quote
Old 2019-02-25, 14:31   #31
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11011111000112 Posts
Default

Quote:
Originally Posted by retina View Post
The runtime is slightly more, on the order of about 1-2% or so.
Doing a Gerbicz check every 1000000 iterations (the default) is an overhead of 0.2%. Plus you save on running the Jacobi check every 12 hours.
Prime95 is offline   Reply With Quote
Old 2019-02-25, 14:34   #32
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

23·727 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Doing a Gerbicz check every 1000000 iterations (the default) is an overhead of 0.2%. Plus you save on running the Jacobi check every 12 hours.
Good. So even better than I stated.
retina is online now   Reply With Quote
Old 2019-02-26, 03:25   #33
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

97·103 Posts
Default

Quote:
Originally Posted by dcheuk View Post
Sorry I guess should have used the search function extensively.

.....
The forum search function leaves a bit to be desired. (Sorry xyzzy!)

On Google, at least, try 'site: mersenneforum.org [search terms]'
Other engines may also have this function. I don't know.
kladner is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
(M48) NEW MERSENNE PRIME! LARGEST PRIME NUMBER DISCOVERED! dabaichi News 571 2020-10-26 11:02
How does one prove that a mersenne prime found with CUDALucas is really prime? ICWiener Software 38 2018-06-09 13:59
Twin Prime Days, Prime Day Clusters cuBerBruce Puzzles 3 2014-12-01 18:15
disk died, prime work lost forever? where to put prime? on SSD or HDD? emily PrimeNet 3 2013-03-01 05:49
How do I determine the xth-highest prime on prime pages? jasong Data 7 2005-09-13 20:41

All times are UTC. The time now is 11:48.

Wed Oct 28 11:48:19 UTC 2020 up 48 days, 8:59, 1 user, load averages: 2.19, 1.87, 1.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.