mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > News

Reply
 
Thread Tools
Old 2017-07-26, 21:11   #617
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,733 Posts
Default CUDALucas2.05.1 error zero residues

Quote:
Originally Posted by Madpoo View Post
Yeah... and normally I'd look at these prime reports and see if it could be legit (they never are, well, except that one from Curtis). But when someone reports 3 x 100M digit primes all at once, you do have to stop and think maybe they have a bug.

What amazes me is when people paste these results into the manual report form and they don't stop to think "wow, my exponent in the 332M range completed in just a few days and said it's prime". I guess they don't really understand how long it should take so when it finishes mysteriously fast they don't question it.

By the way, those 3 (and a bonus 4th one) were all done by CUDAlucas 2.05.1 which has been turning in buggy false prime reports for a while. Not to say that version is itself buggy, but something is going on where it gets into a funky state and breezes through the rest of the iterations super fast and spits out a residue of zero.
Zero or other value repeating residues all the way through an LL test is something I could cause at will in 2.05.1. If memory serves: Get a Compute 2.0 or 2.1 card, set square threads=1024. oxfffffffffffffffd repeatedly and absurdly fast. Note the iterations are also inexplicably fast compared to fft benchmarking with a different square threads value in the ini file, before thread benchmarking. That burned me early on, cost a few days, but I was doing manual reporting and saw it.

The May 5 2.06 beta build incorporates the bad residue checks and halts the program. If you can identify and contact the users, I suggest you recommend they switch versions. Perhaps also refer them to this post for background.

FFT benchmarking and threads benchmarking screen output should be redirected to a file. Then it can be examined for signs of trouble like this:
fft = 4K, ave time = 0.0760 ms, square: 32, splice: 128
fft = 4K, ave time = 0.0759 ms, square: 64, splice: 128
fft = 4K, ave time = 0.0760 ms, square: 128, splice: 128
fft = 4K, ave time = 0.0763 ms, square: 256, splice: 128
fft = 4K, ave time = 0.0788 ms, square: 512, splice: 128
fft = 4K, ave time = 0.0273 ms, square: 1024, splice: 128
fft = 4K, ave time = 0.0279 ms, square: 1024, splice: 32
fft = 4K, ave time = 0.0270 ms, square: 1024, splice: 64
fft = 4K, ave time = 0.0270 ms, square: 1024, splice: 128
fft = 4K, ave time = 0.0273 ms, square: 1024, splice: 256
fft = 4K, ave time = 0.0271 ms, square: 1024, splice: 512
fft = 4K, ave time = 0.0274 ms, square: 1024, splice: 1024
fft = 4K, min time = 0.0270 ms, square: 1024, splice: 128

Those square 1024 timings above that are barely a third of the others are signs of trouble. Normal variation is percentage points, not nearly 2 or 3-fold. Some cards need to be benchmarked to generate fft files and threads files without trying 1024 threads, or without 32 threads, to produce valid reliable results. That's why the bit masks to skip them are provided in the m of the -threadbench s e i m option.

I've also noted some CUDALucas quirks with a recently acquired GTX1060. Mostly it won't run the 32-bit versions of CUDALucas 2.06beta May 5 build, crashing instead of performing an fft benchmark, threads benchmark, residue test, or ordinary LL test. But when it does run win32 versions, it can produce all zeros for the residue check test and so fail every test. I need to look at this some more.

Regardless of card, CUDALucas users should take care to avoid low CUDA levels. I've seen very bad benchmark results with older versions such as 4.0 and 4.1, in some cases small multiples faster, or orders of magnitude less iteration time than plausible when doing fft benchmark or threads benchmark. If it looks too good to be true, it probably isn't true.

I have proposed, but no one has attempted yet, modifying CUDALucas so that anomalously low fft or threads benchmark values are detected and suppressed or at least warned on. There would need to be some tolerance bounds around a function fit that's around k l^1.03 where l is the fft length. The effect of such exceptions is particularly pernicious in threads benchmarking since the minimum value per fft length is sought, and if any of the twelve or so tested combinations goes bad, so that its iteration time is too low, it supersedes the other choices for that fft length, then when the threads file is being output, it also supersedes, suppresses output, of other values at shorter fft lengths that are not power of two lengths, that may be valid values but exceed the bad low iteration time.

Here's an example that occurred on GTX1070 with CUDA4.1. Only the 4096k line is valid, believed to be from correct function.
4.1 Win32: all timings over 4096k anomalously low, by large multiples.
Contiguous excerpt from threads file:
4096 32 128 4.48234
8192 32 256 0.73438
16384 32 32 1.46306
32768 128 32 2.93054
65536 128 32 0.04897

People running 100-million-digit exponents will need around 18816k or longer fft lengths and so are more likely to run benchmarks all the way to 65536k. I've seen frequent trouble in exhaustive benchmarking all the way from min to max (1k-65536k). It can really fall off a cliff at 65536k, or be much more subtle.

4.0win32 on gtx1060
16384 294471259 35.2237 too small
32768 580225813 70.4184 big enough
65536 1143276383 15.8164 faster, might use this if they didn't know it was trouble.

Iteration times are supposed to increase from the small fft length to the long fft length, with some fluctuation upward of the approximately linear trend line between the powers of two on length. If they don't, something's wrong, and the fft file, threads file, and program output are highly suspect. The exception to that is at the very low fft lengths, that most people won't bother to run, under ~100k, there is what appears to be variation in startup delay or other overhead. The cliff at the high end is clearly bad, and is responsible for removal of many valid useful non-power-of-two fft values from the resulting fft file.
Attached Thumbnails
Click image for larger version

Name:	fft length iteration time example gtx1070 CUDA4.1.png
Views:	81
Size:	15.0 KB
ID:	16536  

Last fiddled with by kriesel on 2017-07-26 at 21:18
kriesel is offline   Reply With Quote
Old 2017-07-26, 21:31   #618
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·4,663 Posts
Default

Quote:
Originally Posted by kriesel View Post
The cliff at the high end is clearly bad, and is responsible for removal of many valid useful non-power-of-two fft values from the resulting fft file.
And you raise a fire alarm. Claim wolf, and then you are found asleep.

This becomes tiring. Are you familiar with the concept of temporal sampling?

Last fiddled with by chalsall on 2017-07-26 at 21:39
chalsall is online now   Reply With Quote
Old 2017-07-26, 21:51   #619
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

473310 Posts
Default credit for bad fast runs

Quote:
Originally Posted by Madpoo View Post
Yeah... and normally I'd look at these prime reports and see if it could be legit (they never are, well, except that one from Curtis). But when someone reports 3 x 100M digit primes all at once, you do have to stop and think maybe they have a bug.

What amazes me is when people paste these results into the manual report form and they don't stop to think "wow, my exponent in the 332M range completed in just a few days and said it's prime". I guess they don't really understand how long it should take so when it finishes mysteriously fast they don't question it.

By the way, those 3 (and a bonus 4th one) were all done by CUDAlucas 2.05.1 which has been turning in buggy false prime reports for a while. Not to say that version is itself buggy, but something is going on where it gets into a funky state and breezes through the rest of the iterations super fast and spits out a residue of zero.
It's possibly a result of a misconfiguration issue that's not hard to stumble into. (Benchmark without masking bad-for-the-specific-GPU parameter choices out, don't look too closely at the results, and be in a hurry or too optimistic rather than skeptical. The 1024-threads issue with some GPU types has been known for several years,and a partial rollout of a code change to catch the error occurred in 2016.)

Are such erroneous and anomalously fast runs credited as the number of GHzDays it ought to have taken? Perhaps if it isn't too much trouble, certain known-bad residues from causes known for years ought earn less or no credit. Or something.
kriesel is offline   Reply With Quote
Old 2017-07-26, 22:13   #620
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

246E16 Posts
Default

Quote:
Originally Posted by kriesel View Post
Perhaps if it isn't too much trouble, certain known-bad residues from causes known for years ought earn less or no credit. Or something.
I'm going to presume that was meant to be funny (it was)...

Do we flash you so you forget everything?
chalsall is online now   Reply With Quote
Old 2017-07-27, 00:40   #621
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,733 Posts
Default 65536k iteration time cliff sometimes

Quote:
Originally Posted by chalsall View Post
And you raise a fire alarm. Claim wolf, and then you are found asleep.

This becomes tiring. Are you familiar with the concept of temporal sampling?
I must confess I am baffled by your reply.

Maybe I needed to be clearer. The cliff I referred to is the 65536k timing more than a factor of 100 faster than the next shorter fft length. (No objections to natural variation in combination with timing inaccuracy putting some fuzz on the plot, or larger percentage differences in the low lengths. Granted the measurement of the fft length iteration time is subject to certain experimental errors. The duration of a specific fft length iteration is a scalar though, not a time or spatial varying signal. Its sensitivity to digitization error and perhaps different operand inputs is controllable by making multiple runs and averaging the duration, which CUDALucas has built in.)

Maybe I don't understand the point you're trying to make above. Try again?

Several GPU/CUDA library/bitness combinations appear to produce a correct fft length iteration time at 65536k fft length. Some do not. I suspect that is a big red flag of a bug and incorrect results for that too-fast 65536k computation.

Based in part on nearly 50 years of experience with software and nearly 40 years with designing, building and using scientific instrumentation and robotics including realtime control programming in assembler, it looks to me like software issues, not sampling issues, specific to some software combinations, run the same way on the same hardware and timed the same way too.

One of the things an engineer learns early on is when to distrust experimental data. Outliers are suspicious and get extra attention. Durations for 65536k below (much) less than 134msec seem outliers to me. From an assortment of fft benchmarks tagged by GPU, CUDALucas version, CUDA library, & driver:
---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 4.0 X64 N378.66.TXT
65536 1143276383 15.8225

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 4.1 X64 N378.66.TXT
65536 1143276383 0.2968

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 4.2 X64 N378.66.TXT
65536 1143276383 135.7008

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 5.0 X64 N378.66.TXT
65536 1143276383 135.6211

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 5.5 X64 N378.66.TXT
65536 1143276383 135.0425

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 6.0 X64 N378.66.TXT
65536 1143276383 134.6015

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 6.5 X64 N378.66.TXT
65536 1143276383 135.7659

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 7.0 X64 N378.66.TXT
65536 1143276383 134.4818

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 7.5 X64 N378.66.TXT
65536 1143276383 137.3554

---------- GEFORCE GTX 1060 3GB FFT 2.06BETA 8.0 X64 N378.66.TXT
65536 1143276383 136.9777
kriesel is offline   Reply With Quote
Old 2017-07-27, 01:07   #622
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

32·1,093 Posts
Default

Quote:
Originally Posted by kriesel View Post
Maybe I don't understand the point you're trying to make above. Try again?
"The point you're trying to make" assumes facts not in evidence. When Chris is bored at his job or whatever, he likes to troll his fellow forumites and beat up on n00bs and known easy targets just for sport. This leads to fairly regular intervals of time-outs, a.k.a. "toadification" in the parlance of the mods. If you encounter such a post - usually of a fairly well-established 1 or 2-line length and conveying no coherent point, best just to ignore it.
ewmayer is offline   Reply With Quote
Old 2017-07-27, 01:13   #623
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·4,663 Posts
Default

Quote:
Originally Posted by kriesel View Post
I must confess I am baffled by your reply.
Yeah... Sorry, most don't get my humour...

Quote:
Originally Posted by kriesel View Post
Maybe I needed to be clearer. The cliff I referred to is the 65536k timing more than a factor of 100 faster than the next shorter fft length.
I got that. And it's not really that much of a cliff, but rather a dip...

You didn't get the associated joke (it involves powers of two)....
chalsall is online now   Reply With Quote
Old 2017-07-27, 01:19   #624
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,733 Posts
Default no joke

Quote:
Originally Posted by chalsall View Post
I'm going to presume that was meant to be funny (it was)...

Do we flash you so you forget everything?
Erasure has been tried repeatedly, and didn't take.

If a large exponent result representing a large amount of computation if performed correctly, already attracted attention because it was reported back too fast to be correct, and signaled a new prime, and one cared about the quality of the stats and database, I think it would be up to the mersenne.org board and the PrimeNet administrator whether the excessive computing credit got adjusted down, or an obviously bad and very recent special residue got cleared. Perhaps the policy is everything is added to the database, never deleted. I don't know. (I know that my lifetime stats are seriously underrepresented already. That has nothing to do with "Erasure" above.) Falsely reported primes no matter how innocently they occur seem like a special case meriting more attention than the ordinary composite residue in this project.
kriesel is offline   Reply With Quote
Old 2017-07-27, 15:52   #625
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2·7·139 Posts
Default

Presumably all the benchmarks for a given exponent are doing the same calculation with different FFT lengths etc, so should all get the same result. So rejecting any that don't get the correct result should fix the issue.

All you would need is a list of residues generated by each benchmark on known good hardware to compare the results with.

Or have I misunderstood how it works?

Chris
chris2be8 is offline   Reply With Quote
Old 2017-07-27, 20:35   #626
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2×4,663 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
Or have I misunderstood how it works?
Quite possibly.

In the optimal case, different candidates would run on different hardware and software, and converge on a very similar value. Ideally 0.

Last fiddled with by chalsall on 2017-07-27 at 20:37
chalsall is online now   Reply With Quote
Old 2017-07-27, 22:51   #627
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,733 Posts
Default CUDALucas 2.05.1

Quote:
Originally Posted by Madpoo View Post
Yeah... and normally I'd look at these prime reports and see if it could be legit (they never are, well, except that one from Curtis). But when someone reports 3 x 100M digit primes all at once, you do have to stop and think maybe they have a bug.

What amazes me is when people paste these results into the manual report form and they don't stop to think "wow, my exponent in the 332M range completed in just a few days and said it's prime". I guess they don't really understand how long it should take so when it finishes mysteriously fast they don't question it.

By the way, those 3 (and a bonus 4th one) were all done by CUDAlucas 2.05.1 which has been turning in buggy false prime reports for a while. Not to say that version is itself buggy, but something is going on where it gets into a funky state and breezes through the rest of the iterations super fast and spits out a residue of zero.
CUDALucas 2.05.1 has multiple reasons to be a bit suspect on large exponents. So does 2.06beta to a lesser extent. Not intended to be critical of or unappreciative of anyone's hard work as a volunteer code developer or tester. It's hard to get right or even sorta nearly right usually.

Windows flavor V2.051 executables available now for download still don't have the illegal interim residue checks built in (except perhaps the cuda 8 x64 executable), although the 2.06beta set does. Also, I've verified by testing it's easy to hit various issues that generate the all-zero residues fast.

Additionally, when the extended residue check self test is run in CUDALucas 2.06beta, the largest fft length used is 8192k.(I ran a set for every x64 CUDA version available of the May 2.06beta on Windows, and none exceeded 8192k residue test.) That's much shorter than the ~18816k needed for 100M-digits. There's also a behavior where it picks a seemingly too-short fft length to run large exponents on. I vaguely recall reading of someone else seeing that happen too, on 100M exponents. It occurs at the very top end; an exponent that ought get 65536k drops to 57600k, then generates all zero residues too fast.

My impression is CUDALucas requires more user attention to detail than prime95. I'm taking the details of my months of testing by using over to the CUDALucas thread.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Are Bitcoins Prime Related a1call Miscellaneous Math 23 2020-09-17 13:17
Merry Christmas and a prime! (M50 related) Prime95 News 505 2020-01-18 01:03
Oops i did it again. (Prime found) ltd Prime Sierpinski Project 21 2006-01-04 14:50
Another new prime (M42-Related) Uncwilly News 132 2005-05-10 19:47
some prime-related trick questions ixfd64 Puzzles 2 2003-09-23 12:53

All times are UTC. The time now is 02:12.

Mon Nov 30 02:12:48 UTC 2020 up 80 days, 23:23, 3 users, load averages: 1.45, 1.36, 1.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.