mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Blogorrhea > LaurV

Reply
 
Thread Tools
Old 2020-04-23, 16:59   #1
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

210468 Posts
Default Bitching about the owl (was: 332M to 83 TF Request)

We can't do any TF right now (no resources) and we somehow just managed to get assigned M332381057 for PRP First Test. We didn't realize that this is under-TF-ed (currently, to 77) until gpuOwl started to do P-1 instead the expected PRP. So, we are doing P-1 right now, which would take one or two hours more, then we will stop until we will be able to TF it to ar least 83 bits, which will be in a week or two.

Alternative is to unreserve, and get an assignment which is at least well TF-ed.

Of course, unless somebody is willing to do the TF job for us (77 to 83).
Any takers?
Thanks in advance.

Last fiddled with by LaurV on 2020-05-17 at 13:12
LaurV is offline   Reply With Quote
Old 2020-04-23, 17:41   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,441 Posts
Default

Quote:
Originally Posted by LaurV View Post
We can't do any TF right now (no resources) and we somehow just managed to get assigned M332381057 for PRP First Test. We didn't realize that this is under-TF-ed (currently, to 77) until gpuOwl started to do P-1 instead the expected PRP. So, we are doing P-1 right now, which would take one or two hours more, then we will stop until we will be able to TF it to ar least 83 bits, which will be in a week or two.
https://www.mersenne.ca/exponent/332381057 says GPU72 TF limit is 81 bits. That I could do in half a day. PM me whether your P-1 of it finds a factor or not. Hopefully it is doing P-1 to full PrimeNet B1 and B2.
kriesel is offline   Reply With Quote
Old 2020-04-23, 18:03   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

874210 Posts
Default

Quote:
Originally Posted by kriesel View Post
Hopefully it is doing P-1 to full PrimeNet B1 and B2.
Aaaa.. nope. Good point. I just restarted with proper limits. The TF limit is card-dependent. 2080Ti should go to 83, lower cards even to 84. Titans should stop at 81. Radeon VII should stop at 80. I said 83, but any bit over 77 should help. But don't start yet, I will let you know if the TF still needed. Thanks.

Edit: I mean tomorrow, or in weekend, now it is 1:20AM here, I go to sleep...

Last fiddled with by LaurV on 2020-04-23 at 18:18
LaurV is offline   Reply With Quote
Old 2020-04-25, 05:14   #4
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2·3·31·47 Posts
Default

Timeout.

Break.

gpuOwl is not usable for P-1, for the cards we have and/or for the settings we have... Fullstop.

More research is needed.

We also believe we uncovered a bug in it, but that may be coincidental.

This job is indefinitely postponed. Thread moved to personal blog.

TL;DR:
(to be written in the following 30 minutes or so)

Last fiddled with by LaurV on 2020-04-25 at 05:16
LaurV is offline   Reply With Quote
Old 2020-04-25, 06:37   #5
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

874210 Posts
Default

So, last weekend we decided to switch one wheelbarrow from "a lot of TF" to "a lot of LL/PRP".

These went out:
Click image for larger version

Name:	forzes.jpg
Views:	75
Size:	187.1 KB
ID:	22135

(the foot is intentional, we can prove it is ours )

This is the bare-barrow:
Click image for larger version

Name:	mobo.jpg
Views:	63
Size:	189.1 KB
ID:	22136

These went in:
Click image for larger version

Name:	sevens.jpg
Views:	64
Size:	153.9 KB
ID:	22137

The "sevens" were our "self inflicted", "Christmas present", we got them end of November, from B&H, that is also where we got the mobo from, many years ago. This is a wonderful mobo, about which I posted in the past. But due to our "Australian detour" (from beginning of December to end of January), about which we also posted in the past, we had a lot of work stacked on, so we didn't really have time for playing God with our computers. Last week, after a lot of TF work with GPU72, and other adventures, we desperately needed a cleaning of the water fins of the CPU, as it was starting getting to the junction limit when both mfaktc and P95 were munching. Therefore, we took the toy apart and decided it's a good time to try the "sevens", which were still unboxed (well, we had a look to what's in the box when we got them, but that's all).

They went successfully, and amazingly fast, through 15 or 20 LLDCs each, with the gpuOwl (we could not report all of them, as they all were our former work from the past, which was not yet verified, so, during we were "DC-ing our own work", some exponents were reserved by "reliable" third parties, and we didn't want to poach, or they were already DC-ed, so we only reported the results for those unreserved, or reserved by "unreliable" anonymous users, but that is another story, and we know Madpoo will not resist the temptation to TC those exponents ). Of course, we were not so clever to handle the installation of the new owl smoothly, but after some help from Mihai and the forum, we succeeded in installing and running it, also we had to erase all Nvidia drivers, install AMD, etc. All the ordeal.

There was only a single mismatch in all those DCs, to which we did the TC rerun, and everything turned fine (initial residue was right, our current DC was wrong). But all the other DCs turned out well. These cards are monsters for LL/PRP/DC. With proper cooling, they can go through one 55M DC in about six-seven hours. One error in 20 test or so, that is what I would call reliable, for a gaming card, and "very VERY fast" for a FFT implementation. Well done!

But this is where the praise stops.

We didn't like the fact that the checkpoint history is not retained, and there seems not to be any way in the program to tell it to keep the history. We have to make a batch to check every 30 minutes or so, and if there is any gpuowl.ll file in the folder, then rename it to gpuowl.001.ll, then 002, 003, etc. Then, in case of mismatches (which are properly recorded in the logs!) we can resume from the proper checkpoint and avoid wasting the time of reruning both tests from scratch. Otherwise a lot of resources are lost, and the toy will not be suitable for "long jobs", EVER.

So, back to the story, drunk by the success we had with the DCs we decided to "go big" and reserved the 332M in cause. We switched to PRP, which we expected to save us from a full re-run (due to Gerbicz Check - note the title cases ). But we still kept running the same test in both cards, in parallel, with intermediary logs that we can check and see if the things are still on track.

This was the plan. Finding that the runs differ somewhere in the middle, with no Gerbicz error signaled, would have made the news

However, we didn't go so far. We didn't remark the fact that the reserved exponent was not TFed enough, until gpuOwl started to do P-1 on it, instead of the PRP that we expected. We said WTF?

Well, actually, wait a moment, this is a good point with The Owl that it didn't let us PRP for ages some exponent for which a factor may have been found much faster. White ball for the owl.

But on the other hand, we are now STUCK with the P-1, and we are quite unsatisfied with actual version of gpuOwl, when it comes to P-1. Sorry Mihai. I commend you for the work you invested in this toy, to make it fast, etc., and I know you have that genius spark and you are a hardworking guy, but the owl is far away of being robust or reliable, or even useful, for long jobs.

Short runs, which you can repeat in case of failure, yeah, we are good. Maybe that was the goal. But long runs, no.

We repeated the P-1 three times, each time in both cards, and every time they differ. We stopped the run as soon as we have seen the difference, but of course, sometimes hours after it happened, or even when the run was already in stage 2 (where, by the way, THERE IS NO RESIDUE OUTPUT! ), and the "correct" checkpoints were nowhere to be found (they were long overwritten by the new, incorrect ones).

Now, we ended up with totally 6 (partial) runs (in two cards together), with 6 (partial) log files, all 6 different. One differs from the other at iteration ~680k, and there is another differing from the rest of 4 at iteration about 1.7M (from a total of about 6M iterations for P-1 with a 5% chances to a factor). These two runs are, no doubt, wrong. Because the other 4 agree about the residues, up to a higher number of iterations. Moreover, the two wrong runs came from the same card, so that card seems to be... not so reliable like the other. Or it was just unlucky this time and got the error sooner. All video cards are prone to errors, unless you buy specially dedicated GPGPUs (we had this discussion in the past, and I explained why, from the point of view of a guy working in the electronic manufacturing design/industry).

So what, you would say, maybe the other card is also wrong, or you pushed it too much.

Well.. that MAY BE. But my concern is the fact that I had to run three times, two cards, FROM SCRATCH. Because no history, no checkpoints.

But NO, that CANNOT BE. Because the other 4 files, from which 3 were run in the same card, at different times, and the fourth came from a different card - and here is why I assume there is a bug in the P-1 code - all start differing at iteration 2.71M. That would be too much of a coincidence...

So, future plans, that I have to do, all 3 in parallel, in the same time:

1. Make a clever batch file to grab the checkpoint files as soon as they are saved by the Owl, and store them in a better-organized fashion, to be able to resume in case sh!t happens. Possibly, read the iteration number from inside the file (well.. it was so difficult for gpuOwl to write it in the name of the file...)
2. Test how PRP is really going - maybe due to GC, this is not needed anymore, and GC can indeed save us all the trouble. Note that we didn't go so far yet, to be able to run a long PRP test with GC active, and LL/P-1 that we played with, have no GC-similar check.
3. Continue to bother Mihai on all fronts, until he is totally pissed off of us, and he makes the Owl to our liking (similar what we did during cudaLucas development )

Last fiddled with by LaurV on 2020-04-25 at 07:18
LaurV is offline   Reply With Quote
Old 2020-04-25, 13:33   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,441 Posts
Default

I wonder if some of the mismatch you may be seeing could be differences in sieving results between runs. As I recall, I've seen P-1 be nonreproducible, also. Different bounds selected by CUDAPm1 from one run to the next, or different sieving of the same bounds, may occur. Sieving differences in <B1 would affect res64 matches in stage 1. Sieving differences in B1 to B2would affect res64 matches in stage 2 if there were any res64 there to look at in gpuowl.

Last fiddled with by kriesel on 2020-04-25 at 13:39
kriesel is offline   Reply With Quote
Old 2020-04-25, 15:40   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,441 Posts
Default

Maybe try a test run or more with known-factor exponents.
P-1 selftest candidates https://www.mersenneforum.org/showpo...8&postcount=31
General background re P-1 errors (work in progress) https://www.mersenneforum.org/showth...937#post509937

Last fiddled with by kriesel on 2020-04-25 at 15:43
kriesel is offline   Reply With Quote
Old 2020-04-25, 16:58   #8
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2×3×31×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
I wonder if some of the mismatch you may be seeing could be differences in sieving results between runs. As I recall, I've seen P-1 be nonreproducible, also. Different bounds selected by CUDAPm1 from one run to the next, or different sieving of the same bounds, may occur. Sieving differences in <B1 would affect res64 matches in stage 1. Sieving differences in B1 to B2would affect res64 matches in stage 2 if there were any res64 there to look at in gpuowl.
For sure if B1 changes, the residues will be very different, but they will start to differ much earlier. That's is because you compute E as the product of all numbers less than B1, and then when you raise b^E, there is a different order of the bits there (square, multiply by b). For example, if you change B1 from 20 to 30 (trivial example), they will give you the same result when you put the powers of 2 inside (2^4=16 fits both limits) but once you put powers of 3 in E, well, you have 3^2 in 20, but 3^3 in 30, so they will be already different. Also, if you sieve in the CPU and/or you sieve by time limit (regardless CPU or GPU), then your sieve will get different results if the CPU/GPU are more or less busy. But if you sieve by hard limits, like "with primes lower than 40k (as P95 does), then there is no reason why P-1 shouldn't be reproducible. It may be intentionally done so, for example, just introduce a random prime in E in the very beginning. This is actually an interesting idea (I will put a copyright on it ) because, beside of the fact that it increase (infinitesimal) the chance of finding a factor, it also does the squaring+multiply_by_base (exponentiation) deal with completely new data every time, similar to the "shifts" of the LL/PRP tests. However, there are 2 arguments here, first of all, the residues will differ from the start. Second, I don't know how much it worth the trouble, because beside of paranoid guys like me, nobody is going to do double-checks for P-1 tests.

But is is an interesting idea...

As I have seen the Owl working (only talk about P-1), if you also witnessed "nonreproducibility" in the past, then my confidence in the P-1 results reported to PrimeNet by all other users, assuming the work was done with gpuOwl, is zero divided by four.

And that is where more work has to be done.

Finally, I managed to finish one P-1 run giving the same residues in Stage 1 as one of the existent 6 previous runs, in the same card. The other card (same card that produced the other 3 wrong residues!) started differing at iteration ~3.5 millions, and I scrapped the run, together with the other 5. So, from 8 runs in 2 cards, one card produced 4 bad runs, and one card produced 2 good 2 bad. And I let it finish the Stage 2, (in both cards, by copying the end of Stage 1 checkpoints) and reported the result to PrimeNet.

No factors.

But the Stage 2 is still "not sure". Because no residues output. What if the both cards just went nuts?

I started PRP, which up to now, matches.

I also tried mfakto and these cards output about 1100-1500GHzD/D, depending on the bitlevel (the higher output is for higher levels). So, the "breaking point" is somewhere at 80 bits, and James' calculus is correct. Going higher with "sevens" is waste of time, you could clear exponents faster by PRP and PRPDC than you can clear them by TF to 81 bits (one PRP run takes about 17 days, assuming everything goes perfect, and no very long restarts or a lot of "resumes" - mind that I don't know yet how well PRP runs, and how efficient GC is in catching errors and resuming in a timely maner).

Therefore, if you want to run some TF, now it is the time Thanks in advance. It is not urgent, you can do it or not, or do it after a while, like next month or so, but please tell me if you start, and if you find a factor. Because sooner or later I will put together a new box for the "forces" and then I can resume TF too.

Last fiddled with by LaurV on 2020-04-25 at 17:20
LaurV is offline   Reply With Quote
Old 2020-04-26, 09:55   #9
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

4,441 Posts
Default

Quote:
Originally Posted by LaurV View Post
But the Stage 2 is still "not sure". Because no residues output. What if the both cards just went nuts?
That's what test candidates with known factors are for, determining whether a given gpu and app are sane or not.
Quote:
I also tried mfakto and these cards output about 1100-1500GHzD/D, depending on the bitlevel (the higher output is for higher levels). So, the "breaking point" is somewhere at 80 bits, and James' calculus is correct. Going higher with "sevens" is waste of time, you could clear exponents faster by PRP and PRPDC than you can clear them by TF to 81 bits (one PRP run takes about 17 days, assuming everything goes perfect, and no very long restarts or a lot of "resumes" - mind that I don't know yet how well PRP runs, and how efficient GC is in catching errors and resuming in a timely maner).
TF belongs on GTX or RTX; primality testing on cpu or Radeon VII. Heinrich's charts are for TF and primality optimization ON THE SAME GPU. But that is a suboptimization, and increasingly a bad one. Using each for what they are relatively better at is a better optimization. No one knows how to optimize that combination of disparate resources but it seems to me an intermediate TF level like 81 or 82 makes sense there.
Quote:
Therefore, if you want to run some TF, now it is the time.
When I saw your battle with 6 P-1 runs, time out, break, I preemptively launched. Should be done to 81 later today.

Last fiddled with by kriesel on 2020-04-26 at 10:07
kriesel is offline   Reply With Quote
Old 2020-04-26, 17:20   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

115916 Posts
Default

TF to 81 https://www.mersenne.org/report_expo...2381057&full=1
kriesel is offline   Reply With Quote
Old 2020-04-27, 05:34   #11
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2×3×31×47 Posts
Default

Quote:
Originally Posted by kriesel View Post
That's what test candidates with known factors are for, determining whether a given gpu and app are sane or not.
Thanks for the TF.

So PRP (from start to finish) would end in 16 days (and a bit more), which results in about 6% (and a bit more) per day. Meantime, a day and a half (from the 16) already went past, after which the PRP progressed to the expected ~10%, without any error and without any GC resume. Both cards output exactly the same residues, all the way. Which says the cards are sane, app is sane for PRP, and most probably, due to the LLDCs I did before (see former posts), app is also sane for LL (except that is very hard to use, due to missing worktodo facilities and checkpoint history). However, app sucks for P-1. Unless it is not supposed to be reproducible, as you said, in which case I am wrong and wasted my time. Mihai's input here should be valuable...

Last fiddled with by LaurV on 2020-04-27 at 05:36
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Why did I get a 332M exponent jovada Information & Answers 2 2019-05-05 12:31
ARM ASM request ET_ Programming 0 2018-11-01 14:57
Not quite a bug... a request maybe. ET_ PrimeNet 4 2018-07-06 16:08
Why do I get ~332M work assigned to the 100M search? heliosh PrimeNet 6 2017-10-02 18:22
GPU72 out of 332M exponents? Uncwilly GPU to 72 16 2014-04-11 11:31

All times are UTC. The time now is 20:43.

Sun Sep 27 20:43:08 UTC 2020 up 17 days, 17:54, 0 users, load averages: 1.55, 1.75, 1.70

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.