mersenneforum.org Testing....
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2010-02-24, 23:10   #100
gd_barnes

"Gary"
May 2007
Overland Park, KS

3×5×7×113 Posts

I'm having a serious concern about the most recent stress test now. With the aforementioned 5 errors fixed, there should be no problems. But something happened that has now happened 3 times in a row:

1. The server "loses" several pairs right at the very beginning. In this case, it is pairs numbered 6 thru 10. The first 5 went through OK. (BTW, my cache was set to 10 for this test.)

2. The server "loses" a large # of pairs at the very end. In this case, 42 of them, which is the fewest that its lost of any of the 3 stress tests I've run. (Likely because I was only running 4 cores vs. 31 cores.) Checking confirmed that it was the final 42 pairs.

They are just sitting in knpairs.txt and joblist.txt as though they were handed out and never processed. Yet checking my clients confirmed that they were.

I don't know if this is stress-related or related to problems in the Linux client/script. Since this occurred when running just one quad, which is effectively like 1000+ clients at n=~400K, which makes it a pretty decent stress test, I may need to run the Windows client to see if it has the same problem. I can simulate a similar load with 4 cores of my I7 with the same knpairs loaded in the server.

I initially thought that it might be related to the fact that all of the first few pairs are prime except that the same issue seems to be happening at the beginning of the file as at the end.

For reference, I'm attaching the final knpairs that didn't process and the joblist. See a few posts back where I posted the entire knpairs file. The prune period was set to 15 mins and the server dried some 9 hours ago so these are not just some straggling pairs that still need to be received by the server.

Gary
Attached Files
 joblist-knpairs.tar.gz (909 Bytes, 113 views)

Last fiddled with by gd_barnes on 2010-02-24 at 23:11

2010-02-24, 23:15   #101
gd_barnes

"Gary"
May 2007
Overland Park, KS

3·5·7·113 Posts

Quote:
 Originally Posted by kar_bon see post #97 in the first code-block: it's llrnet.lua. i'll use the same link as in post #1 for any new version. i'll try to implement the other options the next time, not sure if today all of them. BTW: i thought about another helpful output: when starting the script, prompt the most important setting from llr-clientconfig.txt at first! Code: +-------------------------------------+ | LLRnet client V0.9b7 with cLLR V3.8 | | K.Bonath, 2010-02-10, Version 0.61 | +-------------------------------------+ Current configuration: server = "nplb-gb1.no-ip.org" port = 9950 username = "kar_bon" WUCacheSize=1 that's what would have saved some time on running and checking for errors at the first tests with the script (you know: forgot to change my username in your settings). suggestions?

That's a good idea on displaying that info. But if you do it, we need to make sure Max agrees and that he can change the Linux client.

Gary

2010-02-25, 01:02   #102
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

11000011010102 Posts

Quote:
 Originally Posted by gd_barnes I'm having a serious concern about the most recent stress test now. With the aforementioned 5 errors fixed, there should be no problems. But something happened that has now happened 3 times in a row: 1. The server "loses" several pairs right at the very beginning. In this case, it is pairs numbered 6 thru 10. The first 5 went through OK. (BTW, my cache was set to 10 for this test.) 2. The server "loses" a large # of pairs at the very end. In this case, 42 of them, which is the fewest that its lost of any of the 3 stress tests I've run. (Likely because I was only running 4 cores vs. 31 cores.) Checking confirmed that it was the final 42 pairs. They are just sitting in knpairs.txt and joblist.txt as though they were handed out and never processed. Yet checking my clients confirmed that they were. I don't know if this is stress-related or related to problems in the Linux client/script. Since this occurred when running just one quad, which is effectively like 1000+ clients at n=~400K, which makes it a pretty decent stress test, I may need to run the Windows client to see if it has the same problem. I can simulate a similar load with 4 cores of my I7 with the same knpairs loaded in the server. I initially thought that it might be related to the fact that all of the first few pairs are prime except that the same issue seems to be happening at the beginning of the file as at the end. For reference, I'm attaching the final knpairs that didn't process and the joblist. See a few posts back where I posted the entire knpairs file. The prune period was set to 15 mins and the server dried some 9 hours ago so these are not just some straggling pairs that still need to be received by the server. Gary
I have to wonder if this has something to do with what I suggested before, that the server might not "know" it's time to prune unless there's actually activity happening. I doubt it has a separate thread devoted to monitoring such things, so that kind of behavior would indeed by expected. In fact, come to think of it, in the past I was able to "trigger" an overdue prune by sending in a completely bogus result with the intent that it would be rejected--that's enough to "wake it up".

2010-02-25, 01:13   #103
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

2·55 Posts

Quote:
 Originally Posted by gd_barnes Max, It took me a while but I finally concluded what you did: It's much better to "kill" the Linux client with the system manager than it is to do Ctl-C. There are several times I noticed when it took 3-4 Ctl-C's to kill it; usually on small tests -or- when the server had dried. (Don't quote me on the exact scenarios but I do know that sometimes it didn't want to "die" on the first Ctl-C.) Can you please put something in the documentation about it being best to kill the clients when stopping them?
I wouldn't recommend using kill as a matter of course since that won't give LLR a chance to save its checkpoint file; for small tests like the ones we're testing with it's not a terribly big deal, but it would be not a good thing to recommend to users in general. Possibly it would be better to just say in the readme that sometimes you have to Ctrl-C it a few times to kill it (especially with small tests) and if it's getting connection errors. For connection errors, though, it shouldn't hurt to just kill it the "hard" way.

2010-02-25, 03:42   #104
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

2·55 Posts

Quote:
 Originally Posted by gd_barnes That's a good idea on displaying that info. But if you do it, we need to make sure Max agrees and that he can change the Linux client.
Yeah, I suppose I could do that--shouldn't be too hard.

2010-02-25, 06:04   #105
gd_barnes

"Gary"
May 2007
Overland Park, KS

271318 Posts

Quote:
 Originally Posted by mdettweiler I have to wonder if this has something to do with what I suggested before, that the server might not "know" it's time to prune unless there's actually activity happening. I doubt it has a separate thread devoted to monitoring such things, so that kind of behavior would indeed by expected. In fact, come to think of it, in the past I was able to "trigger" an overdue prune by sending in a completely bogus result with the intent that it would be rejected--that's enough to "wake it up".
Hum. That doesn't quite hold water here. First, the first few pairs in the knpairs would have been processed long before. Second, the pairs are being shown as processed and sent to the server by the client yet they are not showing up in the results. For the prune to work, wouldn't the results have to be there? You might take a peak at port 9985. The files from yesterday's run including joblist, knpairs, results, and stdout all have a "-1st" extension on them. Shortly I'm going to run it again but I'll make the file much smaller this go around.

Unfortunately I've already stopped the server; saved off the applicable files and reloaded it. I'll have to try a smaller file to retest it in < 1 hour or so instead of waiting 6-7 hours for it to dry.

Quote:
 Originally Posted by mdettweiler I wouldn't recommend using kill as a matter of course since that won't give LLR a chance to save its checkpoint file; for small tests like the ones we're testing with it's not a terribly big deal, but it would be not a good thing to recommend to users in general. Possibly it would be better to just say in the readme that sometimes you have to Ctrl-C it a few times to kill it (especially with small tests) and if it's getting connection errors. For connection errors, though, it shouldn't hurt to just kill it the "hard" way.
OK, agreed. Makes sense. For the most part, I've been able to make it stop by the 2nd Ctl-C and sometimes on the 1st. So yeah, just commenting that you might have to hit Ctl-C something like 2-4 times to get it to stop should be OK.

Gary

Last fiddled with by gd_barnes on 2010-02-25 at 06:05

 2010-02-25, 06:05 #106 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 2×55 Posts I've now tested the do.pl script on Windows for most of today (since around 10 AM EST), and have encountered no problems except the small factor issue. BTW, I did a bit of investigating on that and found a couple things: -The first of the four results (which came before the small factor in the batch) was accepted fine. -On the server end, the small-factor result was received and accepted, though with the NewPGen header (!) put in place of the residual. -The remaining 3 results in the batch, all of which came after the small-factor one, were rejected and subsequently thrown out by the client. I'm not positive, but I think normal LLRnet is designed to be able to handle small factors correctly (though I haven't actually tested it). At any rate, though, no properly sieved file should ever have factors in it small enough for LLR to turn up; I imagine it wouldn't be a big deal if we didn't bother to fix it, since if there's small factors in the server, then there's a much bigger problem than just a few abandoned tests. Not to mention that if LLRnet doesn't have a precedent for handling these (as I said, I'm not sure if it does), then we wouldn't be able to fix it at all without adding code for it on the server end (which we probably don't want to get into). Other than that, though, do.pl seems to be working perfectly. Gary, have you gotten the chance to test the latest version yet on Linux and do the stress test you were planning?
2010-02-25, 06:14   #107
gd_barnes

"Gary"
May 2007
Overland Park, KS

3·5·7·113 Posts

Quote:
 Originally Posted by mdettweiler I've now tested the do.pl script on Windows for most of today (since around 10 AM EST), and have encountered no problems except the small factor issue. BTW, I did a bit of investigating on that and found a couple things: -The first of the four results (which came before the small factor in the batch) was accepted fine. -On the server end, the small-factor result was received and accepted, though with the NewPGen header (!) put in place of the residual. -The remaining 3 results in the batch, all of which came after the small-factor one, were rejected and subsequently thrown out by the client. I'm not positive, but I think normal LLRnet is designed to be able to handle small factors correctly (though I haven't actually tested it). At any rate, though, no properly sieved file should ever have factors in it small enough for LLR to turn up; I imagine it wouldn't be a big deal if we didn't bother to fix it, since if there's small factors in the server, then there's a much bigger problem than just a few abandoned tests. Not to mention that if LLRnet doesn't have a precedent for handling these (as I said, I'm not sure if it does), then we wouldn't be able to fix it at all without adding code for it on the server end (which we probably don't want to get into). Other than that, though, do.pl seems to be working perfectly. Gary, have you gotten the chance to test the latest version yet on Linux and do the stress test you were planning?

The version that I posted yesterday is the latest version. Correct? lol It is that latest version that I ran my big stress test on yesterday. It was about halfway through the stress test that I changed a prime residue from 16 x's to a single digit of "0" like the Windows client. What I want to test today is the same script for the cancellation of pairs and the problem with the pairs not processed at the beginning and end of the file by the server.

I just got back in after a long day and need to do a couple of things yet. But I plan to test in the wee hours here for 2-4 hours.

BTW, I also observed what you did on a pair that had a factor of 5. It put the file header in the residue. You know what? I think that might explain why the 4-5 pairs right after it were not accepted by the server even though the client processed them. Bingo! And...if what you said about the final pruning is causing them not to be processed at the end, well...that might explain completely what happened yesterday with the pairs that weren't processed by the server. That said, the server never showed the results for the missing pairs at the end so I'm questioning how a final prune would actually be able to work.

Agreed that a small factor should never happen on a reasonably sieved file. As a programmer though, it would be nice to code around it but not at the expense of a lot of extra time/testing. I'll see what the code looks like.

Gary

Last fiddled with by gd_barnes on 2010-02-25 at 06:17

2010-02-25, 06:24   #108
gd_barnes

"Gary"
May 2007
Overland Park, KS

271318 Posts

Quote:
 Originally Posted by kar_bon it's not so easy as thought, but the following lines will do the trick. Code: result, residue = primeTest(t, format("%s %s", k, n)) if result == 0 then residue = "0" end so, if a prime is found, set the residue to '0' and all is ok! Note: not needed for the script, only for the 'old' version of the LLRnet-client.

Karsten,

I was looking to make this change to the residue for a prime in llrnet.lua on the Linux side but it appears to already default to a "0". Here is the code:

Code:
         -- perform prime test !
if not asynchronous then
Logout() -- logout before performing computation
end
--       UpdateStatus(format("Working on : %s/%s (%s)", k, n, t))
--       print(format("Working on : %s/%s (%s)", k, n, t))
--       result, residue = primeTest(t, format("%s %s", k, n))
result, residue = 0, "0"
-- check user interruption
if stopCheck() then
return -- return with no error
end
end
SemaWait(semaphore)

What change is needed to accomplish what you are talking about?

Edit: The code in the Windows client is the same. Please enlighten me.

Last fiddled with by gd_barnes on 2010-02-25 at 06:26

2010-02-25, 06:31   #109
kar_bon

Mar 2006
Germany

5·601 Posts

Quote:
 Originally Posted by gd_barnes BTW, I also observed what you did on a pair that had a factor of 5. It put the file header in the residue. You know what? I think that might explain why the 4-5 pairs right after it were not accepted by the server even though the client processed them. Bingo! And...if what you said about the final pruning is causing them not to be processed at the end, well...that might explain completely what happened yesterday with the pairs that weren't processed by the server. That said, the server never showed the results for the missing pairs at the end so I'm questioning how a final prune would actually be able to work. Agreed that a small factor should never happen on a reasonably sieved file. As a programmer though, it would be nice to code around it but not at the expense of a lot of extra time/testing. I'll see what the code looks like.
i had a look at the llrserver.lua:
try to do this: there're functions called PrunePairs() and PruneJoblist() (called in funtion ProxyUpdate). make an output on the server with "print("PrunePairs Call 1")" everytime that function is called before (the other same) and gave every call an own number, so you can say, which call invokes the function.
even better: put the date/time into it:
Code:
print(format("PrunePairs Call #1: [%s] ", date("%Y-%m-%d\ %H:%M:%S")))
and test a small amout of pairs and a prune time of 15 mins.

this let you see, where and when the server pruned; perhaps there's issue: only pruning when results received.

Last fiddled with by kar_bon on 2010-02-25 at 06:36

2010-02-25, 06:34   #110
gd_barnes

"Gary"
May 2007
Overland Park, KS

3·5·7·113 Posts

Quote:
 Originally Posted by kar_bon i had a look at the llrserver.lua: try to do this: there're functions called PrunePairs() and PruneJoblist() (called in funtion ProxyUpdate). make an output on the server with "print("PrunePairs Call 1")" everytime that function is called before (the other same) and gave every call an own number, so you can say, which call invokes the function. even better: put the date/time into it: Code: print(format("PrunePairs Call #1: [%s] ", date("%Y-%m-%d\ %H:%M:%S"))) and test a small amout of pairs and a prune time of 15 mins.

Don't you ever sleep? lol

I'm going to need some help with this. I'm not clear on where in llrserver.lua that it goes. Can you post an updated llrserver.lua file with this change in it?

 Similar Threads Thread Thread Starter Forum Replies Last Post kladner Soap Box 3 2016-10-14 18:43 GARYP166 Information & Answers 9 2009-02-18 22:41 gd_barnes Riesel Prime Search 20 2007-11-08 21:13 grobie Marin's Mersenne-aries 1 2006-05-15 12:26 eepiccolo Math 6 2006-03-28 20:53

All times are UTC. The time now is 13:24.

Thu Feb 9 13:24:55 UTC 2023 up 175 days, 10:53, 1 user, load averages: 0.69, 0.74, 0.79

Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔