mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Reply
 
Thread Tools
Old 2021-11-05, 06:36   #12
axn
 
axn's Avatar
 
Jun 2003

34×5×13 Posts
Default

Quote:
Originally Posted by Prime95 View Post
It is not in the works. There are two reasons:
1) Prime95 would require a lot of memory during the PRP+Stage1 part of the PRP test. Thus, not an option that can be turned on in a default install. A prime95 default install is not supposed to interfere with normal work.
2) There is some overhead involved in PRP+Stage1. I need to revisit the process to quantify -- it's been a long while since I last looked at it. IIRC, if you have 128 temporaries you get a theoretical 7x increase in stage 1 performance -- I don't recall what I expected the overhead to knock that 7x down to.
128 temps= all 8-bit patterns (only odd numbers are considered) = on average 9 bits per multiplication (after each chunk of 8 bits, we expect a run of 0 bits with average length one -- which we can skip over), so it will be 9x

The option is viable with as little as 8 temps = 5 bits per multiplication.


Quote:
Originally Posted by Prime95 View Post
That said, it is worth implementing. The last year I've been working on needed gwnum library improvements as well as improving P-1 (and P+1/ECM) stage 2 in preparation for implementing this. 30.7 is close to a finished product, but Pavel and I are working on yet another stage 2 improvement!
Nice! Any and all improvement to P-1 is welcome, especially for 2K project, where there is no scope for PRP+stage 1. If I may, what kind of potential improvement are we looking at? < 5%? > 10%?
axn is offline   Reply With Quote
Old 2021-11-05, 14:41   #13
lisanderke
 
"Lisander Viaene"
Oct 2020
Belgium

89 Posts
Question

Thank you for bringing this up techn1ciaN! And thank you kriesel for your insightful calculations. Although I'm having difficulty fully wrapping my head around your second post in this thread, I'll simply accept your title
Quote:
Originally Posted by kriesel
Approx 1 test saved bounds appears optimal
as another reason for changing my own tests_saved from 2 to 1 (I've tried 1.048 but as Uncwilly noted that does not register .) I've been meaning to do P-1 as a way to make more meaningful contributions to the FTC-wavefront with my 32GB ram-system. I tend to use CPU-heavy applications during the day which essentially put a hold on PRP-tests, making the assignments last /even/ longer on my i5-8400 (6 cores, 6 threads). Changing primality_tests_saved from 2 to 1 on the P-1 assignments I've queued will probably increase my throughput quite a bit! Currently every P-1 test will take approximately 39 hours to complete. Perhaps this time I'll stick to P-1 in the FTC-wavefront for a while longer


On the subject of P-1 work, I've been learning a lot about best practices with this workload and I'm still trying to find the best configuration for my system:
OS, software: Win64/Win11 and Prime95 v30.3b6
Processor: Intel Core i5-8400 (6 cores, 6 threads) currently set to 6 workers
Memory: 4x8GB 2400MHz memory (32 GB)
I have daytime & nighttime Stage2/ECM limits at respectively 13.2 & 18 GBs. (my OS + usual programs tend to use up to 10 GB and that leaves 8.8 GB of "games-reserved" memory (or for other memory intensive programs) during the day, with the nighttime Stage2/ECM limit taking up 4.8 GB out of the "spare" memory.
I have run timing 5760K FFT benchmarks with all possible worker variations of 6 cores and came up with the following values:
6 cores, 6 workers: anywhere between 121 and 132 iter/sec
6 cores, 3 workers: anywhere between 122 and 131 iter/sec
6 cores, 2 workers: anywhere between 121 and 132 iter/sec
6 cores, 1 worker: anywhere between 118 and 131 iter/sec


TL;DR: I have six workers, one core/thread each, doing p-1 with at least 2.2 GB memory allocated with a maximum of 18 GB allocated. Every P-1 assignment takes approx. 39 hrs to complete. Phew. That was a mouthful.

In readme.txt I found these values for ram usage with P-1 under "Daytime and nighttime P-1/ECM stage 2 memory" (perhaps these are deprecated, or certainly not correct anymore for this different tests_saved value given that lower tests_saved equals lower bounds thus lower ram usage(?) with 1 test saved versus 2 tests saved.)
Exponent Minimum Reasonable Desirable
-------- ------- ---------- ---------
100000000 0.2GB 0.7GB 1.1GB
333000000 0.7GB 2.1GB 3.5GB

Based on this table I decided that a 107M exponent would certainly not require more than 2.2 GB of memory for a "desirable" P-1 test (this was before accounting for the tests_saved value of 1 instead of 2.) I have read before that higher ram allocation speeds up the P-1 process, but I do not know the relation (or rather; difference in speed) between higher core count per worker and higher ram usage.


My questions are the following:
1: Is my P-1 workload understanding wrong in any of the above statements? (or are there any other mistakes present?)
2: Did I miss any important factors (pun intended) of doing P-1? Are there other things I haven't accounted for that would impact my systems' ability to do 'optimal' P-1?
3: Do I have too little minimum stage 2 memory allocated (2.2 GB) for FTC-wavefront P-1?
4: Would it be better for me to have 1 worker, 6 cores take up all of the stage 2 memory? And also, why? (I've been stuck on the reasoning for choosing 1, 2, 3 or 6 workers)

Last fiddled with by lisanderke on 2021-11-05 at 14:54 Reason: added OS and Prime95 build info.
lisanderke is offline   Reply With Quote
Old 2021-11-05, 14:51   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3,049 Posts
Default

Quote:
Originally Posted by axn View Post
128 temps= all 8-bit patterns (only odd numbers are considered) = on average 9 bits per multiplication (after each chunk of 8 bits, we expect a run of 0 bits with average length one -- which we can skip over), so it will be 9x

The option is viable with as little as 8 temps = 5 bits per multiplication.
I don't follow. The temporaries are IIUC 3x mod Mp, so p bits large each.
128 or 256 temporaries are multiple gigabytes. Gpuowl v7.x would fill a GPU's 8 or 16 GiB memory about as full as we let it with them. https://mersenneforum.org/showpost.p...30&postcount=2
Actually it looks like larger than p bits:

https://mersenneforum.org/showpost.p...52&postcount=7 gives an example of 22 MB/buffer (176 Mbits) for 100M exponent.

Mihai on stage 2 efficiency vs low or high memory amount: https://mersenneforum.org/showpost.p...6&postcount=31 There references "big" buffers 44MB each at wavefront at that time.


...oh, you're perhaps referring to how many bits of the P-1 stage 1 power can be accomplished per very-wide multiplication of saved big temporaries.

Last fiddled with by kriesel on 2021-11-05 at 15:37
kriesel is online now   Reply With Quote
Old 2021-11-05, 15:34   #15
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×3,049 Posts
Default

Quote:
Originally Posted by lisanderke View Post
On the subject of P-1 work, I've been learning a lot about best practices with this workload and I'm still trying to find the best configuration for my system:
OS, software: Win64/Win11 and Prime95 v30.3b6
Processor: Intel Core i5-8400 (6 cores, 6 threads) currently set to 6 workers
Memory: 4x8GB 2400MHz memory (32 GB)
I have daytime & nighttime Stage2/ECM limits at respectively 13.2 & 18 GBs. (my OS + usual programs tend to use up to 10 GB and that leaves 8.8 GB of "games-reserved" memory (or for other memory intensive programs) during the day, with the nighttime Stage2/ECM limit taking up 4.8 GB out of the "spare" memory.
I have run timing 5760K FFT benchmarks with all possible worker variations of 6 cores and came up with the following values:
6 cores, 6 workers: anywhere between 121 and 132 iter/sec
6 cores, 3 workers: anywhere between 122 and 131 iter/sec
6 cores, 2 workers: anywhere between 121 and 132 iter/sec
6 cores, 1 worker: anywhere between 118 and 131 iter/sec
...

My questions are the following:
1: Is my P-1 workload understanding wrong in any of the above statements? (or are there any other mistakes present?)
2: Did I miss any important factors (pun intended) of doing P-1? Are there other things I haven't accounted for that would impact my systems' ability to do 'optimal' P-1?
3: Do I have too little minimum stage 2 memory allocated (2.2 GB) for FTC-wavefront P-1?
4: Would it be better for me to have 1 worker, 6 cores take up all of the stage 2 memory? And also, why? (I've been stuck on the reasoning for choosing 1, 2, 3 or 6 workers)
Consider updating prime95 to v30.6b4 or v30.7b7.

Given your benchmark summary table, I'd be inclined to run 2 workers. Costs apparently little or no throughput, and gives much quicker latency than 6 workers, and reduces required disk space for work in progress. Also for P-1 work it can allow more ram per worker for the same total allowed, when memory-hungry stage twos overlap in time, which will help P-1 effectiveness.
Next, if daytime and nighttime allowance were equalized, that would avoid restarts on allowance change twice daily for a bit more efficiency. (And in some versions, restart does not continue from the already completed point, but restarts stage 2 from the beginning, which can be a big net throughput loss.)
If your system is not fully populated with ram, you may gain some more performance by adding more; prime95 is typically memory bandwidth bound, and using all the available memory channels helps.

Use the reference info for additional background.


Don't let concern about optimization get in the way of having fun.
(When near the optimal, modest deltas on independent variables have little or no effect on the dependent variable being optimized. Layman's version of at the optimal point, the partial derivative is zero, and near there, it's still small.)

Last fiddled with by kriesel on 2021-11-05 at 15:45
kriesel is online now   Reply With Quote
Old 2021-11-05, 16:53   #16
axn
 
axn's Avatar
 
Jun 2003

122218 Posts
Default

Quote:
Originally Posted by kriesel View Post
...oh, you're perhaps referring to how many bits of the P-1 stage 1 power can be accomplished per very-wide multiplication of saved big temporaries.
Yep. Sorry about the confusion.
axn is offline   Reply With Quote
Old 2021-11-05, 18:35   #17
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

29·31 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I agree that setting tests_saved to 1 would be optimal for GIMPS throughput.
Why not just have a parameter in local.txt like

P1TestsSaved=n.n

with a default value of 2.0?
Chuck is offline   Reply With Quote
Old 2021-11-05, 19:14   #18
lisanderke
 
"Lisander Viaene"
Oct 2020
Belgium

89 Posts
Default

Quote:
Originally Posted by Chuck View Post
Why not just have a parameter in local.txt like

P1TestsSaved=n.n

with a default value of 2.0?

Seconded, I spent about 15 minutes trying to find this line to then realize tests_saved can be adjusted in the assignment line.
lisanderke is offline   Reply With Quote
Old 2021-11-05, 19:26   #19
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

17·599 Posts
Default

Quote:
Originally Posted by Chuck View Post
...with a default value of 2.0?
I would like to support this idea. Also, could we please be able to set the value beyond 9.0?

/Some/ of us are doing things with mprime which "don't make sense". But, it's our compute.

This wouldn't be as much of an issue if (for example) Pminus1=1,2,15458087,-1,2000000,28800000,72 lines worked with the Primenet API. But, they don't...
chalsall is online now   Reply With Quote
Old 2021-11-05, 19:29   #20
lisanderke
 
"Lisander Viaene"
Oct 2020
Belgium

89 Posts
Default

Quote:
Originally Posted by kriesel View Post
Consider updating prime95 to v30.6b4 or v30.7b7.

Given your benchmark summary table, I'd be inclined to run 2 workers. Costs apparently little or no throughput, and gives much quicker latency than 6 workers, and reduces required disk space for work in progress. Also for P-1 work it can allow more ram per worker for the same total allowed, when memory-hungry stage twos overlap in time, which will help P-1 effectiveness.
Next, if daytime and nighttime allowance were equalized, that would avoid restarts on allowance change twice daily for a bit more efficiency. (And in some versions, restart does not continue from the already completed point, but restarts stage 2 from the beginning, which can be a big net throughput loss.)
If your system is not fully populated with ram, you may gain some more performance by adding more; prime95 is typically memory bandwidth bound, and using all the available memory channels helps.
Thank you! I've updated to Prime95 v30.7 b7 and am currently using 2 workers, 3 cores each. I'll be re-doing some P-1 for low exponents (18-20M) and monitoring for any bugs other than the ones mentioned in the v30.7 b7 thread. I've set nighttime/daytime stage 2 allowance to 14 GB each. I'm running a motherboard with only 4 channels and I'm using 4 ram sticks, so it's "double" dual channel and fully populated .
lisanderke is offline   Reply With Quote
Old 2021-11-05, 20:34   #21
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

137228 Posts
Default

Quote:
Originally Posted by Chuck View Post
Why not just have a parameter in local.txt like

P1TestsSaved=n.n
Because historically, it varied by exponent. And sometimes does still.
kriesel is online now   Reply With Quote
Old 2021-11-05, 20:48   #22
LOBES
 
Mar 2019
USA

10010012 Posts
Default

This has been a very interesting thread, even though the majority of it is out of my realm of comprehension. I've read the description of the tests_saved parameter, but I still don't quite understand how setting the number of future primality tests saved if a factor is found corresponds with the P-1 test completing faster? Is there a layman's answer to that?
LOBES is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to set 2^78 as default on trial factoring piforbreakfast PrimeNet 14 2021-03-24 20:54
How to change default job type piforbreakfast Information & Answers 2 2021-03-08 13:30
Bootable Standalone Prime95? ant Software 9 2016-07-27 16:45
Default ECM assignments lycorn PrimeNet 9 2015-01-09 16:32
Search default (threads or posts) schickel Forum Feedback 15 2009-04-05 14:50

All times are UTC. The time now is 22:55.


Wed Jan 19 22:55:48 UTC 2022 up 180 days, 17:24, 0 users, load averages: 1.02, 1.33, 1.49

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔