Register FAQ Search Today's Posts Mark Forums Read

 2021-01-18, 21:58 #837 petrw1 1976 Toyota Corona years forever!     "Wayne" Nov 2006 Saskatchewan, Canada 29×157 Posts Am I reading this wrong....it doesn't seem right Code: Timings for 2304K FFT length (8 cores, 1 worker): 1.13 ms. Throughput: 881.23 iter/sec. Timings for 2304K FFT length (8 cores, 8 workers): 14.88, 13.80, 14.65, 15.13, 14.54, 14.41, 14.49, 13.89 ms. Throughput: 553.21 iter/sec. This seems to be telling me that if I ran only 1 worker and put all 8 cores on it (i7-7820x, 32GB DDR4, 8x4 Quad Channel) it would produce quite a bit more total throughput that 8 workers of 1 core each. This seems contrary to what I've seen for every past Computer (all 4 cores). Related to that I tried an unrelated test where I ran P-1s on 2 cores only leaving the other 6 idle. The P-1s completed in 4:41. If I run all 8 cores on P-1 (MaxHighMemWorkers=6) each P-1 takes about 9 hours. Or are these benchmarks only legit for LL/PRP and not for P1? Here is a snipped from a benchmark from today Code: [Jan 18 16:01] Your timings will be written to the results.bench.txt file. [Jan 18 16:01] Compare your results to other computers at http://www.mersenne.org/report_benchmarks [Jan 18 16:01] Benchmarking multiple workers to measure the impact of memory bandwidth [Jan 18 16:01] Timing 2048K FFT, 8 cores, 1 worker. Average times: 0.83 ms. Total throughput: 1208.05 iter/sec. [Jan 18 16:01] Timing 2048K FFT, 8 cores, 2 workers. Average times: 2.12, 2.22 ms. Total throughput: 921.34 iter/sec. [Jan 18 16:01] Timing 2048K FFT, 8 cores, 4 workers. Average times: 5.31, 5.12, 5.26, 4.85 ms. Total throughput: 779.64 iter/sec. [Jan 18 16:01] Timing 2048K FFT, 8 cores, 8 workers. Average times: 10.36, 10.62, 10.29, 9.98, 10.65, 10.65, 10.33, 10.32 ms. Total throughput: 769.56 iter/sec. I benchmarked an i5-3570 with 16GB DDR3. The total throughput for 4 cores with 1, 2 or 4 workers was similar; slightly better overall with 4 workers....as expected. Last fiddled with by petrw1 on 2021-01-18 at 22:16
 2021-01-19, 03:54 #838 LaurV Romulan Interpreter     Jun 2011 Thailand 52×7×53 Posts The benchmarks apply to P-1 too. Except for the GCD phase, P-1 is computationally the same as LL (needs FFT multiplications and sqarings). Your system seems to be memory bounded, so you will get better if you run less workers. This is the "norm" for a while (few years), with new CPUs and with many cores, since P95 v28 or so. What you see is normal. I have a quad channel X99 mobo, with a 10 cores i7-6950X on it, 64GB RAM (edit: in 8 sticks times 8GB)**, and I witness the same behavior like you, except timing is a bit different. You get 4 hours wall-clock running one worker in 2 cores, but you will NOT get the same 4 hours running 2 workers, each in 2 cores (total 4 cores uses), because they access the same memory channels, waiting for each-other. Your best bet may be to run 2 workers, each in 4 cores, or so. Try different versions. ** Edit: does your 8x4 means "8 sticks" or "4 sticks"? (and don't answer "yes" ) Last fiddled with by LaurV on 2021-01-19 at 04:00
2021-01-19, 04:15   #839
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

107118 Posts

Quote:
 Originally Posted by LaurV The benchmarks apply to P-1 too. Except for the GCD phase, P-1 is computationally the same as LL (needs FFT multiplications and sqarings). Your system seems to be memory bounded, so you will get better if you run less workers. This is the "norm" for a while (few years), with new CPUs and with many cores, since P95 v28 or so. What you see is normal. I have a quad channel X99 mobo, with a 10 cores i7-6950X on it, 64GB RAM (edit: in 8 sticks times 8GB)**, and I witness the same behavior like you, except timing is a bit different. You get 4 hours wall-clock running one worker in 2 cores, but you will NOT get the same 4 hours running 2 workers, each in 2 cores (total 4 cores uses), because they access the same memory channels, waiting for each-other. Your best bet may be to run 2 workers, each in 4 cores, or so. Try different versions. ** Edit: does your 8x4 means "8 sticks" or "4 sticks"? (and don't answer "yes" )
Thanks.
4x8GB per attachment.
I was hoping getting very fast (?) 3600 RAM would minimize the bottleneck.
I can try different cores/workers setups but the benchmark seems to indicate 1 worker x 8 cores will be the fastest by far. Or am I reading it wrong?
Attached Thumbnails

2021-01-19, 05:11   #840
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

29·157 Posts

Quote:
 Originally Posted by petrw1 Thanks. 4x8GB per attachment. I was hoping getting very fast (?) 3600 RAM would minimize the bottleneck. I can try different cores/workers setups but the benchmark seems to indicate 1 worker x 8 cores will be the fastest by far. Or am I reading it wrong?
8 workers 1 core each takes over 9 hours to complete a P-1 (hard to be exact with all the stop/start with memory sharing between workers.
Total about 22 completions per day.

A short test suggests 2 workers x 4 cores will take almost exactly 2 hours to complete.
With 2 workers that is 24 completions per day.

A short test suggests 1 worker x 8 cores takes 51 minutes per completion.
That is 28 completions per day.

I find it hard to believe that with 4x8GB DDR6 DRAM 3600MHZ that the best throughput is 1 worker sharing 8 cores.

Could something be set up wrong?
Where do I start?

Thanks

 2021-01-19, 05:16 #841 LaurV Romulan Interpreter     Jun 2011 Thailand 52·7·53 Posts You are not reading it wrong. Try it and see if it is indeed so much faster. If it is not, you have an argument with George . P-1 does some other things too, beside of squarings, and each system is different. When you said quad channel, I assumed 8 sticks. Also, make sure your card uses all 4 channels when populated with 4 sticks only, it may run only in 2, and it may need all 8 sticks to run quad, or may need to move the 4 sticks around in other slots. (stupid question, do you have 8 slots? I have no idea if there are any quad boards with only 4 slots, you usually have 2 slots per channel, unless the mobo is crap). Edit: crosspost Last fiddled with by LaurV on 2021-01-19 at 05:19
2021-01-19, 05:32   #842
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

32·19·43 Posts

Quote:
 Originally Posted by LaurV The benchmarks apply to P-1 too.
Yes and no. If the speed increase is due to all or a significant portion of the FFT data being cached rather than stored in RAM, then stage 2 of P-1 will not see the same speedup. Stage 1 should see the same benefit.

2021-01-19, 14:51   #843
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

29×157 Posts

Quote:
 Originally Posted by LaurV You are not reading it wrong. Try it and see if it is indeed so much faster. If it is not, you have an argument with George . P-1 does some other things too, beside of squarings, and each system is different. When you said quad channel, I assumed 8 sticks. Also, make sure your card uses all 4 channels when populated with 4 sticks only, it may run only in 2, and it may need all 8 sticks to run quad, or may need to move the 4 sticks around in other slots. (stupid question, do you have 8 slots? I have no idea if there are any quad boards with only 4 slots, you usually have 2 slots per channel, unless the mobo is crap). Edit: crosspost
MB: 2066 : ATX : DDR4 : X299 Extreme 4 : ASRock
See CPU-Z screen shots.
RAM is in slots 1, 3, 5, 7.

1 Worker x 8 Cores overnight.
Anywhere from 47 to 60 minutes per.
Code:
Magic_8_Ball	43313173	NF-PM1	2021-01-19 13:55	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43310401	NF-PM1	2021-01-19 13:07	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311067	NF-PM1	2021-01-19 12:20	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311071	NF-PM1	2021-01-19 11:21	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311077	NF-PM1	2021-01-19 10:29	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311101	NF-PM1	2021-01-19 09:29	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311131	NF-PM1	2021-01-19 08:37	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311269	NF-PM1	2021-01-19 07:37	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311287	NF-PM1	2021-01-19 06:45	0.0	B1=1000000, B2=20000000	4.5044
Attached Thumbnails

 2021-01-21, 04:17 #844 petrw1 1976 Toyota Corona years forever!     "Wayne" Nov 2006 Saskatchewan, Canada 29·157 Posts Its getting smarter and faster I'm now completing these same P1 in 45 to 47 minutes.
2021-01-21, 06:19   #845
LaurV
Romulan Interpreter

Jun 2011
Thailand

52·7·53 Posts

Quote:
 Originally Posted by petrw1 I'm now completing these same P1 in 45 to 47 minutes.
Well done. So, no argument. What a pity...
Just to make it clear, my wheelbarrow is faster when I run 2 workers in 5 cores each, compared with a single worker in 10, especially for small FFT (like PRP-CF and CF-DC ranges). For larger FFTs, the difference is not significant, or is arguable. That's why I suggested you try both versions.

 2021-01-28, 14:40 #846 MisterBitcoin     "Nuri, the dragon :P" Jul 2016 Good old Germany 32×89 Posts Quick run without water cooling. After about 10 minutes the temperature growed up to around 95°C-100°C; so i have to stop running BOINC work til i got the water cooling to run. Attached Thumbnails
2021-01-28, 14:52   #847
Viliam Furik

"Viliam Furík"
Jul 2018
Martin, Slovakia

2×193 Posts

Quote:
 Originally Posted by MisterBitcoin Quick run without water cooling. After about 10 minutes the temperature growed up to around 95°C-100°C; so i have to stop running BOINC work til i got the water cooling to run.
And what is the CPU? I guess it's Intel 10900 something.

 Similar Threads Thread Thread Starter Forum Replies Last Post Xyzzy Lounge 35 2021-02-28 06:57 Oddball Riesel Prime Search 5 2010-08-02 00:11 rogue Soap Box 19 2009-10-28 19:17 Xyzzy Lounge 10 2006-09-28 00:36 Xyzzy Factoring 65 2005-09-05 08:16

All times are UTC. The time now is 08:54.

Wed Mar 3 08:54:24 UTC 2021 up 90 days, 5:05, 0 users, load averages: 0.59, 0.99, 1.20