mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-01-18, 21:58   #837
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

29×157 Posts
Default Am I reading this wrong....it doesn't seem right

Code:
Timings for 2304K FFT length (8 cores, 1 worker):  1.13 ms.  Throughput: 881.23 iter/sec.
Timings for 2304K FFT length (8 cores, 8 workers): 14.88, 13.80, 14.65, 15.13, 14.54, 14.41, 14.49, 13.89 ms.  Throughput: 553.21 iter/sec.
This seems to be telling me that if I ran only 1 worker and put all 8 cores on it (i7-7820x, 32GB DDR4, 8x4 Quad Channel) it would produce quite a bit more total throughput that 8 workers of 1 core each.
This seems contrary to what I've seen for every past Computer (all 4 cores).

Related to that I tried an unrelated test where I ran P-1s on 2 cores only leaving the other 6 idle. The P-1s completed in 4:41.
If I run all 8 cores on P-1 (MaxHighMemWorkers=6) each P-1 takes about 9 hours.

Or are these benchmarks only legit for LL/PRP and not for P1?

Here is a snipped from a benchmark from today
Code:
[Jan 18 16:01] Your timings will be written to the results.bench.txt file.
[Jan 18 16:01] Compare your results to other computers at http://www.mersenne.org/report_benchmarks
[Jan 18 16:01] Benchmarking multiple workers to measure the impact of memory bandwidth
[Jan 18 16:01] Timing 2048K FFT, 8 cores, 1 worker.  Average times:  0.83 ms.  Total throughput: 1208.05 iter/sec.
[Jan 18 16:01] Timing 2048K FFT, 8 cores, 2 workers.  Average times:  2.12,  2.22 ms.  Total throughput: 921.34 iter/sec.
[Jan 18 16:01] Timing 2048K FFT, 8 cores, 4 workers.  Average times:  5.31,  5.12,  5.26,  4.85 ms.  Total throughput: 779.64 iter/sec.
[Jan 18 16:01] Timing 2048K FFT, 8 cores, 8 workers.  Average times: 10.36, 10.62, 10.29,  9.98, 10.65, 10.65, 10.33, 10.32 ms.  Total throughput: 769.56 iter/sec.
I benchmarked an i5-3570 with 16GB DDR3.
The total throughput for 4 cores with 1, 2 or 4 workers was similar; slightly better overall with 4 workers....as expected.

Last fiddled with by petrw1 on 2021-01-18 at 22:16
petrw1 is offline   Reply With Quote
Old 2021-01-19, 03:54   #838
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

52×7×53 Posts
Default

The benchmarks apply to P-1 too. Except for the GCD phase, P-1 is computationally the same as LL (needs FFT multiplications and sqarings). Your system seems to be memory bounded, so you will get better if you run less workers. This is the "norm" for a while (few years), with new CPUs and with many cores, since P95 v28 or so. What you see is normal. I have a quad channel X99 mobo, with a 10 cores i7-6950X on it, 64GB RAM (edit: in 8 sticks times 8GB)**, and I witness the same behavior like you, except timing is a bit different. You get 4 hours wall-clock running one worker in 2 cores, but you will NOT get the same 4 hours running 2 workers, each in 2 cores (total 4 cores uses), because they access the same memory channels, waiting for each-other. Your best bet may be to run 2 workers, each in 4 cores, or so. Try different versions.

** Edit: does your 8x4 means "8 sticks" or "4 sticks"? (and don't answer "yes" )

Last fiddled with by LaurV on 2021-01-19 at 04:00
LaurV is offline   Reply With Quote
Old 2021-01-19, 04:15   #839
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

107118 Posts
Default

Quote:
Originally Posted by LaurV View Post
The benchmarks apply to P-1 too. Except for the GCD phase, P-1 is computationally the same as LL (needs FFT multiplications and sqarings). Your system seems to be memory bounded, so you will get better if you run less workers. This is the "norm" for a while (few years), with new CPUs and with many cores, since P95 v28 or so. What you see is normal. I have a quad channel X99 mobo, with a 10 cores i7-6950X on it, 64GB RAM (edit: in 8 sticks times 8GB)**, and I witness the same behavior like you, except timing is a bit different. You get 4 hours wall-clock running one worker in 2 cores, but you will NOT get the same 4 hours running 2 workers, each in 2 cores (total 4 cores uses), because they access the same memory channels, waiting for each-other. Your best bet may be to run 2 workers, each in 4 cores, or so. Try different versions.

** Edit: does your 8x4 means "8 sticks" or "4 sticks"? (and don't answer "yes" )
Thanks.
4x8GB per attachment.
I was hoping getting very fast (?) 3600 RAM would minimize the bottleneck.
I can try different cores/workers setups but the benchmark seems to indicate 1 worker x 8 cores will be the fastest by far. Or am I reading it wrong?
Attached Thumbnails
Click image for larger version

Name:	RAM.jpg
Views:	29
Size:	194.2 KB
ID:	24201  
petrw1 is offline   Reply With Quote
Old 2021-01-19, 05:11   #840
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

29·157 Posts
Default

Quote:
Originally Posted by petrw1 View Post
Thanks.
4x8GB per attachment.
I was hoping getting very fast (?) 3600 RAM would minimize the bottleneck.
I can try different cores/workers setups but the benchmark seems to indicate 1 worker x 8 cores will be the fastest by far. Or am I reading it wrong?
8 workers 1 core each takes over 9 hours to complete a P-1 (hard to be exact with all the stop/start with memory sharing between workers.
Total about 22 completions per day.

A short test suggests 2 workers x 4 cores will take almost exactly 2 hours to complete.
With 2 workers that is 24 completions per day.

A short test suggests 1 worker x 8 cores takes 51 minutes per completion.
That is 28 completions per day.

I find it hard to believe that with 4x8GB DDR6 DRAM 3600MHZ that the best throughput is 1 worker sharing 8 cores.

Could something be set up wrong?
Where do I start?

Thanks
petrw1 is offline   Reply With Quote
Old 2021-01-19, 05:16   #841
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

52·7·53 Posts
Default

You are not reading it wrong. Try it and see if it is indeed so much faster. If it is not, you have an argument with George . P-1 does some other things too, beside of squarings, and each system is different. When you said quad channel, I assumed 8 sticks. Also, make sure your card uses all 4 channels when populated with 4 sticks only, it may run only in 2, and it may need all 8 sticks to run quad, or may need to move the 4 sticks around in other slots. (stupid question, do you have 8 slots? I have no idea if there are any quad boards with only 4 slots, you usually have 2 slots per channel, unless the mobo is crap).

Edit: crosspost

Last fiddled with by LaurV on 2021-01-19 at 05:19
LaurV is offline   Reply With Quote
Old 2021-01-19, 05:32   #842
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

32·19·43 Posts
Default

Quote:
Originally Posted by LaurV View Post
The benchmarks apply to P-1 too.
Yes and no. If the speed increase is due to all or a significant portion of the FFT data being cached rather than stored in RAM, then stage 2 of P-1 will not see the same speedup. Stage 1 should see the same benefit.
Prime95 is offline   Reply With Quote
Old 2021-01-19, 14:51   #843
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

29×157 Posts
Default

Quote:
Originally Posted by LaurV View Post
You are not reading it wrong. Try it and see if it is indeed so much faster. If it is not, you have an argument with George . P-1 does some other things too, beside of squarings, and each system is different. When you said quad channel, I assumed 8 sticks. Also, make sure your card uses all 4 channels when populated with 4 sticks only, it may run only in 2, and it may need all 8 sticks to run quad, or may need to move the 4 sticks around in other slots. (stupid question, do you have 8 slots? I have no idea if there are any quad boards with only 4 slots, you usually have 2 slots per channel, unless the mobo is crap).

Edit: crosspost
MB: 2066 : ATX : DDR4 : X299 Extreme 4 : ASRock
See CPU-Z screen shots.
RAM is in slots 1, 3, 5, 7.

1 Worker x 8 Cores overnight.
Anywhere from 47 to 60 minutes per.
Code:
Magic_8_Ball	43313173	NF-PM1	2021-01-19 13:55	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43310401	NF-PM1	2021-01-19 13:07	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311067	NF-PM1	2021-01-19 12:20	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311071	NF-PM1	2021-01-19 11:21	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311077	NF-PM1	2021-01-19 10:29	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311101	NF-PM1	2021-01-19 09:29	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311131	NF-PM1	2021-01-19 08:37	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311269	NF-PM1	2021-01-19 07:37	0.0	B1=1000000, B2=20000000	4.5044
Magic_8_Ball	43311287	NF-PM1	2021-01-19 06:45	0.0	B1=1000000, B2=20000000	4.5044
Attached Thumbnails
Click image for larger version

Name:	MB.png
Views:	33
Size:	18.6 KB
ID:	24206   Click image for larger version

Name:	CPU.png
Views:	32
Size:	30.4 KB
ID:	24207   Click image for larger version

Name:	RAM.png
Views:	22
Size:	18.1 KB
ID:	24208   Click image for larger version

Name:	SPD.png
Views:	27
Size:	21.7 KB
ID:	24209  
petrw1 is offline   Reply With Quote
Old 2021-01-21, 04:17   #844
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

29·157 Posts
Default Its getting smarter and faster

I'm now completing these same P1 in 45 to 47 minutes.
petrw1 is offline   Reply With Quote
Old 2021-01-21, 06:19   #845
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

52·7·53 Posts
Default

Quote:
Originally Posted by petrw1 View Post
I'm now completing these same P1 in 45 to 47 minutes.
Well done. So, no argument. What a pity...
Just to make it clear, my wheelbarrow is faster when I run 2 workers in 5 cores each, compared with a single worker in 10, especially for small FFT (like PRP-CF and CF-DC ranges). For larger FFTs, the difference is not significant, or is arguable. That's why I suggested you try both versions.
LaurV is offline   Reply With Quote
Old 2021-01-28, 14:40   #846
MisterBitcoin
 
MisterBitcoin's Avatar
 
"Nuri, the dragon :P"
Jul 2016
Good old Germany

32×89 Posts
Default

Quick run without water cooling.


After about 10 minutes the temperature growed up to around 95°C-100°C; so i have to stop running BOINC work til i got the water cooling to run.
Attached Thumbnails
Click image for larger version

Name:	2021-01-28 15_09_10-Start.png
Views:	31
Size:	21.9 KB
ID:	24243  
MisterBitcoin is offline   Reply With Quote
Old 2021-01-28, 14:52   #847
Viliam Furik
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

2×193 Posts
Default

Quote:
Originally Posted by MisterBitcoin View Post
Quick run without water cooling.


After about 10 minutes the temperature growed up to around 95°C-100°C; so i have to stop running BOINC work til i got the water cooling to run.
And what is the CPU? I guess it's Intel 10900 something.
Viliam Furik is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Perpetual "interesting video" thread... Xyzzy Lounge 35 2021-02-28 06:57
LLR benchmark thread Oddball Riesel Prime Search 5 2010-08-02 00:11
Perpetual I'm pi**ed off thread rogue Soap Box 19 2009-10-28 19:17
Perpetual autostereogram thread... Xyzzy Lounge 10 2006-09-28 00:36
Perpetual ECM factoring challenge thread... Xyzzy Factoring 65 2005-09-05 08:16

All times are UTC. The time now is 08:54.

Wed Mar 3 08:54:24 UTC 2021 up 90 days, 5:05, 0 users, load averages: 0.59, 0.99, 1.20

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.