mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2022-06-28, 05:49   #12
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

101001100010102 Posts
Default

Quote:
Originally Posted by timbit View Post
Also I still cannot get more than 8 threads on a worker.
You want more workers, not threads per worker. Basically more workers, each doing their own assignments.
Uncwilly is online now   Reply With Quote
Old 2022-06-28, 16:18   #13
timbit
 
Mar 2009

101002 Posts
Default

Hi, thanks for all the replies.

My slow system has 32 GB DDR4-2400 ECC RDIMM (4 * 8GB) RAM running in quad channel. (Intel Xeon E5-2680 v4, 14 cores, 28 threads). In the local.txt I have:

Memory=28672 during 7:30-23:30 else 28672

So essentially 28GB RAM available to mprime.

I've deleted the results.bench.txt and gwnum.txt. I'm then invoked mprime -m, did the self-throughput test, with 1 worker 4 cores, 48kB FFT size. Yes, I see the replies saying "it doesn't multithread will with small FFTs", and "use many workers, 1 core". Yes, I will lean towards that from now on. But I am doing 1 worker, 4 cores as my benchmark.

I've attached snippets of results.bench.txt, and gwnum.txt.

Now according to the results.bench.txt, the fastest is:

FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (4 cores, 1 worker): 0.21 ms. Throughput: 4848.21 iter/sec.

Pass1=768, Pass2=64, clm=1. Right?

However, when I invoke mprime -d on my ECM 999181 I see:

[Work thread Jun 27 22:30] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2, 4 threads

This isn't the fastest FFT selection given the results from the self-benchmark. Is this normal? Also I still cannot see any autobenchmark in my work, can anyone explain what triggers the autobench? I see other posts complaining about it, and how to prevent it, I want to do the opposite and invoke it!

The reason I am asking for the autobench, is on one of my other machines it was also running slow for a day or two on a new exponent, then the autbench kicked in, and it was much much faster after that. Obviously a different FFT implementation was chosen which made throughput higher.

In the attached log_snippet.txt, ECM curve 50 phase 1 is taking almost 3000 seconds. Other machine with DDR4-2133 it's taking < 2000 seconds for phase 1. That machine has been shutdown for the summer unfortunately, so I cannot access it again until the fall.


Also when I run htop there are no other processes taking up CPU cycles (mprime is using 4 out of 14 cores anyways).
Attached Files
File Type: txt gwnum.txt (681 Bytes, 39 views)
File Type: txt results.bench.txt (10.2 KB, 35 views)
File Type: txt log_snippet.txt (2.4 KB, 42 views)

Last fiddled with by timbit on 2022-06-28 at 16:22 Reason: formatting
timbit is offline   Reply With Quote
Old 2022-06-28, 19:36   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×7×317 Posts
Default

Memory situation looks pretty good to me. Log file shows up to 8 GB used in stage2, so 3 workers at a time could get all they want, and a fourth can get by on somewhat less. Additional workers could be running stage 1 at the same time.

It's not normally necessary to delete results.bench.txt and gwnum.txt.

Mprime defaults to 4 cores per worker, because the expectation is most users will be running DC (~61M exponent) or higher, and 4 cores/worker is reasonably close to maximum total system throughput for that exponent & fft size and somewhat upward.
Deviating considerably from usual usage as you do (very small exponent) or I do (very large exponent) means the usual near optimal no longer applies.
And systems/processors do vary in what is optimal for their specific design.

Please bite the bullet and benchmark with multiple workers. It's the only way you will get close to the full capability of your system on such small exponents.
I suggest benchmarking 1 & 2 cores/worker, on 14 and 7 workers respectively. Latency of an assignment will go up, but throughput (assignments completed per day) should go up.
After that you could try whichever cores/worker seems faster for your chosen work, and varying number of workers downward from all-cores-occupied. It's possible that cache efficiency may be higher at less than the maximum possible number of workers, enough to give higher throughput at say 12 rather than 14 cores working.
You could also consider what is the max throughput/system-power-consumption configuration.

Last fiddled with by kriesel on 2022-06-28 at 19:39
kriesel is offline   Reply With Quote
Old 2022-06-29, 16:35   #15
timbit
 
Mar 2009

101002 Posts
Default

Ok, I managed to see the autobench last night (1 worker, 4 cores). The fastest FFT implementation did not get selected.


I then decided to go back to basics. 1 worker, 1 core that's it. I deleted existing gwnum.txt and results.bench.txt.


I ran ./mprime -m, chose item 17, benchmark. 48k FFT, 1 worker, 1 core.


Attached are the results.bench.txt, and gwnum.txt. When I started ECM on 999xxx exponent, B1=1000000. I can see that a non-optimal FFT was chosen.


How can I get mprime to choose the optimal FFT? Is there anything in prime.txt or local.txt that can manually choose an FFT implementation? I've run as 1 worker, 1 core, no excuses now. Also nothing else running on my Ubuntu 2204 x64 (Intel Xeon E5-2680 v4).


From results.bench.txt: (bold is fastest)


Prime95 64-bit version 30.7, RdtscTiming=1
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=4 (1 core, 1 worker): 0.47 ms. Throughput: 2138.80 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=2 (1 core, 1 worker): 0.45 ms. Throughput: 2238.99 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=256, Pass2=192, clm=1 (1 core, 1 worker): 0.26 ms. Throughput: 3908.64 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=4 (1 core, 1 worker): 0.21 ms. Throughput: 4709.43 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=2 (1 core, 1 worker): 0.26 ms. Throughput: 3907.09 iter/sec.
FFTlen=48K, Type=3, Arch=4, Pass1=768, Pass2=64, clm=1 (1 core, 1 worker): 0.21 ms. Throughput: 4874.29 iter/sec.


When I start ./mprime -d, I see:


[Main thread Jun 29 09:26] Mersenne number primality test program version 30.7
[Main thread Jun 29 09:26] Optimizing for CPU architecture: Core i3/i5/i7, L2 cache size: 14x256 KB, L3 cache size: 35 MB
[Main thread Jun 29 09:26] Starting worker.
[Work thread Jun 29 09:26] Worker starting
[Work thread Jun 29 09:26] Setting affinity to run worker on CPU core #2
[Work thread Jun 29 09:26]
[Work thread Jun 29 09:26] Using FMA3 FFT length 48K, Pass1=768, Pass2=64, clm=2
[Work thread Jun 29 09:26] 0.052 bits-per-word below FFT limit (more than 0.509 allows extra optimizations)
[Work thread Jun 29 09:26] ECM on M999217: curve #1 with s=7014263894342847, B1=1000000, B2=TBD



Non-optimal FFT chosen. It's truly bizarre.


Any thoughts on what a root cause may be?
Attached Files
File Type: txt results.bench.txt (5.1 KB, 8 views)
File Type: txt gwnum.txt (375 Bytes, 7 views)
timbit is offline   Reply With Quote
Old 2022-06-29, 19:00   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×7×317 Posts
Default

Quote:
Originally Posted by timbit View Post
I deleted existing gwnum.txt and results.bench.txt.

Non-optimal FFT chosen. It's truly bizarre.

Any thoughts on what a root cause may be?
Please stop routinely giving mprime "benchmark amnesia". Files results.bench.txt and gwnum.txt are for ACCUMULATING performance data. Over time. I have systems that have run prime95 for years for which these files have NEVER been cleared. Mprime/prime95 will repeatedly auto benchmark to handle the case of relative performance being affected by fluctuations in other loadings, whether caused by the user or system activity, until it builds up a sufficient count of redundant benchmark values. You're repeatedly intentionally erasing the history instead.

Perhaps there is a bug in v30.7 regarding benchmark handling. Have you considered trying v30.8b15?

For comparison, here's 48k fft benchmark results done two different ways on 4-core/8HT i5-1035G1, 2 of 2 DDR4-2666 SODIMMs x 8 GiB each, Win10 with prime95 v30.7b9, and the relevant portion of results.bench.txt produced from them both.
There is a considerable effect of what I think is memory bandwidth constraint visible, given that 4 cores/4workers produce only two to three times the total iteration throughput of a single core/worker. Note, this benchmark was while a multi-tab web browser session and multiple remote desktop session clients were also running. Prime95 gets ~59% of CPU while these & other tasks (Windows Explorer, AV, system services, etc.) keep the HT busy up to 90% of the 8 logical cores. That's my normal way of operating this particular system, so that's what I benchmark for.
Quote:
Originally Posted by kriesel View Post
Please bite the bullet and benchmark with multiple workers.

edit: for ~77M PRP as DC, 4M fft, all 4 cores in use, checking my system above's results.bench.txt and running fft, I find it is using the fastest fft benchmarked, which happens to be 4M, Pass1=1K, Pass2=4K, clm=1, 4 threads.
Its results.bench.txt is ~0.5MB, gwnum.txt ~0.18MB. There have been rare cases where one grew too large and caused problems, but these sizes seem ok in this version.
Attached Thumbnails
Click image for larger version

Name:	martin48kbenchmark.png
Views:	32
Size:	11.2 KB
ID:	27066   Click image for larger version

Name:	martin48kbenchmark2.png
Views:	6
Size:	11.5 KB
ID:	27067  
Attached Files
File Type: txt martin48kbenchmark.txt (10.2 KB, 6 views)

Last fiddled with by kriesel on 2022-06-29 at 19:55
kriesel is offline   Reply With Quote
Old 2022-06-29, 19:18   #17
timbit
 
Mar 2009

22·5 Posts
Default

Ok, thanks for the reply. I will move the files out of the "trash bin". Was unaware the program uses all the results from the past. I would only assume it takes the latest one. (if there's only 1 benchmark available -- what else is it supposed to use?)

Ahhh... there's a 30.8 build 15. Okey dokey.

Let me give that a shot. I'll let it run for a week or two. I'll double check it for autobench, and hopefully there's smarts to let it run faster. Or perhaps a bug was indeed fixed.
timbit is offline   Reply With Quote
Old 2022-06-29, 19:30   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·7·317 Posts
Default

Found in prime95's undoc.txt (emphasis mine):
Code:
Most FFT sizes have several implementations.  The program uses throughput benchmark
data to select the fastest FFT implementation.  The program assumes all CPU cores
will be used and all workers will be running FFTs.  This can be overridden
in gwnum.txt:
    BenchCores=x
    BenchWorkers=y
If the program is selecting what would be fastest with all cores busy, as its documentation states, and your expectation is it would select what would be fastest with a few of the 14 cores busy, that may account for some discrepancies between expectation and observed operation.
I'd be interested to see what George's take is on the clm=2 vs. 1 performance and selection.
kriesel is offline   Reply With Quote
Old 2022-06-29, 19:57   #19
timbit
 
Mar 2009

248 Posts
Default

Quote:
Originally Posted by kriesel View Post
Found in prime95's undoc.txt (emphasis mine):
Code:
Most FFT sizes have several implementations.  The program uses throughput benchmark
data to select the fastest FFT implementation.  The program assumes all CPU cores
will be used and all workers will be running FFTs.  This can be overridden
in gwnum.txt:
    BenchCores=x
    BenchWorkers=y
If the program is selecting what would be fastest with all cores busy, as its documentation states, and your expectation is it would select what would be fastest with a few of the 14 cores busy, that may account for some discrepancies between expectation and observed operation.
I'd be interested to see what George's take is on the clm=2 vs. 1 performance and selection.
Hi, ya I saw that too. I didn't think much of it because the autobench would know your current configuration (plus the manual throughput test, the user is inputting the number of cores and workers).
I'll stop mprime, save the entire directory and give the new version a try (it's a beta and I'll keep that in mind) in a fresh directory.
timbit is offline   Reply With Quote
Old 2022-06-29, 20:47   #20
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·3·331 Posts
Default

Quote:
Originally Posted by timbit View Post
Any thoughts on what a root cause may be?
A bug is certainly possible. Version 30.8 will not behave any differently.

For the throughput benchmark data to be used in FFT selection, the throughput benchmark inputs (#workers,#cores) must match current mprime #workers/#cores configuration. As you've noted the auto-bench should make this happen.

There could be an issue/bug in that you aren't using all cores. I did the majority of my testing assuming all cores would be used. Try running a throughput benchmark on 14 workers/14 cores, then set up mprime to run 14 workers (obviously 1 core per worker). Is the fastest implementation selected?

Last fiddled with by Prime95 on 2022-06-29 at 20:49
Prime95 is offline   Reply With Quote
Old 2022-06-29, 20:51   #21
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×3×331 Posts
Default

Keep an eye out for version 30.9 (won't address your problem, but will run ECM better)
Prime95 is offline   Reply With Quote
Old 2022-06-29, 20:53   #22
timbit
 
Mar 2009

101002 Posts
Default

Quote:
Originally Posted by Prime95 View Post
There could be an issue/bug in that you aren't using all cores. I did the majority of my testing assuming all cores would be used. Try running a throughput benchmark on 14 workers/14 cores, then set up mprime to run 14 workers (obviously 1 core per worker). Is the fastest implementation selected?
That's definitely possible. I am only using 4 cores per worker at the most.

Let me run the 30.7 version throughput test, 14 workers, 1 core per worker. I can have some results within 24 hours.

I'll also try with 30.8 b15.

Last fiddled with by timbit on 2022-06-29 at 20:58 Reason: Version 30.8
timbit is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Running fstrim on SSD while mprime is running might cause errors in mprime AwesomeMachine Software 4 2021-10-07 23:49
Radeon VII on a mining-like bench Viliam Furik Viliam Furik 17 2021-01-14 08:12
mprime from git SELROC Software 2 2018-10-30 10:16
2 x AMD Opteron 2427 @ 2.39 GHz - prime95 bench- joblack Hardware 2 2010-03-12 19:38
Problem with mprime (Fixed with mprime -d) antiroach Software 2 2004-07-19 04:07

All times are UTC. The time now is 06:11.


Wed Aug 10 06:11:55 UTC 2022 up 34 days, 59 mins, 1 user, load averages: 2.06, 1.43, 1.17

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔