mersenneforum.org Prime95 30.8 (big P-1 changes, see post #551)
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2022-09-25, 18:27   #727
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

41×199 Posts

Quote:
 Originally Posted by kriesel undoc.txt says this about memory in P-1 stage 2: (There's nothing there about the upper limit or modifying it.) 1. Is there a way to allow up to ~60 GiB on a 64 GiB system?
The 90% limit is for the GUI. Editing local.txt manually can work around the 90% limit.

Quote:
 2. Is there a way to ensure a worker's memory access & allocation remains entirely or mostly on the same side of the NUMA boundary as the worker's CPU cores on a multi-Xeon system?
Prime95 has no understanding of NUMA. Your only up is to run two instances of prime95. Use a Windows tool to force each instance to run on a different NUMA node. In local.txt ste NumCPUs=8. Let us know if you find a method that works well.

2022-09-26, 06:12   #728
preda

"Mihai Preda"
Apr 2015

22·192 Posts

Quote:
 Originally Posted by Prime95 Some debugging reveals prime95 is looking for a benchmark with all 16 cores used. Thus, run a throughput benchmark for 16 cores, 1 worker, all FFT implementations, 6M to 7M fft sizes. Let me know if that does the trick. Auto bench done every 21(?) hours until there are several data points. I'm looking into why it is running 13 core benchmarks when it only uses 16 core bench results (a bug). Benchmarks are not uploaded. They are not particularly useful to others given all the combinations of overclocking, memory speeds, etc.
I ran a benchmark with all 14cores, and indeed it seems to pick up the FFT bench timings afterwards. Although the run configuration is 1worker/12cores.

The benchmark asks for the number of cores to use for bench, and before I was giving it a list of what I was actually using (i.e. 12 cores, 13 cores) not the all-cores (14).

 2022-09-26, 08:47 #729 kruoli     "Oliver" Sep 2017 Porta Westfalica, DE 1,321 Posts What does the heading in results.bench.txt say for you? E.g.: Code: Compare your results to other computers at http://www.mersenne.org/report_benchmarks AMD Ryzen 7 3800X 8-Core Processor CPU speed: 4350.39 MHz, 8 hyperthreaded cores Especially the last line.
2022-09-26, 15:19   #730
preda

"Mihai Preda"
Apr 2015

26448 Posts

Quote:
 Originally Posted by kruoli What does the heading in results.bench.txt say for you? E.g.:
Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz
CPU speed: 3805.32 MHz, 14 hyperthreaded cores
CPU features: Prefetchw, SSE, SSE2, SSE4, AVX, AVX2, FMA, AVX512F
L1 cache size: 14x32 KB, L2 cache size: 14x1 MB, L3 cache size: 19712 KB
L1 cache line size: 64 bytes, L2 cache line size: 64 bytes

 2022-09-28, 05:16 #731 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2×29×127 Posts Dual-xeon dual instance experiment Multiple instances' worker windows give repeated occurrences of the error message error setting affinity: no error. See attachment. I think George's recent comments about prime95 being NUMA-unaware means split up worktodo and files in progress to two folders, one per Xeon, copy the program code then alter prime.txt to limit number of cores per prime95 instance to number of physical cores per Xeon, copy prime.txt and local.txt, and specify using one thread per physical core on a CPU package using affinity bitmasks. I chose the even numbered logical cores. In a batch file, or separately: Code: start /D (folder0) /NODE 0 /Affinity 0x5555 prime95.exe start /D (folder1) /NODE 1 /Affinity 0x5555 prime95.exe seemed to do the trick. (See cmd /k start /? for detailed help. AfAIK the start command is the only NUMA-aware control available at the Windows command line. Powershell is a whole other kettle of fish I won't go into here.) (Omitting /Affinity (bitmask) filled up all the hyperthreads on NUMA node 0 and doubled iteration times, leaving the other Xeon idle.) The two instances each have two workers with four cores each. Each instance has 48 GiB allowed for stage 2 P-1/P+1/ECM, 32 GiB as emergency memory, which leaves the 128 GiB ram potentially oversubscribed. Since it is being transitioned from DC to P-1, emergency memory for saving proof residues will become moot and can be pared back. Alternate hyperthreads on the same physical core are consecutive logical processors on Windows, so use either the odd or the even bit but not both for a given two-bit field in the affinity mask; 0x5555 = binary 0101 0101 0101 0101 corresponding to HT0 of each of 8 cores on a Xeon E5-2670 8-core x2 HT. So in HWMonitor, even numbered logical cores are fully occupied with prime95; odd are available for OS etc. Without setting both /node and /affinity values, everything fell on NUMA node 0. 0XAAAA would select 8 odd numbered logical cores. Observed Windows 7 worker timings are consistent with that interpretation. https://linustechtips.com/topic/5919...ined-for-real/ Task Manager displays Cores as follows: Numa Node 0 top row leftmost: core 0 hyperthread 0, then core 0 hyperthread 1, core 1 hyperthread 0 ... core 7 hyperthread 1 Second row is Numa node 1. On Xeon Phis, in Windows 10, CPU rows wrap according to window width, but upper left is core 0, HT 0 1 2 3, bottom right ends core N-1, HT 0 1 2 3. A forced server update from each of the instances, then a check of my CPUs page shows one occurrence of the nodename common to the two instances. (No duplication seen.) Attached Thumbnails   Last fiddled with by kriesel on 2022-09-28 at 05:27
 2022-09-28, 15:50 #732 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 736610 Posts SPR Two outstanding issues as far as I know (haven't tested v30.8b17 yet) 1) Observed on Windows 7 Pro x64, dual Xeon E5-2670, prime95 V30.8b15, using start /Node 0 or 1, /affinity 0x5555, running two instances, intended as one each side of the QPI; when a worker window assigns cores, the following message is produced repeatedly, with variety of hex values consisting of 3 or c at various offsets: Error setting affinity to cpuset 0x000000c0: No error (refer to attachment of https://mersenneforum.org/showpost.p...&postcount=731) 3 or c is 0011 or 1100. Windows' numbering representation of the two logical cores of a x2 hyperthreaded physical core #0 is 0,1, while Linux's is 0,n where n is number of physical hyperthreaded cores present in the system. So it appears to me that prime95, a Windows application, may be using an inappropriate affinity mask for Windows. We don't usually want two prime95 compute threads running on the same physical core. Or, prime95 is setting a bit map for using either hyperthread of the core involved. (Which would be less constraining than what was already done in the start command's affinity mask.) https://www.systutorials.com/docs/li...wloc_cpuset_t/ I speculate that locking prime95 activity to a specific hyperthread of a core may reduce activity in Windows' task scheduling on multiple cores. I've seen indications in Task Manager's CPU display of a fair amount of logical-core-hopping at times. As if Windows is trying to balance load between hyperthreads, a probably futile exercise for code as memory-bound as prime95's fft crunching, with performance generally hurt, not helped, by multiple hyperthreads on the same core. 2) Observed repeatedly on Windows 10, i5-1035G1, prime95 V30.8b14, Indicated P-1 stage 1 total time for a P-1 interrupted by autobenchmarking is too low, reflecting only the time from the end of the interruption to the end of the stage, omitting all the stage time before the benchmarking interruption. See for example the log content at https://mersenneforum.org/showpost.p...4&postcount=40 which shows "[Sep 2 15:41] M100204259 stage 1 complete. 347584 transforms. Total time: 2487.884 sec." but the stage 1 ran from Sep 2 12:44 to Sep 2 15:41, 12:44 to 18:21 = 5:37 - 3 minutes for benchmarking ~20040. seconds, about 8.055 times as long as the total time indicated to millisecond precision. Last fiddled with by kriesel on 2022-09-28 at 16:29
2022-09-28, 16:22   #733
storm5510
Random Account

Aug 2009
Not U. + S.A.

32·281 Posts

Quote:
 Originally Posted by kriesel Two outstanding issues as far as I know (haven't tested v30.8b17 yet) 1) Observed on Windows 7 Pro x64, dual Xeon E5-2670, prime95 V30.8b15, using start /Node 0 or 1, /affinity 0x5555, when a worker window assigns cores, the following message is produced repeatedly, with variety of hex values consisting of 3 or c at various offsets: Error setting affinity to cpuset 0x000000c0: No error (refer to attachment of https://mersenneforum.org/showpost.p...&postcount=731) 2) Observed repeatedly on Windows 10, i5-1035G1, prime95 V30.8b14, Indicated P-1 stage 1 total time for a P-1 interrupted by autobenchmarking is too low, reflecting only the time from the end of the interruption to the end of the stage, omitting all the stage time before the benchmarking interruption. See for example the log content at https://mersenneforum.org/showpost.p...4&postcount=40 which shows "[Sep 2 15:41] M100204259 stage 1 complete. 347584 transforms. Total time: 2487.884 sec." but the stage 1 ran from Sep 2 12:44 to Sep 2 15:41, 12:44 to 18:21 = 5:37 - 3 minutes for benchmarking ~20040. seconds, about 8.055 times as long as the total time indicated to millisecond precision.
It seems like you may be taking a more difficult road setting affinity. I use the below in local.txt:

Code:
[Worker #1]
Affinity=(0,4),(2,6)
The bold section in your quote above makes no sense. 20,040 seconds is 5.57 hours...

2022-09-28, 16:36   #734
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11100110001102 Posts

Quote:
 Originally Posted by storm5510 It seems like you may be taking a more difficult road setting affinity. I use the below in local.txt: Code: [Worker #1] Affinity=(0,4),(2,6) The bold section in your quote above makes no sense. 20,040 seconds is 5.57 hours...
Are you running a dual-Xeon system with QPI bottleneck between halves of total installed ram? Two separate instances, one per Xeon as directed by George? I'm setting affinity to 8 hyperthreads of 32, in each of two prime95 instances. I think what I'm doing is simpler than setting four different affinity lists for four different workers in two different folders. And more likely to get the memory locality right. It's also easily extensible to a dual-12-core&HT system later; masks become 0x555555. Done.

5:37: 5 hours 37 minutes from start to finish of stage 1, minus 3 minutes benchmarking interruption:
5 * 3600 +37 * 60 -3 * 60 = 20040. seconds. Perhaps a case of vigorous agreement?
But the program reported only 2487. seconds & change, less than 1/8 the actual stage 1 compute time, is the point I was making.

Last fiddled with by kriesel on 2022-09-28 at 17:24

 2022-09-28, 17:40 #735 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 2·29·127 Posts Same hardware, the error no error persists in prime95 v30.8b17. There are no directives involving affinity in local.txt or prime.txt, so it is prime95 default response in the context of the start commands used. IIRC George runs Linux not Windows, and may have no dual-CPU-package systems to test mprime / prime95 on. Last fiddled with by kriesel on 2022-09-28 at 18:17
2022-09-29, 23:16   #736
storm5510
Random Account

Aug 2009
Not U. + S.A.

32×281 Posts

Quote:
 Originally Posted by kriesel Are you running a dual-Xeon system with QPI bottleneck between halves of total installed ram? Two separate instances, one per Xeon as directed by George? But the program reported only 2487. seconds & change, less than 1/8 the actual stage 1 compute time, is the point I was making.
No, and I probably would not try. Multiple workers in a single instance, maybe. If the OS can see both CPU's then I would think any running process could as well.

2022-09-30, 03:06   #737
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

736610 Posts

Quote:
 Originally Posted by storm5510 If the OS can see both CPU's then I would think any running process could as well.
Yes, and there's empirical evidence that using almost all the dual-CPU-system's ram in prime95 is suboptimal compared to using almost all the near ram on each CPU, in P-1 stage 2, and leaving the rest to be used for the other CPU's processes. Mprime / prime95 needs a little user assistance apparently to ensure it's the near ram in the multi-CPU (not merely multi-core) case.

 Similar Threads Thread Thread Starter Forum Replies Last Post kar_bon Prime Wiki 40 2022-04-03 19:05 science_man_88 science_man_88 24 2018-10-19 23:00 xilman Linux 2 2010-12-15 16:39 kar_bon Forum Feedback 3 2010-09-28 08:01 dave_0273 Lounge 1 2005-02-27 18:36

All times are UTC. The time now is 22:46.

Wed Feb 1 22:46:36 UTC 2023 up 167 days, 20:15, 0 users, load averages: 0.69, 0.93, 1.04

Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔