mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-07-25, 02:58   #221
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3×13×101 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Debian 10 (Buster) installed. The chip is running at 1.4GHz even though it says powersave. HyperThreading is off. 32 workers in 56 days for 113M bit candidates With some runtime tuning I expect this to lower.

Edit: After some runtime tuning by mprime at 2% done the ETA is now 55 days.
After another auto-benchmarking by mprime the ETA is now 47 days (with 4% done)

This is about 66% of the speed of a Radeon VII. Not bad for a cheap low-end chip.

Last fiddled with by paulunderwood on 2021-07-25 at 03:08
paulunderwood is offline   Reply With Quote
Old 2021-07-25, 16:06   #222
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

134458 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
After another auto-benchmarking by mprime the ETA is now 47 days (with 4% done)

This is about 66% of the speed of a Radeon VII. Not bad for a cheap low-end chip.
Radeon VII list was $699 new. Xeon Phi systems were considerably higher, ~$5k new as I recall.
The 7210 alone was ~$2036. https://ark.intel.com/content/www/us...z-64-core.html
kriesel is offline   Reply With Quote
Old 2021-11-02, 21:12   #223
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·3·7·139 Posts
Default

I have some very interesting p-1 stage 2 timings to report, related to the issue of running big-memory-footprint tasks in a mix of the 16GB fast onboard MCDRAM and any regular dimm-based RAM the owner has installed.

I just posted over in Software re. the release of Mlucas v20.1.1. On of the bugfixes in that patch-release is to permit p-1 stage 2 runs for moduli which need the currently-largest FFT length supported by the program, 512M. (I tested said fix on M(p) with p = 8589934693, which has 2 prime factors q = 2.k.p+1 < 2^68, with k1 = 1866092580 = 2^2.3^4.5.11.23.29.157 and k2 = 2291914028 = 2^2.23.727.34267. So b1 = 1000 and b2 = 35000 should catch them both.)

Once I had tested the 512M fix on the above, I started a run of a first p-1 stage 2 interval (from B1=10^7 to B2=10^8) on F33 using the stage 1 residue I spent August to mid-October crunching, which computation used only the MCDRAM. Will announce the distributed stage 2 initiative and upload the Stage 1 residue file needed for that soon. Now to the timings: Stage 1 was pinned to the MCDRAM using "numactl -m 1" - in low-memory build mode (which disallows both PRP+Gerbicz-check and p-1 stage 2 work), the resulting memory footprint fits easily into 16GB, with the actual main residue doubles-array occupying 4GB. Running on 64 of the 68 physical cores, 2 threads per core, gives a stage 1 timing around 470 ms/iter, stage 1 to B1 = 10^7 needed ~14.5 million of those.

Running exclusively in the MCDRAM is not an option for stage 2: The minimum stage 2 prime-pairing-related buffer count supported by the code is 24, which along with 4 auxiliary residue arrays and other stuff means a memory footprint of ~120GB. So using the new bugfixed v20.1.1 code I switched off the above numactl run-prefix. Note that the key stage 2 FFT-mul operation if of the form A *= (B - C[i]), where A is the stage 2 accumulator, B is update-recomputed just once for each bigstep-sized loop (frequency of perhaps once for every 10 of above SUBMULs, or less for larger bigstep values), and C[i] is one of our precomputed stage 2 prime-pairing buffers. When on a typical PC both stages are running of of the same pool of dimm-RAM, each stage 2 iteration runs 10-15% slower than a stage 1 one, since the former is doing a 3-input SUBMUL, whereas the latter is just doing LL/PRP-style FFT-autosquarings. So if the KNL had enough MCDRAM to run stage 2 in it, I would expect a timing of around 540 ms/iter.

I figured running out of RAM might be around 2x slower, but hoped that the OS would still be smart enough to keep the A and B-arrays (4GB each) in the MCDRAM, treating it as a huge L3 cache, and only stream the various C[i] in from the RAM. I specified 40 stage 2 buffers (total footprint ~180GB), the largest supported value which fits into the 208GB = (16GB MCDRAM + 192GB dimm-RAM) of the machine; the next-larger value of 48 buffers needs ~206GB, but in practice system tasks and such reduce the available RAM to ~200GB. My initial timings did not appear to reflect good usage of the MCDRAM, they are around 2.5x slower than the above using-only-MCDRAM estimate:
Code:
Using complex FFT radices       256        32        32        32        32
Using Bigstep 330, pairing-window multiplicity M = 1: Init M*40 = 40 [base^(A^(b^2)) % n] buffers for Stage 2...
Buffer-init done; clocks = 00:08:47.608, MaxErr = 0.066406250.
Small-prime[7] relocation: will start Stage 2 at bound 14285714
Stage 2 q0 = 14285700, k0 = 43290
Computing Stage 2 loop-multipliers...
Stage 2 loop-multipliers: clocks = 00:03:12.325, MaxErr = 0.070312500.
[2021-10-31 01:14:45] F33 S2 at q = 14454660 [ 0.20% complete] clocks = 03:43:54.623 [1342.1202 msec/iter] Res64: 375C751C2193C084. AvgMaxErr = 0.060365977. MaxErr = 0.078125000.
[2021-10-31 04:58:40] F33 S2 at q = 14624610 [ 0.39% complete] clocks = 03:43:54.280 [1341.1481 msec/iter] Res64: 8FA26CABD8066355. AvgMaxErr = 0.060372666. MaxErr = 0.078125000.
Now since stage 2 seemed to be using only/mostly the dimm-RAM, I figured I might as well restart my long-running on-again/off-again DC-run of F30 @64M FFT, using the same 64 physical cores and pinning it to the MCDRAM, to see what happens - the above F33 stage 2 timings barely budged, perhaps 1% slower than before; the F30 (Pépin-test, i.e. a base-3 PRP-style iteration sequence) timing was ~10x slower than the 60 ms/iter it gets when it has exclusive use of those physical cores and the MCDRAM. I've converted the F30 timings to sec/iter here to make it easier to keep them and the F33 timings straight:
Code:
[Oct 31 14:53:29] F30 Iter# = 1014050000 [94.44% complete] clocks = 01:52:37.270 [  0.6757 sec/iter] Res64: 61D461B8BE5AA6EC. AvgMaxErr = 0.020537182. MaxErr = 0.025390625.
[Oct 31 16:28:20] F30 Iter# = 1014060000 [94.44% complete] clocks = 01:34:11.211 [  0.5651 sec/iter] Res64: 0FAB072D0630B7AD. AvgMaxErr = 0.020541896. MaxErr = 0.027343750.
[Oct 31 18:21:34] F30 Iter# = 1014070000 [94.44% complete] clocks = 01:52:35.246 [  0.6755 sec/iter] Res64: 8E2F615B63261E5E. AvgMaxErr = 0.020547897. MaxErr = 0.027343750.
[Oct 31 20:00:07] F30 Iter# = 1014080000 [94.44% complete] clocks = 01:38:08.717 [  0.5889 sec/iter] Res64: 9FE2EFD72D198020. AvgMaxErr = 0.020539502. MaxErr = 0.025390625.
So time to peruse the numactl manpage for possible flags related to this - that turned up:
Code:
--preferred=node
	Preferably allocate memory on node, but if memory cannot be allocated there fall back to other nodes. This option
	takes only a single node number. Relative notation may be used.
So replaced the stage 1 "numactl -m 1" with "--preferred=1" ... the result was rather dramatic:
Code:
[2021-11-01 19:42:10] F33 S2 at q = 16323450 [ 2.38% complete] clocks = 01:53:30.824 [680.5380 msec/iter] Res64: AC7ADAF3890D3E2B. AvgMaxErr = 0.060386060. MaxErr = 0.078125000.
[2021-11-01 21:35:46] F33 S2 at q = 16494060 [ 2.58% complete] clocks = 01:53:36.272 [680.6743 msec/iter] Res64: A8F72B08FE0ED8ED. AvgMaxErr = 0.060396971. MaxErr = 0.078125000.
[2021-11-01 23:29:21] F33 S2 at q = 16665330 [ 2.78% complete] clocks = 01:53:34.381 [681.0976 msec/iter] Res64: EF5013CB5B044A5F. AvgMaxErr = 0.060309100. MaxErr = 0.078125000.
[2021-11-02 01:22:51] F33 S2 at q = 16835610 [ 2.97% complete] clocks = 01:53:30.290 [680.2808 msec/iter] Res64: AAC4601E2A662B3D. AvgMaxErr = 0.060352525. MaxErr = 0.078125000.
[2021-11-02 03:16:19] F33 S2 at q = 17005230 [ 3.17% complete] clocks = 01:53:28.145 [679.6591 msec/iter] Res64: BDCBEF62E1DCBA41. AvgMaxErr = 0.060377368. MaxErr = 0.078125000.
That cut the per-iter time in half. That was great, but I figured the F33 run now properly using the MCDRAM as an L3 cache would hammer the timings of the F30 run. Another surprise:
Code:
[Nov 02 09:23:01] F30 Iter# = 1014450000 [94.48% complete] clocks = 00:47:54.481 [  0.2874 sec/iter] Res64: DB4B350821A5186C. AvgMaxErr = 0.020541790. MaxErr = 0.027343750.
[Nov 02 10:19:12] F30 Iter# = 1014460000 [94.48% complete] clocks = 00:55:39.301 [  0.3339 sec/iter] Res64: C5BA6F080DB6978A. AvgMaxErr = 0.020556644. MaxErr = 0.027343750.
[Nov 02 11:07:43] F30 Iter# = 1014470000 [94.48% complete] clocks = 00:47:58.540 [  0.2879 sec/iter] Res64: 79736A5FEF2A69B3. AvgMaxErr = 0.020545662. MaxErr = 0.027343750.
[Nov 02 12:03:56] F30 Iter# = 1014480000 [94.48% complete] clocks = 00:55:42.774 [  0.3343 sec/iter] Res64: 6C83CE991294FBD3. AvgMaxErr = 0.020542251. MaxErr = 0.027343750.
[Nov 02 12:52:24] F30 Iter# = 1014490000 [94.48% complete] clocks = 00:47:56.019 [  0.2876 sec/iter] Res64: D6C2F91C87683142. AvgMaxErr = 0.020544057. MaxErr = 0.027343750.
So the timings for that secondary run also dropped by around 2x.
ewmayer is offline   Reply With Quote
Old 2021-11-27, 18:52   #224
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×3×7×139 Posts
Default

Mike/Xyzzy fwded the following link to a used-gear-4-sale Reddit page, asked if the Xeon Phi gear there was a good deal, presumably w.r.to GIMPS crunching:

https://www.reddit.com/r/hardwareswa...us_256_thread/

Went to the Intel page for the HNS7200AP compute module - looks like that is a host for the 7200-series CPUs, i.e. KNL; 6 dimm slots up to 384 GB RAM sounds just like my barebones workstation system, but mobo seems different. I see he has 2 w/CPU and 2 w/o. No idea how much stuff/work beyond the obvious (chassis, psu, water cooler) this needs to build into a working system - discussion welcome.
ewmayer is offline   Reply With Quote
Old 2021-11-27, 21:34   #225
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3×13×101 Posts
Thumbs up

Quote:
Originally Posted by ewmayer View Post
Quote:
complete running server with 4 cpus| 256 cores, 1024 threads, 48gb mcdram, dual 2100w psu (2 es 8gb chips and 2 es 16gb chips) |$2000 + shipping
Formidable power. I had toyed with getting such a system. The potential noise put me off. It would draw about 1400w running flat out. One would need a garage or basement to house it. I think it is like 4 machines in one i.e. it needs 4 kernels running. Still not a bad offer by the seller.

I guess one could run 16 mprime workers (4 cores each) on each of the two 8GB ES chips and likewise 32 workers (2 cores) on each of the 16GB chips.

Last fiddled with by paulunderwood on 2021-11-27 at 22:39
paulunderwood is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD vs Intel dtripp Software 3 2013-02-19 20:20
Intel NUC nucleon Hardware 2 2012-05-10 23:53
Intel RNG API? R.D. Silverman Programming 19 2011-09-17 01:43
AMD or Intel mack Information & Answers 7 2009-09-13 01:48
Intel Mac? penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 09:30.


Mon Dec 6 09:30:30 UTC 2021 up 136 days, 3:59, 0 users, load averages: 1.51, 1.31, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.