mersenneforum.org Intel Xeon PHI?
 Register FAQ Search Today's Posts Mark Forums Read

2020-11-29, 23:22   #133
ewmayer
2ω=0

Sep 2002
República de California

2×7×829 Posts

Quote:
 Originally Posted by ewmayer So if the 53.0C for the KNL is to be believed - and the fact that a similar run using 'only' 32 cores gives a cooler 44.0C indicates so - that water cooling is working very well indeed.
Spoke too soon - I neglected to mention that temperature was with the case side panel on the CPU side of the mobo removed - I put the panel back on last night and the temp quickly rose by over 10C into the 65-70C range. When I rechecked just now I saw it at 70C but an added ALARM (CRIT) at end of the sensors output line - not sure precisely what temp triggers that, because it was still at 70C, which I first saw last night, without said alarm message. It probably rose a few degrees higher at some point in the last 15 hours and tripped the alarm. It seems to be a "once tripped, the alarm message persists" deal because I took the side panel back off and the temp quickly dropped back to ~60C, but the message still shows. I looked at the manpage to see if the 'sensors' command has a 'clear alarm' option, didn't find one.

Will look into replacing the side panel in question with a fine-perforated metal-mesh one, similar to the one on top of the casem covering the 2 water-cooler vent fans.

Here some Mlucas avx-512 build timings at 64M-FFT - more below on why that large FFT length is of special interest ATM - on the KNL, all same FFT length, 1-thread-per-core (I found no benefit from any combination of hyperthreading I tried), #threads from 1-64. Parallel scaling is good through 16-threads but then falls off a cliff beyond that:
Code:
64M FFT, 1-thread-per-core, #threads from 1-64:              #thread:	|| scaling (vs 1-thr):
65536  msec/iter = 1765.36  radices =  16 16 16 16 16 32	 1	1.00
65536  msec/iter =  943.43  radices =  16 16 16 16 16 32	 2	.936
65536  msec/iter =  496.24  radices =  16 16 16 16 16 32	 4	.889
65536  msec/iter =  259.18  radices =  16 16 16 16 16 32	 8	.851
65536  msec/iter =  125.93  radices =  16 16 16 16 16 32	16	.876
65536  msec/iter =   85.70  radices = 256 16 16 16 32  	32	.644
65536  msec/iter =   69.06  radices = 256 16 16 16 32  	64	.399
The actual runtimes for a production run, once things settle down after a few minutes, are 5-10% faster - getting ~64ms/iter at 64-threads for the 64M-FFT run described below. Here results - these are just representative examples, I did many more experiments - of several supplemental timing tests, illustrating the ineffectiveness of hyperthreading and the total-throughput boost from running multiple jobs, each using 16 or 32 threads on nonoverlapping sets of cores:

[B] 2 side-by-side runs, each using 16-thr: Each nets 136 ms/iter, 1.85x total throughput of one 16-thr job, 1.26x total throughput of one 32-thr job;

[C] 4 side-by-side runs, each using 16-thr: Each nets 170 ms/iter, 2.96x total throughput of one 16-thr job, 1.62x total throughput of one 64-thr job.

First task I set the KNL on is to complete the 64M-FFT one of the pair of primality test of F30 I started several years ago. I did ~2/3 of the needed 2^30-1 = 1073741823 iterations of said test on a pair of machines: one at 60M on a 32-core AVX2 Xeon server, the other at 64M on the GIMPS KNL. Both machines were physically hosted by David Stanfill, who went AWOL early this year. Ryan Proper was kind enough to pick up the 60M run and complete that on a manycore virtual machine machine he had access to, but the 64M one remained in need of completion. Picked that up at iteration 730M last night, based on timings so far ETA for completion is a little over 8 months. Again, per the above table, this is getting less half the total throughput the CPU is capable of.

The above multiple-job results [B] and [C] indicate that a much better total throughput would be for me, as soon as the in-development Mlucas v20 has a working p-1 Stage 1 with restart-from savefile capability, I should switch the above F30-test completion to 32-threaded and fire up a second 32-threaded job, in form of a deep p-1 Stage 1 on F33. By deep I mean something on the order of a year's runtime. At that point - assuming none of the occasional GCDs during Stage 1 turns up a factor, which is the expected result based on the TF depth to date - the Stage 1 residue can be distributed to volunteers in possession of bigmem systems - any kind of halfway-fast Stage 2 will need at least 128GB of RAM - to run various Stage 2 subintervals in hopes one finds a factor. Said Stage 1 will of course slow the finishing-off of the F30@64M job, but we already know that number to be composite via small TF-found factor, the primality test is to generate a residue for cofactor PRP-checking.

 2020-11-30, 00:29 #134 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2·33·67 Posts Thanks ewmayer! I followed the "Frankencable" thread as well as this one. These are part of why I'm interested. I'm just usually shy of spending so much at once. As you probably know I dabble with discards rather than upgrading or jumping into something with potential. My trouble is that even if I bought a brand new system that could replace all the machines I curently run for less running cost, I'd probably just add it in with the others instead of replacing them. The same seems to be what I do with most of my additions, although I actually did retire all my Pentium 4 machines.
 2020-11-30, 04:23 #135 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 7·701 Posts My temp & turbo reports were also with the motherboard-side cover off. Two fans draw air through the big radiator above them. The lines from block on the cpu to radiator cross through the space where gpus would reside, so any gpu squeezed in there would need to be a low profile type, like the 2GB RX550 I have, or there would be mechanical interference. Mine also has the unusual power supply dimensions Ernst describes, and the physical mounting is unimpressive, involving 2 screws at one end and a zip tie at the other that still leaves it a big wiggly. Does the new Mlucas P-1 code support only Fermats, or also Mersennes? When will the new code be available for others to use? Mlucas on CentOS: 170ms/iter x four 16-thread instances of 64M fft length corresponds to 4 x 1000 / 170 = 23.53 iters/sec throughput on 64 of the 68 cores. (At what average clock rate?) Prime95 on Windows 10 at the same 64M fft length benchmarked as 25.37 iters/sec throughput on all 68 cores. Straight line interpolating down to 64 cores would give 22.15 iters/sec, which is probably a bit pessimistic.The indicated throughput Mlucas vs. prime95 is within 8%, one way or the other. Last fiddled with by kriesel on 2020-11-30 at 04:47
2020-11-30, 20:00   #136
ewmayer
2ω=0

Sep 2002
República de California

101101010101102 Posts

Quote:
 Originally Posted by kriesel Mine also has the unusual power supply dimensions Ernst describes, and the physical mounting is unimpressive, involving 2 screws at one end and a zip tie at the other that still leaves it a big wiggly.
Yeah, that was a rather shoddy job w.r.to PSU securement. Does your system also have the power plug which likes to disconnect at the slightest jiggling of the case?

Quote:
 Does the new Mlucas P-1 code support only Fermats, or also Mersennes? When will the new code be available for others to use?
It will support both - as I noted before, there's very little difference between p-1 for both kinds, assuming one has already got the different specialized FFT-modmul routines for the 2 different moduli in place. I got basic p-1 Stage code working last week, working on properly integrating that code into the production-mode front end and modifying the savefile-restart mechanism for p-1 this week. Once that's in place, firing up a p-1 Stage 1 for F33 on the KNL should be a relatively trivial matter, then let that run and run while I work on Stage 2 code. No specific timeframe I can give at present - hope to have all the p-1 work done by EOY, then on the other major new feature for v20, PRP-proof support. I will likely make the v20-with-p-1-only-added code available for build&test while I work on PRP-proof support.

Quote:
 Mlucas on CentOS: 170ms/iter x four 16-thread instances of 64M fft length corresponds to 4 x 1000 / 170 = 23.53 iters/sec throughput on 64 of the 68 cores. (At what average clock rate?)
Here is the first entry in my /proc/cpuinfo system file:
Code:
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 87
model name	: Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz
stepping	: 1
microcode	: 0x1b0
cpu MHz		: 1501.193
cache size	: 1024 KB
Or were you perhaps referring to possible auto-downclocking-under-load? If that is a possibility, I'll have to dig out how to get the "live GHz" numbers under Linux.

Quote:
 Prime95 on Windows 10 at the same 64M fft length benchmarked as 25.37 iters/sec throughput on all 68 cores. Straight line interpolating down to 64 cores would give 22.15 iters/sec, which is probably a bit pessimistic.The indicated throughput Mlucas vs. prime95 is within 8%, one way or the other.
Thanks, closer than I'd hoped. My box still a ways away from that total throughput, as I'm currently running F30@64M on 64 cores, wasting perhaps half the max. achievable FLOPS. But expect to better that soon.

Do you have any watts-at-wall numbers for your system, idle and under load? All my wattmeters are currently hooked up to GPU-hosting systems which I don't want to unplug.

Last fiddled with by ewmayer on 2020-11-30 at 20:01

2020-11-30, 21:28   #137
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·701 Posts

Quote:
 Originally Posted by ewmayer Yeah, that was a rather shoddy job w.r.to PSU securement. Does your system also have the power plug which likes to disconnect at the slightest jiggling of the case?
Haven't detected that. Mine arrived without a power cord, so I repurposed one left over from a recycled case.
Quote:
 No specific timeframe I can give at present - hope to have all the p-1 work done by EOY, then on the other major new feature for v20, PRP-proof support. I will likely make the v20-with-p-1-only-added code available for build&test while I work on PRP-proof support.
Sounds good.

Quote:
 Or were you perhaps referring to possible auto-downclocking-under-load? If that is a possibility, I'll have to dig out how to get the "live GHz" numbers under Linux.
7250 is nominal 1.4Ghz, turbo 1.5, and I've seen 1.42 or 1.44 indicated in Windows Task Manager while under full all-cores prime95 load. And it may drop after reinstalling the 12.44 x 14" solid side panel. Have you found some hole-y sheet metal yet?
Quote:
 Do you have any watts-at-wall numbers for your system, idle and under load? All my wattmeters are currently hooked up to GPU-hosting systems which I don't want to unplug.
Ditto, and I think a storm may have decalibrated one of them; same system & operating mode that would indicate ~1100W now indicates ~700. I have another on order, may shuffle/liberate an existing one.

 2020-11-30, 22:11 #138 ewmayer ∂2ω=0     Sep 2002 República de California 2×7×829 Posts I found some useful perforated-metal-sheet product listings here, but those are building-supply-oriented, few fine-mesh ones, but possibly a few usable. Per my measurement the precise side-panel WxH = 14" x 12 7/16" (35.6 x 31.6 cm), can you confirm/deny those dimensions? Some fine steel woven filter mesh, the kind one puts over vent to keep bugs out, might also suit, but most I've seen via cursory search has max dimension <= 12". Perhaps something like this, or even a perforated baking mat cut to size and stretched over the side opening. ================ One last theme of interest re. KNL setup: exploring overclocking the CPU and/or onboard memory, and disabling the power-saving/auto-throttling modes, to boost performance. I've made the mobo manual for the Hydra workstation available here (10.6MB). BIOS setup is Chapter 7, very long and detailed. The key sections and settings for performance-tweaking appear to be (default settings noted with **): [pg 7-6] CPU Configuration: Lots of stuff there, main items of interest appear to be frequency settings and power-saving/auto-throttling modes *enable*/disable. [pg 7-13] Memory Configuration o Enforce POR Select Enforce POR to enforce the onboard memory DIMM modules to operate and run at the frequency and voltage as specified by the Intel POR specifications. The options are *Enforce POR*, Disabled and Enforce Stretch Goals. o Memory Frequency Use this feature to set the maximum memory frequency for onboard memory modules. The options are *Auto*, 1600, 1867, 2133, and 2400.
 2020-11-30, 22:27 #139 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 2×33×67 Posts Would window screening be too course? [EWM: oops, I hit 'edit' instead of 'reply', but will just leave my reply here - this is me hijacking EdH's post below :] That would be similar in terms of fineness as the latter 2 items I linked, but you want a bit of stiffness and typical window-sreen mesh is too supple and thus needs a frame. Using the setscrews to stretch the mesh obviates the too-pliant issue, but w/o a frame you'd tear holes in the window mesh. Not sure if the baking-mat mesh is any sturdier, but cheap enough to just order one, if proves unsuitable for the intended use you can still use it as, well, a baking mat. :) Last fiddled with by ewmayer on 2020-11-30 at 23:06
2020-11-30, 23:35   #140
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132B16 Posts

Quote:
 Originally Posted by ewmayer Per my measurement the precise side-panel WxH = 14" x 12 7/16" (35.6 x 31.6 cm), can you confirm/deny those dimensions?
Anticipated that, take another look at post 137.

Quote:
 I've made the mobo manual for the Hydra workstation available here (10.6MB). BIOS setup is Chapter 7, very long and detailed.
I note that what I see in BIOS setup screens does not always correspond with the manual.
Quote:
 The key sections and settings for performance-tweaking appear to be (default settings noted with **): [pg 7-6] CPU Configuration: Lots of stuff there, main items of interest appear to be frequency settings and power-saving/auto-throttling modes *enable*/disable. [pg 7-13] Memory Configuration o Enforce POR Select Enforce POR to enforce the onboard memory DIMM modules to operate and run at the frequency and voltage as specified by the Intel POR specifications. The options are *Enforce POR*, Disabled and Enforce Stretch Goals. o Memory Frequency Use this feature to set the maximum memory frequency for onboard memory modules. The options are *Auto*, 1600, 1867, 2133, and 2400.
I'd say the most key section is about setting power restoration behavior to "Always Start".
Will be interested to see what you come up with for BIOS setting changes.

 2020-12-01, 20:20 #141 ewmayer ∂2ω=0     Sep 2002 República de California 2×7×829 Posts First performance-tweak I tried was fiddling with the onboard-memory frequency. The mobo User Manual uses a quite stupid scheme for indicating submenu levels, right arrows of slightly differing sizes rather than section/subsection numbering. So to get to the Advanced Setup Configurations -> Chipset Configuration -> North Bridge -> Memory Configuration Inside that rightmost submenu, I set the Enforce POR option to Disable (the only options were "Enforce POR" and "Disable"; the Enforce Stretch Goals" option cited in the manual was not listed.) Next I set Memory Frequency from its default "Auto" - nowhere did I see what actual clock setting that yielded - to the highest available, 2400, then rebooted. There is a related MemTest (Memory Test) option, for which we read "Select Enabled to enable memory testing during system boot. The options are *Enabled* and Disabled", so that is on by default and presumably we got at least a basic mem-test during boot. Fired up Mlucas to resume the 64-threaded F30 continuation run, waited for next checkpoint ... no change in timings. So it seems "Auto" was already setting the onboard-mem to the max displayed value of 2400.
2020-12-01, 21:33   #142
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×701 Posts

Quote:
 Originally Posted by ewmayer First performance-tweak I tried was fiddling with the onboard-memory frequency. The mobo User Manual uses a quite stupid scheme for indicating submenu levels, right arrows of slightly differing sizes rather than section/subsection numbering. So to get to the Advanced Setup Configurations -> Chipset Configuration -> North Bridge -> Memory Configuration Inside that rightmost submenu, I set the Enforce POR option to Disable (the only options were "Enforce POR" and "Disable"; the Enforce Stretch Goals" option cited in the manual was not listed.) Next I set Memory Frequency from its default "Auto" - nowhere did I see what actual clock setting that yielded - to the highest available, 2400, then rebooted. There is a related MemTest (Memory Test) option, for which we read "Select Enabled to enable memory testing during system boot. The options are *Enabled* and Disabled", so that is on by default and presumably we got at least a basic mem-test during boot. Fired up Mlucas to resume the 64-threaded F30 continuation run, waited for next checkpoint ... no change in timings. So it seems "Auto" was already setting the onboard-mem to the max displayed value of 2400.
Have you added DDR4 DIMMs to your system? These systems shipped from the seller with only the 16 GB MCDRAM in the processor package; 6 empty DDR4 slots. I'd expect changing Northbridge settings to have effect on any DDR4 present, and no effect on the much faster integral MCDRAM speed. https://www.anandtech.com/show/9794/...odes-from-sc15 https://en.wikipedia.org/wiki/Northbridge_(computing) If you mentioned adding DDR4, I missed it.

Last fiddled with by kriesel on 2020-12-01 at 21:49

2020-12-01, 22:29   #143
ewmayer
2ω=0

Sep 2002
República de California

2·7·829 Posts

Quote:
 Originally Posted by kriesel Have you added DDR4 DIMMs to your system? These systems shipped from the seller with only the 16 GB MCDRAM in the processor package; 6 empty DDR4 slots. I'd expect changing Northbridge settings to have effect on any DDR4 present, and no effect on the much faster integral MCDRAM speed. https://www.anandtech.com/show/9794/...odes-from-sc15 https://en.wikipedia.org/wiki/Northbridge_(computing) If you mentioned adding DDR4, I missed it.
From the mobo manual, I've bolded the relevant snip:

"Memory Frequency
Use this feature to set the maximum memory frequency for onboard memory modules. The options are Auto, 1600, 1867, 2133, and 2400."

Does 'onboard' not refer to the 16GB, um, onboard memory? OTOH the above clock settings are def. DDR-range, not MCDRAM range. From the Anandtech article you linked:
Quote:
 As the diagram stands, the MCDRAM and the regular DDR4 (up to six channels of 386GB of DDR4-2400) are wholly separate, indicating a bi-memory model. This stands at the heart at which developers will have to contend with, should they wish to extract performance from the part. The KNL memory can work in three modes, which are determined by the BIOS at POST time and thus require a reboot to switch between them. The first mode is a cache mode, where nothing is needed to be changed in the code. The OS will organize the data to use the MCDRAM first similar to an L3 cache, then the DDR4 as another level of memory. Intel was coy onto the nature of the cache (victim cache, writeback, cache coherency), but as it is used by default it might offer some performance benefit up to 16GB data sizes. The downside here is when the MCDRAM experiences a cache miss – because of the memory controllers the cache miss has to travel back into the die and then go search out into DDR for the relevant memory. This means that an MCDRAM cache miss is more expensive than a simple read out to DDR. The second mode is ‘Flat Mode’, allowing the MCDRAM to have a physical addressable space which allows the programmer to migrate data structures in and out of the MCDRAM. This can be useful to keep large structures in DDR4 and smaller structures in MCDRAM. We were told that this mode can also be simulated by developers who do not have hardware in hand yet in a dual CPU Xeon system if each CPU is classified as a NUMA node, and Node 0 is pure CPU and Node 1 is for memory only. The downside of the flat mode means that the developer has to maintain and keep track of what data goes where, increasing software design and maintenance costs. The final mode is a hybrid mode, giving a mix of the two. In flat mode, there are separate ways to access the high performance memory – either as a pure NUMA node (only applicable if the whole program can fit in MCDRAM), using direct OS system calls (not recommended) or through the Memkind libraries which implements a series of library calls. There is also an interposer library over Memkind available called AutoHBW which simplifies some of the commands at the expense of fine control. Under Memkind/AutoHBW, data structures aimed at MCDRAM have their own commands in order to be generated in MCDRAM.
The "Memkind libraries" ref. sounds like it refers to Intel’s VTune utilities, no idea if GCC supports any of that stuff, but doubt it.

I searched for 'mcdram' and 'flat' in the mobo manual, nothing for either.

Last fiddled with by ewmayer on 2020-12-01 at 22:30

 Similar Threads Thread Thread Starter Forum Replies Last Post dtripp Software 3 2013-02-19 20:20 nucleon Hardware 2 2012-05-10 23:53 R.D. Silverman Programming 19 2011-09-17 01:43 mack Information & Answers 7 2009-09-13 01:48 penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 01:14.

Sun Feb 28 01:14:57 UTC 2021 up 86 days, 21:26, 0 users, load averages: 1.84, 1.99, 2.06