mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2022-03-01, 20:04   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

23·32·163 Posts
Default Best AVX-512 CPUs for large-footprint FFT-mul

Starting late last summer I ran a p-1 stage 1 to b1 = 10^7 on F33 on my Knights Landing cheapie-refurb mini-workstation. After installing a big wad of server dimm-RAM I've been running 10^8-sized stage 2 intervals using the stage 1 residue, with a view to starting a distributed such effort among interested forumites with suitable hardware.

But before risking wasting others' runtime, it's important to be sure stage 1 result is correct. Our own Mike/Xyzzy has been kindly running a separate stage 1 computation on his Intel 18c36t i9, that was roughly 70% done (~10m s1 iterations of the needed 14427494) when he recently shut said machine down and sold it off. The problem is that like most Intel manycore offerings, his machine had woefully inadequate memory bandwidth to keep those cores fed on a big-footprint (~4GB) FFT-modmul running data-hungry avx-512 8-fold-double code - using 16c32t he was getting 900-1000 ms/iter at 512M FFT, roughly half the speed of my KNL running out of the onboard 16GB HBM.

If one were targeting F33 stage 1 work, I wonder what the most bang-for-buck-ish non-KNL avx-512 option would be. One would want at least 4 cores but no more than 8 due to memory-bandwidth constraints, as large an L3 cache as possible (on the KNL the MCDRAM acts as such) and - lacking any kind of HBM - a mobo which supports fast high-bandwidth RAM, with DIMM slots filled with low-capacity but very-fast sticks, say 16-32GB total. Maybe a 1-2-year-old used CPU, if the newer ones don't really offer much max-throughput for the above type of big-footprint workloads?

if you have such a machine and are willing to do some timings, here's how:

o Get and build and the current version of Mlucas, using instructions here. If your system has < 24GB RAM, you'll need a couple of post-build tweaks to reduce the memory footprint; PM me for those once your automated 'bash makemake.sh' parallel build completes.

o 512M FFT will have a strong preference for power-of-2 threadcounts, and 2-threads-per-core assuming it's a hyperthreaded Intel CPU (AFAIK no AMD chips have avx-512 support at present). Assuming your machine has N physical cores and P = largest power of 2 <= N, you want to pin 2*P threads to the same subset of P physical cores. Using the Intel core numbering convention:

./Mlucas -iters 100 -fft 512M -f 33 -shift 0 -cpu 0:P-1,N:N+P-1

Thus e.g. on a 6c12t CPU, the args to the latter flag would be '-cpu 0:3,6:9'. The resulting timing captured in the fermat.cfg file will be ~10% pessimistic due to data-and-thread-init overhead.
ewmayer is offline   Reply With Quote
Old 2022-03-01, 21:24   #2
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

32·467 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I wonder what the most bang-for-buck-ish non-KNL avx-512 option would be.
I have not seen any desktop chips, but this workstation looks good: https://www.ebay.com/itm/35282856756...r=622531495174

Newer chips run cooler, not like Skylake.
paulunderwood is offline   Reply With Quote
Old 2022-03-01, 23:24   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DD816 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I have not seen any desktop chips, but this workstation looks good: https://www.ebay.com/itm/35282856756...r=622531495174

Newer chips run cooler, not like Skylake.
Thanks for the link - but in "bang for the buck" terms, it comes with no RAM, and is unlikely to outperform my refurb-KNL, which cost just $500, plus - once I had completed my stage 1 run - another $1400 (would be < $1000 currently) to upgrade with 192GB RAM for stage 2 work.

Admittedly, it's a niche sort of optimization problem, and quite possibly a cheap used RAM-less KNL will prove the best option, I mainly wanted a sense of whether there were any consumer-grade Intel offerings which could provide similar total memory-bandwidth at comparable cost.
ewmayer is offline   Reply With Quote
Old 2022-03-02, 05:53   #4
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

32×467 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for the link - but in "bang for the buck" terms, it comes with no RAM, and is unlikely to outperform my refurb-KNL, which cost just $500, plus - once I had completed my stage 1 run - another $1400 (would be < $1000 currently) to upgrade with 192GB RAM for stage 2 work
The base price is for 96GB RAM. Another $500 gets you 192GB. It also comes with 4x 250GB NVMe. I thought 38.5MB Level3 cache would be attractive, despite 1/3 memory bandwidth. Running 4-8 cores on one of its chips would probably trigger turbo bumping it up to nearly 3.8GHz.

Quote:
Admittedly, it's a niche sort of optimization problem, and quite possibly a cheap used RAM-less KNL will prove the best option, I mainly wanted a sense of whether there were any consumer-grade Intel offerings which could provide similar total memory-bandwidth at comparable cost.
Hopefully someone will run some benchmarks for you and that a cheap desktop chip will give you what you want.

I looked at NewEgg. An Intel 12700k ($400) plus an Asus Strix motherboard ($500) and 64GB DDR5 (~$600) dual channel. The chip will run AVX512 if the motherboard allows the disablement of E-cores, resulting in 8 cores.

Last fiddled with by paulunderwood on 2022-03-02 at 07:24
paulunderwood is offline   Reply With Quote
Old 2022-03-02, 19:56   #5
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

10110111012 Posts
Default

Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
Code:
3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs.

    AVX-512          FMA3
2.88 ms/iter     3.025 ms/iter
2.87 ms/iter     3.022 ms/iter
2.91 ms/iter     3.018 ms/iter
Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible.
1 worker 8 cores on all.

Last fiddled with by sdbardwick on 2022-03-02 at 20:10
sdbardwick is offline   Reply With Quote
Old 2022-03-02, 23:09   #6
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

2·19·223 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
…64GB DDR5 (~$600) dual channel.
We are currently running this job on a 32GB NUC.
Code:
top - 17:08:22 up 5 days, 11 min,  1 user,  load average: 5.10, 5.27, 4.73
Tasks: 281 total,   1 running, 278 sleeping,   0 stopped,   2 zombie
%Cpu(s):  1.0 us,  0.3 sy, 50.0 ni, 48.2 id,  0.0 wa,  0.4 hi,  0.1 si,  0.0 st
GiB Mem :     31.1 total,     12.8 free,      8.1 used,     10.1 buff/cache
GiB Swap:      0.0 total,      0.0 free,      0.0 used.     22.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
 121965 m         30  10   22.3g   5.0g   4.0m S 400.0  16.2   5230:10 ./mlucas -cpu 0:3
Xyzzy is offline   Reply With Quote
Old 2022-03-03, 01:32   #7
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

10110111012 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
Code:
3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs.

    AVX-512          FMA3
2.88 ms/iter     3.025 ms/iter
2.87 ms/iter     3.022 ms/iter
2.91 ms/iter     3.018 ms/iter
Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible.
1 worker 8 cores on all.
Yes, average ms/iter on R7 3800X is 2.022. Guess 2x16MB cache helps. The upcoming 3800X3D with 96MB cache will be interesting for larger FFTs.
sdbardwick is offline   Reply With Quote
Old 2022-03-04, 08:35   #8
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

2×5×599 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
Just a note on AVX-512 vs FMA3 on a dual channel board. Used CpuSupportsAVX512F=0 or 1 to toggle AVX-512.
Code:
3200K FFT DCLL on 60198527 @4700 -4698 Mhz on all cores reported by CPU-Z for both FMA3 and AVX-512 runs.

    AVX-512          FMA3
2.88 ms/iter     3.025 ms/iter
2.87 ms/iter     3.022 ms/iter
2.91 ms/iter     3.018 ms/iter
Side note: my Ryzen 7 3800X (dual channel FMA3) is quicker than both at IIRC around 2.00 ms/iter. Will confirm #s when remote access to box possible.
1 worker 8 cores on all.
What was the power usage like for AVX-512 vs FMA3? I suspect that FMA3 may be more efficient on your system(and many others). Was the same frequency held for the AVX-512 benchmark?
henryzz is offline   Reply With Quote
Old 2022-03-04, 15:43   #9
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

2DD16 Posts
Default

Quote:
Originally Posted by henryzz View Post
What was the power usage like for AVX-512 vs FMA3? I suspect that FMA3 may be more efficient on your system(and many others). Was the same frequency held for the AVX-512 benchmark?
According to Intel Extreme Tuning Utility, Package TDP for AVX-512 is 200W, FMA3 is 176W. Both stabilize at 4.7GHz, with the -512 running with an extra 0.1 V for core voltage.

Last fiddled with by sdbardwick on 2022-03-04 at 15:44
sdbardwick is offline   Reply With Quote
Old 2022-03-04, 19:40   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

23×32×163 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
According to Intel Extreme Tuning Utility, Package TDP for AVX-512 is 200W, FMA3 is 176W. Both stabilize at 4.7GHz, with the -512 running with an extra 0.1 V for core voltage.
I'm not familiar with the details of the apparent difference in running mprime in avx-512 vs fma3 mode. Could you enlighten? My own code makes no such distinction, because AFAIK no Intel CPUs have the former without supporting the latter.

Update: Paul Underwood has kindly agreed to run the stage 1 DC to completion, taking over from Mike/Xyzzy around iteration 9.4M. He's getting 502 ms/iter @512M FFT running 64c128t on his KNL, right around what I expected based on the 470 ms/iter I got on my KNL, which at 1.4 GHz clocks 0.1 GHz higher than his. At that rate, with ~5Miters left to go, ETA for the DC is 29 days from now, assuming uninterrupted 24/7 running.

Last fiddled with by ewmayer on 2022-03-04 at 22:25
ewmayer is offline   Reply With Quote
Old 2022-03-05, 15:08   #11
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

23×32×11 Posts
Default

Intel plans to fuse disable AVX-512 support from Alder Lake cpus
even tho it is on the chip. Previously they were kinda, possibly, maybe going
to support it but have changed their minds.

The link below is to my favorite leaks and rumors page, Gamer Meld.
I have found them to be brand agnostic and very accurate.

The Alder Lake section starts at 1:59


https://www.youtube.com/watch?v=LNQVX1YP7m4&t=207s


Up yours Intel !!
tServo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
CPUs and GPUs (Oh My) Primeinator Hardware 15 2021-03-08 15:39
Combining CPUs Edmond Lounge 11 2017-07-03 16:31
Can't Merge CPUs Rodrigo PrimeNet 11 2012-03-03 19:45
Which of these CPUs is most productive? Rodrigo Hardware 123 2011-02-05 21:42
A tale of 3 CPUs chris2be8 Hardware 7 2010-07-20 23:12

All times are UTC. The time now is 04:49.


Thu Jun 30 04:49:14 UTC 2022 up 77 days, 2:50, 0 users, load averages: 1.79, 1.56, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔