mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-11-10, 00:41   #1
bentonsar
 
"Sarah B"
Aug 2021
Washington, Seattle

2×3 Posts
Default How much time is spent on memory access in PRPs?

Hi All,

Is there any rough estimate on what percentage of time of a PRP iteration is spent on memory reading/writing? Is it significant?

I would assume that one of the reasons GPUs are much faster at PRPs than CPUs is due to them having a much large larger cache, correct?

-Sarah
bentonsar is offline   Reply With Quote
Old 2021-11-10, 01:40   #2
JWNoctis
 
"J. W."
Aug 2021

438 Posts
Default

Dependent on your frequency and memory bandwidth.

Note how the left part of the leftmost chart - GHz-d/d vs frequency - is almost linear, and could be projected to go pretty close to zero point. I presume there's no extra wait state et cetera thanks to prefetching, until the frequency go above that knee point at ~2GHz and hit memory bandwidth bottleneck, for this CPU.

I'm under the impression that at least most GPU with stunted/crippled FP64 capability are not much faster than CPU for PRP, at least when compared to how much faster they are at trial factoring. They don't have a much larger cache either, but they do have way much more memory bandwidth.
Attached Thumbnails
Click image for larger version

Name:	5800H.png
Views:	80
Size:	242.4 KB
ID:	26072  

Last fiddled with by JWNoctis on 2021-11-10 at 01:47
JWNoctis is offline   Reply With Quote
Old 2021-11-10, 02:06   #3
techn1ciaN
 
techn1ciaN's Avatar
 
Oct 2021
U. S. / Maine

2×73 Posts
Default

It is my understanding that Prime95 and GPUOwl are parallelized enough that your hardware is not just "waiting" / doing nothing while a memory operation is in progress if it can help it, so the direct answer to your question is theoretically "none at all." (VBCurtis recently asserted that memory latency has effectively no impact on throughput.) A wavefront PRP test also does not take up that much actual space in memory (less than 300 MB on my system). LL or PRP testing is, however, extremely demanding of memory bandwidth. There seems to be a consensus that Prime95's bottleneck is memory speed in most real-life configurations.

Mr. Woltman says in this post from 2019 that one Prime95 LL or PRP iteration requires ((FFT size * 8 * 4) + 5) MB of memory operations. At the current FTC wavefront, that should translate to around 190 MB/iter. My particular machine runs a wavefront test at about 6.7 ms/iter. If we put these figures together, we get my total (theoretical) memory load as approximately 28.5 GB/s. My CPU is from AMD's Ryzen series, whose speed with the AVX instructions that Prime95 uses is handicapped, so my throughput would not be amazing even in the case of unlimited memory bandwidth. Most recent or semi-recent Intel CPUs can run full-speed AVX, so a system with such a one might be pushing a theoretical bandwidth demand of 60 or 70 GB/s or even higher.

One interesting curiosity is that with sufficiently small FFTs, or a sufficiently large CPU cache, one can fit an entire FFT (and thus all of Prime95's memory operations) inside the L3 cache of their CPU and side-step the need for fast memory altogether. This is obvious with my laptop. Its CPU is very similar to my desktop's (same number of cores, same architecture and instruction support, approximately the same speed under sustained load), but its wavefront PRP or wavefront DC speed is barely a third of what my desktop can achieve because its RAM is only single-channel and not clocked very fast. However, if I run PRP-CF testing — whose wavefront is still using comparatively very small FFTs, maybe 640K or 960K — the throughput difference almost entirely disappears.

It is true that GPUs tend to have more memory bandwidth than DDR4 DRAM can manage. However, manufacturers still might provide a wide memory bus on one GPU and a narrow one on another, or set one GPU's memory clock speed high and another's low. A good example of this is my GPU, the Radeon 5700 XT, vs. the newer Radeon 6600 XT. Their performance is very similar in, for example, gaming, and the 6600 XT tends to be available more cheaply, but the 5700 XT pulls ahead in Mersenne.ca's GPU LL benchmarks because* it uses a 256-bit memory bus vs. the 6600 XT's bus of only 128 bits (with similar VRAM clock speeds).

* Presumably. It could also be due at least partially to some other reason. My understanding of the practical impact of GPU specifications is not excellent.
techn1ciaN is offline   Reply With Quote
Old 2021-11-10, 05:36   #4
JWNoctis
 
"J. W."
Aug 2021

2316 Posts
Default

Going a bit off topic, but to be fair, newer AMD processors from the last three years - Zen 2 and Zen 3 - also had similar throughput (2 per clock cycle) for many AVX2 arithmetic and FMA3 instructions when compared to their Intel counterparts. But then Intel processors from the same era had AVX-512 which is twice as wide, but ran at half the IPC of their 256-bit counterparts for the same overall throughput if Agner's measurements are correct. But then again, these would presumably allow better code density for the wider and more numerous vector registers.

Intel's Alder Lake had 2-per-cycle AVX-512 instructions, but it's unknown if they would stay available in chips economically obtainable by the average consumer, and either way they'd still be limited by memory bandwidth except when running at low frequency for power efficiency.

I've observed similar effects from cache and FFT size, and thanks for the pointer to that info on memory operation per iteration - from which my system's maximum memory throughput under the condition of the benchmark above would be 40.9GBps, which is pretty impressive considering the 47.7GBps theoretical maximum for dual-channel DDR4-3200. There's indeed not much latency involved, there.

Back on topic, for PRP work a RTX 3070M would be slower at ~115W and ~10x the theoretical memory bandwidth than a 5800H processor running at ~25W, both discounting overhead, if we go by mersenne.ca's benchmarks and mine above...We'd be looking at things like GP100/GV100/GA100/Radeon VII/RX6900XT for something much faster than current-generation processors. The common thing among these appeared to be their fully-enabled (or at least not-too-crippled) native FP64 performance.

Last fiddled with by JWNoctis on 2021-11-10 at 05:37
JWNoctis is offline   Reply With Quote
Old 2021-11-12, 16:17   #5
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2·17·23 Posts
Default

Quote:
Originally Posted by bentonsar View Post
Hi All,

Is there any rough estimate on what percentage of time of a PRP iteration is spent on memory reading/writing? Is it significant?

I would assume that one of the reasons GPUs are much faster at PRPs than CPUs is due to them having a much large larger cache, correct?

-Sarah
The huge difference between cpu's and gpu's is not just 1 difference. It's many.

First important difference is how execution units work. If you say on a cpu: a = (x % 2); b = a+5;
This can get executed at high speed.

At a GPU it CAN NOT.

So at a gpu after x%2 executes it takes a while before you have the result 'a'.
Only THEN it can get fed into the execution unit executing "a + 5".

So that's a major difference between cpu's and gpu's.

Now this also explains why on CPU's effectively you have just a handful of registers. Ignore the theoretic amount of registers on cpu's. All hardware has been optimized to just use a few registers a lot. The rest is really slower.

Secondly another major difference between cpu's and gpu's is the instruction set. GPU's are just 32 bits cpu's with very limited amount of instructions. CPU's have TONS of instructions in x64 instruction set not to mention AVX or even AVX2.

So on gpu's you really have little instructions.
No nothing 64 bits.

At an architecture level if you zoom in depth multiplication is also a huge difference.
On a cpu if you do in a 64 bits register in x64: "a * b = c" then the result of the multiplication is stored in 2 registers.

On a gpu if you multiply 32 x 32 bits integer - then you get as output 32 bits. So there is different instructions for obtaining the highbits and the low bits.

In fact on most gpu's especially AMD, getting the highbits used to be really slow. throughput 4x slower than lowbits.

Multiplication always has been a major bottleneck at cpu's. Throughput of it always was a big issue.

Now we move to vectorisation. What on a CPU we call a 'core' on a GPU historically has had different names but let's call it the SIMDs. some years ago at Nvidia called SMX.

So on a CPU things there is that on intel we have got hyperthreading, which by the way i have turned off on my 44 core box. So only 1 thread can run simultaneously onto a CPU and use all execution units.

Yet one thread also can execute an AVX instruction so in reality multiple integers or fp64 (double precision floating point) get executed at the same time in a vectorized manner. And a CPU also can execute at several execution units at the same time.

Here comes a problem with GPU's. It works totally different!

If we take Nvidia as example now as i have not much info there on AMD yet you can assume it works similar:

A single 'core' there we call a SIMD now. In reality this is a vector register of 128 'cuda cores' at more modern Nvidia gamer gpu's. The minimum vector length of our 'instruction' is 32 registers wide. We call that a warp.

So in my cuda code i use of course the minimum length of 32 cuda cores as a warp. In reality this is 1 instruction that is a vector of 32 integers (if we execute integer code - doubles work the same). Yet in total at a single SIMD we can 'multitask' at the same time up to 20-30 'warps' at the same time.

Now of course there is good reasons to do it like this, as i already explained how dependencies work different on gpu's than on cpu's.

A 'thread' on a GPU is something total different from what it is on a CPU. Some marketeer thought it a good idea to let things look like bigger.

So in my case a warp you would see as 32 threads. In short each unit of the vector is so called a thread.

Strictly theoretic spoken this definition is impossible to sustain as the 32 threads are not 'independant'. They MUST execute the same command at the same atomic time - simply because it's a vector. In CUDA and OpenCL each 'thread' doesn't get its own instruction. It works different. Each instruction works upon the entire warp.

So seeing each warp as a bunch of threads is dead wrong logics.

So you cannot define this as independant 'threads' strictly spoken - yet that's what the marketing idiot there has done already quite some years ago. OpenCL is doing that worst.

So there you have a huge difference between how you program and the theoretic marketing world how a gpu has been defined by those dudes as 'working'. Must be very confusing for those who want to write some gpu code...

Now in computing everything always goes about bandwidth. Yet that's not necessarily memory bandwidth. Nor cache bandwidth.

Basically GPU's have very little caches and use them even less.

Each SIMD is more or less completely independant from the other SIMDs (if we ignore the distribution of the instructions which is a big thing in hardware of course).

And each SIMD reads directly from the RAM.

now here again there is a huge difference between how caches work from cpu's versus how gpu's function.

GPU's do everything vectorized. So a lookup triggered by an instruction reads preferably 32 integers (or 32 doubles) as a block.

Which 'cuda core' reads what index in the block doesn't matter as long as within that block of 32 integers each cuda core reads a different one than the 31 others.

If each 'cuda core' tries to do a read to a total different location far outside that block say every cuda core sits at more or less 128 bytes distance from each other cuda core, then the gpu isn't going to get just 4 bytes - in reality each 'read' of each cuda core the entire 128 bytes will get read bandwidth wise.

So a single instruction will generate then 32 x 128 = 4KB bandwidth. Which obviously is 32 times slower than doing a single read.

Now a difference between AMD and Nvidia is that the L1 cache is not accessible by programmers on AMD. Yet you can with Nvidia.

Yet it has a very limited effect to use this. Gives on paper a lot of freedom, yet you always have this bandwidth consideration problem. 64 KB L1 datacache might seem a lot maybe but if you realize that there is 20 warps at the same time onto a single SIMD then that means you really must limit how much L1 datacache each warp uses as otherwise you can run less warps at the same time.

Rule of thumb i use is that taking care at least 8 warps can simultaneously run on a single SIMD with vectorsize=32 cuda cores of each warp - that this 'loads' each SIMD more than enough.

Well that's if i write the code - i avoid of course the case as much as possible to write stuff like: a = x % 2; b = a+2;

If you do not avoid writing this in code you really might need 20 or more.

So with GPU's you already must take into account all sorts of things because of how the vectorisation works and there is huge limitations on what is possible BECAUSE every instruction in reality is a vector of 32 or more.

With GPU's you directly read from the device RAM. Whether that's GDDR5 or HBM2 or something newer.

You can safely ignore that there is caches - they do not so much useful things on gpu's as they do on cpu's.

Also the L2 cache of a gpu which is on the memory controller, is in most cases way too tiny and read-only.

In general a major difference between cpu's and gpu's is the coherence between the SIMDs. There hardly is any!

Whereas on cpu's you read from the same L3 cache. Which you can see as a fast SRAM more rather than real L3 cache.

So one of the really huge differences betwen gpu's and cpu's is that on CPU's there is all sorts of complicated cache coherency protocols and other sorts of protocols to let multiple cpu cores cooperate and share data in a common caching system, whereas GPU's it's more the islam type idea of: "every man for himself and everyone for allah" - where you can completely ignore the last part of that sentence after 'and'.

In short it doesn't matter how many 'SIMDs' there are on a GPU so to speak. It's easy to have a tad more or a tad less - no problem at all. If one breaks down during production - no problem just turn it off - whereas on a cpu all cores MUST function properly because they are all needed to let the cache coherency work correctly.

So adding more cores in a cpu design is a big issue in CPU world and no problem at all in gpu land.
GPU's are manycores and cpu's are not.

Please note that AMD found a trick around all this - which is not a new trick - yet very well carried out in the threadrippers. Basically the 64 core threadripper would be impossible to produce if it is 1 cpu.

Yet it's a layered concept where it exsits out of 8 cpu's which are 8 core cpu's (CCDs) and a central bridge part so that's a 9th chip and then put it all into 1 package and sell it as a 64 core threadripper. Brilliant concept which works around the problem of a massive cache coherency. So basically it is a 8 socket system on a single package.

So there is different layers of coherency inside that package.

(edit so many: the underlying reason why this is clever concept has to do with producing the cpu's. If you would produce a true 64 core cpu then maybe 1 in 100 cpu's you produce is going to be ok. which is what we call a low yield percentage as the other 99 cpu's you can throw away during production then. And you want a yield of far far far above 80% to make some good profits as a company in this case TSMC producing for AMD, and note nvidia as well, so if you produce 8 core CCD's there you can achieve good yields which is the brilliancy of this concept whereas gpu's it's easy to produce a giant chip as every simd that's defective you simply turn it off - so gpu's no longer have a yield advantage over cpu's now).

The central bridge will have its own cache coherency. Brilliant concept. Yet all totally different from how GPU's work. It basically removes the gap in crunching power between GPU's and cPU's to a state where i would buy cpu's rather than gpu's and where i most definitely would vote that the dominacne in crunching power of gpu's versus cpu's is limited to just a very few areas now where gpu's can dominate the more generic cpu's. Difference is so small now and pricewise cpu's are faster in performance per dollar than gpu's.

Just very specific areas where gpu's have a big edge - like for example if you are a researcher in neural networks and want to run your ANN in a fast manner - then Nvidia has a gpu with specialistic hardware boosting your neural net a lot. Factor 5 over other gaming gpu's boost it gives (factor 5 over their own game gpu's for ANNs).

Yet this is a very limited market of course. All other markets i'd no longer use GPU's.

To give good example: latest Tesla gpu from nvidia is $10k (if you can get it for that) and delivers on paper 10 tflops fp64. Versus latest AMD threadripper i checked out, and maybe there is a newer one, it's 5 Tflops fp64 at 3000 dollar.

And with that threadripper you can run all sorts of software whereas for that gpu you will need to write your own software - in short nearly nothing will run at it.

Last fiddled with by diep on 2021-11-12 at 16:31
diep is offline   Reply With Quote
Old 2021-11-16, 04:41   #6
bentonsar
 
"Sarah B"
Aug 2021
Washington, Seattle

2·3 Posts
Default

Thanks everybody for the input. Would an increase in FFT speed lead to an exponential or linear speed increase for the overall PRP test?
bentonsar is offline   Reply With Quote
Old 2021-11-21, 04:04   #7
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

22·33·72 Posts
Default

What kind of change in FFT speed? Or, what part of the FFT changes in your hypothetical?

I can't tell if you have something specific in mind (like a software optimization, or FPGA use) but are asking a general question, or are asking unspecific questions about speed rather than asking (or reading) about how the FFT is implemented.
VBCurtis is online now   Reply With Quote
Old 2021-11-21, 12:41   #8
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2×17×23 Posts
Default

Quote:
Originally Posted by bentonsar View Post
Thanks everybody for the input. Would an increase in FFT speed lead to an exponential or linear speed increase for the overall PRP test?
If you first run something in 2X seconds. Then you run something in X seconds because you take the steps 2x faster, then that's a linear speedup.

However what we notice last years is that improvement of architecture by factor 8. So from my L5420 Xeon which has a theoretic throughput of 4 flops/clock each core, the hashwell and broadwell i see an increase by factor 3 at the same clock at 1 core. So we're missing roughly a factor 2 in performance for the gwnum library there for the AVX compared to the theoretical maximum that the old Xeons had.

Now i'm sure there is good explanations for all this - yet that's the facts.

An increase in FFT size however does give more than a quadratic slowdown in FFT or PRP timings. Note that PRP timings are 5% slower than DWT (FFT) timings. Every part of a percent performance has been optimized to the limit.

edit: please note that at hashwell/broadwell this factor 3 difference gets a lot larger at larger transform (FFT) sizes as the off chip DDR2 at the L5420 doesn't scale as well as the DDR4 reg ecc i have on the Xeon e5-2699 V4 ES cpu's over here. So it quickly is a factor 5 difference then a clock faster that broadwell core there. Whereas theoretically it would be factor 8.

Now the gpu's hardly have caches. Therefore the bandwidth of the RAM has to be much higher than for cpu's.

In general spoken the dependency on the RAM grows when the working set size increases. In short when the FFT size increases. Todays Mersennes are so huge that i'm sure bandwidth to the RAM is everything.

Whereas what i benchmarked above is testing potential prime candidates of just over 1 megabit each and that factor 5 is roughly at 6mbit size levels.

Last fiddled with by diep on 2021-11-21 at 12:46
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Direct Graphics Memory Access Xyzzy GPU Computing 0 2020-12-11 03:11
cpu memory access speed in detail tServo Hardware 0 2020-07-21 23:43
Too much time spent on ECM jux YAFU 10 2016-01-03 22:00
System memory VS per iteration time bcp19 Hardware 32 2012-01-02 13:47
Manual Time Extension without Access to Computer jinydu Lounge 2 2007-10-13 01:10

All times are UTC. The time now is 20:56.


Sat May 28 20:56:13 UTC 2022 up 44 days, 18:57, 0 users, load averages: 1.83, 1.39, 1.37

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔