![]() |
![]() |
#1 |
Bemusing Prompter
"Danny"
Dec 2002
California
5×499 Posts |
![]()
I know that in many cases, the speed of a LL test depends on not so much the CPU speed, but rather the memory bandwidth. Assuming that the latter isn't an issue, how much of a speedup can we get with current-generation processors? I recall someone saying that their Tesla GPU was using only 25% of its processing power, but does anyone know if the figures are similar in the case of CPUs?
Suddenly, the prospects of "hybrid memory cubes" and DDR4 are starting to sound attractive... Last fiddled with by ixfd64 on 2011-11-13 at 19:45 |
![]() |
![]() |
![]() |
#2 |
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
![]()
I've also been wondering how much the cache sizes and memory bandwidth affect LL tests. Would it improve performance to get higher bandwidth memory? (And for the same frequency, would a 2600k with 8MB L3 be faster than a 2500k with 6MB L3, for example?)
|
![]() |
![]() |
![]() |
#3 |
Dec 2010
Monticello
5·359 Posts |
![]()
Let's suppose for a minute that memory speed is infinite, and the CPU core multiply is the only limitiation. For a 50M LL test, we need 50M operations of square and subtract 2, modulo 2^50M. Each squaring operation costs 2 FFTs of byte length 50M/8 = 6M. The 6M FFT costs 6M*log2(6M) byte multiplies; we'll assume the accumulations come for free as they are done in parallel with the slower multiplies. Latest CPUs run 128 bit registers (I THINK, could be wrong, certainly 64), so that divides my 6Ms by 16, giving 375K * log2(375K) = 375K * 22 multiplies per FFT = 8.25M 128 bit multiplies per FFT.
8M multiplies per FFT *2 FFTs per LL step *50M LL steps = 800 * 10^12 128 bit multiplies Assume a (theoretical) 1 GHz machine -- it will take it 16 clock steps to do that 128 bit multiply, so we end up with 16 * 800 * 10^12 clocks to do it...12800 seconds, about 3.5 hours. Now, I need someone to come poke holes in my numbers...and remind me where I lost a zero. **************** LL tests are specifically designed to be friendly to caches, so more L3 cache is better, all else being equal. |
![]() |
![]() |
![]() |
#4 |
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
3×29×83 Posts |
![]()
How much better? Would the extra 2 MB be a 2% or 6% increase?
|
![]() |
![]() |
![]() |
#5 | |
Jun 2003
2·2,719 Posts |
![]() Quote:
Prime95 uses doubles (64 bit floating points) as the FFT "limbs" -- not bytes. It stores about 19 bits of data in a single limb. So that is around 50M/19 ~ 2.6M limbs. Since we can't do arbitrary sized FFT, we use the next higher size FFT - 3M. So, the opcount should be 3M * log2(3M) [But this is an approximation -- to get an exact count, we need to look at the actual FFT operations]. Double that for inverse FFT. Then 3M point-wise multiplications for the actual squaring. So roughly 3M + 2*3M*log2(3M). EDIT:- Doing parallel operations using SSE2 merely reduces the number of clock cycles -- doesn't reduce the op count. EDIT2:- So opcount per iteration for a 3M FFT ~ 139 Mflops. At 4 ops/cyc, a 2GHz processor would take 0.0174 secs. That is the fastest you can theoretically execute if memory were not a bottleneck. An Intel Core 2 Duo @ 2GHz takes 0.084 secs as per the benchmark page. Which is a factor of 5 from the "theoretical" calculation -- this cannot be accounted by memory issues alone. I must be really off in my opcount ![]() Last fiddled with by axn on 2011-11-15 at 05:08 |
|
![]() |
![]() |
![]() |
#6 |
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
722110 Posts |
![]()
I think it's the approximation part. Correct me if I'm wrong, but aren't you using the complexity there? If so, there's multiplier you're missing which could be 100, 10000000, or whatever.
|
![]() |
![]() |
![]() |
#7 | |
Jun 2003
2·2,719 Posts |
![]() Quote:
FFT algorithm is regular enough that we can count exactly the number of ops. There would be a multiplier, but it'll be a small one (say, 2). But there would also be other factors, and we should be able to count all of them. Anyways, I think I'm off by a factor of 2, at the most. |
|
![]() |
![]() |
![]() |
#8 |
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
![]()
Huh. I thought I remember reading somewhere it was closer to 5... but you at least have calculations, which is better than me.
![]() ![]() |
![]() |
![]() |
![]() |
#9 | |
Jan 2008
France
22·149 Posts |
![]() Quote:
An interesting thing to do would be to run Prime95 through a simulator that can count instructions by category ![]() |
|
![]() |
![]() |
![]() |
#10 | ||
Jun 2003
2·2,719 Posts |
![]() Quote:
Quote:
Like I said, FFT is very regular (no coditional jumps, etc.) You can do a static count based on algorithm. In fact, I think George may already have done this exercise. |
||
![]() |
![]() |
![]() |
#11 | |
Jan 2008
France
22·149 Posts |
![]() Quote:
Anyway I trust you and so will assume that the ld/st count is always less than the rest ![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A Theoretical (vs. Proficient/Practical) Deterministic Primality Test | a1call | Miscellaneous Math | 194 | 2018-03-19 05:54 |
The Limitations of the Cranial GPU | Flatlander | Science & Technology | 3 | 2013-06-13 13:34 |
Maximum theoretical MPG | TimSorbet | Lounge | 9 | 2008-07-14 22:45 |
Custom test for maximum CPU stress | Torpedo | Information & Answers | 10 | 2007-10-05 17:33 |
LL test speed up? | jebeagles | Miscellaneous Math | 16 | 2006-01-04 02:43 |