![]() |
![]() |
#1 |
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
![]()
I *think* there is a possibility of speeding up LM-tests by a factor of 2-4 using pure integer arithmetic and a hybrid of FFTs and number theoretical transforms.
To see whether this is possible, I need to know how much time the processor spends in the most inner loops, so I've attached a small program that measures the clocks taken for arithmetic on two different (and incompatible) types of data-structures. Using 64-bit general purpose registers is about 40% faster on an Athlon64, but I have no idea how other processors (Core, Core 2, i2c, Phenom) perform. I suspect that the SSE2 version will be faster on these, which would have the bonus of not needing a 64-bit operating system. Anybody who can compile and run programs can help me by doing the following: 1. Download the attached "speedtest.cpp.bz2" and bunzip 2. compile with "g++ -O3 -o speedtest speedtest.cpp" 3. run several times (10x or so) record the fastest results, and post them in this thread like this: (Processor: Athlon64) Clocks/Element using 64-bit GPRs: 1.5332 Clocks/Element using SSE2: 2.5415 Thank you! P.S. even if it is possible to tweak the assembly routines to get out another 10%, part of the plan is to tap that 10% later on by squeezing in instructions to prefetch other data into the caches. |
![]() |
![]() |
![]() |
#2 |
Mar 2007
Austria
2×151 Posts |
![]()
On my Q6600 i get(turned -O3 on during compilation):
Clocks/Element using 64-bit GPRs: 3.14648 Clocks/Element using SSE2: 2.87842 Is that worse than the athlon? |
![]() |
![]() |
![]() |
#3 |
Oct 2008
California
22·59 Posts |
![]()
oh, you have to have linux...
(i tried compiling in dev-c++, but it gave me errors) |
![]() |
![]() |
![]() |
#4 | |
Dec 2008
Boycotting the Soapbox
13208 Posts |
![]() Quote:
Just in case I've been cryptic about how to interpret the results: The lower the "Clocks/Element", the better. I figured the GPR version would be slower on Intel processors, because the add with carry (adc) instruction has a latency of 2 cycles with a throughput of 1 'adc' per clock. On Athlons 'adc' has a latency of only 1 cycle with a throughput of 3 per clock (Theoretically at least. In practice every adc depends on the state of the carry flag and the instruction itself modifies the carry flag, so only one 'adc' can actually be done per clock). No surprise that they are at least twice as fast. The SSE2 performance is a disappointment. Core 2 has 128-bit SSE units, so I had hopped that they would be twice as fast as Athlons that process SSE instructions in 2 blocks of 64-bit. Probably the latency is again higher here for Core 2. Currently the SSE-loop processes four 64-bit integers in parallel (in 2 registers). Let's see what happens when arithmetic is done on eight values (in 4 registers) in parallel. I'll upload a version that does that today or tomorrow. The bottleneck for SSE on Athlon64s is that they can only do 1.5 SSE instructions per clock. The inner SSE2 loop has 12 instructions to process 4 elements, so the optimum would be 2 clocks/element. My prediction is that Phenoms will do the SSE2 version in 1.27 clocks (i.e. faster than the GPR version), because AMD simply doubled everything SSE related. Last fiddled with by __HRB__ on 2009-01-01 at 00:32 Reason: Tried to improve clarity |
|
![]() |
![]() |
![]() |
#5 |
Dec 2008
Boycotting the Soapbox
2D016 Posts |
![]() |
![]() |
![]() |
![]() |
#6 |
May 2005
22·11·37 Posts |
![]()
Output from Q9450 @ 3.2GHz
Code:
Clocks/Element using 64-bit GPRs: 2.07031 Clocks/Element using 64-bit GPRs: 2.07812 Clocks/Element using 64-bit GPRs: 2.03125 Clocks/Element using 64-bit GPRs: 2.03906 Clocks/Element using 64-bit GPRs: 2.03125 Clocks/Element using 64-bit GPRs: 2.03125 Clocks/Element using 64-bit GPRs: 2.03125 Clocks/Element using 64-bit GPRs: 2.03125 Clocks/Element using SSE2: 1.98242 Clocks/Element using SSE2: 1.86523 Clocks/Element using SSE2: 1.84961 Clocks/Element using SSE2: 1.84961 Clocks/Element using SSE2: 1.84961 Clocks/Element using SSE2: 1.85156 Clocks/Element using SSE2: 1.86133 Clocks/Element using SSE2: 1.84961 Last fiddled with by Cruelty on 2009-01-01 at 01:20 |
![]() |
![]() |
![]() |
#7 |
Dec 2008
LA, CA
5 Posts |
![]() Code:
/home/euser/Desktop/speedtest.cpp:47: error: ‘__m128i’ does not name a type /home/euser/Desktop/speedtest.cpp: In member function ‘void SSE2::operator+=(SSE2&)’: /home/euser/Desktop/speedtest.cpp:51: error: ‘__m128i’ was not declared in this scope /home/euser/Desktop/speedtest.cpp:51: error: expected `;' before ‘discard’ /home/euser/Desktop/speedtest.cpp:76: error: ‘discard’ was not declared in this scope /home/euser/Desktop/speedtest.cpp:77: error: ‘X’ was not declared in this scope /home/euser/Desktop/speedtest.cpp:77: error: ‘struct SSE2’ has no member named ‘X’ /home/euser/Desktop/speedtest.cpp:78: error: lvalue required in asm statement /home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 0 /home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 1 /home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 2 /home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 3 /home/euser/Desktop/speedtest.cpp:78: error: invalid lvalue in asm output 4 |
![]() |
![]() |
![]() |
#8 |
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
![]()
While I was modifying the code to do 4-way and 8-way SSE2 I found an embarrassing copy&paste error which might have been responsible for the weak performance on Core 2. Performance is unchanged on Athlon64.
Please try the attached code for the new 'speedtest' that doesn't have the mistake in the 4-way SSE code and includes modified code for 8-way paralellism. Sorry about the goof. 1. download & bunzip2 2. g++ -O3 -o speedtest speedtest-v2.cpp 3. run several time and post lowest Clocks/Element for GPR, SSE2 (4-way) and SSE2 (8-way) P.S. What do I have to do to get rid of the original attachment in the first post? P.P.S. 1.85 Clocks on the Q9450 looks promising... Last fiddled with by __HRB__ on 2009-01-01 at 01:59 |
![]() |
![]() |
![]() |
#9 |
May 2005
22·11·37 Posts |
![]()
Voila:
Code:
Speedtest 0.2 Clocks/Element using 64-bit GPRs (1-Way): 2.0625 Clocks/Element using 64-bit GPRs (1-Way): 2.07031 Clocks/Element using 64-bit GPRs (1-Way): 2.03125 Clocks/Element using 64-bit GPRs (1-Way): 2.03906 Clocks/Element using 64-bit GPRs (1-Way): 2.03125 Clocks/Element using 64-bit GPRs (1-Way): 2.02344 Clocks/Element using 64-bit GPRs (1-Way): 2.03125 Clocks/Element using 64-bit GPRs (1-Way): 2.03125 Clocks/Element using SSE2 (4-Way): 1.74805 Clocks/Element using SSE2 (4-Way): 1.55859 Clocks/Element using SSE2 (4-Way): 1.5625 Clocks/Element using SSE2 (4-Way): 1.54883 Clocks/Element using SSE2 (4-Way): 1.55078 Clocks/Element using SSE2 (4-Way): 1.55664 Clocks/Element using SSE2 (4-Way): 1.55078 Clocks/Element using SSE2 (4-Way): 1.55664 Clocks/Element using SSE2 (8-Way): 1.69238 Clocks/Element using SSE2 (8-Way): 1.52539 Clocks/Element using SSE2 (8-Way): 1.52246 Clocks/Element using SSE2 (8-Way): 1.54492 Clocks/Element using SSE2 (8-Way): 1.52051 Clocks/Element using SSE2 (8-Way): 1.51953 Clocks/Element using SSE2 (8-Way): 1.54102 Clocks/Element using SSE2 (8-Way): 1.53027 |
![]() |
![]() |
![]() |
#10 |
Dec 2008
LA, CA
5 Posts |
![]()
Still not working on Core 2 Duo... Is it because you designed it for quad cores?
|
![]() |
![]() |
![]() |
#11 |
Oct 2008
California
3548 Posts |
![]()
Line 47:
'_m128i' does not name a type In member function 'void SSE2_4::operator+=(SSE2_4&)': Line 51: '_m128i' undeclared (first use this function) (Each undeclared identifier is reported only once for each function it appears in) Line 51: expected ';' before "discard" Line 76: 'discard' undeclared (first use this function) Line 77: 'X' undeclared (first use this function) Line 77: 'struct SSE2_4' has no member named 'X' Line 77: At global scope: Line 85: '_m128i' does not name a type In member function 'void SSE2_8::operator+=(SSE2_8&)': Line 89: '_m128i' undeclared (first use this function) Line 89: expected ';' before "discard" Line 127: 'discard' undeclared (first use this function) Line 129: 'X' undeclared (first use this function) Line 129: 'struct SSE2_8' has no member named 'X' (here is the compiler output: Code:
Compiler: Default compiler Executing g++.exe... g++.exe "D:\Document\My Documents\speedtest-v2.cpp" -o "D:\Document\My Documents\speedtest-v2.exe" -O3 -o -I"C:\Dev-Cpp\lib\gcc\mingw32\3.4.2\include" -I"C:\Dev-Cpp\include\c++\3.4.2\backward" -I"C:\Dev-Cpp\include\c++\3.4.2\mingw32" -I"C:\Dev-Cpp\include\c++\3.4.2" -I"C:\Dev-Cpp\include" -L"C:\Dev-Cpp\lib" D:\Document\My Documents\speedtest-v2.cpp:47: error: `__m128i' does not name a type D:\Document\My Documents\speedtest-v2.cpp: In member function `void SSE2_4::operator+=(SSE2_4&)': D:\Document\My Documents\speedtest-v2.cpp:51: error: `__m128i' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:51: error: (Each undeclared identifier is reported only once for each function it appears in.) D:\Document\My Documents\speedtest-v2.cpp:51: error: expected `;' before "discard" D:\Document\My Documents\speedtest-v2.cpp:76: error: `discard' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:77: error: `X' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:77: error: 'struct SSE2_4' has no member named 'X' D:\Document\My Documents\speedtest-v2.cpp: At global scope: D:\Document\My Documents\speedtest-v2.cpp:85: error: `__m128i' does not name a type D:\Document\My Documents\speedtest-v2.cpp: In member function `void SSE2_8::operator+=(SSE2_8&)': D:\Document\My Documents\speedtest-v2.cpp:89: error: `__m128i' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:89: error: expected `;' before "discard" D:\Document\My Documents\speedtest-v2.cpp:127: error: `discard' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:129: error: `X' undeclared (first use this function) D:\Document\My Documents\speedtest-v2.cpp:129: error: 'struct SSE2_8' has no member named 'X' Execution terminated |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Modifying the Lucas Lehmer Primality Test into a fast test of nothing | Trilo | Miscellaneous Math | 25 | 2018-03-11 23:20 |
Lucas-Lehmer test | Mathsgirl | Information & Answers | 23 | 2014-12-10 16:25 |
Question on Lucas Lehmer variant (probably a faster prime test) | MrRepunit | Math | 9 | 2012-05-10 03:50 |
Sumout Test in Lucas Lehmer? | paramveer | Information & Answers | 8 | 2012-01-30 08:23 |
Lucas-Lehmer Test | storm5510 | Math | 22 | 2009-09-24 22:32 |