![]() |
![]() |
#1 |
Mar 2003
New Zealand
13×89 Posts |
![]()
These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
Code:
proth_sieve_sse2 0.42 190 kp/s JJsieveSSE2 PRS52 v0.102b 192 kp/s sr2sieve-intel 1.4.39 266 kp/s To run a benchmark, download sr2sieve-1.4.39-windows-x86.zip here and in the same directory as SoB.dat run `sr2sieve -s -p100e12 -P101e12'. My understanding is that proth_sieve and JJsieve have a better algorithm than sr2sieve, and that the speed difference is due to the SSE2 assembler used for modular multiplications. If anyone is interested in this code it is in the asm-i386-gcc.h file in sr5sieve-1.4.39.tar.gz here. (see the VEC4_* macros). I think there is still room for improvement, perhaps by interleaving 8 mulmods per pass instead of 4, or finding some way to use the SSE2 floating point unit instead of the FPU. |
![]() |
![]() |
![]() |
#2 |
Jun 2005
373 Posts |
![]()
I will run the bench on my 1.83GHz Core2Duo as soon as I get home. I can run it with clockspeed throttled as well (both internal and external), if you are interested. H.
|
![]() |
![]() |
![]() |
#3 |
Feb 2007
100001112 Posts |
![]()
Opteron 165 @ 2840 MHz, combined .dat
sr2sieve-amd.exe sr2sieve started: 991 <= n <= 49999997, 100000000000000 <= p <= 101000000000000 p=100000103547551, 504004 p/sec, 0 factors, 0.01% done, ETA 14 May 11:54 sr2sieve stopped: at p=100000132329451 because SIGINT was received. Found factors for 0 terms in 296.922 cpu sec. (expected about 0.10) sr2sieve-intel.exe sr2sieve started: 991 <= n <= 49999997, 100000132329451 <= p <= 101000000000000 p=100000253047711, 504963 p/sec, 0 factors, 0.03% done, ETA 10 May 16:24 sr2sieve stopped: at p=100000272376483 because SIGINT was received. Found factors for 0 terms in 575.375 cpu sec. (expected about 0.20) JJsieveSSE2 pmin=100000024999967 @ 511kp/s pmin=100000049999999 @ 536kp/s pmin=100000074999941 @ 536kp/s pmin=100000099999937 @ 535kp/s pmin=100000124999969 @ 536kp/s I set the priority to realtime in the taskmanager after I started the siever. |
![]() |
![]() |
![]() |
#4 |
Jun 2005
5658 Posts |
![]()
Core2Duo, 1.83 GHz. One bar of 1GB only.
Code:
Intel AMD One core alone 426 416 425 417 425 417 Two cores 418/421 409/413 416/419 408/411 409/413 Competing against 418 408 each other 410 419 Last fiddled with by hhh on 2007-04-17 at 21:51 |
![]() |
![]() |
![]() |
#5 |
Mar 2003
New Zealand
13·89 Posts |
![]()
Thanks KriZp and hhh for the benchmarks :-)
I have also tested on a P4 Celeron D (16Kb L1, 256Kb L2), I find it hard to get repeatable benchmarks for either program but on average the sr2sieve times are no slower than the JJsieve ones. Just the fact that sr2sieve gets within 10% of JJsieve without the benefit of the SPH algorithm suggests that there might be something to gain if the SSE2 assembler from sr2sieve could be added to JJsieve. The SSE2 routines are used in two situations: 1. An array A is filled with elements A[0] = x (mod p), A[1] = x^2 (mod p) ... A[n] = x^n (mod p). 2. Each element in an existing array A is multiplied by a constant b (mod p). |
![]() |
![]() |
![]() |
#6 |
Jul 2005
2×193 Posts |
![]()
Great work Geoff.
I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum: http://www.free-dc.org/forum/showthread.php?t=10262 The "problem" with SPH is that you have to find, and keep track of factors of p-1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it. I need to go back and see if just looking at powers of 2 in p-1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required. |
![]() |
![]() |
![]() |
#7 | ||
Mar 2003
New Zealand
13·89 Posts |
![]() Quote:
I have found that on the P4, computing every element of the array x,x^2,...,x^n (mod p) with SSE2 is faster than computing 40% of the elements using an addition ladder without SSE2, but that will be different for different machines. Quote:
sr2sieve only uses small factors of p-1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p-1, but since only 3,4,5,8,9,16-power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720. |
||
![]() |
![]() |
![]() |
#8 | |
Jun 2003
3·232 Posts |
![]() Quote:
Do you think this will be faster? For prime=2, I think you are using the quadratic reciprocity law. Note, that these laws exist for higher powers also and can be used to replace the SPH algorithm. Google "Rational reciprocity laws". ![]() http://mathworld.wolfram.com/ReciprocityTheorem.html (A good place to start) Last fiddled with by Citrix on 2007-04-19 at 04:57 |
|
![]() |
![]() |
![]() |
#9 |
Mar 2003
New Zealand
13·89 Posts |
![]()
While these times are accurate for the Northwood P4, it appears that they may not be valid for other P4 models. Sorry for jumping to conclusions. I am going to try to work out exactly what is making the code run so much faster on Northwood processors than others.
|
![]() |
![]() |
![]() |
#10 |
Aug 2002
3·52·7 Posts |
![]()
Could you make the 1.4.34.tar.gz source available?
|
![]() |
![]() |
![]() |
#11 |
Mar 2003
New Zealand
13×89 Posts |
![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Slow Transfer Speeds with Windows 7 | pinhodecarlos | Soap Box | 2 | 2016-10-19 19:52 |
Core i7 memory speeds | nucleon | Hardware | 9 | 2009-09-17 11:47 |
LL tests running at different speeds | GARYP166 | Information & Answers | 11 | 2009-07-13 19:39 |
sieving speeds for Intels | jasong | Sierpinski/Riesel Base 5 | 11 | 2007-08-09 00:15 |
Factoring Speeds | Khemikal796 | Lone Mersenne Hunters | 5 | 2005-04-26 20:28 |