20070417, 02:03  #1 
Mar 2003
New Zealand
13·89 Posts 
SSE2 sieve speeds
These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
Code:
proth_sieve_sse2 0.42 190 kp/s JJsieveSSE2 PRS52 v0.102b 192 kp/s sr2sieveintel 1.4.39 266 kp/s To run a benchmark, download sr2sieve1.4.39windowsx86.zip here and in the same directory as SoB.dat run `sr2sieve s p100e12 P101e12'. My understanding is that proth_sieve and JJsieve have a better algorithm than sr2sieve, and that the speed difference is due to the SSE2 assembler used for modular multiplications. If anyone is interested in this code it is in the asmi386gcc.h file in sr5sieve1.4.39.tar.gz here. (see the VEC4_* macros). I think there is still room for improvement, perhaps by interleaving 8 mulmods per pass instead of 4, or finding some way to use the SSE2 floating point unit instead of the FPU. 
20070417, 09:26  #2 
Jun 2005
373_{10} Posts 
I will run the bench on my 1.83GHz Core2Duo as soon as I get home. I can run it with clockspeed throttled as well (both internal and external), if you are interested. H.

20070417, 14:43  #3 
Feb 2007
207_{8} Posts 
Opteron 165 @ 2840 MHz, combined .dat
sr2sieveamd.exe sr2sieve started: 991 <= n <= 49999997, 100000000000000 <= p <= 101000000000000 p=100000103547551, 504004 p/sec, 0 factors, 0.01% done, ETA 14 May 11:54 sr2sieve stopped: at p=100000132329451 because SIGINT was received. Found factors for 0 terms in 296.922 cpu sec. (expected about 0.10) sr2sieveintel.exe sr2sieve started: 991 <= n <= 49999997, 100000132329451 <= p <= 101000000000000 p=100000253047711, 504963 p/sec, 0 factors, 0.03% done, ETA 10 May 16:24 sr2sieve stopped: at p=100000272376483 because SIGINT was received. Found factors for 0 terms in 575.375 cpu sec. (expected about 0.20) JJsieveSSE2 pmin=100000024999967 @ 511kp/s pmin=100000049999999 @ 536kp/s pmin=100000074999941 @ 536kp/s pmin=100000099999937 @ 535kp/s pmin=100000124999969 @ 536kp/s I set the priority to realtime in the taskmanager after I started the siever. 
20070417, 21:47  #4 
Jun 2005
373 Posts 
Core2Duo, 1.83 GHz. One bar of 1GB only.
Code:
Intel AMD One core alone 426 416 425 417 425 417 Two cores 418/421 409/413 416/419 408/411 409/413 Competing against 418 408 each other 410 419 Last fiddled with by hhh on 20070417 at 21:51 
20070418, 04:11  #5 
Mar 2003
New Zealand
13×89 Posts 
Thanks KriZp and hhh for the benchmarks :)
I have also tested on a P4 Celeron D (16Kb L1, 256Kb L2), I find it hard to get repeatable benchmarks for either program but on average the sr2sieve times are no slower than the JJsieve ones. Just the fact that sr2sieve gets within 10% of JJsieve without the benefit of the SPH algorithm suggests that there might be something to gain if the SSE2 assembler from sr2sieve could be added to JJsieve. The SSE2 routines are used in two situations: 1. An array A is filled with elements A[0] = x (mod p), A[1] = x^2 (mod p) ... A[n] = x^n (mod p). 2. Each element in an existing array A is multiplied by a constant b (mod p). 
20070418, 10:44  #6 
Jul 2005
110000010_{2} Posts 
Great work Geoff.
I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum: http://www.freedc.org/forum/showthread.php?t=10262 The "problem" with SPH is that you have to find, and keep track of factors of p1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it. I need to go back and see if just looking at powers of 2 in p1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required. 
20070419, 03:30  #7  
Mar 2003
New Zealand
2205_{8} Posts 
Quote:
I have found that on the P4, computing every element of the array x,x^2,...,x^n (mod p) with SSE2 is faster than computing 40% of the elements using an addition ladder without SSE2, but that will be different for different machines. Quote:
sr2sieve only uses small factors of p1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p1, but since only 3,4,5,8,9,16power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720. 

20070419, 04:48  #8  
Jun 2003
3×23^{2} Posts 
Quote:
Do you think this will be faster? For prime=2, I think you are using the quadratic reciprocity law. Note, that these laws exist for higher powers also and can be used to replace the SPH algorithm. Google "Rational reciprocity laws". http://mathworld.wolfram.com/ReciprocityTheorem.html (A good place to start) Last fiddled with by Citrix on 20070419 at 04:57 

20070420, 03:19  #9 
Mar 2003
New Zealand
2205_{8} Posts 
While these times are accurate for the Northwood P4, it appears that they may not be valid for other P4 models. Sorry for jumping to conclusions. I am going to try to work out exactly what is making the code run so much faster on Northwood processors than others.

20070421, 13:59  #10 
Aug 2002
525_{10} Posts 
Could you make the 1.4.34.tar.gz source available?

20070422, 03:03  #11 
Mar 2003
New Zealand
13×89 Posts 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Slow Transfer Speeds with Windows 7  pinhodecarlos  Soap Box  2  20161019 19:52 
Core i7 memory speeds  nucleon  Hardware  9  20090917 11:47 
LL tests running at different speeds  GARYP166  Information & Answers  11  20090713 19:39 
sieving speeds for Intels  jasong  Sierpinski/Riesel Base 5  11  20070809 00:15 
Factoring Speeds  Khemikal796  Lone Mersenne Hunters  5  20050426 20:28 