mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Prime Sierpinski Project

Reply
 
Thread Tools
Old 2007-04-17, 02:03   #1
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default SSE2 sieve speeds

These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
Code:
proth_sieve_sse2 0.42      190 kp/s
JJsieveSSE2 PRS52 v0.102b  192 kp/s
sr2sieve-intel 1.4.39      266 kp/s
I don't have access to any SSE2 machines other than 32-bit P4 and P4/Celerons for testing, so I would be very interested to see comparisons for Core2 or Athlon64 (in 32-bit mode).

To run a benchmark, download sr2sieve-1.4.39-windows-x86.zip here and in the same directory as SoB.dat run `sr2sieve -s -p100e12 -P101e12'.

My understanding is that proth_sieve and JJsieve have a better algorithm than sr2sieve, and that the speed difference is due to the SSE2 assembler used for modular multiplications. If anyone is interested in this code it is in the asm-i386-gcc.h file in sr5sieve-1.4.39.tar.gz here. (see the VEC4_* macros). I think there is still room for improvement, perhaps by interleaving 8 mulmods per pass instead of 4, or finding some way to use the SSE2 floating point unit instead of the FPU.
geoff is offline   Reply With Quote
Old 2007-04-17, 09:26   #2
hhh
 
hhh's Avatar
 
Jun 2005

37310 Posts
Default

I will run the bench on my 1.83GHz Core2Duo as soon as I get home. I can run it with clockspeed throttled as well (both internal and external), if you are interested. H.
hhh is offline   Reply With Quote
Old 2007-04-17, 14:43   #3
KriZp
 
KriZp's Avatar
 
Feb 2007

2078 Posts
Default

Opteron 165 @ 2840 MHz, combined .dat

sr2sieve-amd.exe
sr2sieve started: 991 <= n <= 49999997, 100000000000000 <= p <= 101000000000000
p=100000103547551, 504004 p/sec, 0 factors, 0.01% done, ETA 14 May 11:54
sr2sieve stopped: at p=100000132329451 because SIGINT was received.
Found factors for 0 terms in 296.922 cpu sec. (expected about 0.10)

sr2sieve-intel.exe
sr2sieve started: 991 <= n <= 49999997, 100000132329451 <= p <= 101000000000000
p=100000253047711, 504963 p/sec, 0 factors, 0.03% done, ETA 10 May 16:24
sr2sieve stopped: at p=100000272376483 because SIGINT was received.
Found factors for 0 terms in 575.375 cpu sec. (expected about 0.20)

JJsieveSSE2
pmin=100000024999967 @ 511kp/s
pmin=100000049999999 @ 536kp/s
pmin=100000074999941 @ 536kp/s
pmin=100000099999937 @ 535kp/s
pmin=100000124999969 @ 536kp/s

I set the priority to realtime in the taskmanager after I started the siever.
KriZp is offline   Reply With Quote
Old 2007-04-17, 21:47   #4
hhh
 
hhh's Avatar
 
Jun 2005

373 Posts
Default

Core2Duo, 1.83 GHz. One bar of 1GB only.
Code:
                            Intel                    AMD

One core alone               426                    416
                             425                    417
                             425                    417

Two cores                    418/421              409/413
                             416/419              408/411
                                                  409/413

Competing against            418                    408
each other                   410                    419
If you still have questions, I'll be glad to help.

Last fiddled with by hhh on 2007-04-17 at 21:51
hhh is offline   Reply With Quote
Old 2007-04-18, 04:11   #5
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Thanks KriZp and hhh for the benchmarks :-)

I have also tested on a P4 Celeron D (16Kb L1, 256Kb L2), I find it hard to get repeatable benchmarks for either program but on average the sr2sieve times are no slower than the JJsieve ones.

Just the fact that sr2sieve gets within 10% of JJsieve without the benefit of the SPH algorithm suggests that there might be something to gain if the SSE2 assembler from sr2sieve could be added to JJsieve.

The SSE2 routines are used in two situations:

1. An array A is filled with elements A[0] = x (mod p), A[1] = x^2 (mod p) ... A[n] = x^n (mod p).

2. Each element in an existing array A is multiplied by a constant b (mod p).
geoff is offline   Reply With Quote
Old 2007-04-18, 10:44   #6
Greenbank
 
Greenbank's Avatar
 
Jul 2005

1100000102 Posts
Default

Great work Geoff.

I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum:

http://www.free-dc.org/forum/showthread.php?t=10262

The "problem" with SPH is that you have to find, and keep track of factors of p-1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it.

I need to go back and see if just looking at powers of 2 in p-1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required.
Greenbank is offline   Reply With Quote
Old 2007-04-19, 03:30   #7
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

22058 Posts
Default

Quote:
Originally Posted by Greenbank View Post
I'm beginning to wonder if SPH really is worth the trouble it is. I wrote a whole load of stuff about sieving (Riesel/Sierpinski, DL problem, BSGS and SPH) over at the SoB forum:

http://www.free-dc.org/forum/showthread.php?t=10262
That is an interesting thread, thanks for the pointer. I see there that proth_sieve uses a fixed size giant step, and so doesn't need to compute every baby step. I guess this means that the SSE2 routine to fill an array with x,x^2,...,x^n (mod p) may not be useful. sr2sieve uses a variable number of giant and baby steps and so every baby step needs to be computed, which is the main place that routine is used.

I have found that on the P4, computing every element of the array x,x^2,...,x^n (mod p) with SSE2 is faster than computing 40% of the elements using an addition ladder without SSE2, but that will be different for different machines.

Quote:
The "problem" with SPH is that you have to find, and keep track of factors of p-1 (where p is the current value you are sieving). But in doing this you use reasonable chunks of memory and also start to overuse the cache. I'm sure there's a better balance to be had but it's so tricky working towards it.

I need to go back and see if just looking at powers of 2 in p-1 using SPH would help "whack" as many candidate k,p pairs before BSGS is required.
That is one of the problems I had when I was looking at how to implement SPH: I haven't been able to think of any way to find the needed factors of p-1 fast enough to make it feasible.

sr2sieve only uses small factors of p-1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p-1, but since only 3,4,5,8,9,16-power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720.
geoff is offline   Reply With Quote
Old 2007-04-19, 04:48   #8
Citrix
 
Citrix's Avatar
 
Jun 2003

3×232 Posts
Default

Quote:
Originally Posted by geoff View Post
sr2sieve only uses small factors of p-1 implicitly: to check whether k is a cubic residue for p it needs first to check whether p=1 (mod 3). That is equivalent to determining whether 3 is a factor of p-1, but since only 3,4,5,8,9,16-power residues are checked, all the needed factors can be found by a single lookup into a table indexed by p%720.
One way to get over this is to look for primes that are p=x (mod 720) in the given range and then go back and look at primes p=x+1 (mod 720) and so on...
Do you think this will be faster?

For prime=2, I think you are using the quadratic reciprocity law. Note, that these laws exist for higher powers also and can be used to replace the SPH algorithm. Google "Rational reciprocity laws".

http://mathworld.wolfram.com/ReciprocityTheorem.html (A good place to start)

Last fiddled with by Citrix on 2007-04-19 at 04:57
Citrix is offline   Reply With Quote
Old 2007-04-20, 03:19   #9
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

22058 Posts
Default

Quote:
Originally Posted by geoff View Post
These times are for a 2.66GHz P4 (8Kb L1, 512Kb L2) running Windows XP sieving the current combined SoB.dat file at p=100e12:
Code:
proth_sieve_sse2 0.42      190 kp/s
JJsieveSSE2 PRS52 v0.102b  192 kp/s
sr2sieve-intel 1.4.39      266 kp/s
While these times are accurate for the Northwood P4, it appears that they may not be valid for other P4 models. Sorry for jumping to conclusions. I am going to try to work out exactly what is making the code run so much faster on Northwood processors than others.
geoff is offline   Reply With Quote
Old 2007-04-21, 13:59   #10
Joe O
 
Joe O's Avatar
 
Aug 2002

52510 Posts
Default

Could you make the 1.4.34.tar.gz source available?
Joe O is offline   Reply With Quote
Old 2007-04-22, 03:03   #11
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Quote:
Originally Posted by Joe O View Post
Could you make the 1.4.34.tar.gz source available?
Done.
geoff is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Slow Transfer Speeds with Windows 7 pinhodecarlos Soap Box 2 2016-10-19 19:52
Core i7 memory speeds nucleon Hardware 9 2009-09-17 11:47
LL tests running at different speeds GARYP166 Information & Answers 11 2009-07-13 19:39
sieving speeds for Intels jasong Sierpinski/Riesel Base 5 11 2007-08-09 00:15
Factoring Speeds Khemikal796 Lone Mersenne Hunters 5 2005-04-26 20:28

All times are UTC. The time now is 16:05.


Wed May 18 16:05:13 UTC 2022 up 34 days, 14:06, 0 users, load averages: 1.65, 1.84, 1.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔