20120630, 05:03  #34 
"Ben"
Feb 2007
2×1,789 Posts 
Here's the code. I'd be interested to hear how it runs on other hardware (or *if* it runs).

20120630, 06:17  #35 
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 89<O<88
3×29×83 Posts 
Mfakt* search for factors of Mersenne numbers; a factor of 2^p1 must be congruent to 1 (mod 2p). Thus they produce a list of k's (the multiplier) such that 2p*k+1 is between 2^a and 2^b for some integers a and b. (Currently, depending on the exponent, the final TF bound is 2^72 or 2^73, though mfaktc supports TFing up to 96 bits.) This list of potential factors is then sieved for primality on the CPU before being sent to the GPU for the actual divisibility test. Currently, in "vanilla" versions, the depth of the sieve is variable; the "SievePrimes" variable is allowed to be between 5000 and 250000. Presumably that's the number of small primes by which a factor candidate is divided before being sent to the GPU (if none of the small primes divide the factor candidate). There has been some experimentation with reducing SievePrimes, down to around 1500 which seemed to be a sweet spot, and someone (rcv?) had experimented with sieving on the GPU. While that wasn't a failure (it did seem to work), nothing has come of it yet.
tl;dr: Arbitrary numbers of up to ~32 digits are sieved for primality against 5000250000 small primes. I'll test your code on my GTX460 when I get home tomorrow. Last fiddled with by Dubslow on 20120630 at 06:20 
20120630, 09:27  #36 
Dec 2011
8F_{16} Posts 
I gave a version of my sieving code to TheJudger, Bdot, and Prime95, on or about March 8, 2012. It's not a perfect applestoapples comparison, partly for the reasons given by Dubslow. Simplifying my code to just sieve the integers from 1 to 1e9 should not slow it down, but it probably wouldn't speed it up by very much either. (Setup of the sieving loops might be simplified, but the computeintensive sieving, itself, would be unchanged.)
Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9. My sieving code respects the "class" concept used by mfaktc/mfakto. [Contrast 960 of 4620 classes with 2 of 6 for the code posted above. More classes provides a nice performance advantage for big problems, but adds overhead for small problems, such as merely finding the primes to 1e9. (Another non applestoapples issue.)] My sieving code is also larger, more complex, and uses multiple kernels, each of which is optimized a bit differently than the next. For the mfaktc/mfakto problem, the sieved results do not have to leave the GPU. As bsquared noted, for counting the primes to 1e9, the sieved results do not have to leave the GPU. Be aware that transfer of the primes, themselves, would add significant datatransfer overhead to any performance metric. The code was released under GPLv3. Bdot had my blessing to put the code under source control. [I don't know whether he did.] 
20120630, 12:08  #37  
"Ben"
Feb 2007
2×1,789 Posts 
Quote:


20120702, 16:43  #38 
"Ben"
Feb 2007
2·1,789 Posts 
More playing around this past weekend:
Code:
5.149000 milliseconds for big sieve 5761455 big primes (< 100000000) found 64.112999 milliseconds for big sieve 50847534 big primes (< 1000000000) found 1161.977051 milliseconds for big sieve 455052511 big primes (< 10000000000) found 
20120703, 02:32  #39 
"Ben"
Feb 2007
3578_{10} Posts 
+ Early abort for primes that don't hit a block at all
+ axn's multiply by inverse trick (thanks again!) Code:
4.929000 milliseconds for big sieve 5761455 big primes (< 100000000) found 57.847000 milliseconds for big sieve 50847534 big primes (< 1000000000) found 891.489014 milliseconds for big sieve 455052511 big primes (< 10000000000) found 
20120703, 10:02  #40  
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2^{5}×5×37 Posts 
Quote:


20120703, 11:30  #41 
Jun 2003
19×271 Posts 
I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the smallish size of the buffers). Nonetheless, it is there for the taking.
EDIT: This may or may not affect the timing, since the code is only timing the kernel itself. Last fiddled with by axn on 20120703 at 11:33 
20120703, 13:39  #42  
"Ben"
Feb 2007
DFA_{16} Posts 
Quote:
Quote:
Code. It's maybe worth noting that for the 10e9 sieve interval, the kernel sieves approximately as many bytes as a SIQS job for a C60 (with a "factor base" more than twice as big), and does it 7 times faster. Granted, there are a lot more operations in siqs vs. erato. And it will be very hard (impossible?) to avoid readmodifywrites, so use of atomics might be necessary. But this suggests it might not be totally insane to implement a decent siqs on a gpu. 

20120704, 04:50  #43  
Jun 2003
19×271 Posts 
Quote:
One microoptimization. Where you're using an array s[2], you can change it out to use two scalar variables s0 & s1. Followed by this: Code:
uint32 sxor = s0 ^ s1; uint32 scur = (k == 0) ? s0 : s1; for( ;i < block_size; ) { locsieve[i] = 0; i += scur; scur ^= sxor; } Code:
/*0998*/ /*0x0851dc4340000000*/ ISCADD R7, R5, R2, 0x2; /*09a0*/ /*0x004fdc05c9000000*/ STS.U8 [R4], RZ; /*09a8*/ /*0x07f15c23328ac000*/ ICMP.NE R5, RZ, 0x1, R5; /*09b0*/ /*0x0071dc85c0000000*/ LDL R7, [R7]; /*09b8*/ /*0x10711c0348000000*/ IADD R4, R7, R4; /*09c0*/ /*0x0041dc03188ec080*/ ISETP.LT.U32.AND P0, pt, R4, 0x2000, pt; /*09c8*/ /*0x200001e74003ffff*/ @P0 BRA 0x998; Code:
/*09a8*/ /*0x009fdc05c9000000*/ STS.U8 [R9], RZ; /*09b0*/ /*0x24725c0348000000*/ IADD R9, R7, R9; /*09b8*/ /*0x1c41dc8368000000*/ LOP.XOR R7, R4, R7; /*09c0*/ /*0x0091dc03188ec080*/ ISETP.LT.U32.AND P0, pt, R9, 0x2000, pt; /*09c8*/ /*0x600001e74003ffff*/ @P0 BRA 0x9a8; EDIT: Now that I've looked at the code closer, you can replace the index k and instead use scur directly wherever k appears. This will actually allow to combine the two "unrolled loops". Another suggestion for a microoptimization. In the main "pid" loop, instead of two variables j & pid, you can directly loop on pid, like so: Code:
for (int pid =threadIdx.x + startprime; pid < maxp; pid += threadsPerBlock) Last fiddled with by axn on 20120704 at 05:08 

20120704, 05:30  #44 
Jun 2003
1010000011101_{2} Posts 
Major (code size) optimization. The first stage reduction can be rewritten as:
Code:
j=threadIdx.x * range; uint8 sum = 0; for (k=0; k<range; k++) sum += locsieve[j+k] & (bitsieve[threadIdx.x] >> k); locsieve[j] = sum; This alone reduces the code size by well over 25%! EDIT: Probably you need ~bitsieve[] EDIT2: In another "duh!" moment, I just realized that if the above transformation is valid, then you can get rid of the entire bitsieve array and replace it with a single uint32 bitsieve, because that is the slice of bitsieve that a thread is dealing with. Last fiddled with by axn on 20120704 at 05:52 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3506  20210918 00:04 
The P1 factoring CUDA program  firejuggler  GPU Computing  753  20201212 18:07 
End of the world as we know it (in music)  firejuggler  Lounge  3  20121222 01:43 
World Cup Soccer  davieddy  Hobbies  111  20110528 19:21 
World's dumbest CUDA program?  xilman  Programming  1  20091116 10:26 