mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   World's second-dumbest CUDA program (https://www.mersenneforum.org/showthread.php?t=11900)

 bsquared 2012-06-30 05:03

1 Attachment(s)
Here's the code. I'd be interested to hear how it runs on other hardware (or *if* it runs).

 Dubslow 2012-06-30 06:17

Mfakt* search for factors of Mersenne numbers; a factor of 2^p-1 must be congruent to 1 (mod 2p). Thus they produce a list of k's (the multiplier) such that 2p*k+1 is between 2^a and 2^b for some integers a and b. (Currently, depending on the exponent, the final TF bound is 2^72 or 2^73, though mfaktc supports TFing up to 96 bits.) This list of potential factors is then sieved for primality on the CPU before being sent to the GPU for the actual divisibility test. Currently, in "vanilla" versions, the depth of the sieve is variable; the "SievePrimes" variable is allowed to be between 5000 and 250000. Presumably that's the number of small primes by which a factor candidate is divided before being sent to the GPU (if none of the small primes divide the factor candidate). There has been some experimentation with reducing SievePrimes, down to around 1500 which seemed to be a sweet spot, and someone (rcv?) had experimented with sieving on the GPU. While that wasn't a failure (it did seem to work), nothing has come of it yet.

tl;dr: Arbitrary numbers of up to ~32 digits are sieved for primality against 5000-250000 small primes.

I'll test your code on my GTX460 when I get home tomorrow.

 rcv 2012-06-30 09:27

I gave a version of my sieving code to TheJudger, Bdot, and Prime95, on or about March 8, 2012. It's not a perfect apples-to-apples comparison, partly for the reasons given by Dubslow. Simplifying my code to just sieve the integers from 1 to 1e9 should not slow it down, but it probably wouldn't speed it up by very much either. (Setup of the sieving loops might be simplified, but the compute-intensive sieving, itself, would be unchanged.)

Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9.

My sieving code respects the "class" concept used by mfaktc/mfakto. [Contrast 960 of 4620 classes with 2 of 6 for the code posted above. More classes provides a nice performance advantage for big problems, but adds overhead for small problems, such as merely finding the primes to 1e9. (Another non apples-to-apples issue.)]

My sieving code is also larger, more complex, and uses multiple kernels, each of which is optimized a bit differently than the next.

For the mfaktc/mfakto problem, the sieved results do not have to leave the GPU. As bsquared noted, for counting the primes to 1e9, the sieved results do not have to leave the GPU. Be aware that transfer of the primes, themselves, would add significant data-transfer overhead to any performance metric.

The code was released under GPLv3. Bdot had my blessing to put the code under source control. [I don't know whether he did.]

 bsquared 2012-06-30 12:08

[QUOTE=rcv;303722]

Looking at the timings I took circa March, on my EVGA GTX560Ti, and scaling to find primes to 1e9, and assuming I haven't made an arithmetic error, my code should take about 55406 usec, plus a few hundred usec of setup time. This includes sieving with all of the primes below 2^15, which is slightly more sieving than necessary to find big primes to 1e9.
[/QUOTE]

Wow, cool! My apologies for coming around after 2 days of looking at gpu code and presuming to have come up with the fastest thing around. Also good to know that targetting a version of the above for mfakt* isn't necessary. I'd really like to have a look at your code, if its still available.

 bsquared 2012-07-02 16:43

1 Attachment(s)
More playing around this past weekend:

[CODE]
5.149000 milliseconds for big sieve
5761455 big primes (< 100000000) found

64.112999 milliseconds for big sieve
50847534 big primes (< 1000000000) found

1161.977051 milliseconds for big sieve
455052511 big primes (< 10000000000) found
[/CODE]

 bsquared 2012-07-03 02:32

+ Early abort for primes that don't hit a block at all
+ axn's multiply by inverse trick (thanks again!)

[CODE]

4.929000 milliseconds for big sieve
5761455 big primes (< 100000000) found

57.847000 milliseconds for big sieve
50847534 big primes (< 1000000000) found

891.489014 milliseconds for big sieve
455052511 big primes (< 10000000000) found

[/CODE]

 henryzz 2012-07-03 10:02

[QUOTE=bsquared;303911]+ Early abort for primes that don't hit a block at all
+ axn's multiply by inverse trick (thanks again!)

[CODE]

4.929000 milliseconds for big sieve
5761455 big primes (< 100000000) found

57.847000 milliseconds for big sieve
50847534 big primes (< 1000000000) found

891.489014 milliseconds for big sieve
455052511 big primes (< 10000000000) found

[/CODE][/QUOTE]
Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.

 axn 2012-07-03 11:30

I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking.

EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself.

 bsquared 2012-07-03 13:39

[QUOTE=henryzz;303942]Looks like you are catching up to rcv's code. Any clue how they would compare on the same hardware? rcv's gpu looks a little more powerful I think but yours is tesla.[/QUOTE]

Mine is a Tesla M2050. I have little idea how that compares with his, but the tesla is pretty fast. I would really like to see rcv's code to do a proper comparison and see if any of his ideas are transferrable (assuming he's willing to do that). It's not that I don't trust his timing estimate, but sieving that many classes may cause things scale in unexpected ways. Every experiment I tried with more than 2/6 classes ran slower - a lot slower. But maybe that's because I implemented it badly. *shrug*. I'm pretty happy with where the performance is now.

[QUOTE=axn;303944]I see that the code is using plain "malloc" for allocating the two buffers which are later used in cudaMemcpy. Technically, it is more efficient to allocate "pinned" buffers using cudaHostAlloc (and cudaFreeHost) and using them in cudaMemcpy . However, the gains will probably be negligible (due to the small-ish size of the buffers). Nonetheless, it is there for the taking.

EDIT:- This may or may not affect the timing, since the code is only timing the kernel itself.[/QUOTE]

Didn't seem to make any difference, but it's probably the perferred way to do it, so thanks for the pointer.

It's maybe worth noting that for the 10e9 sieve interval, the kernel sieves approximately as many bytes as a SIQS job for a C60 (with a "factor base" more than twice as big), and does it 7 times faster. Granted, there are a lot more operations in siqs vs. erato. And it will be very hard (impossible?) to avoid read-modify-writes, so use of atomics might be necessary. But this suggests it might not be totally insane to implement a decent siqs on a gpu.

 axn 2012-07-04 04:50

[QUOTE=bsquared;303946]

Looking at the latest code, the compiler is actually doing very good job.

One micro-optimization. Where you're using an array s[2], you can change it out to use two scalar variables s0 & s1. Followed by this:

[CODE] uint32 sxor = s0 ^ s1;
uint32 scur = (k == 0) ? s0 : s1;
for( ;i < block_size; )
{
locsieve[i] = 0;
i += scur;
scur ^= sxor;
}[/CODE]

This changes that loop from
[CODE] /*0998*/ /*0x0851dc4340000000*/ ISCADD R7, R5, R2, 0x2;
/*09a0*/ /*0x004fdc05c9000000*/ STS.U8 [R4], RZ;
/*09a8*/ /*0x07f15c23328ac000*/ ICMP.NE R5, RZ, 0x1, R5;
/*09b0*/ /*0x0071dc85c0000000*/ LDL R7, [R7];
/*09b8*/ /*0x10711c0348000000*/ IADD R4, R7, R4;
/*09c0*/ /*0x0041dc03188ec080*/ ISETP.LT.U32.AND P0, pt, R4, 0x2000, pt;
/*09c8*/ /*0x200001e74003ffff*/ @P0 BRA 0x998;[/CODE]

to
[CODE] /*09a8*/ /*0x009fdc05c9000000*/ STS.U8 [R9], RZ;
/*09b0*/ /*0x24725c0348000000*/ IADD R9, R7, R9;
/*09b8*/ /*0x1c41dc8368000000*/ LOP.XOR R7, R4, R7;
/*09c0*/ /*0x0091dc03188ec080*/ ISETP.LT.U32.AND P0, pt, R9, 0x2000, pt;
/*09c8*/ /*0x600001e74003ffff*/ @P0 BRA 0x9a8;[/CODE]

And totally avoids that usage of two memory locations (which, incidentally are "local" memory and AFAIK is slower(!) than shared memory). As usual, net savings may be iffy :(

EDIT:- Now that I've looked at the code closer, you can replace the index k and instead use scur directly wherever k appears. This will actually allow to combine the two "unrolled loops".

Another suggestion for a micro-optimization. In the main "pid" loop, instead of two variables j & pid, you can directly loop on pid, like so:
[CODE]for (int pid =threadIdx.x + startprime; pid < maxp; pid += threadsPerBlock)[/CODE]
thus removing the inner "if" and the need to send the "nump" parameter to the kernel. This should have almost no impact on runtime :)

 axn 2012-07-04 05:30

Major (code size) optimization. The first stage reduction can be rewritten as:

uint8 sum = 0;
for (k=0; k<range; k++)
sum += locsieve[j+k] & (bitsieve[threadIdx.x] >> k);
locsieve[j] = sum;
[/CODE]

Note that this only works when range = 32 (i.e. block_size / threads_per_block). But then again, the initial logic of setting bitsieve[threadIdx.x] only works when that is the case.

This alone reduces the code size by well over 25%!

EDIT:- Probably you need ~bitsieve[]

EDIT2:- In another "duh!" moment, I just realized that if the above transformation is valid, then you can get rid of the entire bitsieve array and replace it with a single uint32 bitsieve, because that is the slice of bitsieve that a thread is dealing with.

All times are UTC. The time now is 15:21.