mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Fast Mersenne Testing on the GPU using CUDA (https://www.mersenneforum.org/showthread.php?t=14310)

msft 2012-02-26 18:59

[QUOTE=aaronhaviland;290949]Residue output wasn't available when I posted that. Since then, I had submitted a pull request with basic residue output, and a basic Makefile, which has been merged.
[/QUOTE]
understand.
apologies.

aaronhaviland 2012-02-27 03:56

[QUOTE=msft;290971]understand.
apologies.[/QUOTE]
No need to apologise :)

I'm continuing to work on his code on my fork: [URL]https://github.com/ah42/gpuLucas[/URL]

RichD 2012-02-27 05:17

[QUOTE=aaronhaviland;291003]I'm continuing to work on his code on my fork: [URL]https://github.com/ah42/gpuLucas[/URL][/QUOTE]

Thank you for taking a vested interest in gpuLucas. I am glad someone, with more time than me, can continue the review/development. I felt this was a too valuable asset to be forgotten in the GIMPS community. Thus, my friendly "prodding" to Prof. Thall.

aaronhaviland 2012-03-02 04:06

1 Attachment(s)
current progress: I'm comfortable calling this an alpha version, and have given it its first version number, 0.9.0 (tagged in git: [URL]https://github.com/ah42/gpuLucas/tags[/URL])
I have no idea if the windows build files still work, and I'm fairly sure this will only run on 64-bit.
I have not yet implemented checkpointing, that is my next goal. Not having checkpoints, I haven't yet run larger exponents to completion.

Currently verified primes: 1279, 2203, 110503, 859433
Verified non-prime residues: 1061, 10771, 106121, 1061069

ChangeLog:
[SIZE=1]
Aaron Haviland <orion@parsed.net> 2012-03-01

* Update version to 0.9.0. First versioned commit [/SIZE][SIZE=1]
*[B][I] Add autodetection of optimal FFT signalSize.[/I][/B] Can test exponents as low as 1000 (Verified residue with M(1061)
* [B][I]Auto-select setSliceAndDice[/I][/B] depending on testPrime and signalSize. May need to be further tuned.
* Reduce T_PER_B to 512 to better fit more blocks on GPUs

Aaron Haviland <orion@parsed.net> 2012-02-29 [/SIZE][SIZE=1]

* Add -d flag to choose CUDA device, default to device 0 [/SIZE][SIZE=1]
* Add some more kernel failure checks
* Clean up compiler warnings
* Makefile: more verbose. Lower registers-per-kernel to 20, fits better.
* Fix __cufftSafeCall: break statements had been accidentally omitted

Aaron Haviland <orion@parsed.net> 2012-02-26 [/SIZE][SIZE=1]

* Add help option -h [/SIZE][SIZE=1]
* [B][I]Introduce getopt support for passing command-line options.[/I][/B]
* Wrap verbose startup messages inside opt_verbose and add verbosity flag.
* Abort if error goes too high in errorTrial
* Remove generated files
* Remove external dependancy on cutil as NVIDIA recommends against using it, and it was the only thing from the SDK we needed.

Aaron Haviland <orion@parsed.net> 2012-02-24 [/SIZE][SIZE=1]

* Add residue-printing support for non-primes (based on rw.c from the mers package). Verified to work on 64-bit linux. [/SIZE][SIZE=1]
* Add a basic makefile for building on Linux[/SIZE]


Attachment: compiled for 64-bit linux, with Cuda3.2. Cuda libraries not included.

aaronhaviland 2012-03-02 19:27

v0.9.1: Checkpointing added! It took less time than I thought it would. (inspired by checkpointing code in recent versions of CUDALucas)

Checkpoints are incompatible with CUDALucas.

flashjh 2012-03-02 19:58

Just wondering since I haven't used gpuLucas... is this project set to replace CUDALucas or the other way around?

Is it feasable to combine the best of both into one as to not spend time on two separate projects with the same goal?

Maybe aaronhaviland and msft can talk about it?

aaronhaviland 2012-03-03 15:00

I don't believe that its intention is to replace CUDALucas at all, but an alternative method to the same goal. As for merging the two codebases, I'm not sure how possible that is, since, although at the core, they both do FFT->Multiplication->IFFT, the supporting maths around them are quite different (which is why I intentionally made the checkpoint files incompatible).

Mr. Thall has done some great work getting the maths to this point, and I think gpuLucas as it stands could see quite a bit of improvement still, and would like to keep working on it independently of CUDALucas.

With the recent improvements in CL, when running with the same FFT lengths, it is only 2% slower than gpuLucas (rather than the larger speedups reported before). However, it does appear that gpuLucas is capable of running with smaller FFT lengths, (where CL bails out due to potential round-off error) thereby increasing its lead again (to about an estimated 9% quicker on this 26xxxxxx exponent I'm currently testing)

All of that said, I think there are things that can be learned from both of these endeavours (as well as some other GPGPU applications I've dug into), as far as portability and best practices for optimising for multiple gpu architectures, and I plan to keep working on gpuLucas (or possibly renaming as a third derivative work due to gpuLucas being BSD licensed, but other code being GPL'd.)

Prime95 2012-03-04 19:54

Minor bug: You are checkpointing every 1000 iterations, not 10000.


Note: every 10000 iters is every 40 seconds or so? That seems excessive. Prime95 writes one every half hour.

aaronhaviland 2012-03-05 02:58

[QUOTE=Prime95;291890]Minor bug: You are checkpointing every 1000 iterations, not 10000.


Note: every 10000 iters is every 40 seconds or so? That seems excessive. Prime95 writes one every half hour.[/QUOTE]

I agree, I think it is excessive (even at 10,000), but thanks for pointing out my typo. I had left it low for debugging purposes, and I hadn't yet looked at what a sane frequency should be.

I've just committed configurable checkpointing, defaulting to 10,000 iterations, and changed some of the output formats so it matches CL a bit more: it now prints residue output at checkpointing time (unless running in quiet mode with -q).

I'm a little concerned about the residue mismatch I just got on M(26171441), but since I had restarted it several times, and changed a few things, including the checkpoint format itself, it was most likely my fault. I'm re-starting the test... it's using an FFT length with a high round-off error (around 0.37) so I can test out what an acceptable round-off error should be with this method. (So far, residues are matching CUDALucas through around 250,000 iterations.)

flashjh 2012-03-05 03:01

[QUOTE=aaronhaviland;291923]I agree, I think it is excessive (even at 10,000), but thanks for pointing out my typo. I had left it low for debugging purposes, and I hadn't yet looked at what a sane frequency should be.

I've just committed configurable checkpointing, defaulting to 10,000 iterations, and changed some of the output formats so it matches CL a bit more: it now prints residue output at checkpointing time (unless running in quiet mode with -q).

I'm a little concerned about the residue mismatch I just got on M(26171441), but since I had restarted it several times, and changed a few things, including the checkpoint format itself, it was most likely my fault. I'm re-starting the test... it's using an FFT length with a high round-off error (around 0.37) so I can test out what an acceptable round-off error should be with this method. (So far, residues are matching CUDALucas through around 250,000 iterations.)[/QUOTE]

Which version of CUDALucas are you using?

aaronhaviland 2012-03-05 03:24

[QUOTE=flashjh;291924]Which version of CUDALucas are you using?[/QUOTE]
1.48 and 1.64, why?


All times are UTC. The time now is 13:36.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.