mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   Fast Mersenne Testing on the GPU using CUDA (https://www.mersenneforum.org/showthread.php?t=14310)

Karl M Johnson 2011-11-30 15:49

Any updates ?

RichD 2012-02-23 19:25

Just received a note from Andrew Thall and he is releasing his gpuLucas program at [url]https://github.com/Almajester/gpuLucas[/url].

He claims it is still pretty ugly research code but between the ReadMe file, internal documentation and his [URL="http://andrewthall.org/papers/gpuMersenne2011MKII.pdf"]paper[/URL], that should be enough to make a working copy.

It appears the program was developed under Windows 7 using Visual C++ in Visual Studio 2008. I may play with it (time permitting) to see if I can get a working version under Linux.

aaronhaviland 2012-02-24 03:49

[QUOTE=RichD;290603]Just received a note from Andrew Thall and he is releasing his gpuLucas program at [URL]https://github.com/Almajester/gpuLucas[/URL].

I may play with it (time permitting) to see if I can get a working version under Linux.[/QUOTE]

Very interesting!

I've managed to get a linux version working, myself. (Had a bit of trouble with #include <qd/dd_real.h> being included under nvcc compilation.)

Observations: Currently the number to test, and the FFTlen are hard-coded, there is no checkpoint file, it does not bail/restart/change FFTlen if error is too great, and there is no residue output for non-primes.

However, after a couple tests, it does seem to be a fair bit faster than CUDALucas: estimated runtime for M(26xxxxxx) using the same FFT size (1572864) is about 47 hrs in CUDALucas, and 40 hrs in gpuLucas (I've actually gotten it down to 36 hrs by fine tuning FFT size, and T_PER_B), but that's just [I]estimated[/I] run-time...

RichD 2012-02-24 04:08

Hey, that's great!!!

I found the QD package at [url]http://crd-legacy.lbl.gov/~dhbailey/mpdist/[/url] but then I ran into another problem before getting side track.

Your observations are what I was expecting (unfortunately).

I think [B]TheJudger[/B] has done a lot of work on threads per block (T_PER_B) is his mfaktc program. Might need to be tuned for each card.

There is a lot of work that still needs to be done before it can be accepted by the community. Or maybe just the ideas present in the code could be used in existing programs. ??

Dubslow 2012-02-24 04:14

A hybrid of of GL and CL? (Oh, those are such unfortunate acronyms.)

frmky 2012-02-25 00:29

Yes, gpulucas appears considerably faster. On a GTX 480, for 43122609 using a 2304K FFT, gpulucas claims to require 51.2 hours and CUDALucas 1.58 claims to require 63.7 hours. Of course both of these are ETA's and not actual runtimes, but that's a nearly 20% difference.

TheJudger 2012-02-25 00:50

Hi,

[QUOTE=RichD;290663]I think [B]TheJudger[/B] has done a lot of work on threads per block (T_PER_B) is his mfaktc program. Might need to be tuned for each card.[/QUOTE]

*hmm* not really. Actually "threads per block" is currently fixed at 256 in mfaktc. When I've choosen this number I did some tests with other values, 512 runs out of registers on CC 1.1 GPUs, for other GPUs it does not really make any difference for 128, 256 or 512. The more important number for mfaktc is the number of threads per grid but this might be special to mfaktc, not for all CUDA applications.

Oliver

msft 2012-02-25 23:05

Hi ,
Work on linux.
I think compile option is important.
Makefile
[code]
NVIDIA_SDK = $(HOME)/NVIDIA_GPU_Computing_SDK
gpuLucas: gpuLucas.o
g++ -fPIC -o gpuLucas gpuLucas.o -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 $(NVIDIA_SDK)/C/lib/libcutil_x86_64.a -lqd -lcufft -lm
gpuLucas.o: gpuLucas.cu
/usr/local/cuda/bin/nvcc -O3 -use_fast_math -gencode arch=compute_20,code=sm_20 --compiler-options="-fno-strict-aliasing" -w -I. -I/usr/local/include -I$(NVIDIA_SDK)/C/common/inc gpuLucas.cu -arch=sm_13 -c
clean:
-rm *.o gpuLucas
[/code]
GTX-550Ti
[code]
[0/50]: iteration 4300: max abs error = 0.226562
[0/50]: iteration 4300: max Bit Vector = 39.000000
Time to rebalance llint: 1.936 ms

Time to rebalance and write-back: 821.3 ms

Timing: To test M43112609
elapsed time : 75901 msec = 75.9 sec
dev. elapsed time: 143860 msec = 143.9 sec
est. total time: 620216064 msec = 620216.1 sec

Beginning full test of M43112609
[/code]
CUDALucas
[code]
$ ./CUDALucas 43112609
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.58 (2:35 real, 15.4797 ms/iter, ETA 185:19:36)
[/code]

science_man_88 2012-02-25 23:25

[QUOTE=msft;290911]Hi ,
Work on linux.
I think compile option is important.
Makefile
[code]
NVIDIA_SDK = $(HOME)/NVIDIA_GPU_Computing_SDK
gpuLucas: gpuLucas.o
g++ -fPIC -o gpuLucas gpuLucas.o -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 $(NVIDIA_SDK)/C/lib/libcutil_x86_64.a -lqd -lcufft -lm
gpuLucas.o: gpuLucas.cu
/usr/local/cuda/bin/nvcc -O3 -use_fast_math -gencode arch=compute_20,code=sm_20 --compiler-options="-fno-strict-aliasing" -w -I. -I/usr/local/include -I$(NVIDIA_SDK)/C/common/inc gpuLucas.cu -arch=sm_13 -c
clean:
-rm *.o gpuLucas
[/code]
GTX-550Ti
[code]
[0/50]: iteration 4300: max abs error = 0.226562
[0/50]: iteration 4300: max Bit Vector = 39.000000
Time to rebalance llint: 1.936 ms

Time to rebalance and write-back: 821.3 ms

Timing: To test M43112609
elapsed time : 75901 msec = 75.9 sec
dev. elapsed time: 143860 msec = 143.9 sec
est. total time: 620216064 msec = 620216.1 sec

Beginning full test of M43112609
[/code]
CUDALucas
[code]
$ ./CUDALucas 43112609
Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, CUDALucas v1.58 (2:35 real, 15.4797 ms/iter, ETA 185:19:36)
[/code][/QUOTE]
looks like the difference is about 13:02:40

msft 2012-02-26 01:50

[QUOTE=science_man_88;290913]looks like the difference is about 13:02:40[/QUOTE]
Indeed.
[QUOTE=aaronhaviland;290657] and there is no residue output for non-primes.
[/QUOTE]
residue is available.
but not same mprime.
[code]
M_1215421 tests as non-prime.
M_1215421, 0xfd93939b00a071bf, n = 65536, gpuLucas
[/code]
mprime:
[code]
[Work thread Feb 25 18:53] M1215421 is not prime. Res64: FE93935B009871C0. We8: 5EAF771A,140242,00000000
[/code]
each h_signalOUT[] value -1 or +1.

aaronhaviland 2012-02-26 13:47

[QUOTE=msft;290918]residue is available.
but not same mprime.
[code]
M_1215421 tests as non-prime.
M_1215421, 0xfd93939b00a071bf, n = 65536, gpuLucas
[/code]mprime:
[code]
[Work thread Feb 25 18:53] M1215421 is not prime. Res64: FE93935B009871C0. We8: 5EAF771A,140242,00000000
[/code]each h_signalOUT[] value -1 or +1.[/QUOTE]

Residue output wasn't available when I posted that. Since then, I had submitted a pull request with basic residue output, and a basic Makefile, which has been merged.

As far as the residue output: It actually works fine on my end (linux, 64-bit, gtx460):
[CODE]M_1215421, 0xfe93935b009871c0, n = 65536, gpuLucas
M_1215421, 0xfe93935b009871c0, n = 61440, gpuLucas
[/CODE]I've tested it now with several different testIntegers... and different FFT lengths.


All times are UTC. The time now is 04:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.