mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2009-12-15, 17:04   #12
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hi,

my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :(

The good news: the 2nd attempt works fine an took only ~10 lines of code :)

sieving the first 2000 primes:
TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...)

sieving the first 4000 primes:
TF M66xxxxxx to 2^64: 220 seconds :)
Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast.
mdettweiler is offline   Reply With Quote
Old 2009-12-16, 00:37   #13
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

56468 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast.
It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.
ATH is offline   Reply With Quote
Old 2009-12-16, 01:55   #14
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by ATH View Post
It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.
Wow! Now I really wish I had a real GPU in my computer...
mdettweiler is offline   Reply With Quote
Old 2009-12-16, 02:00   #15
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

8,933 Posts
Default

Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?

If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU.

Last fiddled with by Uncwilly on 2009-12-16 at 02:00
Uncwilly is offline   Reply With Quote
Old 2009-12-18, 10:51   #16
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×3×5×37 Posts
Default

Hi ATH,

Quote:
Originally Posted by ATH View Post
It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.
I'm wondering why M669xxxxx is slower than M660xxxxx, M669xxxxx should have LESS factor candidates.

Can you run a benchmark from 2^64 to 2^65 aswell?
There might be a slowdown in P95 for factors above 2^64.
My CUDA-Code has only one code path: factors up to 2^70 (maybe 2^71, I have to check, I would say that I've a 90% chance that 2^71 will work without modifications) so there should be no slowdown when moving to factors > 2^64. :)
TheJudger is offline   Reply With Quote
Old 2009-12-18, 10:58   #17
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101102 Posts
Default

Hi Uncwilly,

Quote:
Originally Posted by Uncwilly View Post
Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?

If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU.
right now I've a soft limit for the exponent of (2^31) -1. I just haven't thought about exponents above this limit right now.

77bit? No. Right now I'm using 3 32bit integers (24bits data + 8 bit carry) so this gives me 72bit overall. I'm using 24bits per int because current GPUs can do 24x24 multiply 4 times faster than 32x32...

I could do 4x 24bits, too. But at this moment I just want to have a working version which runs effective from 2^64 to 2^70 (or 2^71).

CUDA-doc notes that upcomming GPUs can do 32x32 multiply much faster than now...
TheJudger is offline   Reply With Quote
Old 2009-12-19, 22:30   #18
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2·3·5·37 Posts
Default

No code improvement, just some more benchmarks.

M66362159
sieving the first 4000 primes:
TF to 2^64: 220 seconds
TF from 2^64 to 2^65: 212 seconds
TF from 2^65 to 2^66: 423 seconds

In the range from 2^65 to 2^66 exists one factor, so Prime95 might stop there?
63205291599831307391 is a factor of M66362159

TF to 2^64 does one step less precalculation so for each factor candidate is one more long division needed. This is the reason why it is slower than TF from 2^64 to 2^65.
TheJudger is offline   Reply With Quote
Old 2009-12-20, 22:22   #19
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×3×7×71 Posts
Default

Quote:
Originally Posted by TheJudger View Post
M66362159
sieving the first 4000 primes:
TF to 2^64: 220 seconds
TF from 2^64 to 2^65: 212 seconds
TF from 2^65 to 2^66: 423 seconds
1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with Windows XP x64 and Prime95 25.11.2 64bit:

M66362189:
TF to 2^64: 1373 sec
TF 2^64-2^65: 1741 sec
TF 2^65-2^66: 3889 sec
TF 2^66-2^67: 7733 sec

Last fiddled with by ATH on 2009-12-20 at 22:27
ATH is offline   Reply With Quote
Old 2009-12-22, 10:20   #20
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×3×5×37 Posts
Default

Hi!

Thank you for the benchmarks, ATH :)

Little speed increase caused by several minimal improvements. :)
M66362159
sieving the first 4000 primes:
TF to 2^64: 205 seconds (needs further testing, quick tests showed no problems)
TF from 2^64 to 2^65: 205 seconds
TF from 2^65 to 2^66: 406 seconds

Some code cleanups asweel.
And finally an user interface (command line).

Next steps are:
- some more cleanups (remove unused code, ...)
- some more describtion about compiletime options
- testing the code on different platforms on mersenne numbers with known factors (check that my code finds the factors aswell). Here you all can help me. Prefered are people who know how to compile/run CUDA code.
TheJudger is offline   Reply With Quote
Old 2009-12-24, 00:35   #21
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010101102 Posts
Default

Hi,

Good news and bad news! (don't be afraid, it is not really bad)

Good news #1:
I quick test on my 8800GTX (G80) ran fine, it produced the same results as my GTX 275 (GT200b). The G80 is the oldest CUDA-capable chip with "compute capabilities 1.0". If the code works on G80 it should work on all CUDA-enabled products. :)

Good news #2:
I've increased the raw speed of the GPU code.
My GTX 275 can now test ~53.1M factor candidates per second (M66362159, tested with 65 bit factors). Performance before the latests optimisation was ~36.6M factor candidates per second. The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before)

Bad news #1:
The "ptx hack" is ugly!!!
I have to check some compilers...
There is a patch to enable some more intrinsics but I was not able to build the compiler. :(

Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I have to think about possiblities:
(1) speedup the siever by writing better code (I'm not sure if I can do this).
If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.
(2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution.
(3) put the siever on the GPU. I'm not sure if this might work...
(4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ...

personnally I prefer (4)
Any comments?

-----
New benchmark numbers (still on M66362159)
8800GTX, 3.00GHz Core 2 Duo, sieving up to 37831 (4000th odd prime):
TF to 2^64: 283s

GTX 275, Core 2 Duo overclocked to 4GHz, sieveing up to 3581 (500th odd prime)
In this configuration the siever causes allready some stalls on the GPU.
TF to 2^64: 190s
TF from 2^64 to 2^65: 190s
TF from 2^65 to 2^66: 379s

Raw "factor testing speed" of the 8800 GTX is exactly half of the GTX 275 so imagin the time if the siever would be fast enough...
TheJudger is offline   Reply With Quote
Old 2009-12-24, 01:41   #22
axn
 
axn's Avatar
 
Jun 2003

2·5·479 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?

In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.
axn is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
The P-1 factoring CUDA program firejuggler GPU Computing 752 2020-09-08 16:15
"CUDA runtime version 0.0" when running mfaktc.exe froderik GPU Computing 4 2016-10-30 15:29
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 03:50.

Wed Dec 2 03:50:04 UTC 2020 up 83 days, 1:01, 1 user, load averages: 1.93, 1.65, 1.61

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.