![]() |
![]() |
#12 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
624910 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#13 | |
Einyen
Dec 2003
Denmark
57348 Posts |
![]() Quote:
M660xxxxx: 1 to 2^64: 1369sec M669xxxxx: 1 to 2^64: 1406sec So 1 core on a 1.5year old cpu is over 6 times slower. |
|
![]() |
![]() |
![]() |
#14 | |
A Sunny Moo
Aug 2007
USA (GMT-5)
3×2,083 Posts |
![]() Quote:
![]() ![]() |
|
![]() |
![]() |
![]() |
#15 |
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
100100100010002 Posts |
![]()
Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?
If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU. Last fiddled with by Uncwilly on 2009-12-16 at 02:00 |
![]() |
![]() |
![]() |
#16 | |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]()
Hi ATH,
Quote:
Can you run a benchmark from 2^64 to 2^65 aswell? There might be a slowdown in P95 for factors above 2^64. My CUDA-Code has only one code path: factors up to 2^70 (maybe 2^71, I have to check, I would say that I've a 90% chance that 2^71 will work without modifications) so there should be no slowdown when moving to factors > 2^64. :) |
|
![]() |
![]() |
![]() |
#17 | |
"Oliver"
Mar 2005
Germany
11×101 Posts |
![]()
Hi Uncwilly,
Quote:
77bit? No. Right now I'm using 3 32bit integers (24bits data + 8 bit carry) so this gives me 72bit overall. I'm using 24bits per int because current GPUs can do 24x24 multiply 4 times faster than 32x32... I could do 4x 24bits, too. But at this moment I just want to have a working version which runs effective from 2^64 to 2^70 (or 2^71). CUDA-doc notes that upcomming GPUs can do 32x32 multiply much faster than now... |
|
![]() |
![]() |
![]() |
#18 |
"Oliver"
Mar 2005
Germany
11·101 Posts |
![]()
No code improvement, just some more benchmarks.
M66362159 sieving the first 4000 primes: TF to 2^64: 220 seconds TF from 2^64 to 2^65: 212 seconds TF from 2^65 to 2^66: 423 seconds In the range from 2^65 to 2^66 exists one factor, so Prime95 might stop there? 63205291599831307391 is a factor of M66362159 TF to 2^64 does one step less precalculation so for each factor candidate is one more long division needed. This is the reason why it is slower than TF from 2^64 to 2^65. |
![]() |
![]() |
![]() |
#19 | |
Einyen
Dec 2003
Denmark
22·3·11·23 Posts |
![]() Quote:
M66362189: TF to 2^64: 1373 sec TF 2^64-2^65: 1741 sec TF 2^65-2^66: 3889 sec TF 2^66-2^67: 7733 sec Last fiddled with by ATH on 2009-12-20 at 22:27 |
|
![]() |
![]() |
![]() |
#20 |
"Oliver"
Mar 2005
Germany
100010101112 Posts |
![]()
Hi!
Thank you for the benchmarks, ATH :) Little speed increase caused by several minimal improvements. :) M66362159 sieving the first 4000 primes: TF to 2^64: 205 seconds (needs further testing, quick tests showed no problems) TF from 2^64 to 2^65: 205 seconds TF from 2^65 to 2^66: 406 seconds Some code cleanups asweel. And finally an user interface (command line). Next steps are: - some more cleanups (remove unused code, ...) - some more describtion about compiletime options - testing the code on different platforms on mersenne numbers with known factors (check that my code finds the factors aswell). Here you all can help me. Prefered are people who know how to compile/run CUDA code. |
![]() |
![]() |
![]() |
#21 |
"Oliver"
Mar 2005
Germany
11×101 Posts |
![]()
Hi,
Good news and bad news! (don't be afraid, it is not really bad) Good news #1: I quick test on my 8800GTX (G80) ran fine, it produced the same results as my GTX 275 (GT200b). The G80 is the oldest CUDA-capable chip with "compute capabilities 1.0". If the code works on G80 it should work on all CUDA-enabled products. :) Good news #2: I've increased the raw speed of the GPU code. My GTX 275 can now test ~53.1M factor candidates per second (M66362159, tested with 65 bit factors). Performance before the latests optimisation was ~36.6M factor candidates per second. The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before) Bad news #1: The "ptx hack" is ugly!!! I have to check some compilers... There is a patch to enable some more intrinsics but I was not able to build the compiler. :( Bad news #2: My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now. I have to think about possiblities: (1) speedup the siever by writing better code (I'm not sure if I can do this). If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2. (2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution. (3) put the siever on the GPU. I'm not sure if this might work... (4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ... personnally I prefer (4) Any comments? ----- New benchmark numbers (still on M66362159) 8800GTX, 3.00GHz Core 2 Duo, sieving up to 37831 (4000th odd prime): TF to 2^64: 283s GTX 275, Core 2 Duo overclocked to 4GHz, sieveing up to 3581 (500th odd prime) In this configuration the siever causes allready some stalls on the GPU. TF to 2^64: 190s TF from 2^64 to 2^65: 190s TF from 2^65 to 2^66: 379s Raw "factor testing speed" of the 8800 GTX is exactly half of the GTX 275 so imagin the time if the siever would be fast enough... |
![]() |
![]() |
![]() |
#22 | |
Jun 2003
485610 Posts |
![]() Quote:
In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit. |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1668 | 2020-12-22 15:38 |
The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |