mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-01-03, 09:43   #34
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

22·1,193 Posts
Default

You can take into shared memory the 16 segments used for sieving (one for each residual class) and synchronize threads only at the end of the whole task.

Updating the segment is a matter of nanoseconds, once you locally store the precomputed gaps for each prime.

Luigi
ET_ is offline   Reply With Quote
Old 2010-01-04, 19:54   #35
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×3×5×37 Posts
Default

Hi,

another performance update.
Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations).

Remove classes which are
- 3 or 5 mod 8
- multiples of 3 or 5
4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%)

Removing multiples of 7, too:
4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%)

Removing multiples of 11, too:
4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%)
-----
One process on the CPU, sieving the first 4000 odd primes:
M66362159 TF from 2^64 to 2^65: 160.7s / 185.5s
M66362159 TF from 2^65 to 2^66: 318.8s / 323.5s
M66362159 TF from 2^66 to 2^67: 631.6s / 620.7s
First time: using 96 classes
Second time: using 960 classes
The range is a bit to small for proper timings when using 960 classes. The reason is that I always fill up the candidate list no matter if its above the range or not. And compared to 96 classes it need 10x more initializations.
-----
M66362159 TF from 2^64 to 2^65 using 96 classes:
sieving first 4000/25000 odd primes
1 process: 160.7s / 167.5s
2 processes: 285.8s / 238.1s (both processes are doing TF from 64 to 65bit)

Throughput for M66362159 TF from 2^64 to 2^65:
Time estimate for my 4GHz C2D (based on ATHs timings): 1160s
(Since Prime95 TF fits excellent into the L1-Cache of the CPU I've scaled up the time directly by the clock rates)
How often can I do the TF in one hour?
One Process for CUDA and one with Prime95 (assuming they don't slow down each other):
3600s / 160.7s + 3600s / 1160s = 22.4 + 3.1 = 25.5
Two Processes for CUDA and no for Prime95:
2 * 3600s / 238.1s = 30.2

So when aiming for TF throughput on my particular system it is clearly beneficial to dedicate 2 CPU cores to my CUDA-code.
TheJudger is offline   Reply With Quote
Old 2010-01-04, 21:13   #36
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

21916 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations).

Remove classes which are
- 3 or 5 mod 8
- multiples of 3 or 5
4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%)

Removing multiples of 7, too:
4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%)

Removing multiples of 11, too:
4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%)
Do you only rely on initialization to remove these? If so, you should perhaps consider removing them from the sieve itself, that'd help you put wider ranges in your L1 Dcache, at the expense of code size, but I don't think it's critical .
ldesnogu is offline   Reply With Quote
Old 2010-01-04, 22:45   #37
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2·3·5·37 Posts
Default

Hi ldesnogu,

I'm not 100% sure if I got the point of your post.
The smallest primes (3, 5, 7, 11) are removed from the sieve totally.

I'll try to explain how it works. Let us ignore the fact that we can remove factors which are 3 or 5 mod 8 for simplicy (this would just increase the number of classes by four).
For example we take M23 (p=23) and build only 3 classes (remove all multiples of 3).
The factor candidates (FC) have the form 2kp+1.
Code:
class 0: k={      3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, ...}
class 1: k={  1,  4,  7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...}
class 2: k={  2,  5,  8, 11, 14, 17, 20, 23, 26, 29, 32, 35, ...}
Actually those lists are only for representation here, they are not generated in the code.

Check, if a class is 0 mod 3: k=3: 2kp+1 mod 3 = 1
Check, if a class is 1 mod 3: k=1: 2kp+1 mod 3 = 2
Check, if a class is 2 mod 3: k=2: 2kp+1 mod 3 = 0

So we can ignore class 2 totally. ALL candidates in this class are a multiple of 3!
Code:
class 0: k={      3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, ...}
class 1: k={  1,  4,  7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...}
class 2: cleared
So we've no multiples of 3 now (talking about FCs). Next small prime is 5.
For each class find the smallest k where 2kp+1 mod 5 = 0
These are k=9 for class 0 and k=4 for class 1. Starting at k=9/k=4 in this lists mark every 5th k as cleared.
Code:
class 0: k={      3,  6, *9, 12, 15, 18, 21,*24, 27, 30, 33, ...}
class 1: k={  1, *4,  7, 10, 13, 16,*19, 22, 25, 28, 31,*34, ...}
class 2: cleared
Repeat sieving for small primes 7, 11, 13, ... up to a certain bound.

Those classes are independent on each other so we need to handle only a single class at one time. There is no need to try those FCs in ascending order.

My actual implementation uses one of
- 4 * 3 * 5 * 7 = 420 classes, sieve size = N * 11 * 13 * 17 * 19 bits = N * 46189 bits
- 4 * 3 * 5 * 7 * 11 = 4620 classes, sieve size = N * 13 * 17 * 19 * 23 bits = N * 96577 bits
Each k is represented as a single bit.

These sieve sizes allow to have preinitialized sieves which have allready sieved the small primes up to 19/23.
TheJudger is offline   Reply With Quote
Old 2010-01-05, 09:47   #38
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3×179 Posts
Default

Oh OK, I mistakenly thought you only removed smaller primes from your sieve by using initialization, while in fact your initialization removes some larger ones

In case you don't know, this method is called sieving by wheel.
ldesnogu is offline   Reply With Quote
Old 2010-01-05, 19:06   #39
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45616 Posts
Default

Hi,

Quote:
Originally Posted by ldesnogu View Post
In case you don't know, this method is called sieving by wheel.
I didn't know that this has a specific name. I came up with this idea some years ago by myself (for a other project, offcourse). In my original code I called it "groups", not "classes". ;)
TheJudger is offline   Reply With Quote
Old 2010-01-05, 19:40   #40
ckdo
 
ckdo's Avatar
 
Dec 2007
Cleves, Germany

10228 Posts
Default

In other words, you reinvented the wheel.
ckdo is offline   Reply With Quote
Old 2010-01-05, 19:45   #41
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2·3·5·37 Posts
Default

nice one, ckdo. :)
TheJudger is offline   Reply With Quote
Old 2010-01-10, 18:20   #42
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×3×5×37 Posts
Default

Hello,

for those who want to try: find attached the code :)
At the current state I recommend this only to people how are familar with CUDA.
The code is NOT ready for everyday usage, you will be able to submit new factors to the primenet server but you can't submit "no factor from 2^X to 2^Y" results.

Compared to my postings from 4. Jan there are some speed improvements aswell (maybe 10%).

When you start: try to run the selftest first (see params.h and enable them).
The selftest contains ~40 exponents which have known factors between 2^68 to 2^71. To speedup the selftest you can define MORE_CLASSES in params.h. The selftest does only check the class in which the known factor falls so it is alot faster than doing the full range.

Try to compile with 'compile_replace_umul.hi_umul24.hi.sh', this gives a speedup for free compared to 'compile.sh'. (See post #21/#26 of this thread).

Once you've created a binary just run it without any parameters and it will show you how to run. (You need to enter a valid set of parameters even if you want to run the selftest...)

I recommend to run on mersenne numbers with known factors and check if my code finds the factor(s) aswell. Please report the results of you runs.

It should work on (allmost) all CUDA-capable devices.
I'm not 100% sure about G80 (Geforce 8800 GTS 320/640, 8800 GTX, 8800 Ultra and their Quadro/Telsa variants). Maybe you need to disable "USE_PINNED_MEMORY" and "USE_ASYNC_COPY" on these devices (params.h).

The selftest runs fine on an old 8600GTS :)

Documentation is allmost missing. :(
The userinterface isn't very comfortable, sorry for that ;)

Oliver
Attached Files
File Type: gz mfaktc.tar.gz (24.4 KB, 355 views)
TheJudger is offline   Reply With Quote
Old 2010-01-10, 18:49   #43
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

11·523 Posts
Default

Quote:
Originally Posted by TheJudger View Post
The selftest runs fine on an old 8600GTS :)
At least i know it works on my card
How fast is it on the 8600GTS? From my experience of it it is very slow and pointless compared with modern cards like the GTX275 although plenty good enough for the gaming i do
Is it worth me getting it working? I have never compiled something as i have only ever used msievegpu. Would there be any chance of binaries for it?
henryzz is offline   Reply With Quote
Old 2010-01-10, 19:09   #44
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

2×3×5×37 Posts
Default

Hi henryzz,

for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad.
I like the card since it is relative slow it's easy to spot differences in runtime of the GPU-code, the CPU never limits the throughput.

The RAW speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on
- 8400GS (43.2GFLOPS / 2.3M candidates tested per second)
- 8600GT (113GFLOPS / 6.1M candidates tested per second)
- 8800GTX (518GFLOPS / ~28M candidates tested per second)
- GTX 275 (1011GFLOPS / ~54M candidates tested per second)

I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now.

I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x86-64).
If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than.
You need to install the CUDA software aswell.

Last fiddled with by TheJudger on 2010-01-10 at 20:01
TheJudger is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
The P-1 factoring CUDA program firejuggler GPU Computing 752 2020-09-08 16:15
"CUDA runtime version 0.0" when running mfaktc.exe froderik GPU Computing 4 2016-10-30 15:29
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26

All times are UTC. The time now is 03:43.

Thu Dec 3 03:43:51 UTC 2020 up 84 days, 54 mins, 1 user, load averages: 1.84, 1.74, 1.62

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.