20100103, 09:43  #34 
Banned
"Luigi"
Aug 2002
Team Italia
12C0_{16} Posts 
You can take into shared memory the 16 segments used for sieving (one for each residual class) and synchronize threads only at the end of the whole task.
Updating the segment is a matter of nanoseconds, once you locally store the precomputed gaps for each prime. Luigi 
20100104, 19:54  #35 
"Oliver"
Mar 2005
Germany
11·101 Posts 
Hi,
another performance update. Using more classes helps to speedup the sieve (since it removes more candidates with small factors totally out of the sieve at the cost of additional initializations). Remove classes which are  3 or 5 mod 8  multiples of 3 or 5 4 * 3 * 5 = 60 classes, 44 classes cleared, 16 classes remaining (26.7%) Removing multiples of 7, too: 4 * 3 * 5 * 7 = 420 classes, 324 classes cleared, 96 classes remaining (22.9%) Removing multiples of 11, too: 4 * 3 * 5 * 7 * 11 = 4620 classes, 3660 classes cleared, 960 classes remaining (20.8%)  One process on the CPU, sieving the first 4000 odd primes: M66362159 TF from 2^64 to 2^65: 160.7s / 185.5s M66362159 TF from 2^65 to 2^66: 318.8s / 323.5s M66362159 TF from 2^66 to 2^67: 631.6s / 620.7s First time: using 96 classes Second time: using 960 classes The range is a bit to small for proper timings when using 960 classes. The reason is that I always fill up the candidate list no matter if its above the range or not. And compared to 96 classes it need 10x more initializations.  M66362159 TF from 2^64 to 2^65 using 96 classes: sieving first 4000/25000 odd primes 1 process: 160.7s / 167.5s 2 processes: 285.8s / 238.1s (both processes are doing TF from 64 to 65bit) Throughput for M66362159 TF from 2^64 to 2^65: Time estimate for my 4GHz C2D (based on ATHs timings): 1160s (Since Prime95 TF fits excellent into the L1Cache of the CPU I've scaled up the time directly by the clock rates) How often can I do the TF in one hour? One Process for CUDA and one with Prime95 (assuming they don't slow down each other): 3600s / 160.7s + 3600s / 1160s = 22.4 + 3.1 = 25.5 Two Processes for CUDA and no for Prime95: 2 * 3600s / 238.1s = 30.2 So when aiming for TF throughput on my particular system it is clearly beneficial to dedicate 2 CPU cores to my CUDAcode. 
20100104, 21:13  #36  
Jan 2008
France
1000011111_{2} Posts 
Quote:


20100104, 22:45  #37 
"Oliver"
Mar 2005
Germany
11·101 Posts 
Hi ldesnogu,
I'm not 100% sure if I got the point of your post. The smallest primes (3, 5, 7, 11) are removed from the sieve totally. I'll try to explain how it works. Let us ignore the fact that we can remove factors which are 3 or 5 mod 8 for simplicy (this would just increase the number of classes by four). For example we take M23 (p=23) and build only 3 classes (remove all multiples of 3). The factor candidates (FC) have the form 2kp+1. Code:
class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: k={ 2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, ...} Check, if a class is 0 mod 3: k=3: 2kp+1 mod 3 = 1 Check, if a class is 1 mod 3: k=1: 2kp+1 mod 3 = 2 Check, if a class is 2 mod 3: k=2: 2kp+1 mod 3 = 0 So we can ignore class 2 totally. ALL candidates in this class are a multiple of 3! Code:
class 0: k={ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, ...} class 1: k={ 1, 4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, ...} class 2: cleared For each class find the smallest k where 2kp+1 mod 5 = 0 These are k=9 for class 0 and k=4 for class 1. Starting at k=9/k=4 in this lists mark every 5th k as cleared. Code:
class 0: k={ 3, 6, *9, 12, 15, 18, 21,*24, 27, 30, 33, ...} class 1: k={ 1, *4, 7, 10, 13, 16,*19, 22, 25, 28, 31,*34, ...} class 2: cleared Those classes are independent on each other so we need to handle only a single class at one time. There is no need to try those FCs in ascending order. My actual implementation uses one of  4 * 3 * 5 * 7 = 420 classes, sieve size = N * 11 * 13 * 17 * 19 bits = N * 46189 bits  4 * 3 * 5 * 7 * 11 = 4620 classes, sieve size = N * 13 * 17 * 19 * 23 bits = N * 96577 bits Each k is represented as a single bit. These sieve sizes allow to have preinitialized sieves which have allready sieved the small primes up to 19/23. 
20100105, 09:47  #38 
Jan 2008
France
543_{10} Posts 
Oh OK, I mistakenly thought you only removed smaller primes from your sieve by using initialization, while in fact your initialization removes some larger ones
In case you don't know, this method is called sieving by wheel. 
20100105, 19:06  #39 
"Oliver"
Mar 2005
Germany
11·101 Posts 

20100105, 19:40  #40 
Dec 2007
Cleves, Germany
1022_{8} Posts 
In other words, you reinvented the wheel.

20100105, 19:45  #41 
"Oliver"
Mar 2005
Germany
457_{16} Posts 
nice one, ckdo. :)

20100110, 18:20  #42 
"Oliver"
Mar 2005
Germany
1111_{10} Posts 
Hello,
for those who want to try: find attached the code :) At the current state I recommend this only to people how are familar with CUDA. The code is NOT ready for everyday usage, you will be able to submit new factors to the primenet server but you can't submit "no factor from 2^X to 2^Y" results. Compared to my postings from 4. Jan there are some speed improvements aswell (maybe 10%). When you start: try to run the selftest first (see params.h and enable them). The selftest contains ~40 exponents which have known factors between 2^68 to 2^71. To speedup the selftest you can define MORE_CLASSES in params.h. The selftest does only check the class in which the known factor falls so it is alot faster than doing the full range. Try to compile with 'compile_replace_umul.hi_umul24.hi.sh', this gives a speedup for free compared to 'compile.sh'. (See post #21/#26 of this thread). Once you've created a binary just run it without any parameters and it will show you how to run. (You need to enter a valid set of parameters even if you want to run the selftest...) I recommend to run on mersenne numbers with known factors and check if my code finds the factor(s) aswell. Please report the results of you runs. It should work on (allmost) all CUDAcapable devices. I'm not 100% sure about G80 (Geforce 8800 GTS 320/640, 8800 GTX, 8800 Ultra and their Quadro/Telsa variants). Maybe you need to disable "USE_PINNED_MEMORY" and "USE_ASYNC_COPY" on these devices (params.h). The selftest runs fine on an old 8600GTS :) Documentation is allmost missing. :( The userinterface isn't very comfortable, sorry for that ;) Oliver 
20100110, 18:49  #43 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2·2,909 Posts 
At least i know it works on my card
How fast is it on the 8600GTS? From my experience of it it is very slow and pointless compared with modern cards like the GTX275 although plenty good enough for the gaming i do Is it worth me getting it working? I have never compiled something as i have only ever used msievegpu. Would there be any chance of binaries for it? 
20100110, 19:09  #44 
"Oliver"
Mar 2005
Germany
11×101 Posts 
Hi henryzz,
for factors between 2^64 to 2^71 it is about twice as fast as a single core of ath's core 2 quad. I like the card since it is relative slow it's easy to spot differences in runtime of the GPUcode, the CPU never limits the throughput. The RAW speed of the GPU code can be easily estimated since it scales perfect along the GPUs GFLOPS. I have tested this on  8400GS (43.2GFLOPS / 2.3M candidates tested per second)  8600GT (113GFLOPS / 6.1M candidates tested per second)  8800GTX (518GFLOPS / ~28M candidates tested per second)  GTX 275 (1011GFLOPS / ~54M candidates tested per second) I think it is not the right time for precompiled binaries, there are too many compiletime options in the code right now. I forgot to mention: I have run it only on Linux right now (openSUSE 11.1 x8664). If you still want a binary I can create one on my system. Let me known which CPU you have, I'll make some settings than. You need to install the CUDA software aswell. Last fiddled with by TheJudger on 20100110 at 20:01 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfakto: an OpenCL program for Mersenne prefactoring  Bdot  GPU Computing  1668  20201222 15:38 
The P1 factoring CUDA program  firejuggler  GPU Computing  753  20201212 18:07 
grmfaktc: a CUDA program for generalized repunits prefactoring  MrRepunit  GPU Computing  32  20201111 19:56 
mfaktc 0.21  CUDA runtime wrong  keisentraut  Software  2  20200818 07:03 
World's seconddumbest CUDA program  fivemack  Programming  112  20150212 22:51 