20091012, 20:06  #1 
(loop (#_fork))
Feb 2006
Cambridge, England
14357_{8} Posts 
Shape of a CUDA lattice siever
These are very random thoughts; I am wondering whether a dumb implementation of lattice sieving running on nvidia's ludicrously fast parallel hardware might manage to run at reasonable speed.
Jason has already written the basic primitives, with a reasonably fast modular inverse. Lattice reduction is not obviously a serious problem, and that at least is grotesquely parallel (one prime per thread). Enumerating lattice points: I should read the bucketsieving paper carefully, because at the moment I think it's necessary to divide the sieve region into slices (eg X=2^15..2^15, Y=Y0..Y0+2^11) to get the buckets to fit in the GPU memory, but IIRC their clever lattice reduction lets you list the lattice points in Y order. Pulling entries out of buckets stored in global memory into 16k shared sieve arrays on each of 27 threadblocks on the GPU: probably fine with one block per bucket. Not sure how persistent the shared memory is, but synchronising and writing 16k back to global memory is trivial. But I can't see how getting the entries into the buckets is going to be anything other than a nightmare on stilts: it's a random store at best (as opposed to the pleasant CPU case where you can be reasonably sure that the ends of the buckets are all in cache), contention on the lastindexforthisbucket array is enormous, and I don't know what the behaviour when umpteen threads write different values to the same location in global memory is. GPGPU guides tend to advise against bucketing, and just suggest writing things out in arbitrary order and then doing a compaction pass and a parallel sort; maybe 60GB/second bandwidth to global memory is enough to let you be foolish in that way. List factorbase primes by the number of bucketentries they're expected to generate and partition among threads so you don't have to be insanely generous with bucket sizes, I suppose. 
20091012, 21:44  #2  
Tribal Bullet
Oct 2004
6720_{8} Posts 
Quote:
For the record, Alex and Bob are primarily responsible for that fast modular inverse. 

20121216, 01:07  #3 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
11426_{8} Posts 
Is it now portable the lattice siever to CUDA code or OpenCL code?
Carlos 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
QS Lattice Siever  R.D. Silverman  Factoring  31  20181008 17:17 
Compiling 64 bit lattice siever on gcc 4.8.5  chris2be8  Factoring  6  20180206 17:22 
OpenCL accellerated lattice siever  pstach  Factoring  1  20140523 01:03 
Msieve / lattice siever with degree 7/8 poly  Batalov  Msieve  54  20100113 19:45 
ggnfs lattice siever misses some primes  fivemack  Factoring  1  20080118 13:47 