20220119, 14:56  #1  
"Ed Hall"
Dec 2009
Adirondack Mtns
7·701 Posts 
Lattice Sieving with GPU
Please redirect me if the following has been discussed elsewhere.
Quote:
When I run CADONFS on my clients, the memory usage is primarily based on the size of the candidate and the options involved, but not so much the number of threads. I'm able to gain a tiny bit of throughput by maximizing the number of clients to use most of the available memory and then adjusting the threads such that all available are in use for those clients. As an example, I can get a little more from two clients running 4 threads each than one client running 8 threads. Each client in all cases uses about the same amount of memory, dependent on the candidate and the options. For the current slate of numbers I'm factoring, the clients are using less than 1G, independent of the number of threads (although my thread count is nowhere near the available threads for a GPU). For the memory limited machines, I run a single client with maximum threads. Would a GPU work for lattice sieving for smaller candidates with a single client, perhaps using less than the full set of GPU cores? Or is the parallelization too much different between GPU and CPU multithreading? (I only did a small amount of CUDA programming back in the CUDA 5 era, so I don't remember the differences.) 

20220119, 18:00  #2 
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
13555_{8} Posts 
You may find https://eprint.iacr.org/2014/397.pdf interesting

20220119, 18:56  #3 
Apr 2020
857 Posts 
As I understand it, a major issue with running lattice sieving on GPUs is that they have poor memory latency compared to CPUs. Every time the sieve hits a lattice entry, that entry must be modified. For largish jobs this will need to be done ~10^9 times per specialq. This is nothing like traditional sieving for primesearching projects, where (a) the number of candidates is much smaller, and (b) candidates are eliminated from the sieve when a single factor is found.
Cofactorization doesn't have this problem  see the paper that David linked. The amount of time spent in cofactorization is much higher with 3 large primes than 2, so the potential benefit from GPUs will be greater on bigger jobs. 
20220119, 21:21  #4  
"Robert Gerbicz"
Oct 2005
Hungary
3^{2}·5^{2}·7 Posts 
Quote:
1. You don't need a full sort, it is enough to reach that a bucket will contain all numbers hits from [x,x+2^13) or so, what will fit in the fast shared memory of gpu. 2. In one round you sort by 24 bits of the input in the fast (small) memory, read and write from the slow gpu global memory but in almost a coalesced way. Of course you need multiple rounds. 3. We can assume that the input is close to random. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Lattice Sieving Parameters  paul0  Factoring  6  20151120 21:12 
Lattice Sieving  where do I start?  paul0  Factoring  3  20150309 13:54 
Line sieving vs. lattice sieving  JHansen  NFSNET Discussion  9  20100609 19:25 
A question on lattice sieving  joral  Factoring  5  20080403 08:01 
Initialization for lattice sieving  jasonp  Factoring  16  20060112 22:53 