mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2021-09-24, 06:17   #56
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

32·13·19 Posts
Default

Quote:
Originally Posted by charybdis View Post
@frmky, for future reference, when I tested this I found that rational side sieving with *algebraic* 3LP was fastest. This shouldn't be too much of a surprise: the rational norms are larger, but not so much larger that 6 large primes across the two sides should split 4/2 rather than 3/3 (don't forget the special-q is a "free" large prime).
I'll try that, thanks!
frmky is offline   Reply With Quote
Old 2021-09-24, 06:21   #57
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

1000101011112 Posts
Default

Quote:
Originally Posted by frmky View Post
filtering yielded
Code:
matrix is 102063424 x 102063602 (51045.3 MB) with weight 14484270868 (141.91/col)
Normally I'd try to bring this down, but testing on a quad V100 system with NVLink gives
Code:
linear algebra completed 2200905 of 102060161 dimensions (2.2%, ETA 129h 5m)
And it's done. LA on the 102M matrix with restarts took 5 days 14 hours.
frmky is offline   Reply With Quote
Old 2021-09-24, 12:36   #58
charybdis
 
charybdis's Avatar
 
Apr 2020

21F16 Posts
Default

Quote:
Originally Posted by frmky View Post
I'll try that, thanks!
Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?
charybdis is offline   Reply With Quote
Old 2021-09-24, 13:40   #59
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3×1,663 Posts
Default

Quote:
Originally Posted by charybdis View Post
Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?
95%.
pinhodecarlos is online now   Reply With Quote
Old 2021-09-24, 15:13   #60
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

222310 Posts
Default

Quote:
Originally Posted by charybdis View Post
Also 250M is very low for alim/rlim at this size; some quick testing suggests the optimum is likely between 500M and 1000M. Is this done to keep memory use low? How many 16f contributors don't have the 1.5GB per thread needed to use lim=500M?
A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.
frmky is offline   Reply With Quote
Old 2021-09-24, 15:50   #61
charybdis
 
charybdis's Avatar
 
Apr 2020

3·181 Posts
Default

If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.

But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this.
charybdis is offline   Reply With Quote
Old 2021-10-22, 13:34   #62
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

5×67 Posts
Default

In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?
ryanp is offline   Reply With Quote
Old 2021-10-22, 23:00   #63
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

42578 Posts
Default

Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.

Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)-byte row offset array in addition to the 4*num_nonzeros-byte column array in GPU memory.

For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default.

So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well.
frmky is offline   Reply With Quote
Old 2021-10-26, 04:05   #64
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

32·13·19 Posts
Default

2,2174M is in LA, so here's one more data point. Running on eight NVLink-connected V100's,
Code:
Sun Oct 24 01:15:27 2021  matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col)
Sun Oct 24 01:15:27 2021  sparse part has weight 13874205635 (129.95/col)
...
Sun Oct 24 23:03:59 2021  commencing linear algebra
Sun Oct 24 23:03:59 2021  using VBITS=384
Sun Oct 24 23:03:59 2021  skipping matrix build
Sun Oct 24 23:03:59 2021  initialized process (0,0) of 2 x 4 grid
Sun Oct 24 23:09:35 2021  matrix starts at (0, 0)
Sun Oct 24 23:09:39 2021  matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col)
Sun Oct 24 23:09:39 2021  sparse part has weight 1913870759 (75.53/col)
Sun Oct 24 23:09:39 2021  saving the first 368 matrix rows for later
Sun Oct 24 23:09:46 2021  matrix includes 384 packed rows
Sun Oct 24 23:10:15 2021  matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col)
Sun Oct 24 23:10:15 2021  sparse part has weight 1451172382 (57.27/col)
Sun Oct 24 23:10:15 2021  using GPU 0 (Tesla V100-SXM2-32GB)
Sun Oct 24 23:10:15 2021  selected card has CUDA arch 7.0
Sun Oct 24 23:12:44 2021  commencing Lanczos iteration
Sun Oct 24 23:12:47 2021  memory use: 20898.7 MB
Sun Oct 24 23:12:56 2021  linear algebra at 0.0%, ETA 90h17m
It'll take a bit longer due to queue logistics, but hopefully it'll be done within the week.
frmky is offline   Reply With Quote
Old 2021-10-26, 06:21   #65
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

3·1,663 Posts
Default

And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.
pinhodecarlos is online now   Reply With Quote
Old 2021-10-26, 07:48   #66
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

32·13·19 Posts
Default

Quote:
Originally Posted by pinhodecarlos View Post
And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Resume linear algebra Timic Msieve 35 2020-10-05 23:08
use msieve linear algebra after CADO-NFS filtering aein Msieve 2 2017-10-05 01:52
Has anyone tried linear algebra on a Threadripper yet? fivemack Hardware 3 2017-10-03 03:11
Linear algebra at 600% CRGreathouse Msieve 8 2009-08-05 07:25
Linear algebra proof Damian Math 8 2007-02-12 22:25

All times are UTC. The time now is 08:19.


Mon Nov 29 08:19:19 UTC 2021 up 129 days, 2:48, 0 users, load averages: 0.96, 1.12, 1.08

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.