mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2021-08-04, 12:22   #12
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

8,311 Posts
Default

We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H


The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.

Xyzzy is offline   Reply With Quote
Old 2021-08-04, 12:29   #13
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

8,311 Posts
Default

If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:

https://developer.nvidia.com/blog/st...arity-streams/

Then you will need these packages installed:

gcc
make
cuda-nvcc-10-2
cuda-cudart-dev-10-2-10.2.89-1


And possibly:

gmp-devel
zlib-devel


You also have to manually adjust your path variable in ~/.bashrc:

export PATH="/usr/local/cuda-10.2/bin:$PATH"

Xyzzy is offline   Reply With Quote
Old 2021-08-04, 19:32   #14
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

207716 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.

025M | 38GB | 50H
100M | 25GB | 51H
500M | 21GB | 59H


The first column is the block size (?) used on the GPU. (25M is the default.)
The second column is the memory used on the GPU.
The third column is the estimated time in hours for the LA phase.
Here are more benchmarks on the same data:
Code:
VBITS =  64; BLOCKS =   25M; MEM = 37.7GB; TIME =  58.8HR
VBITS =  64; BLOCKS =  100M; MEM = 23.8GB; TIME =  66.5HR
VBITS =  64; BLOCKS =  500M; MEM = 20.0GB; TIME =  98.9HR
VBITS =  64; BLOCKS = 1750M; MEM = 19.3GB; TIME = 109.9HR

VBITS = 128; BLOCKS =   25M; MEM = 37.4GB; TIME =  49.5HR
VBITS = 128; BLOCKS =  100M; MEM = 24.2GB; TIME =  50.3HR
VBITS = 128; BLOCKS =  500M; MEM = 20.7GB; TIME =  58.5HR
VBITS = 128; BLOCKS = 1750M; MEM = 20.1GB; TIME =  61.2HR

VBITS = 256; BLOCKS =   25M; MEM = 39.1GB; TIME =  47.4HR
VBITS = 256; BLOCKS =  100M; MEM = 26.5GB; TIME =  37.2HR
VBITS = 256; BLOCKS =  500M; MEM = 23.2GB; TIME =  37.2HR
VBITS = 256; BLOCKS = 1750M; MEM = 22.6GB; TIME =  37.5HR

VBITS = 512; BLOCKS =   25M; MEM = 44.1GB; TIME =  57.1HR
VBITS = 512; BLOCKS =  100M; MEM = 32.2GB; TIME =  43.5HR
VBITS = 512; BLOCKS =  500M; MEM = 28.9GB; TIME =  41.3HR
VBITS = 512; BLOCKS = 1750M; MEM = 28.5GB; TIME =  40.9HR
37.2 hours!

Xyzzy is offline   Reply With Quote
Old 2021-08-05, 01:04   #15
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,097 Posts
Default

That's great! The older V100 definitely doesn't like the VBITS=256 blocks=100M or 500M settings. It doubles the runtime. Anyone using this really needs to test different settings on their card.
frmky is offline   Reply With Quote
Old 2021-08-06, 14:05   #16
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

23·41 Posts
Default

Trying this out on an NVIDIA A100. Compiled and starts to run. I'm invoking with:

Code:
./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2
Code:
Fri Aug  6 12:43:59 2021  commencing linear algebra
Fri Aug  6 12:43:59 2021  using VBITS=256
Fri Aug  6 12:44:04 2021  read 36267445 cycles
Fri Aug  6 12:45:36 2021  cycles contain 123033526 unique relations
Fri Aug  6 12:58:53 2021  read 123033526 relations
Fri Aug  6 13:02:37 2021  using 20 quadratic characters above 4294917295
Fri Aug  6 13:14:54 2021  building initial matrix
Fri Aug  6 13:45:04 2021  memory use: 16201.2 MB
Fri Aug  6 13:45:24 2021  read 36267445 cycles
Fri Aug  6 13:45:28 2021  matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug  6 13:45:28 2021  sparse part has weight 4115543151 (113.48/col)
Fri Aug  6 13:50:59 2021  filtering completed in 1 passes
Fri Aug  6 13:51:04 2021  matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug  6 13:51:04 2021  sparse part has weight 4115543151 (113.48/col)
Fri Aug  6 13:54:35 2021  matrix starts at (0, 0)
Fri Aug  6 13:54:40 2021  matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col)
Fri Aug  6 13:54:40 2021  sparse part has weight 4115543151 (113.48/col)
Fri Aug  6 13:54:40 2021  saving the first 240 matrix rows for later
Fri Aug  6 13:54:47 2021  matrix includes 256 packed rows
Fri Aug  6 13:55:00 2021  matrix is 36267035 x 36267445 (15850.8 MB) with weight 3758763803 (103.64/col)
Fri Aug  6 13:55:00 2021  sparse part has weight 3574908223 (98.57/col)
Fri Aug  6 13:55:01 2021  using GPU 0 (NVIDIA A100-SXM4-40GB)
Fri Aug  6 13:55:01 2021  selected card has CUDA arch 8.0
Then a long sequence of numbers, and:

Code:
25000136 36267035 221384
25000059 36267035 218336
25000041 36267035 214416
25000174 36267035 211066
25000044 36267035 212574
25000047 36267035 212320
25000174 36267035 207956
25000171 36267035 202904
25000117 36267035 197448
25000171 36267035 191566
25000130 36267035 185008
25000136 36267035 178722
24898531 36267035 168358
3811898 36267445 264
22016023 36267445 48
24836805 36267445 60
27790270 36267445 75
24929949 36267445 75
22849647 36267445 75
24896299 36267445 90
22990599 36267445 90
25502972 36267445 110
23602625 36267445 110
26327686 36267445 135
23662886 36267445 135
26145282 36267445 165
23549845 36267445 165
26371744 36267445 205
23884092 36267445 205
26835429 36267445 255
24055165 36267445 255
26699873 36267445 315
23947051 36267445 315
26916570 36267445 390
24419378 36267445 390
27622355 36267445 485
error (line 373): CUDA_ERROR_OUT_OF_MEMORY
ryanp is offline   Reply With Quote
Old 2021-08-06, 14:12   #17
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·1,097 Posts
Default

At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.

Last fiddled with by frmky on 2021-08-06 at 14:55
frmky is offline   Reply With Quote
Old 2021-08-06, 16:39   #18
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

23·41 Posts
Default

Quote:
Originally Posted by frmky View Post
At the end after -nc2, add block_nnz=100000000

Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work.

If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix.

Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with -nc2 "block_nnz=1750000000 use_managed=1"
That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version.

Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., -nc2 "skip_matbuild=1 block_nnz=100000000"
I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card.
Tried a bunch of settings:

* all of 100M, 500M, 1000M, 1750M with VBITS=256 all ran out of memory
* managed to get the matrix down to 27M with more sieving. VBITS=256, 1750M still runs out of memory.
* will try VBITS=128 next with the various settings

Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?
ryanp is offline   Reply With Quote
Old 2021-08-06, 17:52   #19
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,097 Posts
Default

Quote:
Originally Posted by ryanp View Post
Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically?
Optimal and functional are very different parameters. I can try automatically picking a block_nnz value that is more likely to work, but VBITS is a compile-time setting that can't be changed at runtime. Adding use_managed=1 will make it work in most cases but can significantly slow it down, so I've defaulted it to off.
frmky is offline   Reply With Quote
Old 2021-08-06, 18:03   #20
ryanp
 
ryanp's Avatar
 
Jun 2012
Boulder, CO

23·41 Posts
Default

Looks like it's working now with VBITS=128, and a pretty decent runtime:

Code:
./msieve -v -g 0 -i ./f/input.ini -l ./f/input.log -s ./f/input.dat -nf ./f/input.fb -nc2 block_nnz=1000000000
...
matrix starts at (0, 0)
matrix is 27724170 x 27724341 (13842.2 MB) with weight 3947756174 (142.39/col)
sparse part has weight 3351414840 (120.88/col)
saving the first 112 matrix rows for later
matrix includes 128 packed rows
matrix is 27724058 x 27724341 (12940.4 MB) with weight 3222020630 (116.22/col)
sparse part has weight 3059558876 (110.36/col)
using GPU 0 (NVIDIA A100-SXM4-40GB)
selected card has CUDA arch 8.0
Nonzeros per block: 1000000000
converting matrix to CSR and copying it onto the GPU
1000000043 27724058 8774182
1000000028 27724058 9604503
1000000099 27724058 8923418
59558706 27724058 422238
1082873143 27724341 40960
954052655 27724341 1455100
916348921 27724341 16939530
106284157 27724341 9288468
commencing Lanczos iteration
vector memory use: 2961.3 MB
dense rows memory use: 423.0 MB
sparse matrix memory use: 24188.7 MB
memory use: 27573.0 MB
Allocated 82.0 MB for SpMV library
Allocated 88.6 MB for SpMV library
linear algebra at 0.0%, ETA 20h11m7724341 dimensions (0.0%, ETA 20h11m)    
checkpointing every 1230000 dimensions341 dimensions (0.0%, ETA 22h42m)    
linear algebra completed 12223 of 27724341 dimensions (0.0%, ETA 20h46m)
ryanp is offline   Reply With Quote
Old 2021-08-06, 18:03   #21
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·1,097 Posts
Default

The 17.5M matrix for 2,1359+ took just under 15 hours on a V100.
frmky is offline   Reply With Quote
Old 2021-08-06, 18:20   #22
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,097 Posts
Default

Quote:
Originally Posted by ryanp View Post
Looks like it's working now with VBITS=128, and a pretty decent runtime:
A 27.7M matrix in 21 hours. The A100 is a nice card!

Do you have access to an A40 to try it? I'm curious if the slower global memory significantly increases the runtime.
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
msieve on KNL frmky Msieve 3 2016-11-06 11:45
Using msieve with c burrobert Msieve 9 2012-10-26 22:46
msieve help em99010pepe Msieve 23 2009-09-27 16:13
fun with msieve masser Sierpinski/Riesel Base 5 83 2007-11-17 19:39

All times are UTC. The time now is 18:57.


Sun Oct 24 18:57:05 UTC 2021 up 93 days, 13:26, 0 users, load averages: 1.02, 1.04, 1.13

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.