![]() |
![]() |
#34 |
Aug 2002
100001011011012 Posts |
![]()
We are not sure if this is interesting or not.
13_2_909m1 - Near-Cunningham - SNFS(274) This is a big (33 bit?) job. The msieve.dat file, uncompressed and with duplicates and bad relations removed, is 49GB. Code:
$ ls -lh total 105G -rw-rw-r--. 1 m m 36G Sep 8 20:33 13_2_909m1.dat.gz drwx------. 2 m m 50 Aug 4 12:17 cub -r--------. 1 m m 29K Aug 4 12:16 lanczos_kernel.ptx -r-x------. 1 m m 3.4M Aug 4 12:16 msieve -rw-rw-r--. 1 m m 49G Sep 8 22:02 msieve.dat -rw-rw-r--. 1 m m 4.2G Sep 9 14:17 msieve.dat.bak.chk -rw-rw-r--. 1 m m 4.2G Sep 9 14:54 msieve.dat.chk -rw-rw-r--. 1 m m 969M Sep 9 12:11 msieve.dat.cyc -rw-rw-r--. 1 m m 12G Sep 9 12:11 msieve.dat.mat -rw-rw-r--. 1 m m 415 Sep 2 19:15 msieve.fb -rw-rw-r--. 1 m m 13K Sep 9 15:10 msieve.log -r--------. 1 m m 108K Aug 4 12:16 stage1_core.ptx -rw-rw-r--. 1 m m 264 Sep 2 19:15 worktodo.ini Code:
commencing linear algebra using VBITS=256 skipping matrix build matrix starts at (0, 0) matrix is 27521024 x 27521194 (12901.7 MB) with weight 3687594306 (133.99/col) sparse part has weight 3106904079 (112.89/col) saving the first 240 matrix rows for later matrix includes 256 packed rows matrix is 27520784 x 27521194 (12034.4 MB) with weight 2848207923 (103.49/col) sparse part has weight 2714419599 (98.63/col) using GPU 0 (Quadro RTX 8000) selected card has CUDA arch 7.5 Nonzeros per block: 1000000000 converting matrix to CSR and copying it onto the GPU 1000000013 27520784 9680444 1000000057 27520784 11295968 714419529 27520784 6544782 1039631367 27521194 100000 917599197 27521194 3552480 757189035 27521194 23868304 commencing Lanczos iteration vector memory use: 5879.2 MB dense rows memory use: 839.9 MB sparse matrix memory use: 21339.3 MB memory use: 28058.3 MB Allocated 123.0 MB for SpMV library Allocated 127.8 MB for SpMV library linear algebra at 0.0%, ETA 49h57m7521194 dimensions (0.0%, ETA 49h57m) checkpointing every 570000 dimensions linear algebra completed 925789 of 27521194 dimensions (3.4%, ETA 45h13m) received signal 2; shutting down linear algebra completed 926044 of 27521194 dimensions (3.4%, ETA 45h12m) lanczos halted after 3628 iterations (dim = 926044) BLanczosTime: 5932 elapsed time 01:38:53 current factorization was interrupted We have the raw files saved if there are other configurations worth investigating. If so, just let us know! ![]() |
![]() |
![]() |
![]() |
#35 |
"Curtis"
Feb 2005
Riverside, CA
160B16 Posts |
![]()
It's a 32/33 hybrid, with a healthy amount of oversieving (I wanted a matrix below 30M dimensions, success!).
I'm impressed that fits on your card, and 50hr is pretty amazing- I just started the matrix a few hr ago on a 10-core Ivy Bridge, ETA is 365 hr. If you have the free cycles to run it, please be my guest! That 20+ core weeks saved is enough to ECM the next candidate. |
![]() |
![]() |
![]() |
#36 |
Jul 2003
So Cal
19×137 Posts |
![]()
I spent time with Nsight Compute looking at the SpMV kernel. As expected for SpMV it's memory bandwidth limited, so increasing occupancy to hide latency should help. I adjusted parameters to reduce both register and shared memory use, which increased the occupancy. This yielded a runtime improvement of only about 5% on the V100 but it may differ on other cards. I also increased the default block_nnz to 1750M to reduce global memory use a bit.
|
![]() |
![]() |
![]() |
#37 |
Jul 2003
So Cal
1010001010112 Posts |
![]()
Today I expanded the allowed values of VBITS to any of 64, 128, 192, 256, 320, 384, 448, or 512. This works on both CPUs and GPUs, but I don't expect much, if any, speedup on CPUs. As a GPU benchmark, I tested a 42.1M matrix on two NVLink-connected V100's. Here are the results.
Code:
VBITS Time (hours) 64 109.5 128 63.75 192 50 256 40.25 320 40.25 384 37.75 448 40.25 512 37.25 |
![]() |
![]() |
![]() |
#38 |
Aug 2002
100001011011012 Posts |
![]()
Our system has a single GPU. When we are doing compute work on the GPU the display lags. We can think of two ways to fix this.
Some sort of niceness assignment to the compute process. Limiting the compute process to less than 100% of the GPU. Are either of these approaches possible? ![]() |
![]() |
![]() |
![]() |
#39 |
Aug 2002
43×199 Posts |
![]()
Since GPU LA is so fast, should we rethink how many relations are generated by the sieving process?
![]() |
![]() |
![]() |
![]() |
#40 |
Jun 2003
2×2,719 Posts |
![]() |
![]() |
![]() |
![]() |
#41 | |
Jul 2003
So Cal
19·137 Posts |
![]() Quote:
Edit: Lowering VBITS will also reduce kernel runtimes, but don't go below 128. See the benchmark a few posts above. Also, you can't change VBITS in the middle of a run. You would need to start over from the beginning. You can change block_nnz during a restart. Last fiddled with by frmky on 2021-09-17 at 15:40 |
|
![]() |
![]() |
![]() |
#42 |
Sep 2009
22·607 Posts |
![]() |
![]() |
![]() |
![]() |
#43 | |
"Curtis"
Feb 2005
Riverside, CA
33·11·19 Posts |
![]() Quote:
Another way to view this is to aim for the number of relations one would use if one were doing the entire job on one's own equipment, and then add just a bit to reduce the chance of needing to ask for more Q from admin (like round Q up to the nearest 5M or 10M increment). |
|
![]() |
![]() |
![]() |
#44 |
Aug 2002
205558 Posts |
![]()
What is the difference in relations needed between TD=120 and TD=100? (Do we have this data?)
We think a GPU could do a TD=100 job faster than a CPU could do a TD=120 job. Personally, we don't mind having to rerun matrix building if there aren't enough relations. We don't know if it is a drag for the admins to add additional relations, but if it isn't a big deal the project could probably run more efficiently. There doesn't seem to be a shortage of LA power so maybe the project could skew a bit in favor of more jobs overall with less relations per job? Is the bottleneck server storage space? What percentage in CPU-hours is the sieving versus the post-processing work? Does one additional hour of post-processing "save" 1000 hours of sieving? More? Less? (We lack the technical knowledge and vocabulary to express what we are thinking. Hopefully what we wrote makes a little sense.) ![]() |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Resume linear algebra | Timic | Msieve | 35 | 2020-10-05 23:08 |
use msieve linear algebra after CADO-NFS filtering | aein | Msieve | 2 | 2017-10-05 01:52 |
Has anyone tried linear algebra on a Threadripper yet? | fivemack | Hardware | 3 | 2017-10-03 03:11 |
Linear algebra at 600% | CRGreathouse | Msieve | 8 | 2009-08-05 07:25 |
Linear algebra proof | Damian | Math | 8 | 2007-02-12 22:25 |