mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Faster GPU-ECM with CGBN (https://www.mersenneforum.org/showthread.php?t=27103)

chris2be8 2021-09-04 17:08

Thanks, but I've already tried that:
[code]
4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

And it is in the current initramfs:
[code]
4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

lsmod doesn't show any nvidia kernel modules:
[code]
4core:/etc/modprobe.d # lsmod | grep -i nvidia
4core:/etc/modprobe.d #
[/code]

On my system where CUDA (but not cgbn) works:
[code]
root@sirius:~# lsmod | grep nvidia
nvidia_uvm 876544 0
nvidia_drm 49152 5
nvidia_modeset 1122304 14 nvidia_drm
nvidia 19517440 682 nvidia_uvm,nvidia_modeset
drm_kms_helper 180224 1 nvidia_drm
drm 483328 8 drm_kms_helper,nvidia_drm
ipmi_msghandler 102400 2 ipmi_devintf,nvidia
[/code]

paulunderwood 2021-09-04 17:20

Did you install through Yast or a direct download from nVidia?

chris2be8 2021-09-04 17:26

I've already tried that:
[code]
4core:/etc/modprobe.d # cat 60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

And it is in initrd:
[code]
4core:/etc/modprobe.d # lsinitrd -f /etc/modprobe.d/60-blacklist.nouveau.conf
blacklist nouveau
options nouveau modeset=0
[/code]

Digging a bit further I don't think the nvidia kernel modules are correctly installed:
[code]
4core:/lib/modules # find . -name 'nvidia*'
./4.12.14-lp150.12.82-default/updates/nvidia-uvm.ko
./4.12.14-lp150.12.82-default/updates/nvidia-modeset.ko
./4.12.14-lp150.12.82-default/updates/nvidia.ko
./4.12.14-lp150.12.82-default/updates/nvidia-drm.ko
./5.3.18-57-default/weak-updates/updates/nvidia-uvm.ko
./5.3.18-57-default/weak-updates/updates/nvidia-modeset.ko
./5.3.18-57-default/weak-updates/updates/nvidia.ko
./5.3.18-57-default/weak-updates/updates/nvidia-drm.ko
./5.3.18-57-default/kernel/drivers/net/ethernet/nvidia
./5.3.18-57-preempt/kernel/drivers/net/ethernet/nvidia
./5.3.18-59.19-preempt/kernel/drivers/net/ethernet/nvidia
./5.3.18-59.19-default/weak-updates/updates/nvidia-uvm.ko
./5.3.18-59.19-default/weak-updates/updates/nvidia-modeset.ko
./5.3.18-59.19-default/weak-updates/updates/nvidia.ko
./5.3.18-59.19-default/weak-updates/updates/nvidia-drm.ko
./5.3.18-59.19-default/kernel/drivers/net/ethernet/nvidia

4core:/lib/modules # uname -r
5.3.18-59.19-preempt
[/code]

So the kernel I'm running won't find them because it will look in 5.3.18-59.19-preempt even though they are installed in 5.3.18-59.19-default (next question, how to fix this cleanly). But at least I think I know where I'm going now.

chris2be8 2021-09-04 17:28

[QUOTE=paulunderwood;587259]Did you install through Yast or a direct download from nVidia?[/QUOTE]

zypper on the command line. Following the instructions on Nvidia's web site [url]https://developer.nvidia.com/cuda-downloads[/url]

EdH 2021-09-04 18:04

Some of the instructions I saw in the past had a separate step, almost hidden, that was required to install the driver. Is it possible there is a driver install step missing in your procedure?

For my Ubuntu repository install of 10.2, it automatically installs the 470 driver, no matter what I have beforehand.

Is there an equivalent to this Ubuntu command?:[code]sudo [B]ubuntu-drivers devices[/B]
WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
== /sys/devices/pci0000:00/0000:00:03.0/0000:01:00.0 ==
modalias : pci:v000010DEd00000FFDsv0000103Csd00000967bc03sc00i00
vendor : NVIDIA Corporation
model : GK107 [NVS 510]
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-450 - third-party non-free
driver : nvidia-driver-460-server - distro non-free
driver : nvidia-driver-455 - third-party non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-340 - distro non-free
driver : nvidia-driver-465 - third-party non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-470 - third-party non-free recommended
driver : nvidia-driver-418 - third-party non-free
driver : nvidia-driver-410 - third-party non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-440 - third-party non-free
driver : nvidia-driver-460 - third-party non-free
driver : xserver-xorg-video-nouveau - distro free builtin[/code]Would such be of any help?

chris2be8 2021-09-04 20:10

After rebooting using the 5.3.18-59.19-default kernel the nvidia drivers are picked up:
[code]
4core:~ # lspci -v -s 01:00
01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 3978
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at e0000000 (64-bit, prefetchable) [size=256M]
Memory at f0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
[virtual] Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
[/code]

I'll need to fix that but it can wait for now.

Then I started testing things ...

msieve works OK:
[code]
Sat Sep 4 19:10:51 2021 Msieve v. 1.54 (SVN 1043)
Sat Sep 4 19:10:51 2021 random seeds: 6e515738 cae1a347
Sat Sep 4 19:10:51 2021 factoring 1522605027922533360535618378132637429718068114961380688657908494580122963258952897654000350692006139 (100 digits)
Sat Sep 4 19:10:51 2021 no P-1/P+1/ECM available, skipping
Sat Sep 4 19:10:51 2021 commencing number field sieve (100-digit input)
Sat Sep 4 19:10:51 2021 commencing number field sieve polynomial selection
Sat Sep 4 19:10:51 2021 polynomial degree: 4
Sat Sep 4 19:10:51 2021 max stage 1 norm: 1.16e+17
Sat Sep 4 19:10:51 2021 max stage 2 norm: 8.33e+14
Sat Sep 4 19:10:51 2021 min E-value: 9.89e-09
Sat Sep 4 19:10:51 2021 poly select deadline: 54
Sat Sep 4 19:10:51 2021 time limit set to 0.01 CPU-hours
Sat Sep 4 19:10:51 2021 expecting poly E from 1.49e-08 to > 1.71e-08
Sat Sep 4 19:10:51 2021 searching leading coefficients from 10000 to 1000000
Sat Sep 4 19:10:52 2021 using GPU 0 (NVIDIA GeForce GTX 970)
Sat Sep 4 19:10:52 2021 selected card has CUDA arch 5.2
Sat Sep 4 19:11:19 2021 polynomial selection complete
Sat Sep 4 19:11:19 2021 elapsed time 00:00:28
[/code]

But I've been having fun with ecm.

The problem with conftest turned out to be:
[code]
chris@4core:~> gcc-9 -o conftest -I/usr/local/cuda/include -g -O2 -I/usr/local/cuda/include -Wl,-rpath,/usr/local/cuda/lib64 -L/usr/local/cuda/lib64 conftest.c -lcudart -lstdc++ -lcuda -lrt -lm -lm -lm -lm -lm
/usr/lib64/gcc/x86_64-suse-linux/9/../../../../x86_64-suse-linux/bin/ld: cannot find -lstdc++
collect2: error: ld returned 1 exit status
[/code]

So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK.

I then got a lot of errors like this:
[code]
Instruction 'vote' without '.sync' is not supported on .target sm_70 and higher from PTX ISA version 6.4
[/code]

So edited the Makefile to only build for sm_52 since that's all I need.

But trying to build CGBN support I get:
[code]
chris@4core:~/ecm-cgbn/gmp-ecm> make
make all-recursive
make[1]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm'
Making all in x86_64
make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64'
make[2]: Nothing to be done for 'all'.
make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm/x86_64'
make[2]: Entering directory '/home/chris/ecm-cgbn/gmp-ecm'
/bin/sh ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static
libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/home/chris/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_52,code=sm_52 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o
cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(757): here

cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(757): here

cgbn_stage1.cu(435): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(760): here

cgbn_stage1.cu(442): error: identifier "cgbn_swap" is undefined
detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(760): here

4 errors detected in the compilation of "cgbn_stage1.cu".
make[2]: *** [Makefile:2571: cgbn_stage1.lo] Error 1
make[2]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make[1]: *** [Makefile:1903: all-recursive] Error 1
make[1]: Leaving directory '/home/chris/ecm-cgbn/gmp-ecm'
make: *** [Makefile:783: all] Error 2
[/code]

This is after several attempts to run make, so hopefully only the relevant messages.

But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now.

frmky 2021-09-04 22:01

[QUOTE=SethTro;587047]I halved compile time by adding cgbn_swap and avoiding inlining double_add_v2 twice.
[/QUOTE]

Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of
[CODE] typedef cgbn_params_t<4, 256> cgbn_params_256;
typedef cgbn_params_t<4, 512> cgbn_params_512;
typedef cgbn_params_t<8, 768> cgbn_params_768;
typedef cgbn_params_t<8, 1024> cgbn_params_1024;
typedef cgbn_params_t<8, 1536> cgbn_params_1536;
typedef cgbn_params_t<8, 2048> cgbn_params_2048;
typedef cgbn_params_t<16, 3072> cgbn_params_3072;
typedef cgbn_params_t<16, 4096> cgbn_params_4096;
typedef cgbn_params_t<16, 5120> cgbn_params_5120;
typedef cgbn_params_t<16, 6144> cgbn_params_6144;
typedef cgbn_params_t<16, 7168> cgbn_params_7168;
typedef cgbn_params_t<16, 8192> cgbn_params_8192;
typedef cgbn_params_t<32, 10240> cgbn_params_10240;
typedef cgbn_params_t<32, 12288> cgbn_params_12288;
typedef cgbn_params_t<32, 14336> cgbn_params_14336;
typedef cgbn_params_t<32, 16384> cgbn_params_16384;
typedef cgbn_params_t<32, 18432> cgbn_params_18432;
typedef cgbn_params_t<32, 20480> cgbn_params_20480;
typedef cgbn_params_t<32, 22528> cgbn_params_22528;
typedef cgbn_params_t<32, 24576> cgbn_params_24576;
typedef cgbn_params_t<32, 28672> cgbn_params_28672;
typedef cgbn_params_t<32, 32768> cgbn_params_32768;
[/CODE]
and it took a little over an hour to compile for sm_70.

paulunderwood 2021-09-05 02:24

[QUOTE=chris2be8;587267]
So changing ./configure line 15498 from [c]CUDALIB="-lcudart -lstdc++"[/c] to [c]CUDALIB="-lcudart"[/c] made it work OK.
[/QUOTE]

Use YaST to search for the dev file of libstdc++ and install it (and its dependencies), and then link with -lstdc++

SethTro 2021-09-05 03:28

[QUOTE=chris2be8;587267]
This is after several attempts to run make, so hopefully only the relevant messages.

But I've got an older version of ecm working on the GPU (at last!) So i'll leave it for now.[/QUOTE]

This is an easy fix, you are on the home stretch!

I'll committed a change that depends on [url]https://github.com/NVlabs/CGBN/pull/17[/url] being accepted. I'll committed a change reverting that to 3 cgbn_set's for now. After you `git pull` everything should build!

Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]`

SethTro 2021-09-05 03:40

[QUOTE=frmky;587274]Does it affect the runtime? I don't care much about the compile time. Just compile a few small kernels for testing, and once it's stable include a good coverage of kernels and just let it compile overnight if necessary. In my current build I included all of
[CODE] typedef cgbn_params_t<4, 256> cgbn_params_256;
typedef cgbn_params_t<4, 512> cgbn_params_512;
typedef cgbn_params_t<8, 768> cgbn_params_768;
typedef cgbn_params_t<8, 1024> cgbn_params_1024;
.........
typedef cgbn_params_t<32, 32768> cgbn_params_32768;
[/CODE]
and it took a little over an hour to compile for sm_70.[/QUOTE]

It doesn't reduce runtime, it does make it faster for me to test things and slightly reduces registers pressure.

chris2be8 2021-09-05 05:35

[QUOTE=SethTro;587290]Alternatively you can use replace your CGBN directory with this one. `git clone -b cgbn_swap [email]git@github.com:sethtroisi/CGBN.git[/email]`[/QUOTE]

That fails:
[code]
chris@4core:~> git clone -b cgbn_swap git@github.com:sethtroisi/CGBN.git
Cloning into 'CGBN'...
The authenticity of host 'github.com (140.82.121.4)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'github.com,140.82.121.4' (RSA) to the list of known hosts.
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
[/code]


And 'git pull' does nothing:
[code]
chris@4core:~/CGBN> git pull
Already up to date.
[/code]

Unless I'm not using it correctly.


All times are UTC. The time now is 18:56.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.