![]() |
I spent a good part of this week trying to implement fast squaring for CGBN. Ultimately [URL="https://github.com/NVlabs/CGBN/issues/19#issuecomment-966779554"]my code[/URL] was 10% slower and still had breaking edge cases.
In the best case with 100% faster fast squaring, there are 4 `mont_sqr` and 4 `mont_mul` so it would only be 8 / (4 / 2 + 4) - 1 = 33% faster. Using [URL="https://gmplib.org/manual/Basecase-Multiplication"]GMP's 50% faster number[/URL] it would be 1 - 8 / (4 / 1.5 + 4) - 1 = 20% faster. I'll reach out to the author of the repo because they mention fast squaring in their paper "Optimizing Modular Multiplication for NVIDIA’s Maxwell GPUs" [url]http://www.acsel-lab.com/arithmetic/arith23/data/1616a047.pdf[/url] but it's unlikely to happen. |
Just tried to upgrade my version of this as I was on a fairly old version and certain numbers were crashing.
Compiling has failed with the following error: [CODE]/bin/bash ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static libtool: compile: /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]" (800): here cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]" (803): here 4 errors detected in the compilation of "cgbn_stage1.cu".[/CODE] Have I messed something up while updating my local git repository or is the gpu_integration branch broken currently? |
May have discovered the issue. I think I need to update CGBN
edit: confirmed |
My GTX 970 has burnt out, so I've had to replace it, with a RTX 3060 Ti. That's sm_86 so I had to reinstall ECM-GPU.
After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again: [c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c] But ./configure doesn't support GPU arch above 75 so I had to run: [c]./configure --enable-gpu=75 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/chris/CGBN/include/cgbn[/c] Then manually update the makefiles to sm_86. nvcc -h says in part [code] --gpu-code <code>,... (-code) Specify the name of the NVIDIA GPU to assemble and optimize PTX for. nvcc embeds a compiled code image in the resulting executable for each specified <code> architecture, which is a true binary load image for each 'real' architecture (such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50). During runtime, such embedded PTX code is dynamically compiled by the CUDA runtime system if no binary load image is found for the 'current' GPU. Architectures specified for options '--gpu-architecture' and '--gpu-code' may be 'virtual' as well as 'real', but the <code> architectures must be compatible with the <arch> architecture. When the '--gpu-code' option is used, the value for the '--gpu-architecture' option must be a 'virtual' PTX architecture. For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52', because the earlier compilation stages will assume the availability of 'compute_60' features that are not present on 'sm_52'. Note: the values compute_30, compute_32, compute_35, compute_37, compute_50, sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in a future release. Allowed values for this option: 'compute_35','compute_37','compute_50', 'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70', 'compute_72','compute_75','compute_80','compute_86','compute_87','lto_35', 'lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62','lto_70', 'lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50','sm_52', 'sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80','sm_86', 'sm_87'. [/code] That nvcc has an option --list-gpu-code to list the gpu architectures supported by the compiler. But older versions of it don't have that option. Older versions of nvcc will probably give a list with: [c]nvcc -h | grep -o -E 'sm_[0-9]+' | sort -u[/c] But that won't work for 11.6 because the help lists sm_30 as deprecated even though it's no longer valid. It seems to work OK, but I've not tried it on a big job yet. And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. Which limits its use if I just need to do t30. @ SethTro, can you update configure to support sm_86? One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it. |
[QUOTE=chris2be8;601028]My GTX 970 has burnt out,. . .
One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.[/QUOTE]Sorry to hear about your card. Can it be repaired? You're probably aware of this site, but I've been having good luck at [URL="https://www.techpowerup.com/gpu-specs/"]techpowerup[/URL] for all the details on the various cards. e.g. [URL]https://www.techpowerup.com/gpu-specs/geforce-rtx-3060-ti.c3681[/URL], which shows CUDA 8.6. |
And I've got another problem with the new card:
[code] tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 110000000 110000000 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=110000004, sigma=3:3698165927-3:3698168358 (2432 curves) ecm: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed. Aborted (core dumped) [/code] It did t50 (B1 up to 43000000) OK, but failed with B1=110000000. Testing various B1's it fails at 70000000 but works at 60000000: [code] tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 60000000 1 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=60000000, B2=1, sigma=3:4285427795-3:4285430226 (2432 curves) GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 896 GPU: numRegsPerThread = 65 sharedMemPerBlock = 0 bytes Computing 2432 Step 1 took 3151ms of CPU time / 2557979ms of GPU time [/code] And I've just started a test at B1=110000000 *without* -cgbn and it seems to be running (the failures happened after a few seconds). I may be able to get round this by not using -cgbn but that's not ideal. [code] tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -save test2.save 110000000 1 <b58+148.ini GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits) Using B1=110000000, B2=1, sigma=3:2243519347-3:2243521778 (2432 curves) GPU: Using device code targeted for architecture compile_86 GPU: Ptx version is 86 GPU: maxThreadsPerBlock = 1024 GPU: numRegsPerThread = 30 sharedMemPerBlock = 24576 bytes GPU: Block: 32x32x1 Grid: 76x1x1 (2432 parallel curves) [/code] @SethTro, do you want any more information about this bug? I can probably get a core dump if you want. @EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs. |
[QUOTE=chris2be8;601085]. . .
@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.[/QUOTE]Which I haven't been trying to make use of yet. I still haven't figured out the correlation with sm/cores/?? and how many parallel processes are run by ECM. |
[QUOTE=chris2be8;601028]After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
[c]git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration[/c] [/QUOTE] CGBN has been merged into the main branch, it's probably better to use [c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c] [QUOTE=chris2be8;601028]And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. [/QUOTE] There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once. [QUOTE=chris2be8;601028]Which limits its use if I just need to do t30. [/QUOTE] Why? If you just want to do t30, use 4864 curves and a lower bound of 37e4 and skip stage 2. That should be about t30. Unless you have a very powerful cpu it should be faster. |
[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c][/QUOTE]Is this where I should retrieve GMP-ECM rather than the svn source I reference, or is the svn source still current? Is the git source the official one? [QUOTE=Gimarel;601117] There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.[/QUOTE]This is confusing to me. GMP-ECM defaults to 64 curves for an NVS 510 with 192 shading units and for my K20X with 2688 shading units, the default is 896 curves. If I double (triple, etc.) the curves, it doubles (triples, etc.) the GPU time taken. This is all with the svn download. All help in understanding this is appreciated. |
I don't know. I have a 2060 Super that has 2176 shader units. Anything below 2176 curves takes as much time as 2176 curves. Total throughput is about 5-10% better for 4352 concurrent curves.
|
[QUOTE=Gimarel;601117]CGBN has been merged into the main branch, it's probably better to use
[c]git clone https://gitlab.inria.fr/zimmerma/ecm.git[/c] . . .[/QUOTE]I am confused (yet, again). How do I start from scratch to compile GMP-ECM with CGBN for an sm_35 card? |
All times are UTC. The time now is 05:26. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.