mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2021-11-12, 04:04   #100
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

43210 Posts
Default

I spent a good part of this week trying to implement fast squaring for CGBN. Ultimately my code was 10% slower and still had breaking edge cases.

In the best case with 100% faster fast squaring, there are 4 `mont_sqr` and 4 `mont_mul` so it would only be 8 / (4 / 2 + 4) - 1 = 33% faster.

Using GMP's 50% faster number it would be 1 - 8 / (4 / 1.5 + 4) - 1 = 20% faster.

I'll reach out to the author of the repo because they mention fast squaring in their paper "Optimizing Modular Multiplication for NVIDIA’s
Maxwell GPUs" http://www.acsel-lab.com/arithmetic/...a/1616a047.pdf but it's unlikely to happen.
SethTro is offline   Reply With Quote
Old 2021-11-27, 22:05   #101
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

25·11·17 Posts
Default

Just tried to upgrade my version of this as I was on a fairly old version and certain numbers were crashing.

Compiling has failed with the following error:

Code:
/bin/bash ./libtool --tag=CC --mode=compile /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include  -DECM_GPU_CURVES_BY_BLOCK=32  --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include  -DWITH_GPU -o cgbn_stage1.lo cgbn_stage1.cu -static
libtool: compile:  /usr/local/cuda/bin/nvcc --compile -I/mnt/c/Users/david/Downloads/gmp-ecm-gpu_integration/CGBN/include/cgbn -lgmp -I/usr/local/cuda/include -DECM_GPU_CURVES_BY_BLOCK=32 --generate-code arch=compute_75,code=sm_75 --ptxas-options=-v --compiler-options -fno-strict-aliasing -O2 --compiler-options -fPIC -I/usr/local/cuda/include -DWITH_GPU cgbn_stage1.cu -o cgbn_stage1.o
cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<4U, 512U>]"
(800): here

cgbn_stage1.cu(437): error: identifier "cgbn_swap" is undefined
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(444): error: identifier "cgbn_swap" is undefined
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

cgbn_stage1.cu(407): warning: variable "temp" was declared but never referenced
          detected during instantiation of "void kernel_double_add<params>(cgbn_error_report_t *, uint32_t, uint32_t, uint32_t, char *, uint32_t *, uint32_t, uint32_t, uint32_t) [with params=cgbn_params_t<8U, 1024U>]"
(803): here

4 errors detected in the compilation of "cgbn_stage1.cu".
Have I messed something up while updating my local git repository or is the gpu_integration branch broken currently?
henryzz is offline   Reply With Quote
Old 2021-11-28, 04:43   #102
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

25·11·17 Posts
Default

May have discovered the issue. I think I need to update CGBN
edit: confirmed

Last fiddled with by henryzz on 2021-11-28 at 05:21
henryzz is offline   Reply With Quote
Old 2022-03-03, 16:45   #103
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2×33×43 Posts
Default

My GTX 970 has burnt out, so I've had to replace it, with a RTX 3060 Ti. That's sm_86 so I had to reinstall ECM-GPU.

After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration
But ./configure doesn't support GPU arch above 75 so I had to run:
./configure --enable-gpu=75 --with-cuda=/usr/local/cuda CC=gcc-9 -with-cgbn-include=/home/chris/CGBN/include/cgbn
Then manually update the makefiles to sm_86.

nvcc -h says in part
Code:
--gpu-code <code>,...                           (-code)                         
        Specify the name of the NVIDIA GPU to assemble and optimize PTX for.
        nvcc embeds a compiled code image in the resulting executable for each specified
        <code> architecture, which is a true binary load image for each 'real' architecture
        (such as sm_50), and PTX code for the 'virtual' architecture (such as compute_50).
        During runtime, such embedded PTX code is dynamically compiled by the CUDA
        runtime system if no binary load image is found for the 'current' GPU.
        Architectures specified for options '--gpu-architecture' and '--gpu-code'
        may be 'virtual' as well as 'real', but the <code> architectures must be
        compatible with the <arch> architecture.  When the '--gpu-code' option is
        used, the value for the '--gpu-architecture' option must be a 'virtual' PTX
        architecture.
        For instance, '--gpu-architecture=compute_60' is not compatible with '--gpu-code=sm_52',
        because the earlier compilation stages will assume the availability of 'compute_60'
        features that are not present on 'sm_52'.
        Note: the values compute_30, compute_32, compute_35, compute_37, compute_50,
        sm_30, sm_32, sm_35, sm_37 and sm_50 are deprecated and may be removed in
        a future release.
        Allowed values for this option:  'compute_35','compute_37','compute_50',
        'compute_52','compute_53','compute_60','compute_61','compute_62','compute_70',
        'compute_72','compute_75','compute_80','compute_86','compute_87','lto_35',
        'lto_37','lto_50','lto_52','lto_53','lto_60','lto_61','lto_62','lto_70',
        'lto_72','lto_75','lto_80','lto_86','lto_87','sm_35','sm_37','sm_50','sm_52',
        'sm_53','sm_60','sm_61','sm_62','sm_70','sm_72','sm_75','sm_80','sm_86',
        'sm_87'.
That nvcc has an option --list-gpu-code to list the gpu architectures supported by the compiler. But older versions of it don't have that option.

Older versions of nvcc will probably give a list with:
nvcc -h | grep -o -E 'sm_[0-9]+' | sort -u
But that won't work for 11.6 because the help lists sm_30 as deprecated even though it's no longer valid.

It seems to work OK, but I've not tried it on a big job yet. And I need to update my scripts because the new GPU does 2432 stage 1 curves per run. Which limits its use if I just need to do t30.

@ SethTro, can you update configure to support sm_86?

One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.
chris2be8 is offline   Reply With Quote
Old 2022-03-04, 03:45   #104
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2×2,251 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
My GTX 970 has burnt out,. . .
One other grouse is that Nvidia seem to regard details of what level of CUDA you need for a given card as top secret information. I wasted a lot of time searching for it.
Sorry to hear about your card. Can it be repaired?

You're probably aware of this site, but I've been having good luck at techpowerup for all the details on the various cards.

e.g. https://www.techpowerup.com/gpu-spec...-3060-ti.c3681, which shows CUDA 8.6.
EdH is offline   Reply With Quote
Old 2022-03-04, 16:39   #105
chris2be8
 
chris2be8's Avatar
 
Sep 2009

232210 Posts
Default

And I've got another problem with the new card:
Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 110000000 110000000 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=110000004, sigma=3:3698165927-3:3698168358 (2432 curves)
ecm: cgbn_stage1.cu:525: char* allocate_and_set_s_bits(const __mpz_struct*, int*): Assertion `1 <= num_bits && num_bits <= 100000000' failed.
Aborted (core dumped)
It did t50 (B1 up to 43000000) OK, but failed with B1=110000000.

Testing various B1's it fails at 70000000 but works at 60000000:
Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -cgbn -save test1.save 60000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=60000000, B2=1, sigma=3:4285427795-3:4285430226 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 896
GPU: numRegsPerThread = 65 sharedMemPerBlock = 0 bytes
Computing 2432 Step 1 took 3151ms of CPU time / 2557979ms of GPU time
And I've just started a test at B1=110000000 *without* -cgbn and it seems to be running (the failures happened after a few seconds). I may be able to get round this by not using -cgbn but that's not ideal.
Code:
tests/b58+148> /home/chris/ecm-cgbn/gmp-ecm/ecm -gpu -save test2.save 110000000 1 <b58+148.ini
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=1, sigma=3:2243519347-3:2243521778 (2432 curves)
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 1024
GPU: numRegsPerThread = 30 sharedMemPerBlock = 24576 bytes
GPU: Block: 32x32x1 Grid: 76x1x1 (2432 parallel curves)
@SethTro, do you want any more information about this bug? I can probably get a core dump if you want.

@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.
chris2be8 is offline   Reply With Quote
Old 2022-03-05, 00:44   #106
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2·2,251 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
. . .
@EdH, I don't think the old card can be repaired, it smells of burnt plastic. And the one thing techpowerup don't say is what level of CUDA drivers and runtime the card needs.
Which I haven't been trying to make use of yet. I still haven't figured out the correlation with sm/cores/?? and how many parallel processes are run by ECM.
EdH is offline   Reply With Quote
Old 2022-03-05, 06:13   #107
Gimarel
 
Apr 2010

E416 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
After updating CUDA to the latest driver and runtime version (11.6) I fetched gmp-ecm again:
git clone https://github.com/sethtroisi/gmp-ecm/ -b gpu_integration
CGBN has been merged into the main branch, it's probably better to use
git clone https://gitlab.inria.fr/zimmerma/ecm.git

Quote:
Originally Posted by chris2be8 View Post
And I need to update my scripts because the new GPU does 2432 stage 1 curves per run.
There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.

Quote:
Originally Posted by chris2be8 View Post
Which limits its use if I just need to do t30.
Why? If you just want to do t30, use 4864 curves and a lower bound of 37e4 and skip stage 2. That should be about t30. Unless you have a very powerful cpu it should be faster.
Gimarel is offline   Reply With Quote
Old 2022-03-05, 13:55   #108
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

2×2,251 Posts
Default

Quote:
Originally Posted by Gimarel View Post
CGBN has been merged into the main branch, it's probably better to use
git clone https://gitlab.inria.fr/zimmerma/ecm.git
Is this where I should retrieve GMP-ECM rather than the svn source I reference, or is the svn source still current? Is the git source the official one?

Quote:
Originally Posted by Gimarel View Post
There are 4864 shader units on this card according to the technical infos linked above. So if this is correct, it's better to run 4864 curves at once.
This is confusing to me. GMP-ECM defaults to 64 curves for an NVS 510 with 192 shading units and for my K20X with 2688 shading units, the default is 896 curves. If I double (triple, etc.) the curves, it doubles (triples, etc.) the GPU time taken. This is all with the svn download.

All help in understanding this is appreciated.
EdH is offline   Reply With Quote
Old 2022-03-05, 14:45   #109
Gimarel
 
Apr 2010

22×3×19 Posts
Default

I don't know. I have a 2060 Super that has 2176 shader units. Anything below 2176 curves takes as much time as 2176 curves. Total throughput is about 5-10% better for 4352 concurrent curves.
Gimarel is offline   Reply With Quote
Old 2022-03-06, 13:30   #110
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

10001100101102 Posts
Default

Quote:
Originally Posted by Gimarel View Post
CGBN has been merged into the main branch, it's probably better to use
git clone https://gitlab.inria.fr/zimmerma/ecm.git
. . .
I am confused (yet, again). How do I start from scratch to compile GMP-ECM with CGBN for an sm_35 card?
EdH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NTT faster than FFT? moytrage Software 50 2021-07-21 05:55
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 05:17.


Tue May 17 05:17:01 UTC 2022 up 33 days, 3:18, 0 users, load averages: 1.39, 1.94, 1.93

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔