mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2022-03-08, 13:32   #133
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

125916 Posts
Default

I understood about the -gpucurves, but what confused me was the:
Code:
CGBN<512, 4> running kernel<56 block x 256 threads> input number is 246 bits
lines. I see now that they are based on the input number size and automatically taken care of by the program. I had thought maybe there were more options to provide.

Thanks for helping me understand this and for a great speedup.
EdH is offline   Reply With Quote
Old 2022-03-08, 16:39   #134
chris2be8
 
chris2be8's Avatar
 
Sep 2009

26·37 Posts
Default

ecm-gpu downloaded from https://gitlab.inria.fr/zimmerma/ecm.git works for b1=11e7:
Code:
chris@4core:~/ecm-cgbn.2/ecm> date;time ./ecm -gpu -cgbn -save test2.save 110000000 1 <b58+148.ini;date
Tue  8 Mar 08:18:31 GMT 2022
GMP-ECM 7.0.5-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Input number is 1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417 (172 digits)
Using B1=110000000, B2=1, sigma=3:35896186-3:35898617 (2432 curves)
GPU: Large B1, S = 158705536 bits = 151 MB
GPU: Using device code targeted for architecture compile_86
GPU: Ptx version is 86
GPU: maxThreadsPerBlock = 896
GPU: numRegsPerThread = 67 sharedMemPerBlock = 0 bytes
Computing 2432 Step 1 took 4508ms of CPU time / 4674180ms of GPU time

real	78m0.885s
user	0m10.992s
sys	0m2.513s
Tue  8 Mar 09:36:32 GMT 2022
This is after updating the Makefiles to --generate-code arch=compute_86,code=sm_86.

The older version without -cgbn took about 9 hours to do the same job. Many thanks for the speed up.
chris2be8 is offline   Reply With Quote
Old 2022-03-08, 19:43   #135
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

10010010110012 Posts
Default

Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?

I'm currently basing my B1 values on what 896 curves need for the different t-levels. Should I adjust B1 to a close multiple of a base value, then adjust the -gpucurves, accordingly, or am I complicating things?
EdH is offline   Reply With Quote
Old 2022-03-08, 20:46   #136
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

19×23 Posts
Default

Quote:
Originally Posted by EdH View Post
Sorry if this has an "elementary" answer, but is there an optimum value that B1 should be a multiple of?

I'm currently basing my B1 values on what 896 curves need for the different t-levels. Should I adjust B1 to a close multiple of a base value, then adjust the -gpucurves, accordingly, or am I complicating things?

TL;DR If you are still running B2 you should probably set B1 for each t-level based on this chart then round number of curves to the nearest multiple of 896. This is probably within 20% of optimal for >= t45. You could slightly optimize by increasing B1 if you round down or increasing B1 if you round up (so that ecm -v prints "expected number of curves to find a factor" equal to the number of curves you are using)

In practice for small factors everything is really fast so for a single number who cares, but if you were working on factordb or a huge amount of numbers (>5000) you would want to do something smarter. In theory the code could run one curve for 896 different numbers or something.

It can also make sense to tune the B1/B2 ratio based on how much RAM you have and how fast your CPU is versus your GPU. For example see the discussion here. I wrote some hacky shell code to do this at sethtro/misc-scripts/ecm_gpu_optimizer
SethTro is offline   Reply With Quote
Old 2022-03-08, 22:36   #137
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

469710 Posts
Default

Thanks. This gives me something to study. Unfortunately, the machine I was able to get to run the GPU has only 2 cores and 8G RAM. But, I have a script now that sends the residues to a second machine and moves to the next B1 level. Of course, now the GPU is the bottleneck since I'm only running stage 1 operations on its machine. I'm still looking at what might be best for my setup.
EdH is offline   Reply With Quote
Old 2022-03-17, 08:55   #138
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

19×23 Posts
Default

Quote:
Originally Posted by chris2be8 View Post
The older version without -cgbn took about 9 hours to do the same job. Many thanks for the speed up.
Fun fact if you follow the advice about custom kernel size you can potentially make this an additional 40% faster

Code:
$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Using B1=1100000, B2=0, sigma=3:1276799189-3:1276800020 (832 curves)
Compiling custom kernel for 640 bits should be ~144% faster
CGBN<1024, 8> running kernel<26 block x 256 threads> input number is 569 bits
Computing 1158 bits/call, 96372/1586512 (6.1%), ETA 106 + 7 = 113 seconds (~135 ms/curves)
Computing 1158 bits/call, 212172/1586512 (13.4%), ETA 97 + 15 = 113 seconds (~135 ms/curves)
Computing 1158 bits/call, 327972/1586512 (20.7%), ETA 89 + 23 = 112 seconds (~135 ms/curves)

After changing
-  typedef cgbn_params_t<8, 1024>  cgbn_params_1024;
+  typedef cgbn_params_t<8, 640>  cgbn_params_1024;

$ echo "1044362381090522430349272504349028000743722878937901553864893424154624748141120681170432021570621655565526684395777956912757565835989960001844742211087555729316372309210417" | ./ecm -cgbn -v 11e5 0
GMP-ECM 7.0.5-dev [configured with GMP 6.2.99, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM]
Using B1=1100000, B2=0, sigma=3:230651649-3:230652480 (832 curves)
CGBN<640, 8> running kernel<26 block x 256 threads> input number is 569 bits
Computing 1863 bits/call, 146292/1586512 (9.2%), ETA 67 + 7 = 74 seconds (~89 ms/curves)
Computing 1863 bits/call, 332592/1586512 (21.0%), ETA 60 + 16 = 76 seconds (~92 ms/curves)
Computing 1863 bits/call, 518892/1586512 (32.7%), ETA 52 + 25 = 77 seconds (~93 ms/curves)

Last fiddled with by SethTro on 2022-03-17 at 09:07
SethTro is offline   Reply With Quote
Old 2022-03-17, 09:16   #139
Gimarel
 
Apr 2010

22×3×19 Posts
Default

If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.
Gimarel is offline   Reply With Quote
Old 2022-03-17, 09:37   #140
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

23·7·107 Posts
Default

Quote:
Originally Posted by Gimarel View Post
If trying custom kernel sizes, try also 768 bits. For me (GTX 2060 Super) thats faster than 640 bits.
If thats the case then a kernal benchmark would be useful that identifies the fastest kernels for each card. I currently have a version with all the possible kernals added upto 300 digits or so.
henryzz is offline   Reply With Quote
Old 2022-03-17, 16:53   #141
chris2be8
 
chris2be8's Avatar
 
Sep 2009

236810 Posts
Default

Quote:
Originally Posted by SethTro View Post
Fun fact if you follow the advice about custom kernel size you can potentially make this an additional 40% faster
That won't be much help to me, it already takes the CPU much longer to do stage 2 than the GPU takes to do stage 1.

I've looked at your chart for recommended B1 and B2 values, but it confuses my script's calculations of how much ECM to do for a number of a given size. I need to do some serious thinking to get it to all work together.
chris2be8 is offline   Reply With Quote
Old 2022-04-03, 03:52   #142
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

1,801 Posts
Default

Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:

"No available CGBN Kernel large enough to process N(1864 bits)"

I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything.
wombatman is offline   Reply With Quote
Old 2022-04-03, 06:22   #143
SethTro
 
SethTro's Avatar
 
"Seth"
Apr 2019

6658 Posts
Default

Quote:
Originally Posted by wombatman View Post
Hi, I've built this under WSL2, and everything works quite nicely, but when I do the test file (gpu_throughput_test.sh), CBGN fails when the input number is large enough:

"No available CGBN Kernel large enough to process N(1864 bits)"

I saw some posts earlier in the thread that might apply, but I thought it would be best to ask before I start messing with anything.
This is expected. I'm balancing binary size and compile time vs range of numbers that can be tested.

If you want to run ECM on numbers > 1020 bits look around line 670 in cgbn_stage1.cu

Last fiddled with by SethTro on 2022-04-03 at 06:22
SethTro is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NTT faster than FFT? moytrage Software 50 2021-07-21 05:55
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 09:05.


Thu Aug 11 09:05:12 UTC 2022 up 35 days, 3:52, 2 users, load averages: 1.90, 1.56, 1.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔