mersenneforum.org > YAFU three large primes
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2020-05-22, 04:37 #34 LaurV Romulan Interpreter     Jun 2011 Thailand 5·1,889 Posts Wow! Ben, that is freaking wonderful! And you kept it for yourself for such a long time! Now I will have to upgrade yafu... (grrr... still using old versions here and there, you know, if it works, don't fix it...) (unfortunately, since I discovered pari, I did most of my "numerology" in it, so the loops and screws came somehow too late, but be sure I will give it a try, I still have some ideas in a far side corner of the brain, which I never had the time/knowledge/balls to follow...) Last fiddled with by LaurV on 2020-05-22 at 04:39
2020-05-22, 16:24   #35
bsquared

"Ben"
Feb 2007

2×32×191 Posts

Quote:
 Originally Posted by LaurV Wow! Ben, that is freaking wonderful! And you kept it for yourself for such a long time! Now I will have to upgrade yafu... (grrr... still using old versions here and there, you know, if it works, don't fix it...) (unfortunately, since I discovered pari, I did most of my "numerology" in it, so the loops and screws came somehow too late, but be sure I will give it a try, I still have some ideas in a far side corner of the brain, which I never had the time/knowledge/balls to follow...)
You'll probably have to get the latest wip-branch SVN, so put it somewhere that it won't bother existing installs. I have had some success getting it to work, but chances are it won't work well But if you give it a shot, let me know how it breaks :)

Last fiddled with by bsquared on 2020-05-22 at 16:24

 2020-05-27, 18:48 #36 bsquared     "Ben" Feb 2007 D6E16 Posts I've been spending a little time with the TLP variation again... integrating jasonp's batch factoring code and investigating parameters. The batch factoring code provides a huge speedup for TLP, easily twice as fast as before at C110 sizes. The crossover with DLP quadratic sieve now appears to be around C110, although I'm still not convinced I have found optimal parameters for TLP. There are a lot of influential parameters. So as of now, TLP is still slower than regular DLP in sizes of interest to the quadratic sieve (C95-C100). Code: C110 48178889479314834847826896738914354061668125063983964035428538278448985505047157633738779051249185304620494013 80 threads of cascade-lake based xeon: DLP: 1642 seconds for sieving TLP: 1578 seconds for sieving I've also revisited some of the core sieving code for modern instruction sets (AVX512). Inputs above C75 or so are about 10-20% faster now (tested mostly on cascade-lake xeon system).
2020-07-12, 16:06   #37
Till

"Tilman Neumann"
Jan 2016
Germany

22·109 Posts

Quote:
 Originally Posted by bsquared I found this thesis comparing Kleinjung's algorithm with traditional SIQS: https://prism.ucalgary.ca/bitstream/...=2&isAllowed=y The re-explanation of the algorithm is very helpful. Lots of comparisons are presented on up to 480-bit discriminants. One thing it appears to be lacking, however, is a comparison of a well-tuned traditional siqs with a well-tuned Kleinjung approach. Of course, traditional siqs will be very slow in comparison with Kleinjung when the parameters are the same (e.g., block size 2048). And vice versa (Kleinjung does not do well with large sieve area). I am most curious to see how they compare when each are tuned to their respective strengths. Still, it is a very nice thesis report. I can see better now that Kleinjung's algorithm is compatible with the rest of siqs factorization (e.g., polynomial generation, filtering and matrix steps), which might ease integration of the new approach with existing implementations.
Instead of comparing a well-tuned Kleinjung SIQS to a well-tuned polynomial-major sieve, I did an intermediate step:
Compare a polynomial-major SIQS to a prime-major SIQS. (both single-threaded)

The conversion of my old (polynomial-major) SIQS to a prime-major SIQS took two steps:
1. Compute the solution arrays for all (a, b)-polynomials of one a-parameter outside the sieve.
2. In the sieve, have outer loop over primes, inner loop over polynomials.

At first, performance was very bad. I had to switch two parameters to improve that:
1. Increase the sieve array size.
2. Reduce qCount = the number of q's whose product gaves the a-parameter.

These measures improved performance a lot, but the result is still bad.
Here is a comparison to the polynomial-major sieve:

Code:

200 bit: Poly-mj      4s, 789ms, Prime-mj      11s, 224ms, ratio = 2.34370
210 bit: Poly-mj      8s, 22ms,  Prime-mj      21s, 269ms, ratio = 2.65133
220 bit: Poly-mj     19s, 609ms, Prime-mj      55s, 826ms, ratio = 2.84695
230 bit: Poly-mj     40s, 232ms, Prime-mj  2m, 12s, 960ms, ratio = 3.30483
240 bit: Poly-mj     55s, 305ms, Prime-mj  3m, 12s, 622ms, ratio = 3.48290, Prime-mj sieveArraySize = 119552
250 bit: Poly-mj 2m, 15s, 526ms, Prime-mj  8m, 11s, 63ms,  ratio = 3.62338, Prime-mj sieveArraySize = 159744
260 bit: Poly-mj 4m, 49s, 561ms, Prime-mj 43m, 30s, 551ms, ratio = 9.01554, Prime-mj sieveArraySize = 206848
All timings of the prime-major SIQS were taken with qCount=6; using qCount=7 at 260 bit took 47m, 17s, 109ms.

For smaller N, the performance gap stems mostly from additional array accesses and another loop for the largest primes that is not necessary in the polynomial-major version.
The performance break-in at 260 bit might point out too many L3 cache misses.

Kleinjungs SIQS might remedy many of the problems of a silly prime-major sieve. But maybe we can see here what the challenge is.

-----------------------------------------------

@Jason: It seems that Colin Percivals stuff has never been published? The only thing I found were some comments in another forum...

Last fiddled with by Till on 2020-07-12 at 16:24

 2020-07-13, 15:52 #38 jasonp Tribal Bullet     Oct 2004 2·29·61 Posts It was never published. When I asked him about it in ~2007 he said he was too busy with his startup company to work on it.
2021-03-18, 18:17   #39
bsquared

"Ben"
Feb 2007

2·32·191 Posts

Quote:
 Originally Posted by bsquared I've been spending a little time with the TLP variation again... integrating jasonp's batch factoring code and investigating parameters. The batch factoring code provides a huge speedup for TLP, easily twice as fast as before at C110 sizes. The crossover with DLP quadratic sieve now appears to be around C110, although I'm still not convinced I have found optimal parameters for TLP. There are a lot of influential parameters. So as of now, TLP is still slower than regular DLP in sizes of interest to the quadratic sieve (C95-C100). Code: C110 48178889479314834847826896738914354061668125063983964035428538278448985505047157633738779051249185304620494013 80 threads of cascade-lake based xeon: DLP: 1642 seconds for sieving TLP: 1578 seconds for sieving I've also revisited some of the core sieving code for modern instruction sets (AVX512). Inputs above C75 or so are about 10-20% faster now (tested mostly on cascade-lake xeon system).
Another update (every now and then I can't help coming back to this...):

First instance of a C100 faster by TLP!

Standard double-large-prime (small-blocks, large-core-count server):
Code:
./yafu "siqs(1802716097522165018257858828415111497060066282677325501816640492782221110851604465066510547671104729)"
-v -threads 72 -siqsNB 6 -inmem 200

116794 rels found: 28831 full + 87963 from 1988560 partial, (24888.91 rels/sec)

Elapsed time: 81.0558 sec
QS elapsed time = 81.2256 seconds.

Tweaked triple-large-prime (small-blocks, large-core-count server):
Code:
./yafu "siqs(1802716097522165018257858828415111497060066282677325501816640492782221110851604465066510547671104729)"
-v -threads 72 -siqsNB 6 -forceTLP -siqsBDiv 3 -siqsMFBT 2.6 -siqsMFBD 1.9 -noopt -siqsTFSm 18 -siqsLPB 27 -siqsTF 95 -siqsB 60000 -inmem 200 -siqsBT 250000

last: 60736, now: 10347 full, 147634 slp, 683638 dlp, 159612 tlp, (10763 r/sec)

TLP filtering time = 2.8450 seconds.
QS elapsed time = 79.5713 seconds.
The key parameter seems to be the factor base size. It is less than half the double-large-prime variation version. The rest of the tweaks maximize relation-discovery-rate with this choice. Also helpful was a revamped relation-estimation method so the number of filtering attempts was greatly reduced.

It might be difficult to get these parameter tweaks automated, but I thought this was a neat hero experiment at least.

p.s. C100 in < 80 seconds! That's a long way from where yafu started. Granted on a fairly massive machine.

 2021-03-21, 22:04 #40 Till     "Tilman Neumann" Jan 2016 Germany 22·109 Posts Amazing performance (for a C-100, no matter if 2 or 3 large primes). Can you tell how much better your current version is on that number compared to YaFU 1.34.5? Just for fun maybe compare RSA-100, too. Last fiddled with by Till on 2021-03-21 at 22:09 Reason: "no matter if 2 or 3 large primes" comment
2021-03-22, 00:14   #41
bsquared

"Ben"
Feb 2007

2·32·191 Posts

Quote:
 Originally Posted by Till Amazing performance (for a C-100, no matter if 2 or 3 large primes). Can you tell how much better your current version is on that number compared to YaFU 1.34.5? Just for fun maybe compare RSA-100, too.
Sure. On the same machine I get:
Code:
random C100
vNew, DLP        81 sec
vNew, TLP        79 sec
v1.34.5, DLP     129 sec

RSA-100
vNew, DLP        103 sec
vNew, TLP        108 sec
v1.34.5, DLP     179 sec
Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.

Last fiddled with by bsquared on 2021-03-22 at 00:15

2021-03-22, 10:39   #42
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

133568 Posts

Quote:
 Originally Posted by bsquared Sure. On the same machine I get: Code: random C100 vNew, DLP 81 sec vNew, TLP 79 sec v1.34.5, DLP 129 sec RSA-100 vNew, DLP 103 sec vNew, TLP 108 sec v1.34.5, DLP 179 sec Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.
Is there a reason why RSA-100 is so much slower?
My dad and I are currently working towards adding TLP to our SIQS implementation(record is C120 currently I think). The current target is ECM with Edwards curves. Is there a particular reason why you(and GMP-ECM) avoided Edwards curves?

Last fiddled with by henryzz on 2021-03-22 at 10:42

2021-03-22, 13:19   #43
bsquared

"Ben"
Feb 2007

2·32·191 Posts

Quote:
 Originally Posted by henryzz Is there a reason why RSA-100 is so much slower?
I suspect because the rsa numbers were chosen to have poor QS factor base properties, but I haven't verified this.

Quote:
 Originally Posted by henryzz My dad and I are currently working towards adding TLP to our SIQS implementation(record is C120 currently I think). The current target is ECM with Edwards curves. Is there a particular reason why you(and GMP-ECM) avoided Edwards curves?
Congrats on the C120! I have looked at the various Edwards curves papers, but haven't found the energy yet to implement any of it. So for me at least it was a path of least resistance: the C&P and gmp-ecm references were too good to not use for a first implementation. And after that was working, the amount of additional work necessary to get an incremental improvement (hopefully) from Edwards curves has just been a non-starter.

2021-03-22, 18:57   #44
Till

"Tilman Neumann"
Jan 2016
Germany

22·109 Posts

Quote:
 Originally Posted by bsquared Sure. On the same machine I get: Code: random C100 vNew, DLP 81 sec vNew, TLP 79 sec v1.34.5, DLP 129 sec RSA-100 vNew, DLP 103 sec vNew, TLP 108 sec v1.34.5, DLP 179 sec Looks like the new version takes about 60% of the time compared to the current release. Pretty much all of that is using new AVX-512 stuff. Didn't repeat the TLP success on the more difficult RSA100, though.

Thanks for the data and congrats, great improvements really!

 Similar Threads Thread Thread Starter Forum Replies Last Post Trilo Riesel Prime Search 3 2013-08-20 00:32 jasonp Msieve 24 2010-06-01 19:14 jasonp Factoring 4 2007-12-04 18:32 fivemack Factoring 18 2007-05-10 12:14 Prime Monster Lounge 34 2004-06-10 18:12

All times are UTC. The time now is 05:21.

Thu May 13 05:21:47 UTC 2021 up 35 days, 2 mins, 0 users, load averages: 2.76, 2.87, 3.27