mersenneforum.org > YAFU three large primes
 Register FAQ Search Today's Posts Mark Forums Read

 2019-07-22, 15:09 #1 bsquared     "Ben" Feb 2007 3,617 Posts three large primes Thought about posting this in the Happy Me thread, but didn't want to pollute it with any discussion that may ensue. I've recently completed implementing the triple large prime variation for yafu's siqs. I think it is really neat, and along the way sparked some good discussion about the best way to process larger residues. As a weekend stress test, and due to readily available parameterization info, I decided to re-factor the C135 from Xilman's paper. The job actually finished and produced the factorization. Yay! Code: ==== sieve params ==== n = 135 digits, 446 bits factor base: 550016 primes (max prime = 17158543) single large prime cutoff: 1073741824 (2^30) double large prime range from 294415597882849 to 18014398509482000 triple large prime range from 5051742696143573745664 to 154742504910672259484483584 allocating 17 large prime slices of factor base buckets hold 2048 elements large prime hashtables have 8912896 bytes using AVX512 enabled 32k sieve core sieve interval: 32 blocks of size 32768 polynomial A has ~ 18 factors using multiplier of 1 using SPV correction of 21 bits, starting at offset 32 trial factoring cutoff at 127 bits ==== sieving in progress (256 threads): 550080 relations needed ==== last: 588203, now: 75365 full, 957982 slp, 2002324 dlp (2037katt, 15316kprp), 9687124 tlp (986318katt, 1416158kprp), (44 r/sec) sieving required 917086811 total polynomials (27829 'A' polynomials) trial division touched 1734948798 sieve locations out of 1923270439862272 brent-rho: 159 failures, 2037449 attempts, 135291144 outside range, 15316534 prp, 2002324 useful total reports = 1734948798, total surviving reports = 153678474 total blocks sieved = 2858995264, avg surviving reports per block = 0.05 TLP filtering time = 2543.1387 seconds. QS elapsed time = 266753.9892 seconds. ==== post processing stage (msieve-1.38) ==== read 12722795 relations begin singleton removal with 12722795 relations now at pass 0: 12722795 relations now at pass 1: 5763606 relations now at pass 2: 4099820 relations now at pass 3: 3545870 relations now at pass 4: 3330790 relations now at pass 5: 3240557 relations now at pass 6: 3201113 relations now at pass 7: 3183243 relations now at pass 8: 3175295 relations now at pass 9: 3171772 relations now at pass 10: 3170185 relations now at pass 11: 3169442 relations now at pass 12: 3169148 relations now at pass 13: 3169006 relations now at pass 14: 3168943 relations now at pass 15: 3168908 relations now at pass 16: 3168895 relations now at pass 17: 3168890 relations reduce to 3168890 relations in 18 passes attempting to read 3168890 relations recovered 3168803 relations recovered 3163236 polynomials commencing rbp/pbr table initialization found 75365 full relations pbr table size = 3168803 rbp table size = 2562969 commencing cycle formation pass 1 commencing cycle formation pass 2 commencing cycle formation pass 3 commencing cycle formation pass 4 commencing cycle formation pass 5 commencing cycle formation pass 6 commencing cycle formation pass 7 commencing cycle formation pass 8 commencing cycle formation pass 9 commencing cycle formation pass 10 commencing cycle formation pass 11 commencing cycle formation pass 12 expected 605834 cycles, found 605834 cycles found 530469 cycles from partial relations maximum cycle length = 776 82.5% of cycles from partials involve a tlp found 605834 cycles from 3168803 relations in 12 passes distribution of cycle lengths: length 1 : 75365 length 2 : 37018 length 3 : 24248 length 4 : 19387 length 5 : 16863 length 6 : 14880 length 7 : 13851 length 9+: 404222 largest cycle: 776 relations building matrix with 605834 columns matrix is 550016 x 605834 (1070.5 MB) with weight 275787690 (455.22/col) sparse part has weight 275787690 (455.22/col) filtering completed in 4 passes matrix is 548649 x 548713 (785.0 MB) with weight 201395563 (367.03/col) sparse part has weight 201395563 (367.03/col) saving the first 48 matrix rows for later matrix is 548601 x 548713 (725.5 MB) with weight 193262029 (352.21/col) sparse part has weight 184707217 (336.62/col) matrix includes 64 packed rows using block size 65536 for processor cache size 16896 kB commencing Lanczos iteration memory use: 519.5 MB lanczos halted after 8678 iterations (dim = 548596) recovered 12 nontrivial dependencies Lanczos elapsed time = 3164.9000 seconds. Sqrt elapsed time = 11.7200 seconds. SIQS elapsed time = 3235.4824 seconds. ***factors found*** P66 = 337779774700456816455577092228603627733197301999086530154776370553 P69 = 346129173115857975809709331088291920685569205287238835924196565083957 A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours. Of course, most of that improvment is advances in hardware. And also of course, this size number is best done by NFS. I have done many many smaller jobs as well, learning some things about the parameters that influence tlp jobs. I was hoping that it would be effective even at smaller (90-100 digit) inputs. But unfortunately, so far it is not really any faster than the double large prime variation. The trouble seems to be that in order to get the critical mass of TLP's necessary to see an explosion of cycles, one has to greatly increase the number of sieve reports processed by trial division and the TLP co-factoring code. This tends to slow things down to the point where the DLP implementation can overtake it. I ran a C120 with both DLP and TLP and the DLP was faster. I will continue to run experiments, and I hope to have the code in a state where I can check it in soon.
 2019-07-25, 18:56 #2 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 25·11·17 Posts Where is the majority of time spent here? Is it the trial division or the rho? I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
2019-07-25, 20:39   #3
Till

"Tilman Neumann"
Jan 2016
Germany

32×5×11 Posts

Quote:
 Originally Posted by bsquared A job that took 3 seasons back in 2001 can now be done on a single machine in 75 hours. Of course, most of that improvment is advances in hardware. And also of course, this size number is best done by NFS.

Sorry, but taking Moore's law into account your result does not look too impressive.

Original computation time (2001): 231 days ~= 5544 hours.
New Computation time (2019): 75 hours

Assuming that computing speed doubles every 2 years we have
Original computation time/2^9 = 10.8 hours

2019-07-26, 05:15   #4
bsquared

"Ben"
Feb 2007

3,617 Posts

Quote:
 Originally Posted by henryzz Where is the majority of time spent here? Is it the trial division or the rho? I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
I use rho only for splitting potential double large prime residues; i.e., those residues between B2 and B2^2 which are not prime, where B2 is the large prime bound. This takes a very small portion of the time. I used the ECM routine for processing potential TLP residues. The process is similar to the paper: prp check, ecm to try to extract a factor less than B2, another prp check and size bounds check on the residue if successful, and final ECM to try to split composites that survive. I'll get some timing numbers on that. The code was not fully instrumented when I ran it... I just ran what I had as a test over a weekend that I was going to be away.

So far I have been disappointed by the results as it doesn't seem to be any faster than DLP. I'm hoping that I just haven't yet figured out the proper parameterizations. Smaller factor base bounds and lower trial division thresholds seem to help, to a point. I would also like to try jasonp's batch factoring/gcd method.

2019-07-26, 05:51   #5
bsquared

"Ben"
Feb 2007

3,617 Posts

Quote:
 Originally Posted by Till Sorry, but taking Moore's law into account your result does not look too impressive. Original computation time (2001): 231 days ~= 5544 hours. New Computation time (2019): 75 hours Assuming that computing speed doubles every 2 years we have Original computation time/2^9 = 10.8 hours
I think that scaling assumes that they also used one machine for that length of time whereas the paper mentions all five authors did some work. But anyway, what I said may have come across wrong. I wasn't trying to compare the TLP implementations. I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.

More concerning to me is the fact that the TLP doesn't seem any better than DLP at the moment. Its very possible that I'm doing something wrong still, some bug or oversight that is limiting the TLP speed, or maybe it is parameters.

Or maybe it is a matter of trying a large enough number. If I were to try something big enough to take the KNL 230 days to do, then maybe TLP would be the previously observed 1.7x faster than DLP. How big would that have to be? About 6 doublings of 75 hours, which at 3 digits per doubling (rough QS rule of thumb) is about a C153. That seems ridiculous to contemplate doing by QS. I think I will continue trying things at smaller sizes :)

Last fiddled with by bsquared on 2019-07-26 at 05:59 Reason: another thought...

2019-07-26, 15:41   #6
Till

"Tilman Neumann"
Jan 2016
Germany

49510 Posts

Quote:
 Originally Posted by bsquared I was just impressed and somewhat surprised that hardware today (ok, maybe not common hardware) allows one to kick off a C135 by QS and have it be finished over a weekend.

Actually I am impressed too (despite my calculation above).

Do you know Thorsten Kleinjungs SIQS variant?
http://www.ams.org/journals/mcom/201...-2015-03058-0/

He seems to have made an algorithmic improvement to SIQS.

There are some numbers for up to 130 digit factor arguments (measured in core years).

2019-07-26, 16:27   #7
bsquared

"Ben"
Feb 2007

3,617 Posts

Quote:
 Originally Posted by Till Actually I am impressed too (despite my calculation above). Do you know Thorsten Kleinjungs SIQS variant? http://www.ams.org/journals/mcom/201...-2015-03058-0/ He seems to have made an algorithmic improvement to SIQS. There are some numbers for up to 130 digit factor arguments (measured in core years).

2019-07-26, 18:45   #8
bsquared

"Ben"
Feb 2007

1110001000012 Posts

Quote:
 Originally Posted by henryzz Where is the majority of time spent here? Is it the trial division or the rho? I am a little surprised that you used rho given your recent posts about your super fast ecm routine.
Some timing data (values in percent).
Code:
size   params                                poly-bucket sieve-med   sieve-lp   tdiv-lp   tdiv-resieve   tdiv-med   ecm     sum
C95    default TLP                           40.5        30.7        7.5        2.8       1              0.55       12.29   95.34
C95    default DLP                           44.7        35.5        9.5        3.7       1.4            0.6        0.84    96.24
C100   default TLP                           38.5        26.2        7.5        2.4       0.9            0.4        20.19   96.09
C100   default DLP                           47.3        34          9.4        3.2       1.2            0.52       0.61    96.23
C105   default TLP                           36.8        25.1        6.9        1.9       0.56           0.24       25.81   97.31
C105   default DLP                           50.5        33.9        9          2.33      1              0.4        0.23    97.36
C110   default TLP                           41          24.5        7.6        2         0.56           0.26       21.68   97.6
C110   default DLP                           53.5        31.9        9.4        2         0.63           0.3        0.1     97.83
C110   TLP  MFBT 2.9                 TF 107  26.9        16.4        5.1        1.6       0.4            0.23       46.87   97.5
C135   TLP  MFBT 3.0 LPB 32 B 496656         49.6        18.7        6.1        0.93      0.24           0.1        22.98   98.65
C135   TLP  MFBT 2.9 LPB 30 B 550000         59.2        21.3        7.3        1.1       0.28           0.13       9.58    98.89
C135   TLP  MFBT 2.9 LPB 30 B 550000 TF 120  58          20.8        7.2        1.6       0.4            0.18       10.57   98.75
C135   DLP           LPB 30 B 550000         65.4        23.4        8          1.5       0.4            0.15       0       98.85
Where:
poly-bucket: root calculations for new B-polys for each prime, and placement into buckets
sieve-lp: emptying the appropriate bucket into the sieved block
tdiv-lp: dividing out bucketed primes from sieve reports
tdiv-resieve: dividing out primes between 2^13 and 1.5*2^15 from sieve reports
tdiv-med: dividing out primes between 2^10 and 2^13 from sieve reports
ecm: processing TLP or DLP residues after tdiv-* have finished processing sieve reports
TLP: triple large prime variation
DLP: double large prime variation
LPB: large prime bound (bits)
B: factor base bound (# of primes)
MFBT: exponent of LPB. reports below (2^LPB)^MFBT are subjected to TLP processing
TF: trial factoring bound

The above routines are > 95% of the runtime.

Up to C110, TLP transfers about 10-25% of the runtime from poly-bucket and elsewhere into ECM. Or quite a bit more if TF is tweaked down. Overall this is a speed loss, as the TLPs are not as effective at combining into cycles as DLPs; the default parameters do not allow discovery of enough TLPs. Forcing the parameters into a state where TLP cycle formation is high only results in a more significant speed loss at these sizes... at least in experiments so far.

The runtimes for complete factorizations are actually pretty close between DLP and TLP up to C110. I view this as a good thing in the long run since further optimization of poly-bucket is probably not going to happen, but there is still room to influence ECM (e.g., jasonp's batch factoring).

EDIT:
I am re-running the LPB 32 case for the test C135 right now. Relation discovery rate is quite a bit higher, as expected. We'll see if this parameterization is better for TLP cycle growth and a net reduction in job time.

Last fiddled with by bsquared on 2019-07-26 at 19:51 Reason: lpb mfbt detail

 2019-07-26, 18:50 #9 jasonp Tribal Bullet     Oct 2004 5·709 Posts How do you handle the filtering for datasets that include TLP relations? Last fiddled with by jasonp on 2019-07-26 at 18:50
 2019-07-26, 19:09 #10 xilman Bamboozled!     "𒉺𒌌𒇷𒆷𒀭" May 2003 Down not across 33·419 Posts I'm sure Ben is aware of this but others may not be: in our paper we stated that GNFS was a better option than TMPQS for numbers of this size. Our experiment was purely to investigate the cycle explosion behaviour and to characterize the phase transition when it happened. Apropos another comment: SSW ran a modified version of his SIQS implementation for his contribution; the rest of us used Lenstra&Manasse's "factoring-by-email" MPQS as modified by me for TLP. Also as stated in the paper, we chose to use (at the time) an elderly implementation so that we could make a direct comparison with the RSA-129 effort. I even fired up a DECstation which had been living in my loft so that identically the same hardware ran the sieving code for each project. It's not surprising that Ben sees somewhat different results because the underlying multi-precision arithmetic libraries are surely very different. I agree with another comment: a weekend is a remarkably short time for a single machine to factor a C135 by QS. Last fiddled with by xilman on 2019-07-26 at 19:17
2019-07-26, 19:09   #11
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

101100001100012 Posts

Quote:
 Originally Posted by jasonp How do you handle the filtering for datasets that include TLP relations?
With difficulty, if my experience is anything to go by.

 Similar Threads Thread Thread Starter Forum Replies Last Post Trilo Riesel Prime Search 3 2013-08-20 00:32 jasonp Msieve 24 2010-06-01 19:14 jasonp Factoring 4 2007-12-04 18:32 fivemack Factoring 18 2007-05-10 12:14 Prime Monster Lounge 34 2004-06-10 18:12

All times are UTC. The time now is 00:54.

Wed May 25 00:54:56 UTC 2022 up 40 days, 22:56, 0 users, load averages: 1.89, 1.86, 1.90