mersenneforum.org Some code for factoring numbers up to 2^63
 Register FAQ Search Today's Posts Mark Forums Read

2017-11-01, 18:32   #67
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

22·227 Posts

Quote:
 Originally Posted by jasonp If you want another (not well optimized) SIQS, I wrote this in 2005 and modernized it a short while ago.
Thank you very much for sharing this, Jason. Is this also public domain? The files don't have any author or licensing info so I'd like to verify. It looks like this is all new on 2014-09-28 and not related to any of the RDS code, but instead your 2007 tinyqs (you mention it's originally from 2005).

My initial performance indications are great for this. The x86-64 "pbrent" codes are slightly faster up to 63-64 bits, but co-siqs doesn't have the asm use. From 64-70 bits it's debatable but co-siqs is certainly close if not ahead, and up to 126 bit it is faster than anything else I have tried.

I wrote a "SQUFOF126" which is a modified version of my version of Ben's racing SQUFOF, using 64-bit native int core and GMP for the setup. It's at least in the ballpark, but not faster than co-siqs. It seems SQUFOF has been pushed out of every range.

For non even-split semiprimes, I find a small amount of low-overhead p-1 in front can as much as halve the average time taken. My experience has been that ECM isn't as useful for this sort of optimistic up-front test on small sizes. I need to add a standalone small p-1 to my todo (currently it uses my prime iterator which is a little higher cost than a simple array as well as pulling in a lot more code, albeit more flexible).

2017-11-01, 20:20   #68
Till

"Tilman Neumann"
Jan 2016
Germany

2×3×71 Posts

Quote:
 Originally Posted by danaj I wrote a "SQUFOF126" which is a modified version of my version of Ben's racing SQUFOF, using 64-bit native int core and GMP for the setup. It's at least in the ballpark, but not faster than co-siqs. It seems SQUFOF has been pushed out of every range.
Hi Dana,
if you use multipliers of the form 1680 * "square-free sequence" = {1, 2, 3, 5, 6, 7, 10, 11, 13, ...} and switch multipliers after ~ 1.5*N^(1/5) iterations, you'ld get the algorithm I proposed in my PSIQS thread. It will only work until some 90+ bits because of the larger multipliers, but it looked quite fine in Java. I am curious how a highly optimized C/asm implementation compares.

Btw. 1680 is just one very smooth constant I found to work well. 5040, 10080 or the like could be slightly better.

2017-11-02, 20:40   #69
jasonp
Tribal Bullet

Oct 2004

2×3×19×31 Posts

Quote:
 Originally Posted by danaj Thank you very much for sharing this, Jason. Is this also public domain? The files don't have any author or licensing info so I'd like to verify. It looks like this is all new on 2014-09-28 and not related to any of the RDS code, but instead your 2007 tinyqs (you mention it's originally from 2005).
My pleasure, consider it public domain and include anywhere.

The code has nothing to do with RDS' original sieve, it was part of an early project to build a line sieve suitable for kilobit-scale GNFS. I never did anything else with it and dusted it off when there was a chance RDS would contribute to the group effort for sieving 2,991- if his code could be overhauled to scale up to problems that large. Unfortunately just adding 3LP cofactorization wasn't enough to make his lattice sieve scale up to handle large numbers, but at least the code is out there.

ISTR the SIQS code that is in the Franke-Kleinjung sievers was twice as fast as this when I last benchmarked it a lifetime ago, but it does have a 96-bit limit.

2017-11-02, 22:55   #70
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT/BST)

2×2,909 Posts

Quote:
 Originally Posted by jasonp ISTR the SIQS code that is in the Franke-Kleinjung sievers was twice as fast as this when I last benchmarked it a lifetime ago, but it does have a 96-bit limit.
The code in the v5 sievers has a much higher limit.

2017-11-08, 18:08   #71
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

5×172 Posts

The used key idea is the Bernstein's method, see: https://cr.yp.to/papers/sf.pdf.
We factorize in polynomial time(!) per integer, but this is only true if you have MANY factorized integers,
hence this method needs also a lot of memory.
my test when we factorized 2e6 and also when factorized 9e6 fifty-five bits integers, in the latter
case we are close to an optimal setup, but even in the first case, when we have only 2e6 integers,
the used time is still not that bad.

I've some improvements on this algorithm, that's why this code is somewhat longer.
Contains also an improvement on this idea: http://www.mersenneforum.org/showpos...40&postcount=8
we use only 3 bases(!)

Sample runs of my code.
Code:
gerbicz@gerbicz:~/gmp-6.1.2$g++ -static -m64 -O2 -fomit-frame-pointer -frename-registers -flto -mtune=skylake -march=skylake -mavx2 -Wall -o bernstein bernstein.c -lgmp -lm gerbicz@gerbicz:~/gmp-6.1.2$
gerbicz@gerbicz:~/gmp-6.1.2$gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 1 -in semi_55_9e6.txt
Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method.
primepi(2^16)=6542;
primepi(2^32)=203280221;
Setup took 4 sec. [to build the isprime table]; not counted in total time(!)
Processed 9000000 lines in 349 sec.; total 9000000 lines in 349 sec.
Total processed 9000000 numbers in 349 sec.
There were (too) few numbers in the file, for optimal performance we would expect roughly 9957119 numbers within the same range and difficulty.
gerbicz@gerbicz:~/gmp-6.1.2$gerbicz@gerbicz:~/gmp-6.1.2$
gerbicz@gerbicz:~/gmp-6.1.2$./bernstein -composites 0 -diff 1 -in semi_60_15e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=semi_60_15e6.txt Processed 15000000 lines in 1390 sec.; total 15000000 lines in 1390 sec. Total processed 15000000 numbers in 1390 sec. There were (too) few numbers in the file, for optimal performance we would expect roughly 51630451 numbers within the same range and difficulty. gerbicz@gerbicz:~/gmp-6.1.2$
gerbicz@gerbicz:~/gmp-6.1.2$gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 1 -in semi_55_2e6.txt
Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method.
primepi(2^16)=6542;
primepi(2^32)=203280221;
Setup took 4 sec. [to build the isprime table]; not counted in total time(!)
Processed 2000000 lines in 93 sec.; total 2000000 lines in 93 sec.
Total processed 2000000 numbers in 93 sec.
There were (too) few numbers in the file, for optimal performance we would expect roughly 9957292 numbers within the same range and difficulty.
gerbicz@gerbicz:~/gmp-6.1.2$gerbicz@gerbicz:~/gmp-6.1.2$
gerbicz@gerbicz:~/gmp-6.1.2$./bernstein -composites 0 -diff 0 -in easy_40_4e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=easy_40_4e6.txt Processed 469841 lines in 3 sec.; total 469841 lines in 3 sec. Processed 475395 lines in 3 sec.; total 945236 lines in 6 sec. Processed 470891 lines in 2 sec.; total 1416127 lines in 8 sec. Processed 472562 lines in 3 sec.; total 1888689 lines in 11 sec. Processed 471654 lines in 3 sec.; total 2360343 lines in 14 sec. Processed 472816 lines in 2 sec.; total 2833159 lines in 16 sec. Processed 471924 lines in 3 sec.; total 3305083 lines in 19 sec. Processed 470969 lines in 3 sec.; total 3776052 lines in 22 sec. Processed 223948 lines in 1 sec.; total 4000000 lines in 23 sec. Total processed 4000000 numbers in 23 sec. gerbicz@gerbicz:~/gmp-6.1.2$
In table format, comparing only with the fast factor63 (failed to install some other):

Code:
             factor63.c   bernstein.c
semi_55_9e6:   847 sec.   349 sec.
semi_60_15e6: 3407 sec.  1390 sec.
semi_55_2e6:   201 sec.    93 sec.
easy_40_4e6:     5 sec.    23 sec.
(for string_L_n we factorize n integers each of them has L bits)

Yeah, for "easy" numbers my algorithm is slower, not only in 40 bits, but also for 60 bits, and maybe even on 64 bits.

In the code you can see the used PARI-Gp functions to generate these files, but

ps. I've 16 GB of Ram and in the only 60 bits test (factorization of 15e6 hard numbers)
we've used all memory for some seconds, went to use also some swap memory,
but we survived. [we don't use constantly the same amount of memory, so this helps]

If you have many numbers in the file, then for a better run it should be
sorted in increasing order [the pari-gp methods is not doing this].
Attached Files
 bernstein.zip (48.1 KB, 67 views)

 2017-11-08, 18:31 #72 danaj   "Dana Jacobsen" Feb 2011 Bangkok, TH 22×227 Posts Thanks Robert, that's quite interesting. Re hashed bases, a single hashed base is definitely a win for 32-bit inputs, but since a Lucas test takes about 1.5x the cost of a M-R test, I find it nicer to just do that and get faster overall performance and skip the big table as well. Though it depends on your test input set (if they're all SPSP2 pseudoprimes, then the 2 or 3 M-R test will look good). Also depends on your implementations of course. The 1.5x or so cost is for x86-64 with Montgomery math. GMP for larger inputs is something in the 1.5x to 1.8x range if I recall correctly but I'd have to dig to get the measurements. LibTomMath was more like 6x but maybe it could be sped up. Always nice to have more options, and I'll have to play with it.
2017-11-08, 18:59   #73
arbooker

"Andrew Booker"
Mar 2013

8510 Posts

Quote:
 Originally Posted by R. Gerbicz The used key idea is the Bernstein's method, see: https://cr.yp.to/papers/sf.pdf.
Nice work! I was aware of this algorithm of Bernstein and his toy implementation. It's nice to see an optimized version and impressive that it compares favorably to the one-off algorithms even for small bit sizes.

Probably the optimal strategy is to do some low-level Pollard rho to find the small prime factors, then run Bernstein to take care of all the stubborn cases in one go.

2017-11-08, 19:20   #74
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

5×172 Posts

Quote:
 Originally Posted by danaj Thanks Robert, that's quite interesting. Re hashed bases, a single hashed base is definitely a win for 32-bit inputs, but since a Lucas test takes about 1.5x the cost of a M-R test, I find it nicer to just do that and get faster overall performance and skip the big table as well. Though it depends on your test input set (if they're all SPSP2 pseudoprimes, then the 2 or 3 M-R test will look good).
In our interesting case for hard numbers (diff=1) we are not doing even a single isprime call.
For easy numbers we are calling at most one per number, here nothing for c<2^32 (just a table lookup)
or when c<=maxp^2 (when we know that c=1 or c is prime), where maxp is the max. prime in the smaller product tree.
(and here there is also a special save, when you use the composites=1 as a switch)

With your decent isprime() functions the method would be somewhat faster for the easy case.
[ you have to replace the isprime64 function ].

Quote:
 Originally Posted by danaj LibTomMath was more like 6x but maybe it could be sped up.
In costly operations most of the cases for large integers we are doing multiplication, and no hard division (only division by 2^e, but it is cheap), it could be interesting to ask why? This is due to the
scaled remainder tree (SRT in the code), doing only one hard floating point division and multiplication we avoid all other integer divisions what is common for a "regular" remainder tree.
Though interestingly this SRT gave almost no speedup in the code.

2017-11-08, 19:56   #75
R. Gerbicz

"Robert Gerbicz"
Oct 2005
Hungary

5A516 Posts

Quote:
 Originally Posted by arbooker Nice work! I was aware of this algorithm of Bernstein and his toy implementation. It's nice to see an optimized version and impressive that it compares favorably to the one-off algorithms even for small bit sizes.
Not remembered for this toy.c, my basic implementation, what I called vintage was this bernstein_demo.c (renamed today), see only one test run:
Code:
gerbicz@gerbicz:~/gmp-6.1.2$./demo Polinom ideju faktorizalas egyszerre, sok 'kis' szamra D. J. Bernstein modszere alapjan. Hany bites random szamokat szeretnel (<63 bit) ? 40 Hanyat generaljak? 1000000 A random szamok maximuma=1099510120925 Primek szama a maximum negyzetgyokeig=82025 Kesz 7 masodperc alatt. Faktorizalas eredmenye a ki.txt fileban talalhato. gerbicz@gerbicz:~/gmp-6.1.2$
Left the text in Hungarian , so the first question asks L, the second the n value, then the code generates n random integers, each of them has L bits, and factorize these. The output is in ki.txt . This has the L<63 limitation [the original bernstein can factorize up to L=64], uses more memory for large L, using only one remainder tree, but still has a decent speed, though it is slower than bernstein.c. Displays the used time in the last but one line.

Quote:
 Originally Posted by arbooker Probably the optimal strategy is to do some low-level Pollard rho to find the small prime factors, then run Bernstein to take care of all the stubborn cases in one go.
Note that in this version in the easy case (diff=0) actually first we extract the small prime factors using smaller product trees, but the same algorithm. For the remaining composite(!) cofactors we are using the big trees (using primes up to max sqrt(n)).
That is why the diff=0 is faster than diff=1, ofcourse only if there were easy numbers in the input file.
But it could be possible that extracting the smallish factors with for example Pollard rho would be faster.
Attached Files
 bernstein_demo.zip (1.8 KB, 57 views)

 2017-11-09, 10:02 #76 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 581810 Posts Could Bernstein's method be useful for splitting large primes in SIQS or GNFS? It is certainly possible to have a lot of numbers to factor simultaneously. I get the impression that memory usage would be an issue for GNFS at least. I need to remind myself how Bernstein's method works.
2017-11-09, 12:50   #77
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

22×227 Posts

Quote:
 Originally Posted by henryzz Could Bernstein's method be useful for splitting large primes in SIQS or GNFS? It is certainly possible to have a lot of numbers to factor simultaneously. I get the impression that memory usage would be an issue for GNFS at least. I need to remind myself how Bernstein's method works.
Memory usage is a big deal for my application, but something like this for much larger inputs has been on my wish list for ECPP for a while. There we get 50-500 numbers and the goal is to strip out factors until we get a satisfactory one or decide we should give up and try a different tree. Not the sheer volume of numbers that QS or NFS has, and significantly larger (hundreds or thousands of digits). Primo does a variety of clever things here.

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Miscellaneous Math 18 2017-08-27 14:56 ShridharRasal Factoring 10 2008-03-20 17:17 Peter Nelson Software 9 2005-03-30 18:28 Axel Fox Software 14 2003-07-04 18:57 Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 20:02.

Thu Mar 4 20:02:52 UTC 2021 up 91 days, 16:14, 0 users, load averages: 2.40, 2.03, 1.69