![]() |
![]() |
#67 | |
"Dana Jacobsen"
Feb 2011
Bangkok, TH
22·227 Posts |
![]() Quote:
My initial performance indications are great for this. The x86-64 "pbrent" codes are slightly faster up to 63-64 bits, but co-siqs doesn't have the asm use. From 64-70 bits it's debatable but co-siqs is certainly close if not ahead, and up to 126 bit it is faster than anything else I have tried. I wrote a "SQUFOF126" which is a modified version of my version of Ben's racing SQUFOF, using 64-bit native int core and GMP for the setup. It's at least in the ballpark, but not faster than co-siqs. It seems SQUFOF has been pushed out of every range. For non even-split semiprimes, I find a small amount of low-overhead p-1 in front can as much as halve the average time taken. My experience has been that ECM isn't as useful for this sort of optimistic up-front test on small sizes. I need to add a standalone small p-1 to my todo (currently it uses my prime iterator which is a little higher cost than a simple array as well as pulling in a lot more code, albeit more flexible). |
|
![]() |
![]() |
![]() |
#68 | |
"Tilman Neumann"
Jan 2016
Germany
2×3×71 Posts |
![]() Quote:
if you use multipliers of the form 1680 * "square-free sequence" = {1, 2, 3, 5, 6, 7, 10, 11, 13, ...} and switch multipliers after ~ 1.5*N^(1/5) iterations, you'ld get the algorithm I proposed in my PSIQS thread. It will only work until some 90+ bits because of the larger multipliers, but it looked quite fine in Java. I am curious how a highly optimized C/asm implementation compares. Btw. 1680 is just one very smooth constant I found to work well. 5040, 10080 or the like could be slightly better. |
|
![]() |
![]() |
![]() |
#69 | |
Tribal Bullet
Oct 2004
2×3×19×31 Posts |
![]() Quote:
The code has nothing to do with RDS' original sieve, it was part of an early project to build a line sieve suitable for kilobit-scale GNFS. I never did anything else with it and dusted it off when there was a chance RDS would contribute to the group effort for sieving 2,991- if his code could be overhauled to scale up to problems that large. Unfortunately just adding 3LP cofactorization wasn't enough to make his lattice sieve scale up to handle large numbers, but at least the code is out there. ISTR the SIQS code that is in the Franke-Kleinjung sievers was twice as fast as this when I last benchmarked it a lifetime ago, but it does have a 96-bit limit. |
|
![]() |
![]() |
![]() |
#70 |
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2×2,909 Posts |
![]() |
![]() |
![]() |
![]() |
#71 |
"Robert Gerbicz"
Oct 2005
Hungary
5×172 Posts |
![]()
The used key idea is the Bernstein's method, see: https://cr.yp.to/papers/sf.pdf.
We factorize in polynomial time(!) per integer, but this is only true if you have MANY factorized integers, hence this method needs also a lot of memory. There is some trade-off here, if you simply have not got enough memory, see also my test when we factorized 2e6 and also when factorized 9e6 fifty-five bits integers, in the latter case we are close to an optimal setup, but even in the first case, when we have only 2e6 integers, the used time is still not that bad. I've some improvements on this algorithm, that's why this code is somewhat longer. Contains also an improvement on this idea: http://www.mersenneforum.org/showpos...40&postcount=8 we use only 3 bases(!) Sample runs of my code. Code:
gerbicz@gerbicz:~/gmp-6.1.2$ g++ -static -m64 -O2 -fomit-frame-pointer -frename-registers -flto -mtune=skylake -march=skylake -mavx2 -Wall -o bernstein bernstein.c -lgmp -lm gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 1 -in semi_55_9e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=semi_55_9e6.txt Processed 9000000 lines in 349 sec.; total 9000000 lines in 349 sec. Total processed 9000000 numbers in 349 sec. There were (too) few numbers in the file, for optimal performance we would expect roughly 9957119 numbers within the same range and difficulty. gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 1 -in semi_60_15e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=semi_60_15e6.txt Processed 15000000 lines in 1390 sec.; total 15000000 lines in 1390 sec. Total processed 15000000 numbers in 1390 sec. There were (too) few numbers in the file, for optimal performance we would expect roughly 51630451 numbers within the same range and difficulty. gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 1 -in semi_55_2e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=semi_55_2e6.txt Processed 2000000 lines in 93 sec.; total 2000000 lines in 93 sec. Total processed 2000000 numbers in 93 sec. There were (too) few numbers in the file, for optimal performance we would expect roughly 9957292 numbers within the same range and difficulty. gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ gerbicz@gerbicz:~/gmp-6.1.2$ ./bernstein -composites 0 -diff 0 -in easy_40_4e6.txt Simultaneous fast factorization for many (small) numbers using D. J. Bernstein's method. primepi(2^16)=6542; primepi(2^32)=203280221; Setup took 4 sec. [to build the isprime table]; not counted in total time(!) Read input=easy_40_4e6.txt Processed 469841 lines in 3 sec.; total 469841 lines in 3 sec. Processed 475395 lines in 3 sec.; total 945236 lines in 6 sec. Processed 470891 lines in 2 sec.; total 1416127 lines in 8 sec. Processed 472562 lines in 3 sec.; total 1888689 lines in 11 sec. Processed 471654 lines in 3 sec.; total 2360343 lines in 14 sec. Processed 472816 lines in 2 sec.; total 2833159 lines in 16 sec. Processed 471924 lines in 3 sec.; total 3305083 lines in 19 sec. Processed 470969 lines in 3 sec.; total 3776052 lines in 22 sec. Processed 223948 lines in 1 sec.; total 4000000 lines in 23 sec. Total processed 4000000 numbers in 23 sec. gerbicz@gerbicz:~/gmp-6.1.2$ Code:
factor63.c bernstein.c semi_55_9e6: 847 sec. 349 sec. semi_60_15e6: 3407 sec. 1390 sec. semi_55_2e6: 201 sec. 93 sec. easy_40_4e6: 5 sec. 23 sec. Yeah, for "easy" numbers my algorithm is slower, not only in 40 bits, but also for 60 bits, and maybe even on 64 bits. In the code you can see the used PARI-Gp functions to generate these files, but you can also download it, in the same order as above: https://drive.google.com/file/d/1WLQ...ew?usp=sharing https://drive.google.com/file/d/1CJG...ew?usp=sharing https://drive.google.com/file/d/14sn...ew?usp=sharing https://drive.google.com/file/d/13wQ...ew?usp=sharing ps. I've 16 GB of Ram and in the only 60 bits test (factorization of 15e6 hard numbers) we've used all memory for some seconds, went to use also some swap memory, but we survived. [we don't use constantly the same amount of memory, so this helps] If you have many numbers in the file, then for a better run it should be sorted in increasing order [the pari-gp methods is not doing this]. |
![]() |
![]() |
![]() |
#72 |
"Dana Jacobsen"
Feb 2011
Bangkok, TH
22×227 Posts |
![]()
Thanks Robert, that's quite interesting.
Re hashed bases, a single hashed base is definitely a win for 32-bit inputs, but since a Lucas test takes about 1.5x the cost of a M-R test, I find it nicer to just do that and get faster overall performance and skip the big table as well. Though it depends on your test input set (if they're all SPSP2 pseudoprimes, then the 2 or 3 M-R test will look good). Also depends on your implementations of course. The 1.5x or so cost is for x86-64 with Montgomery math. GMP for larger inputs is something in the 1.5x to 1.8x range if I recall correctly but I'd have to dig to get the measurements. LibTomMath was more like 6x but maybe it could be sped up. Always nice to have more options, and I'll have to play with it. |
![]() |
![]() |
![]() |
#73 | |
"Andrew Booker"
Mar 2013
8510 Posts |
![]() Quote:
Probably the optimal strategy is to do some low-level Pollard rho to find the small prime factors, then run Bernstein to take care of all the stubborn cases in one go. |
|
![]() |
![]() |
![]() |
#74 | |
"Robert Gerbicz"
Oct 2005
Hungary
5×172 Posts |
![]() Quote:
For easy numbers we are calling at most one per number, here nothing for c<2^32 (just a table lookup) or when c<=maxp^2 (when we know that c=1 or c is prime), where maxp is the max. prime in the smaller product tree. (and here there is also a special save, when you use the composites=1 as a switch) With your decent isprime() functions the method would be somewhat faster for the easy case. [ you have to replace the isprime64 function ]. In costly operations most of the cases for large integers we are doing multiplication, and no hard division (only division by 2^e, but it is cheap), it could be interesting to ask why? This is due to the scaled remainder tree (SRT in the code), doing only one hard floating point division and multiplication we avoid all other integer divisions what is common for a "regular" remainder tree. Though interestingly this SRT gave almost no speedup in the code. |
|
![]() |
![]() |
![]() |
#75 | ||
"Robert Gerbicz"
Oct 2005
Hungary
5A516 Posts |
![]() Quote:
Code:
gerbicz@gerbicz:~/gmp-6.1.2$ ./demo Polinom ideju faktorizalas egyszerre, sok 'kis' szamra D. J. Bernstein modszere alapjan. Hany bites random szamokat szeretnel (<63 bit) ? 40 Hanyat generaljak? 1000000 A random szamok maximuma=1099510120925 Primek szama a maximum negyzetgyokeig=82025 Kesz 7 masodperc alatt. Faktorizalas eredmenye a ki.txt fileban talalhato. gerbicz@gerbicz:~/gmp-6.1.2$ ![]() Quote:
That is why the diff=0 is faster than diff=1, ofcourse only if there were easy numbers in the input file. But it could be possible that extracting the smallish factors with for example Pollard rho would be faster. |
||
![]() |
![]() |
![]() |
#76 |
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
581810 Posts |
![]()
Could Bernstein's method be useful for splitting large primes in SIQS or GNFS? It is certainly possible to have a lot of numbers to factor simultaneously. I get the impression that memory usage would be an issue for GNFS at least. I need to remind myself how Bernstein's method works.
|
![]() |
![]() |
![]() |
#77 | |
"Dana Jacobsen"
Feb 2011
Bangkok, TH
22×227 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Factoring Mersenne numbers | paulunderwood | Miscellaneous Math | 18 | 2017-08-27 14:56 |
Factoring Big Numbers In c# | ShridharRasal | Factoring | 10 | 2008-03-20 17:17 |
Question about factoring code | Peter Nelson | Software | 9 | 2005-03-30 18:28 |
Factoring Fermat Numbers | Axel Fox | Software | 14 | 2003-07-04 18:57 |
Questions about SSE2 code and Factoring | Joe O | Software | 2 | 2002-09-13 23:39 |