mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Factoring (https://www.mersenneforum.org/forumdisplay.php?f=19)
-   -   Some code for factoring numbers up to 2^63 (https://www.mersenneforum.org/showthread.php?t=22525)

 arbooker 2017-08-19 14:44

Some code for factoring numbers up to 2^63

While working on aliquot sequences last year, I wrote some fast code for factoring and primality testing of numbers < 2^63. I figured that it might be useful to others, so I've made it available on github at
[url]https://github.com/arbooker/factor63[/url]

One feature of the code is a guaranteed upper bound on the running time for factoring. It's based on Pollard rho, with a data table for resolving collisions and prime factors with long cycles. The primality testing is based on Jan Feitsma's base-2 pseudoprime database.

Sergei Chernykh has released similar code for factoring that works up to 2^64. I stopped at 2^63 both to simplify the coding of Montgomery arithmetic and to keep the data table under 4GB.

The code is written for GCC on 64-bit Linux. I'm not planning to port it to any other platforms, but I don't mind if other people with the know-how would like to do so (that's why I put it on github). As far as I can see, the only system-dependent things that would need to be ported are the POSIX function mmap(), support for 128-bit arithmetic (GCC's __uint128_t) and GCC's __builtin_ctzl()/__builtin_clzl().

 Till 2017-10-05 18:27

Did you compare it to a good Squfof implementation?
In my experiments, Squfof turns out much faster than any PollardRho-style algorithm. Albeit yours may be much more refined...
Till

 danaj 2017-10-05 20:25

[QUOTE=Till;469303]Did you compare it to a good Squfof implementation?
In my experiments, Squfof turns out much faster than any PollardRho-style algorithm. Albeit yours may be much more refined...
Till[/QUOTE]

I haven't yet looked at in detail, but a couple comments about that.

GNU factor uses both P-R and SQUFOF. Pollard-Rho has its place. Especially since it's quite amenable to using Montgomery math.

Yafu is another place to look for some very well optimized 64-bit factoring, though it's not well separated from its source.

I use small trial factoring followed by some Pollard-Rho (Brent), P-1, and SQUFOF. Your mileage will vary. I found it to be faster than Pari/GP and GNU factor. My is_prime is about 4-5x faster than Pari's C function uisprime.

For my measurements I typically assume random inputs in terms of tuning for performance. That means some wasted time if you knew the inputs were more skewed toward semiprimes.

As for factor63, the tiny code size is nice, but the 3.8GB mmapped file required for downloading is obnoxious and makes the whole thing unusable for most purposes. I couldn't distribute a project using this. But I see that for people who just want to run something on their own it could work.

I recommend making a #ifdef'd main function that does simple factoring.

The isprime for large values is about 1.6x faster than mine, though requires the huge table and stops at 63 bit. The Fermat test is nice and fast regardless.

For factoring, it's a bit interesting. The actual CPU time used is pretty good -- about 30% faster than mine for 62-bit inputs, but the total clock time is worse due to having to traverse the giant file.

Given the input 8871153694124723083 it will fail with a floating point exception. It works with iterations bumped to 500000. In my case I run SQUFOF if pbrent fails to find a factor, then tune the iteration count to get the best performance (number varies by input size and how fast mulmods are, since I have to run on other platforms without x86 or uint128_t). It seems you need a backup plan, or raise the iterations to something larger.

I've been meaning for years to pull my code out into a generic library. I'm kind of thinking now that I should merge it with my GMP code for similar functionality. My holdup is mainly that this is a lot of work that isn't fun numerical coding :).

 CRGreathouse 2017-10-05 20:42

Jérôme Milan [url=https://hal.inria.fr/inria-00188645v3/document]compared algorithms[/url] and found that, of the algorithms he tested, SQUFOF was far and away the king from 45 bits until the crossover with SIQS around 59 bits. ECM, CFRAC, McKee, MPQS, and QS didn't even get close. Rho, unfortunately, wasn't compared, but maybe someone could run factor63 against [url=http://www.lix.polytechnique.fr/Labo/Jerome.Milan/tifa/tifa.xhtml]TIFA[/url]'s siqs_factors and squfof_factors? And/or Msieve and yafu?

 danaj 2017-10-05 21:30

[QUOTE=CRGreathouse;469312]Jérôme Milan [url=https://hal.inria.fr/inria-00188645v3/document]compared algorithms[/url] and found that, of the algorithms he tested, SQUFOF was far and away the king from 45 bits until the crossover with SIQS around 59 bits.[/quote]

In section 3 it says the composites are all semiprimes, making it an entirely different problem than looking at random n-bit inputs. If the goal is to optimize average time on random n-bit inputs, then it's definitely worth doing some work to look for small factors, and Pollard-Rho is often a good choice to use before SQUFOF.

[/quote] ECM, CFRAC, McKee, MPQS, and QS didn't even get close. Rho, unfortunately, wasn't compared, but maybe someone could run factor63 against [url=http://www.lix.polytechnique.fr/Labo/Jerome.Milan/tifa/tifa.xhtml]TIFA[/url]'s siqs_factors and squfof_factors? And/or Msieve and yafu?[/QUOTE]

factor63's big table of numbers makes it problematic. Not only does it have a mod-2 bitmask of primality and millions of pseudoprimes, but it includes 146 million 32-bit numbers used in the Pollard-Rho implementation to help find collisions.

 bsquared 2017-10-05 21:36

[QUOTE=CRGreathouse;469312]Jérôme Milan [url=https://hal.inria.fr/inria-00188645v3/document]compared algorithms[/url] and found that, of the algorithms he tested, SQUFOF was far and away the king from 45 bits until the crossover with SIQS around 59 bits. ECM, CFRAC, McKee, MPQS, and QS didn't even get close. Rho, unfortunately, wasn't compared, but maybe someone could run factor63 against [url=http://www.lix.polytechnique.fr/Labo/Jerome.Milan/tifa/tifa.xhtml]TIFA[/url]'s siqs_factors and squfof_factors? And/or Msieve and yafu?[/QUOTE]

The Lehman method is also good for small inputs... Yafu uses it below 40 bits.

[URL="http://www.mersenneforum.org/showpost.php?p=421920&postcount=14"]Here [/URL]is some comparitive data for squfof implementations. (Yafu's squfof is now about twice as fast as was reported there (an avx2 vectorized version).)

I'll see if I can get Factor63 running on the same system to compare.

 bsquared 2017-10-06 02:34

[CODE]
int main(int argc, char **argv)
{
u64 state = 28953;
i64 p[64];
int i, d, num = 10000, e[64];
FILE *fid;

initfactor63("factor.bin");

fid = fopen("results.txt", "w");
for (i=0; i<num; i++)
{
state = 6364136223846793005ULL * state + 1442695040888963407ULL;
d = factor63(p, e, state>>1);
fprintf(fid, "%lu has %d factors\n", state>>1, d);
}

fclose(fid);
return 0;
}

[/CODE]

I get about 1.2 ms per random input, on average:
[CODE]
493 % gcc -O2 factor63.c -lm
494 % time ./a.out
0.226u 0.240s 0:12.83 3.5% 0+0k 0+664io 1529pf+0w
[/CODE]

 CRGreathouse 2017-10-06 04:08

Do you have comparison figures for other programs/algorithms? The link only goes to 59 bits.

 danaj 2017-10-06 05:14

[QUOTE=bsquared;469323]
I get about 1.2 ms per random input, on average:
[CODE] 493 % gcc -O2 factor63.c -lm
494 % time ./a.out
0.226u 0.240s 0:12.83 3.5% 0+0k 0+664io 1529pf+0w[/CODE][/QUOTE]
I'm not following how that turns into 1.2ms (0.226s for 10k inputs). Wouldn't that be 23us?

On my Macbook using the same wrapper but bumped to 100k. Note the real vs. user time.
[code]
\$ gcc -O2 -march=native factor63.c
\$ time ./a.out

real 0m5.312s
user 0m2.066s
sys 0m0.472s
[/code]

Using a wrapper around the Math-Prime-Util factor_exp call that gives the identical 100k line output
[code]
real 0m2.728s
user 0m2.681s
sys 0m0.026s
[/code]

With a bit more P-R before calling SQUFOF I can shave off another 0.1s but that's still not 2.0. At that point only 6 of the 100,000 numbers make it past the P-R at all. With my default settings, 108/100000 get through. So almost all the time is spent in P-R.

I'd love to have a faster SQUFOF even though it looks like it wouldn't make much difference here (but would if we made the inputs harder). About one year ago I finally converted my P-R-Brent to Montgomery math which gave a nice speedup.

FLINT2 ([url]https://github.com/fredrik-johansson/flint2[/url]), n_factor from ulong_extras. Same shasum.
[code]\$ time ./a.out

real 0m27.774s
user 0m26.480s
sys 0m0.353s[/code]

 danaj 2017-10-06 06:57

GNU NT Factor, handles up to 127-bits. Similar wrapper, calling factor() so skips all the text based input. Same 100k line output.

The factor() code does some trial division first then calls the appropriate algorithm with either 1-word (what we use) or 2-word (for 127-bit) paths. So we just use the normal path. The 64-bit P-R algorithm main loop looks quite similar to factor63 and MPU, just with different code snippets for the mont_mulmod aka mulredc and different inner/outer loop counts depending on how we tuned. factor63 goes a completely different direction once we find a gcd != 1 and exit the main loop.

Pollard-Rho:

[code]\$ time ./factor

real 0m4.046s
user 0m4.005s
sys 0m0.029s[/code]

SQUFOF:

[code]\$ time ./factor

real 0m27.199s
user 0m26.372s
sys 0m0.203s[/code]

 danaj 2017-10-06 07:03

[QUOTE=danaj;469327]I'm not following how that turns into 1.2ms (0.226s for 10k inputs). Wouldn't that be 23us?[/quote]Finally decoded this, not long after my edit window closed. So on my Macbook with SSD there is a penalty for using the big file but it's ~2-3x. On your machine it was a massive 50x penalty. Hence the 1283us instead of 23us. Ouch.

All times are UTC. The time now is 19:56.