mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2017-10-06, 07:21   #12
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22×227 Posts
Default

I got TIFA to compile, but it's really not set up for this sort of test. I'm not sure I want to slog through trying to write a factor() routine for it when I'm dubious how competitive it will be.

I think yafu will be good to see.

Probably the best competition is NT Factor since it is a small self-contained C file with two small header files. It seems to do very well and has the advantages of (1) supports 127-bit instead of 63, or could be trimmed for 64-bit support, (2) doesn't rely on 128-bit compiler type, (3) no huge file.

Right now my code seriously needs to be turned into a more generic C library so it's easier to use standalone, but regardless is going to be a lot more code.
danaj is offline   Reply With Quote
Old 2017-10-06, 14:08   #13
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

D2B16 Posts
Default

Quote:
Originally Posted by danaj View Post
Finally decoded this, not long after my edit window closed. So on my Macbook with SSD there is a penalty for using the big file but it's ~2-3x. On your machine it was a massive 50x penalty. Hence the 1283us instead of 23us. Ouch.
Ouch indeed. Sorry, I didn't notice that at first and now realize that I had put the huge file on a network mounted drive. With it local and with 100,000 inputs I get:

Code:
real    0m4.104s
user    0m3.099s
sys     0m0.093s
yafu is not very good at this problem. There is way too much overhead using a batchfile (100k open/append/close of a logfile). The newest version has a "factor range" function which is 10x slower to factor the range of numbers between 2^63-200000 and 2^63-100000 (although again, there is quite a bit of logging overhead):

Code:
 time ./yafu "frange(2^63-200000,2^63-100000)" > out.txt
46.660u 2.174s 1:38.26 49.6%    0+0k 0+11208io 1pf+0w
bsquared is offline   Reply With Quote
Old 2017-10-06, 15:16   #14
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5·17 Posts
Default

Quote:
As for factor63, the tiny code size is nice, but the 3.8GB mmapped file required for downloading is obnoxious and makes the whole thing unusable for most purposes.
Obnoxious? I'm not forcing you to use it or even advocating it over anything else. I'm just putting the code out there in case anyone else finds it useful. This isn't the reaction that I was expecting.

That said, I agree that it isn't well suited for a project with wide distribution. The file size could be reduced, since the largest part is a factorization table that could be generated at initialization time. I've put it in the file simply because that's more convenient.
Quote:
The actual CPU time used is pretty good -- about 30% faster than mine for 62-bit inputs, but the total clock time is worse due to having to traverse the giant file.
That's true if you run it as a one-off, but it gets faster once your system has cached the file. The typical use case that I imagine is factoring lots of small numbers on a multi-core system, with the data table shared between them.
Quote:
Given the input 8871153694124723083 it will fail with a floating point exception. It works with iterations bumped to 500000.
Oops, that was a bug, now corrected. There is no harm in increasing the iteration count, but there is no need either. Nothing is left to chance, so if there are no further bugs then the code will always run correctly with a reasonable worst-case running time. 300000 iterations is close to optimal for my system, though admittedly I haven't tried very hard to tune it.
arbooker is offline   Reply With Quote
Old 2017-10-06, 15:21   #15
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

38C16 Posts
Default

I figured the goal was to look at just the C function. So for Perl/ntheory I wrote up a similar driver in the XS.xs (the C interface) file. But turns out I get basically the same time time just calling directly assuming /tmp/n.txt has the list of 100k numbers produced by the LCG:

Code:
$ function mpufile { perl -Iblib/lib -Iblib/arch -MMath::Prime::Util=:all -nE "$1" $2; }
$ time mpufile 'chomp; say "$_ has ",scalar(factor_exp($_))," factors";' /tmp/n.txt | shasum
0778274b7c0b2679825496df0c9a62222dc22593  -

real	0m2.722s
user	0m2.730s
sys	0m0.027s
Your mention of a logfile reminds me of TIFA's setup, which has individual programs that take exactly one input and return a bunch of text for a single factor invocation. There is a Perl script that they recommend that calls various algorithm executables to completely factor a number or list of numbers and log the results. The overhead is very high. There may be a C factoring method buried in there and certainly one could be written.

It looks like spfactorlist could be relatively easily modified to instead of n=nstart+i have it roll the LCG and then n = state>>1. I'm not sure the timing results will be much different than just the linear range you ran. I would have thought Lehman would have helped, but if most of that is in SQUFOF then that seems to be not dissimilar to the results from my and NT Factor's SQUFOF results, which were 27-35 seconds on my machine.

Where SQUFOF seems to excel is in the larger semiprimes that take excessively long for Pollard-Rho. That would be another good test.

Last fiddled with by danaj on 2017-10-06 at 15:26 Reason: Add mpufile alias
danaj is offline   Reply With Quote
Old 2017-10-06, 18:27   #16
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5·17 Posts
Default

Quote:
Originally Posted by arbooker View Post
Nothing is left to chance, so if there are no further bugs then the code will always run correctly with a reasonable worst-case running time.
Well, that turned out to be wishful thinking. Taking this chance to look through the code again for the first time in a while, I found and fixed another bug.

There are a lot more restrictions on the iteration count than I thought. In particular, it's implicitly assumed that the chosen iteration count actually occurs for some prime. Since that isn't true for 500000 (by which point the stopping times get very sparse), the effect of your change was to make it go through trial division by all primes when Pollard rho fails. I have added some comments to the code to make the restrictions clear.

I've also repaired the earlier bug in a better way. The only numbers left standing by that point in the code are prime squares and semiprimes, so we can avoid the final primality test.
arbooker is offline   Reply With Quote
Old 2017-10-06, 22:40   #17
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22×227 Posts
Default

I haven't seen any issues with the newest code.

I did some benchmarks of random semiprimes. There may be more convenient way for sharing, but I used what I know.
Code:
perl -Mntheory=:all -E 'csrand(0); say random_semiprime(63) for 1..100000'
generates 100k 63-bit composites as a random 31-bit prime times a random 32-bit prime. It takes a bit less than a half-second. For different values of bits with times as real/user:

Code:
bits     factor63           MPU                NTFactor
 40       1.56 /  1.45       1.84 /  1.79       2.26 /  2.15
 55      17.10 / 14.47      18.65 / 18.12      22.68 / 22.37
 60      44.68 / 34.74      46.18 / 43.76      56.92 / 55.15
 63      84.48 / 57.57      73.25 / 71.14      93.32 / 92.51
 64      ----- / -----     101.36 / 99.61     113.06 /111.19
Code:
mpu 'csrand(0); say random_semiprime(63) for 1..100000' | time ./f63 | shasum
mpu 'csrand(0); say random_semiprime(63) for 1..100000' | time ~/src/nt-factor/factor | shasum
mpu 'csrand(0); say random_semiprime(63) for 1..100000' | time perl -Iblib/lib -Iblib/arch -Mntheory=:all -nE 'chomp; say "$_: ",join(" ",factor($_));' | shasum
I took the opportunity to adjust the number of PR rounds my code does before moving to SQUFOF, which made it a bit faster at the high end. My main() for factor63 allows reading from stdin and sorts the factors, so it acts like good old BSD factor (which GNU NT Factor emulates as does my factor.pl script).

Last fiddled with by danaj on 2017-10-06 at 22:53
danaj is offline   Reply With Quote
Old 2017-10-06, 23:25   #18
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5×17 Posts
Default

Quote:
Originally Posted by danaj View Post
I did some benchmarks of random semiprimes.
Thanks, that's a really helpful comparison. I am perplexed by the difference between user and wall time that you keep observing, though. I ran a similar test of 100k semiprimes on my system and saw almost no difference (0.6 seconds out of about 70s total). How much memory does your test system have? Maybe the OS is never loading the whole file into memory, which will surely degrade performance. Does it do better if you replace mmap by malloc+read?
arbooker is offline   Reply With Quote
Old 2017-10-06, 23:42   #19
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

90810 Posts
Default

Quote:
Originally Posted by arbooker View Post
Thanks, that's a really helpful comparison. I am perplexed by the difference between user and wall time that you keep observing, though. I ran a similar test of 100k semiprimes on my system and saw almost no difference (0.6 seconds out of about 70s total). How much memory does your test system have? Maybe the OS is never loading the whole file into memory, which will surely degrade performance. Does it do better if you replace mmap by malloc+read?
16GB. It's a Mac though, so goodness knows what stuff they've got going on.

What I see is that if you run it again almost all the time difference goes away. I'm guessing the entire file is in your cache as you run lots of experiments with it. I find if I work on other things for a while, then run the same command again it will again add significant extra time. Running again on the same dataset yields only a 5% or so adder. It really will depend on the access pattern of the client.

Note for benchmark: The csrand(0) seeds the RNG so we get the same output every time. For each bit size the shasum matched for the three programs.

Last fiddled with by danaj on 2017-10-06 at 23:45
danaj is offline   Reply With Quote
Old 2017-10-07, 10:30   #20
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·41·71 Posts
Default

It may be worth considering the qs implementation that is part of GGNFS. It at least had a very good reputation for factoring small semiprimes quickly. It would be nice to know if we now have better code that should take its place in ggnfs.

Also interesting would be a plot of the average time for each algorithm/combination of algorithms/implementation for n bits similar to how ggnfs and cado were compared.
henryzz is online now   Reply With Quote
Old 2017-10-07, 10:50   #21
arbooker
 
arbooker's Avatar
 
"Andrew Booker"
Mar 2013

5·17 Posts
Default

Quote:
Originally Posted by danaj View Post
What I see is that if you run it again almost all the time difference goes away.
OK, at least it's working. Do you see a significant difference when run for the first time with a long test (an hour, say)? I agree that for any problem whose running time is measured in minutes or seconds, this isn't the best approach. But that case isn't worth optimizing much anyway.
arbooker is offline   Reply With Quote
Old 2017-10-07, 21:45   #22
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

22×227 Posts
Default

Quote:
Originally Posted by henryzz View Post
It may be worth considering the qs implementation that is part of GGNFS. It at least had a very good reputation for factoring small semiprimes quickly. It would be nice to know if we now have better code that should take its place in ggnfs.

Also interesting would be a plot of the average time for each algorithm/combination of algorithms/implementation for n bits similar to how ggnfs and cado were compared.
Perhaps someone could try stripping just this section out, as compiling the full GGNFS is an exercise in frustration (where the issues are in code that isn't relevent for this such as blanczos64). GGNFS uses Jason's 62-bit SQUFOF, but it's used during relation finding in clsieve.c, not directly factoring.

I'm surprised about QS being used for 64-bit numbers. That seems really small -- most QS implementations won't work on such small numbers. yafu's smallmpqs.c does.


A comment on Pollard-Rho vs. SQUFOF. Pollard-Rho's inner loop is just two mulmods and two addmods (with a deferred gcd). Fast 64-bit mulmods are critical -- with asm mulmod it's a lot faster, and with a proper Montgomery mulmod (either asm or using compiler 128-bit type) it's faster yet. Both factor63 and MPU use fast Montgomery mulmod, though MPU has fallbacks to other methods on other platforms. SQUFOF's inner loop is more or less 4 multiplies, 2 divides, and a perfect square check, all standard integer versions so far less platform dependent. x86_64 is rather ubiquitous these days so that probably matters less than 10 years ago.


Doing the random-semiprime test from before, calling yafu -silent with a line "smallmpqs(...)" or similar for each number. It still does verbose logging to factor.log, and the generic yafu front end isn't optimized for such tiny inputs (where, for instance, GNU NT Factor or BSD factor are extremely low overhead). Different front-end design goals.

Code:
40-bit:
  75.90 real / 62.94 user    n
  58.90 real / 54.18 user    rho(n)
  40.19 real / 36.21 user    smallmpqs(n)
  37.02 real / 27.16 user    squfof(n)
  
 bits      factor63           MPU                yafu QS
 40       1.56 /  1.45       1.84 /  1.79      40.19 / 36.21
 55      17.10 / 14.47      18.65 / 18.12      70.93 / 65.09
 60      44.68 / 34.74      46.18 / 43.76     105.81 / 96.92
 63      84.48 / 57.57      73.25 / 71.14     105.09 / 99.66
 64      ----- / -----     101.36 / 99.61     107.12 /102.08
Now looking at SQUFOF only with Pollard-Rho-Brent to compare:
Code:
 bits   MPU pbrent       MPU SQUFOF         yafu SQUFOF
 40     1.69 /  1.67      2.43 /  2.40      37.02 / 27.16
 55    19.43 / 18.34     31.20 / 30.44      62.04 / 50.49
 60    45.62 / 44.11     86.31 / 84.92     103.87 / 91.31
 63    71.60 / 70.51    240.61 /236.82      --    / --
 64    85.67 / 84.43    326.82 /321.82      --    / --
At 40-bit and lower yafu's SQUFOF goes to Lehman. It sure looks like that is a lot slower than the Pollard-Rho methods in factor63 and MPU, but it could all be overhead. My SQUFOF works on 63- and 64-bit inputs but it's obviously suffering. For MPU I limited SQUFOF to 1M iterations which means we fail to factor 4 of the 63-bit inputs (it takes another 1+ minute of iterations on those numbers to succeed), and 12 of the 64-bit inputs.


This seems to be turning into an update of my old 2009 presentation, albeit limited to just 64-bit.

Last fiddled with by danaj on 2017-10-07 at 21:49
danaj is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factoring Mersenne numbers paulunderwood Miscellaneous Math 18 2017-08-27 14:56
Factoring Big Numbers In c# ShridharRasal Factoring 10 2008-03-20 17:17
Question about factoring code Peter Nelson Software 9 2005-03-30 18:28
Factoring Fermat Numbers Axel Fox Software 14 2003-07-04 18:57
Questions about SSE2 code and Factoring Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 12:52.

Sun Mar 7 12:52:36 UTC 2021 up 94 days, 9:03, 0 users, load averages: 1.75, 1.89, 1.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.