mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Ernst's Mfactor program (https://www.mersenneforum.org/showthread.php?t=25009)

kriesel 2019-12-06 20:34

Ernst's Mfactor program
 
This thread is being started as a place to put public discussion of how to build, use, or improve the Mfactor program whose source is included with the source of Mlucas. I'm imagining this thread would encompass general considerations, and cpu-specific considerations, and that another thread, probably in the gpu-computing subforum, would address gpu-specific considerations (and include an early cross reference link to this thread).
After things shake out a bit I'll probably condense some of it into a separate reference thread. Or two.

ewmayer 2019-12-06 23:00

Ken, thanks for the thread - I will post "current status of the code" note here once I've done a few builds with various compile flags to shake out any issues in the current codebase. So, using the current Mlucas v19 codebase, let's get started:

o You need to renamed your local factor.c.txt file back to the buildable factor.c, and disable the little chunk of test code starting at line 1908 of that file by changing the #if 1 to #if 0;

o I already found 1 small code-fiddle that is needed, to un-break the "factor found" probable-prime testing call: line 5230 of mi64.c needs to be changed to
[i]
if(q) ASSERT(HERE, lenQ != 0x0, "If quotient requested, quotient-length pointer must be provided!");
[/i]
That allowed me to build and run my standard quicktest, which reproduces one of the factors of mm31 - I successfully used both gcc and clang on my Macbook, the run needs around 2 min, this again is single-threaded:
[i]
cd ~/mlucas/src/obj_mfac
gcc -c -Os ../get*c && rm get_preferred_fft_radix.o
gcc -c -Os ../imul_macro.c ../mi64.c ../qfloat.c ../rng_isaac.c ../two*c ../types.c ../util.c
gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 ../factor.c
gcc -o Mfactor *o -lm
time ./Mfactor -m 2147483647 -bmax 68 -passmin 15
[/i]
Here, -DTRYQ=4 activates the 4-way pure-integer modexp codepath, for which there is optimized x86_64 inline-asm. Try it yourselves!

Playing with the multithreaded build option now ... more on that, the various SIMD/floatig-point-mod and nVidia GPU build options later.

kriesel 2019-12-07 01:53

An update in factor.c, so obvious, even I can see it::[CODE]/*...Known Mersenne prime exponents. This array must be null-terminated. */
const uint32 knowns[] = {2,3,5,7,13,17,19,31,61,89,107,127,521,607,1279,2203,2281,3217,4253,4423,9689,9941
,11213,19937,21701,23209,44497,86243,110503,132049,216091,756839,859433,1257787,1398269,2976221,3021377,6972593
,13466917,20996011,24036583,25964951,30402457,32582657,37156667,42643801,43112609,57885161[B],74207281,77232917[/B]
[B] ,82589933[/B],0x0};
[/CODE]

ewmayer 2019-12-07 19:48

@Ken: Thanks for the reminder, that array is used to check for valid exponents for the double-Mersenne TFing option, which I'll describe later, when I get around to detailing how to build for TFing beyond 96 bits.
---------------

Note re. the sample quick-test case: The default build mode (at least for non-GPU builds) divides the TFing into 16 separate passes (numbered 0-15), each of which uses candidates from one of the 16 permissible (mod 60) residue classes based on the particular Mersenne exponent used. For the mm31 testcase (p = 2147483647) the factor in question happens to be found on the last of the 16 paases, so rather than do all 16 passes (-passmin 0, the -passmax 15 can be omitted if one wants to do all passes from the passmin argument through 15), or some subrange not necessarily ending in pass 15 (-passmin [i]low[/i] -passmin [i]high[/i]), we simply start on pass 15, which is the last-possible one.

Here is the actual-TFing part of the output for the 64-bit-int-mul-based build, on 1 core of my 2GHz Core2Duo Macbook - one can use the count of trial divides to compute a decent approximate cycle count per candidate, lumping the sieving work together wih the modexp computations:

Pass = 15............................
Factor with k = 56474845800. This factor is a probable prime.
.....
M(2147483647) has 1 factors in range k = [0, 68726898240], passes 15-15
Performed 171253887 trial divides
User time: 128 sec

Next, the code also supports a floating-double-based modexp build option, which allows factor candidates up to 78 bits, and splits each such into three 26-bit pieces. The only reason one would use this build option in scalar-double (non-SIMD) mode is if one's target hardware had dismally slow integer multiply; the scalar-double code was really developed to lead the way to SIMD-based computation, since in the past 15 years the major hardware manufacturers have put disproportionately more silicon into their SIMD FPUs. For example, take 64-bit integer multiply - x86_64 is very fast at this, relative to its double-mul capability. In the SIMD, on the other hand, none of the Intel/AMD offerings has an instruction to compute the high 64 bits of a 64x64-bit double-wide integer product. In the latest-greatest avx-512 hardware Intel added an instruction to compute the upper 52 bits of such a product, by leveraging the already-present FMA capability - more on this wide-integer-product trick using doubles and FMA in a later post.

So I'll jump right to the SSE2 modpow, which is double-float based, with available TRYQ (factor-at-a-time-to-try) counts 2,4,8 ... larger values have better instruction latency hiding, but note TRYQ=8 is only available on SSE2 systems which support the ROUND instruction, i.e. starting with the last generation of the Core2 architecture ... my Core2Duo Macbook does not support ROUND and thus throws an Illegal instruction exception when built with TRYQ=8. (For AVX and later this is not an issue, but those code paths are discussed further on). This needs rebuild of 2 files, the setup here is a little dumb, it needs both USE_FLOAT and USE_SSE2 to be defined. As I noted, TRYQ=4 was the fastest viable option on my Core2:

clang -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DUSE_FLOAT -DUSE_SSE2 ../factor.c ../twopmodq80.c

On my Core2Duo Macbook the above build with TRYQ=2 is 2.5x slower that the pure-64-bit-int-based one, and TRYQ=4 is ~50% slower:

TRYQ=2: 317 sec
TRYQ=4: 203 sec

In order to test out TRYQ=8 I could do an SSE2-enabled build on my Haswell, but there seems little point since there we have avx and avx2 to play with. Based on the above trend, SSE2 with TRYQ=8 looks like it might be close to speed-competitive with the 64-bit-int build, but I doubt it would be faster. More on AVX/AVX2 in my next post.

kriesel 2019-12-07 22:24

test hardware
 
I have a motley assortment of old gear available for test including plain pentium, Pentium MMX, Pentium II, Pentium III, some old 32-bit laptops, twin systems containing Core2 Duo E8200 (Vista and Win 7 Pro) which is the rough minimum to be worth the electrical heating, and a few flavors of dual-Xeons, a mix of Windows 7 & 10, also i3-370M, i7500U, and i-8750H. Only very few of this assortment have an msys2/mingw build environment. A Xeon E5645 Win7 system does (mostly for gpuowl builds).
It's hard to tell what Mlucas or Mfactor flags correspond to what processor models.

kriesel 2019-12-07 22:40

some MFactor past history threads
 
Thread re Ernst's Mfactor program, primarily if not entirely cpu oriented, begins late 2005:
[URL="https://www.mersenneforum.org/showthread.php?t=4939&highlight=mfactor"]https://www.mersenneforum.org/showthread.php?t=4939[/URL]

Another thread, re getting set up for CUDA gpu MFactor compiles on linux, begins late 2014, terminating in a batch of questions Ernst had about efficient gpu implementation:
[URL]https://www.mersenneforum.org/showthread.php?t=19457[/URL]
Since that's from very early 2015, nearly 5 years ago, some of them may no longer be relevant. It's probably best to leave it alone for now, until Ernst has a chance to review/revise the list. [URL]https://www.mersenneforum.org/showpost.php?p=392130&postcount=67[/URL]


Are there more?

ATH 2019-12-08 04:49

I noticed that Mfactor from MLucas 19.1 says p<2[SUP]50[/SUP] while Mfactor from Mlucas 14.1 can do up to 2kp+1 < 2[SUP]128[/SUP], I found a factor of a 98.6 bit exponent.

I compiled Mfactor 19.1 i Windows with MSYS2 but got some errors when trying to run your test:

[CODE]gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 get*c imul_macro.c mi64.c qfloat.c rng_isaac.c two*c types.c util.c factor.c
rm get_preferred_fft_radix.o
gcc -o Mfactor *o -lm

mfactor -m 2147483647 -kmin 1 -kmax 68726898240 -passmin 15 -passmax 15

TRYQ = 4, max sieving prime = 1299721
Time to set up sieve = 00:00:00.025
pass = 15............................ERROR: at line 5230 of file mi64.c
Assertion failed: Either both or neither of quotient-array and quotient-length pointers must be provided!
[/CODE]

[CODE]gcc -c -Os -DFACTOR_STANDALONE -DTRYQ=4 -DUSE_FLOAT -DUSE_SSE2 get*c imul_macro.c mi64.c qfloat.c rng_isaac.c two*c types.c util.c factor.c
rm get_preferred_fft_radix.o
gcc -o Mfactor *o -lm

mfactor -m 2147483647 -kmin 1 -kmax 68726898240 -passmin 15 -passmax 15

Testing 63-bit factors...
Testing 64-bit factors...
Testing 65-bit factors...
Testing 96-bit factors...
ERROR: at line 1366 of file factor_test.h
Assertion failed: Approx and exactly-computed k differ!
[/CODE]

ewmayer 2019-12-08 20:13

@Andreas:
[QUOTE=ATH;532343]I noticed that Mfactor from MLucas 19.1 says p<2[SUP]50[/SUP] while Mfactor from Mlucas 14.1 can do up to 2kp+1 < 2[SUP]128[/SUP], I found a factor of a 98.6 bit exponent.[/quote]
Where are you getting that 50-bit limit ... can you copy the warnig/assertion here?
Re. 1st error, see my code-fiddle notes in post #2 for the needed 1-line fix in mi64.c.
Re. 2nd error, don't see that in my sse2 builds ... if you have gdb, recompile factor.c and util.c with -O1 -g3 -ggdb, set a breakpoint at util.c:100 and examine the variable values involved in the failure.

@Ken: Thanks for the GPU-build thread links ... my file with GPU build notes and instructions was victim of my Macbook HD-failure a few years back (I had all the main Mlucas-dev stuff backed up in various places, but a few valuable "notes" files got lost), am currently trying nvcc-compiles of a few source files, those old links may be helpful.

kriesel 2019-12-08 23:35

[QUOTE=ewmayer;532380]@Andreas:

Where are you getting that 50-bit limit ... can you copy the warnig/assertion here?
Re. 1st error, see my code-fiddle notes in post #2 for the needed 1-line fix in mi64.c.
Re. 2nd error, don't see that in my sse2 builds ... if you have gdb, recompile factor.c and util.c with -O1 -g3 -ggdb, set a breakpoint at util.c:100 and examine the variable values involved in the failure.

@Ken: Thanks for the GPU-build thread links ... my file with GPU build notes and instructions was victim of my Macbook HD-failure a few years back (I had all the main Mlucas-dev stuff backed up in various places, but a few valuable "notes" files got lost), am currently trying nvcc-compiles of a few source files, those old links may be helpful.[/QUOTE]I also had 50 bit exponent limit and 96 bit factor, in the available software tabulation, and could not figure out where I got it from. I gather from reading a bit of your Mfactor code, that the values will vary depending on the build defines and target hardware. I feel a table coming on eventually.
Re your build notes/instructions, ouch. Those, and your time, are valuable enough that I suggest including such documentation in a redundant backup scheme. Someone I used to supervise ran lots of servers and desktops but some important things went in his spiral bound paper notebook. Documentation is not useful if the server or drive it lives on is down!

ewmayer 2019-12-09 02:48

[QUOTE=kriesel;532391]I also had 50 bit exponent limit and 96 bit factor, in the available software tabulation, and could not figure out where I got it from.[/QUOTE]

Should've looked in the (to me) obvious place first:
[i]
MacBook:src ewmayer$ grep 50 Mdata.h
128: #define MAX_BITS_P 50
[/i]
There are smaller limits in at least one build type: CUDA/GPU builds require p < 2^32.

kriesel 2019-12-09 08:57

1 Attachment(s)
Here's a limits table draft . I vaguely recall the CUDA factor limit being something like max k = 64 bits, so max factor candidate q would be 64+pbits+1.


All times are UTC. The time now is 02:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.