mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2021-01-13, 15:17   #67
pvn
 
Nov 2020

22 Posts
Default

tdulcet,


ah, this is very helpful. I spent a good bit of time yesterday doing something similar, building essentially a barebones version of this to build docker images.



For intel, I just built multiple binaries (for sse, avx, avx2, avx512) and use an entrypoint script to determine at runtime what hardware is available and run the right binary.



I have had some trouble with the build on arm, though, so for now I'm just using the precompiled binaries, but similar routine in the entrypoint script to run the nosmid/c2smid binary as needed.


the docker image is at pvnovarese/mlucas_v19:latest (it's a multi-arch image, with both aaarch64 and x86_64)



Dockerfile etc can be found here: https://github.com/pvnovarese/mlucas_v19


I will review your script as well, it looks like you've thought a lot more about this than I have :)
pvn is offline   Reply With Quote
Old 2021-01-14, 00:09   #68
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

265528 Posts
Default

Quote:
Originally Posted by tdulcet View Post
My install script for Linux currently follows the recommended instructions on the Mlucas README for each architecture to hopefully provide the best performance for most users, but I would be interested in adding this feature to automatically try different combinations of CPU cores/threads and then picking the one with the best performance, although I am not sure what the correct procedure is to do this for each architecture and CPU or how the -DUSE_THREADS compile flag factors in. The scripts goal is to automate the entire download, build, setup and run process for Mlucas, so I think this could be an important component of that. I have not received any feedback on the script so far, so I am also not even sure if there is any interest in this feature or what percentage of systems it would affect.
-DUSE_THREADS is needed to enable multithreaded build mode; without it you get a single-threaded build, which would only be useful if all you ever wanted to do is run one such 1-thread job per core. Even in that, the core-affinity stuff (-cpu argument) is not available for such builds, so you'd basically be stuck firing up a bunch of executable images, each from its own run directory (unique worktodo.ini file, copy of mlucas.cfg and primenet.py script to manage work for that directory) and hoping the OS does a good job managing the core affinities.

(Basically, there's just no good reason to omit the above flag anymore).

Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point.
ewmayer is offline   Reply With Quote
Old 2021-01-14, 13:21   #69
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

2·3·5 Posts
Default

Quote:
Originally Posted by pvn View Post
Dockerfile etc can be found here: https://github.com/pvnovarese/mlucas_v19


I will review your script as well, it looks like you've thought a lot more about this than I have :)
Nice! Thanks. With my script you should be able to compile Mlucas on demand, since is uses a parallel Makefile with one job for each CPU thread, it should only take a couple minutes or less to compile on most systems. It uses the -march=native compile flag on x86 systems, so the resulting binaries should also be slightly faster, although they are generally not portable. What was the issue you had building on ARM?

There is a longstanding issue with 32-bit ARM, where the mi64.c file hangs when compiling with GCC. If you remove the -O3 optimization you get these errors:
Quote:
../src/mi64.c: In function ‘mi64_shl_short’:
../src/mi64.c:1038:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
^~~~~~~
../src/mi64.c:1038:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1038:2: error: unknown register name ‘rax’ in ‘asm’
../src/mi64.c: In function ‘mi64_shrl_short’:
../src/mi64.c:1536:2: error: unknown register name ‘rsi’ in ‘asm’
__asm__ volatile (\
^~~~~~~
../src/mi64.c:1536:2: error: unknown register name ‘rcx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rbx’ in ‘asm’
../src/mi64.c:1536:2: error: unknown register name ‘rax’ in ‘asm’
Quote:
Originally Posted by ewmayer View Post
(Basically, there's just no good reason to omit the above flag anymore).
OK, thanks for the info. That is what I thought. I just wanted to make sure that there was not some edge case where my script should omit the flag.

Quote:
Originally Posted by ewmayer View Post
Re. some kind of script to automate the self-testing using various suitable candidate -cpu arguments, that would indeed be useful. George uses the freeware hwloc library in his Prime95 code to suss out the topology of the machine running the code - I'd considered using it for my own as well in the past, but had seen a few too many threads that boiled down to "hwloc doesn't work properly on my machine" and needing some intervention re. that library by George for my taste. In any event, let me think on it more, and perhaps some playing-around with that library by those of you interested in this aspect would be a good starting point.
OK, I was just thinking that there was some procedure my script could use given the CPU (Intel, AMD or ARM), the number of CPU Cores and the number of CPU threads to generate all possible candidate combinations for the -cpu argument that could realistically generate the best performance. It could then try the different candidate combinations (as described in the two examples of your previous post) and pick the one with the best performance.

Based on the "Advanced Users" and "Advanced Usage" sections of the Mlucas README, for an example 8 core/16 thread system, this is my best guess of the candidate combinations to try with the -cpu argument:

Intel
Code:
0     (1-threaded)
0:1     (2-threaded)
0:3     (4-threaded)
0:7     (8-threaded)
0:15     (16-threaded)
0,8     (2 threads per core, 1-threaded) (current default)
0:1,8:9     (2 threads per core, 2-threaded)
0:3,8:11     (2 threads per core, 4-threaded)

AMD

Code:
0     (1-threaded)
0:3:2     (2-threaded)
0:7:2     (4-threaded)
0:15:2     (8-threaded)
0:1     (2 threads per core, 1-threaded) (current default)
0:3     (2 threads per core, 2-threaded)
0:7     (2 threads per core, 4-threaded)
0:15     (2 threads per core, 8-threaded)

ARM
(8 core/8 thread)
Code:
0     (1-threaded)
0:3     (4-threaded) (current default)
0:7     (8-threaded)
I am not sure if these are all the combinations worth testing or if we could rule any of them out.

Last fiddled with by tdulcet on 2021-01-14 at 13:22
tdulcet is offline   Reply With Quote
Old 2021-01-16, 21:06   #70
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1162610 Posts
Default

@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)

First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here, suggest you save a copy of the old one so you can diff and see the changes for yourself. Briefly, a big chunk of x86_64 inline-asm needed extra wrapping inside a '#ifdef YES_ASM' preprocessor directive. That flag is def'd (or not) in mi64.h like so:
Code:
  #if(defined(CPU_IS_X86_64) && defined(COMPILER_TYPE_GCC) && (OS_BITS == 64))
	#define YES_ASM
  #endif
Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?

Similarly, we usually see a steep dropoff in || scaling beyond 4 cores - but that need not imply that running two 4-thread jobs is better than one 8-thread one. If said dropoff is due to the workload saturing the memory banwidth, we might well see a similar performance hit with two 4-thread jobs
Attached Files
File Type: bz2 mi64.c.bz2 (75.6 KB, 41 views)
ewmayer is offline   Reply With Quote
Old 2021-01-16, 22:23   #71
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1162610 Posts
Default

Addendum: OK, I think the roadmap needs to look something like this - abbreviation-wise, 'c' refers to physical cores, 't' to threadcount:

1. Based on the user's HW topology, identify a set of 'most likely to succeed' core/thread combos, like tdulcet did in his above post. For x86 this needs to take into account the different core-numbering conventions used by Intel and AMD;

2. For each combo in [1], run the automated self-tests, and save the resulting mlucas.cfg file under a unique name, e.g. for 4c/8t call it mlucas.cfg.4c.8t;

3. The various cfg-files hold the best FFT-radix combo to use at each FFT length for the given c/t combo, i.e. in terms of maximizing total throughput on the user's system we can focus on just those. So let's take a hypothetical example: Say on my 8c/16t AMD processor the round of self-tests in [1] has shown that using just 1c, 1c2t is 10% faster than 1c1t. We now need to see how 1c2t scales to all physical cores, across the various FFT lengths in the self-test. E.g. at FFT length 4096K, say the best radix combo found for 1c2t is 64,32,32,32 (note the product of those = 2048K rather than 4096K because to match general GIMPS convention "FFT length" refers to #doubles, but Mlucas uses an underlying complex-FFT, so the individual radices are complex and refer to pairs-of-doubles). So we next want to fire up 8 separate 1c2t jobs at 4096K, each using that radix combo and running on a distinct physical core, thus our 8 jobs would use -cpu flags (I used AMD for my example to avoid the comm confusion Inte;'s convention would case here) 0:1,2:3,4:5,6:7,8:9,10:11,12:13 and 14:15, respectively. I would further like to specify the foregoing radix combo via the -radset flag, but here we hit a small snag: at present, there is no way to specify an actual radix-combo. Instead one must find the target FFT length in the big case() table in get_fft_radices.c and match the desired radix-combo to a case-index. For 4096K, we see 64,32,32,32 maps to 'case 7', so we'd use -radset 7 for each of our 8 launch-at-same-time jobs. I may need to do some code-fiddling to make that less awkward.

Anyhow, since we're now using just 1 radix-combo at each FFT length and we want a decent timing sample not dominated by start-up init and thread-management overhead, we might use -iters 1000 for each of our 8 jobs. Launch at more-or-less same time, they will have a range of msec/iter timings t0-t7 which we convert into total throughput in iters/sec via 1000*(1/t0+1/t1+1/t2+1/t3+1/t4+1/t5+1/t6+1/t7). Repeat for each FFT length of interest, generating a set of total throughput numbers.

4. Repeast [3] for each c/t combo in [1]. It may well prove the case that a single c/t combo does not give best total throughput across all FFT lengths, but for a first cut it seems best to somehow generate some kind of weighted average-across-all-FFT-lengths for each c/t combo and pick the best one. In [3] we generated total throughput iters/sec numbers at each FFT length, maybe multiply each by its corresponding FFT length and sum over all FFT lengths.

Last fiddled with by ewmayer on 2021-01-16 at 22:24
ewmayer is offline   Reply With Quote
Old 2021-01-17, 16:05   #72
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

2·3·5 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@tdulcet: Extremely busy this past month working on a high-priority 'intermediate' v19.1 release (this will restore Clang/llvm buildability on Arm, problem was first IDed on the new Apple M1 CPU but is more general), alas no time to give the automation of best-total-throughput-finding the attention it deserves. But that's where folks like you come in. :)
No problem. I will look forward to your new v19.1 release.

Quote:
Originally Posted by ewmayer View Post
First off - the mi64.c compile issue has been fixed in the as-yet-unreleased 19.1 code, as the mods in that file are small I will attach it here
Thanks for the fix. This had been preventing me from running Mlucas on my Raspberry Pis for a couple years, so that is great that it will now work.

Quote:
Originally Posted by ewmayer View Post
Re. your core/thread-combos-to-try on an example 8c/16t system, those look correct. The remaining trick, though, is figuring out which of the most promising c/t combos give the best total-throughput on the user's system. For example - sticking to just 1-thread-per-physical-core for the moment - we expect 1t to run roughly 2x slower that 2t. Say the ratio is 1.8, and the user has an 8-core system. The real question is, how does the total-throughput compare for 8x1t jobs versus 4x2t?
Yes, this will be difficult. I implemented a preliminary version that follows the instructions on the Mlucas README. Specifically, it will multiply the 4x2t msec/iter times by 1.5 before comparing them. Multiplying by 2 would of course produce different results in this case.

Quote:
Originally Posted by ewmayer View Post
Addendum: OK, I think the roadmap needs to look something like this
Wow, thanks for the detailed roadmap, it is very helpful!

1. OK, I wrote Bash code to automatically generate the combinations from my previous post above, for the user's CPU and number of CPU cores/threads. It will generate a nice table like one of these for example:
Code:
The CPU is Intel.
#  Workers/Runs  Threads         -cpu arguments
1  1             16, 2 per core  0:15
2  1             8, 1 per core   0:7
3  2             4, 1 per core   0:3 4:7
4  4             2, 1 per core   0:1 2:3 4:5 6:7
5  8             1, 1 per core   0 1 2 3 4 5 6 7
6  2             4, 2 per core   0:3,8:11 4:7,12:15
7  4             2, 2 per core   0:1,8:9 2:3,10:11 4:5,12:13 6:7,14:15
8  8             1, 2 per core   0,8 1,9 2,10 3,11 4,12 5,13 6,14 7,15

The CPU is AMD.
#  Workers/Runs  Threads         -cpu arguments
1  1             16, 2 per core  0:15
2  1             8, 1 per core   0:15:2
3  2             4, 1 per core   0:7:2 8:15:2
4  4             2, 1 per core   0:3:2 4:7:2 8:11:2 12:15:2
5  8             1, 1 per core   0 2 4 6 8 10 12 14
6  2             4, 2 per core   0:7 8:15
7  4             2, 2 per core   0:3 4:7 8:11 12:15
8  8             1, 2 per core   0:1 2:3 4:5 6:7 8:9 10:11 12:13 14:15

The CPU is ARM.
#  Workers/Runs  Threads  -cpu arguments
1  1             8        0:7
2  2             4        0:3 4:7
3  4             2        0:1 2:3 4:5 6:7
4  8             1        0 1 2 3 4 5 6 7
The combinations are the same as my previous post above, except I added a 2-threaded combination for ARM and the ordering is different.

2. Done.
3./4. Interesting, this is going to be a lot more complex to implement then I originally thought. The switch statement in get_fft_radices.c is too big to store in my script and creating an awk command to extract the case number based on the FFT length and radix combo would obviously be extremely difficult, particularly because there are nested switch statements. I am going to have to think about how best to do this... I welcome suggestions from anyone who is reading this. In the meantime, I wrote code to directly compare the adjusted msec/iter times from the mlucas.cfg files from step #2. This of course does not account for any of the scaling issues that @ewmayer described. It will generate two tables (the fastest combination and the rest of the combinations tested) like these for my 6 core/12 thread Intel system for example:
Code:
Fastest
#  Workers/Runs  Threads        First -cpu argument  Adjusted msec/iter times
6  6             1, 2 per core  0,6                  8.47  9.69  10.72  12.26  12.71  14.53  14.76  16.54  16.1  18.89  20.94  23.94  26.39  28.85  29.16  32.98

Mean/Average faster     #  Workers/Runs  Threads         First -cpu argument  Adjusted msec/iter times
3.248 ± 0.101 (324.8%)  1  1             12, 2 per core  0:11                 28.92  31.74  33.78  38.64  42.66  44.52  46.56  51.06  52.26  61.8  70.14  79.38  88.62  92.94  97.26  106.2
3.627 ± 0.146 (362.7%)  2  1             6, 1 per core   0:5                  34.14  34.8  39.66  45.48  47.28  51.18  51.3  56.64  60.78  66.24  73.02  87.78  96.12  102.6  108.12  116.22
2.607 ± 0.068 (260.7%)  3  2             3, 1 per core   0:2                  22.98  25.53  27.3  30.12  33.63  37.83  36.72  42.15  42.69  48.9  54.66  61.68  71.19  76.44  77.49  87.36
1.736 ± 0.029 (173.6%)  4  6             1, 1 per core   0                    14.41  17.1  18.46  20.88  22.72  24.99  25.36  28.38  28.82  32.67  36.06  40.67  46.12  49.87  51.32  58.26
1.816 ± 0.047 (181.6%)  5  2             3, 2 per core   0:2,6:8              16.11  18.09  19.32  21.99  23.64  25.41  25.92  29.85  30.57  33.78  38.7  42.42  48.57  51.12  52.53  59.19
The two tables show that 1-threaded with 2 threads per core is ~1.7 times faster then 1-threaded with 1 thread per core for example.
tdulcet is offline   Reply With Quote
Old 2021-01-18, 20:49   #73
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·5,813 Posts
Default

@tdulcet - How about I add support in v19.1 for the -radset flag to take either an index into the big table, or an actual set of comma-separated FFT radices? Shouldn't be difficult - if the expected -radset[whitespace]numeric arg pair is immediately followed by a comma, the code assumes it's a set of radices, reads those in, checks that said set is supported and if so runs with it.

I expect - fingers crossed, still plenty of work to do - to be able to release v19.1 around EOM, so you'd have to wait a little while, but it sounds like this is the way to go.

Edit: Why make people wait - here is a modified version of Mlucas.c which supports the above-described -radset argument. You should be able to drop into your current v19 source archive, but suggest you save the old Mlucas.c under a different name - maybe add a '.bak' - so you can diff the 2 versions to see the changes, the first and most obvious of which is the version number, now bumped up to 19.1.

Note user-supplied radix set is considered "advanced usage" in the sense that I assume users of it know what they are doing, though I have included a set of sanity checks on inputs. Most important is to understand the difference between the FFT length conventions between the -fftlen and -radset args: -fftlen supplies a real-vector FFT length in Kdoubles; -radset [comma-separated list of radices] specifies a corresponding set of complex-FFT radices. If the user has supplied a real-FFT length (in Kdoubles) via -fftlen, the product of the complex-FFT radices (call it 'rad_prod') must correspond to half that value, accounting for the Kdoubles scaling of the former. In C-code terms, we require that (rad_prod>>9) == fftlen .

Note that event though this is strictly-speaking redundant, the -fftlen arg is required even if the user supplies an actual radix set; this is for purposes of sanity checking the latter, because the above-described differing conventions make it easy to get confused. Using any of the radix sets listed in the mlucas.cfg file along with the corresponding FFT length is of course guaranteed to be OK.

Examples: After building the attached Mlucas.c file and relinking, try running the resulting binary with the following sets of command-line arguments to see what happens:

-iters 100 -fftlen 1664 -radset 0
-iters 100 -fftlen 1664 -radset 208,16,16,16
-iters 100 -fftlen 1668 -radset 208,16,16,16
-iters 100 -fftlen 1664 -radset 207,16,16,16
-iters 100 -fftlen 1664 -radset 208,8,8,8,8

Last fiddled with by ewmayer on 2021-02-13 at 20:06 Reason: Deleted attachment; code now part of v19.1
ewmayer is offline   Reply With Quote
Old 2021-01-19, 13:46   #74
tdulcet
 
tdulcet's Avatar
 
"Teal Dulcet"
Jun 2018

3010 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@tdulcet - How about I add support in v19.1 for the -radset flag to take either an index into the big table, or an actual set of comma-separated FFT radices?
That would be very helpful to automate this!

Quote:
Originally Posted by ewmayer View Post
Edit: Why make people wait - here is a modified version of Mlucas.c which supports the above-described -radset argument.
Wow, thanks for doing it so quickly! This will be very helpful. I committed and pushed the the changes I described in my previous post to GitHub here, which basically implements step # 1, 2 and part of 4. I will now get started on step 3 and the rest of 4 using your new version of Mlucas.c.

In my previous post on an example 8c/16t system, I said it will multiply the 4x2t msec/iter times by 1.5 before comparing them to the 8x1t times, following the instructions on the Mlucas README. After doing more testing, I was getting unexpected results with this formula ((CPU cores / workers) - 0.5), so it will now multiply the times by 2 (CPU cores / workers) for this example. This should be irrelevant once I implement step 3.

I thought I should note that some systems like the Intel Xeon Phi can have more then two CPU threads per CPU core. The Mlucas README does not mention this case, but my script should correctly handle it for Intel and AMD x86 systems. For example, on a 64 core/256 thread Intel Xeon Phi system it would try these combinations (only showing the first -cpu argument for brevity):
Code:
#   Workers/Runs  Threads          -cpu arguments
1   1             64, 1 per core   0:63
2   2             32, 1 per core   0:31
3   4             16, 1 per core   0:15
4   8             8, 1 per core    0:7
5   16            4, 1 per core    0:3
6   32            2, 1 per core    0:1
7   64            1, 1 per core    0
8   1             128, 2 per core  0:63,64:127
9   2             64, 2 per core   0:31,64:95
10  4             32, 2 per core   0:15,64:79
11  8             16, 2 per core   0:7,64:71
12  16            8, 2 per core    0:3,64:67
13  32            4, 2 per core    0:1,64:65
14  64            2, 2 per core    0,64
15  1             256, 4 per core  0:63,64:127,128:191,192:255
16  2             128, 4 per core  0:31,64:95,128:159,192:223
17  4             64, 4 per core   0:15,64:79,128:143,192:207
18  8             32, 4 per core   0:7,64:71,128:135,192:199
19  16            16, 4 per core   0:3,64:67,128:131,192:195
20  32            8, 4 per core    0:1,64:65,128:129,192:193
21  64            4, 4 per core    0,64,128,192

Last fiddled with by tdulcet on 2021-01-19 at 13:52
tdulcet is offline   Reply With Quote
Old 2021-01-19, 20:19   #75
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·5,813 Posts
Default

@tdulcet: Glad to be of service to someone else who wants be of service, or something. :)

o Re. KNL, yes I have a barebones one sitting next to me and running a big 64M-FFT primality test, 1 thread on each of physical cores 0:63. On KNL I've never found any advantage from running this kind of code with more than 1 thread per physical core.

o One of your timing sample above mentioned getting nearly 2x speedup from running 2 threads on 1 physical core, with the other cores unused. I suspect that may be the OS actually putting 1 thread on each of 2 physical cores. Remember, those pthread affinity settings are treated as *hints* to the OS, we hope that under heavy load the OS will respect them because there are no otherwise-idle physical cores it can bounce threads to.

o You mentioned the mi64.c missing-x86-preprocessor-flag-wrapper was keeping you from building on your Raspberry Pi - that was even with -O3? And did you as a result just use the precompiled Arm/Linux binaries on that machine?
ewmayer is offline   Reply With Quote
Old 2021-01-20, 07:46   #76
joniano
 
Jan 2021

1 Posts
Default Possible Symptoms of a Bug on ARM64 Build - Running too fast

Hello Folks - I recently got Mlucas running on a Raspberry Pi 4, 8GB of RAM, running Ubuntu and I am doing PRP checks on large primes.

I'm assuming either Mlucas is extremely fast and consistent or I'm running into some sort of a bug.

If you look at a few lines of the ".stat" file for one of my recent primes, you'll see that every few seconds I blast through 10,000 iterations at exactly the same ms/iter speed and it seems to take under a day to fully PRP test a new number.

Code:
[2021-01-19 21:42:45] M110899639 Iter# = 110780000 [99.89% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:48] M110899639 Iter# = 110790000 [99.90% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:50] M110899639 Iter# = 110800000 [99.91% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:52] M110899639 Iter# = 110810000 [99.92% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:55] M110899639 Iter# = 110820000 [99.93% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:57] M110899639 Iter# = 110830000 [99.94% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:42:59] M110899639 Iter# = 110840000 [99.95% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:01] M110899639 Iter# = 110850000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:04] M110899639 Iter# = 110860000 [99.96% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:06] M110899639 Iter# = 110870000 [99.97% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:08] M110899639 Iter# = 110880000 [99.98% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:10] M110899639 Iter# = 110890000 [99.99% complete] clocks = 00:15:20.953 [ 92.0954 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
[2021-01-19 21:43:13] M110899639 Iter# = 110899639 [100.00% complete] clocks = 00:15:20.953 [ 95.5445 msec/iter] Res64: 461E323B49699D73. AvgMaxErr = 0.000000000. MaxErr = 0.000000000. Residue shift count = 13555775.
M110899639 is not prime. Res64: 243C3E785D7D8345. Program: E19.0. Final residue shift count = 13555775
M110899639 mod 2^35 - 1 =          20387533375
M110899639 mod 2^36 - 1 =          12983321457
Does this look suspicious to anyone else?

I also run Prime95 on a seemingly much more powerful Core i7-7700 and that is taking about 14 days to PRP-test a single number, which is what is making me question this.

I'm glad to provide more detail if it would help troubleshoot.
joniano is offline   Reply With Quote
Old 2021-01-20, 09:28   #77
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

5·1,877 Posts
Default

Quote:
Originally Posted by joniano View Post
Does this look suspicious to anyone else?
Yep. Very. The residues are the same, which is close to impossible. Like one in k chances, where k is much larger than the number of particles in the universe
Unfortunately I can't help, not the Linux neither mLucas guy.
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas v18 available ewmayer Mlucas 48 2019-11-28 02:53
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
Mlucas on Sparc - Unregistered Mlucas 0 2009-10-27 20:35
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 00:57.

Thu Apr 22 00:57:40 UTC 2021 up 13 days, 19:38, 0 users, load averages: 2.93, 2.76, 2.54

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.