mersenneforum.org Alternative LLR tester implementation
 Register FAQ Search Today's Posts Mark Forums Read

2020-03-05, 10:48   #1
kuratkull

Mar 2007
Estonia

23·17 Posts
Alternative LLR tester implementation

Hi people!

I implemeted an end-to-end Riesel Prime tester.
https://github.com/dkull/rpt

Since I wrote the README with you here in mind, I will just copy paste it from the project.

I have only built it on 64 bit Linux. Other platforms will probably work. I have attached the compiled Linux executable to this post.

Any discussion on the topic of these kinds of tools, methods, algorithms is welcome.

RPT - Riesel Prime Tester

Experimental Lucas-Lehmer-Riesel primality tester.

Note

This project is in no way 'production ready', it does not check errors and does not save state. I wouldn't recommend swapping out your LLR64's for this (yet). Although I have noticed that if a CPU is stable with LLR (for a while), it should be okay to run it without error checking...am I right about this?
Also, it is important to note that this software and Jean Penné's software (LLR64 http://jpenne.free.fr/index2.html) do (almost) exactly the same thing, using the same libraries. So there is no performance gains to be had - we both rely on the speed of the GWNum library. LLR64 is probably faster in many cases (possibly due to using C/GCC or some other magic), even though it uses error checking. But I have noticed that RPT is a smidge faster when using larger k's in threaded mode.

What this project is

This is me being interested in how the Lucas-Lehmer-Riesel primality proving works - from end to end. I first ran Jean's LLR64 software in 2007 and found my first primes that got into the TOP5000(https://primes.utm.edu/primes/lists/all.txt). I stopped for over a decade, but the topic always lingered in my mind. In 2019 I started sieving/testing again and the curiosity got the best of me and I decided to implement most of what LLR64 does with Riesel Primes.
The core LLR loop is actually trivial and can be implemented in no time. Much of the complexity comes from needing to find U0 for k > 1. Eg. for Mersenne primes (k=1) U0==4. For k>1 U0 needs to be calculated, and naive implementations are slow for large k's. I have three different (naive, less-naive and optimal) implementations in this project. The optimal one is the same one used in LLR64, which runs in O(log(bitlen(k))) time.
This project will probably also implement the PRP primality testing used in PFGW(https://sourceforge.net/projects/openpfgw/)

What this project is not

This is not an attempt to replace LLR64. LLR64 has a lot of years of work behind it, in both features and stability/safety. This project is no match to that. This project does not aim to provide factoring, trivial candidate elimination, resumability, safety, support for different prime formats, etc.

Building

Requires the Zig compiler. And GWNum and GMP in the project directory. Both library dependencies need to be built first in their respective ways. They are not complicated to build. The Zig compiler can be downloaded in binary form from https://ziglang.org/download/

Running

$./rpt <k> <n> [threads]$ ./rpt 39547695 506636 4
=== RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 39547695*2^506636-1 [152521 digits] on 4 threads
step 1. find U0 ...
found V1 [11] using Jacobi Symbols in 0ms
found U0 using Lucas Sequence in 66ms
step 2. llr test ...
9%.19%.29%.39%.49%.59%.69%.78%.88%.98%.
llr took 73827ms
#> 39547695*2^506636-1 [152521 digits] IS PRIME

Pseudocode of the whole thing

Code:
const k, n, b, c;  # they have to have values
const N = k * b ^ n - 1;

# find V1
# this uses Jacobi Symbols
while (v1 := 1) : () : (v1 += 1) {
if jacobi(v1 - 2, N) == 1 && jacobi(v1 + 2, N) == -1 {
break
}
}

# fastest method [ O(log(bitlen(k))) ] to find u0
# this calculates the value of the Lucas Sequence
x = v1
y = (v1 * v1) - 2
k_bitlen = bitlen(k)
while (i := k_bitlen - 2) : (i > 0) : (i -= 1) {
if k.bits[i] == 1 {
x = (x*y) - v1 mod N
y = (y*y) - 2 mod N
} else {
y = (x*y) - v1 mod N
x = (x*x) - 2 mod N
}
}
x = x * x
x = x - v1
u = u0 = x mod N

# Lucas-Lehmer-Riesel primality test starting from U0
while (i := 1) : (i < n - 1) : (i += 1) {
u = (u * u) - 2
}
'prime!' if u == 0
Attached Files
 rpt.tar.gz (1.55 MB, 77 views)

Last fiddled with by kuratkull on 2020-03-05 at 11:26

 2020-03-05, 14:41 #2 kuratkull     Mar 2007 Estonia 23×17 Posts There is already much newer information on the Github page - for example detailed build instructions and some limitations and other tweaks to the software.
 2020-03-06, 16:40 #3 Thomas11     Feb 2003 1,907 Posts Looks quite interesting! I did some tests. Single thread performance is indeed almost identical to LLR64. However, multithreading doesn't seem to work as expected: single-threaded: Code: time ./rpt 3 234760 1 === RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 3*2^234760-1 [70671 digits] on 1 threads step 1. find U0 ... found V1 [3] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 0ms step 2. LLR test ... X LLR took 11946ms #> 3*2^234760-1 [70671 digits] IS PRIME real 0m12,006s user 0m12,932s sys 0m0,000s 4 threads: Code: time ./rpt 3 234760 4 === RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 3*2^234760-1 [70671 digits] on 4 threads step 1. find U0 ... found V1 [3] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 1ms step 2. LLR test ... X LLR took 26103ms #> 3*2^234760-1 [70671 digits] IS PRIME real 0m26,161s user 1m6,958s sys 0m18,038s
 2020-03-07, 00:39 #4 kuratkull     Mar 2007 Estonia 23·17 Posts Yes, I also tried it on larger N's today and it seems that the current version lags behind more and more at larger N's compared to LLR64. I'm not sure why, because during development I saw faster speeds than LLR64 in some cases. The weird thing is that the core loop is basically just calling the square function in GWnum - this function call takes 99.99% of the time in the loop. I am looking into it with the goal of at least matching the speed of LLR64, because it should be doable due to the fact that the core loop of LLR64 is exactly the same. Though you might be surprised to see that LLR64 is also slower with smaller N's in threaded mode(i mean the same smallish N [for example 1*2^859433-1] is slower on multithreaded than single thread, both on RPT and LLR64). I just discovered this today when benchmarking the slowness issue. This is probably due to small N's taking too much CPU time context switching. Larger N's should benefit from multithreaded, while small N's don't. PS! N = k*2^n-1 (N equals the digits in a number) Last fiddled with by kuratkull on 2020-03-07 at 00:54
 2020-03-07, 06:36 #5 VBCurtis     "Curtis" Feb 2005 Riverside, CA 461610 Posts In general, LLR does not benefit from splitting an FFT into threads such that FFT size divided by threadcount is smaller than 128k. That is, a 256k FFT size will run 2-threaded quite nicely. FFTs below 200k in size are not efficient 2-threaded. I've run a ~400k FFT 3-threaded with reasonable success.
 2020-03-07, 11:54 #6 kuratkull     Mar 2007 Estonia 8816 Posts (running all this on a Skylake i7) Using zero-padded FMA3 FFT length 384K for both: 4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 3 threads: 39547695*2^3664034-1 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec. 384 / 4 = 96 384 / 3 = 128 I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user. Also I updated my code so it should be at least as fast as LLR64, most likely faster (due to me not using checks and safety features in the core loop). On my CPU RPT seems to be about 1% faster. 39547695*2^506636-1 has 50KB fft size: ./rpt 39547695 506636 1 took 68820ms ./rpt 39547695 506636 2 took 65739ms ./rpt 39547695 506636 3 took 57495ms ./rpt 39547695 506636 4 took 63035ms ./llr64 -d -q"39547695*2^506636-1" -t1 -> Time : 69.294 sec. ./llr64 -d -q"39547695*2^506636-1" -t2 -> Time : 66.765 sec. ./llr64 -d -q"39547695*2^506636-1" -t3 -> Time : 59.057 sec. ./llr64 -d -q"39547695*2^506636-1" -t4 -> Time : 63.470 sec. And comparing larger n's: 4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 4 threads: rpt 1397.022 seconds I pushed the new binary (0.0.2) to Github releases: https://github.com/dkull/rpt/releases/tag/v0.0.2
 2020-03-07, 13:41 #7 kuratkull     Mar 2007 Estonia 23·17 Posts https://github.com/dkull/rpt/releases/tag/v0.0.3 RPT can now choose the best threadcount for you. Just give 0 as the threadcount. This information should be transferrable to LLR64, eg. you can test the best threadcount and use that in LLR64 with the same k,n Currently the best one has to be at least 5% better than the last best, this ensures we don't waste cores for negligible improvements. Also this minimizes the effects of jitter. Code: ./rpt_release_linux64_0_0_3 39547695 3664175 0 === RPT - Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 39547695*2^3664175-1 [1103035 digits] on 0 threads step #1 find U0 ... found V1 [11] using Jacobi Symbols in 1ms found U0 using Lucas Sequence in 390ms step #1.5 benchmark threadcount ... threads 1 took 1319ms for 1000 iterations threads 2 took 770ms for 1000 iterations threads 3 took 552ms for 1000 iterations threads 4 took 489ms for 1000 iterations threads 5 took 514ms for 1000 iterations threads 6 took 543ms for 1000 iterations threads 7 took 513ms for 1000 iterations threads 8 took 466ms for 1000 iterations using fastest threadcount 4 FFT size 384KB step #2 LLR test ... ./rpt_release_linux64_0_0_3 39547695 506636 0 === RPT - Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 39547695*2^506636-1 [152521 digits] on 0 threads step #1 find U0 ... found V1 [11] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 65ms step #1.5 benchmark threadcount ... threads 1 took 159ms for 1000 iterations threads 2 took 155ms for 1000 iterations threads 3 took 149ms for 1000 iterations threads 4 took 164ms for 1000 iterations threads 5 took 162ms for 1000 iterations threads 6 took 169ms for 1000 iterations threads 7 took 174ms for 1000 iterations threads 8 took 192ms for 1000 iterations using fastest threadcount 3 FFT size 50KB step #2 LLR test ... Last fiddled with by kuratkull on 2020-03-07 at 13:42
2020-03-07, 16:42   #8
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

120816 Posts

Quote:
 Originally Posted by kuratkull (running all this on a Skylake i7) Using zero-padded FMA3 FFT length 384K for both: 4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 3 threads: 39547695*2^3664034-1 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec. 384 / 4 = 96 384 / 3 = 128 I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user.
4 * 1409 = 5636 thread-sec
3 * 1645 = 4935 thread-sec

I consider 15% more thread-sec 'inefficient'. I certainly didn't make clear what I meant the first time, sorry! If we did a similar comparison for 2- vs 3- threaded, I imagine the 2-threaded instance would be ~4800 thread-sec, within 3-4% of the 3-threaded instance. That's a tradeoff I'm willing to make for the admin simplicity of fewer clients running, but 15% is not.

 2020-03-08, 02:29 #9 Happy5214     "Alexander" Nov 2008 The Alamo City 32·72 Posts I've never heard of Zig before. It looks pretty cool at first glance. I hope your code is more readable than the actual LLR code. I've actually used the Jacobi symbol code in class as an example of ugly code (namely the result variable being named "resul").
 2020-03-08, 12:46 #10 kuratkull     Mar 2007 Estonia 23×17 Posts @VBCurtis Aah right, I see what you mean. I personally just try to optimize the time-spent-on-candidate with little regard for used CPU time/power - unless adding an extra core really gives no benefit (probably around 5-10% for me). @Happy5214 And about Zig, yes it's a pretty neat language. This project is as much about learning/testing Zig as much as it is learning about the LLR test. But Zig hasn't let me down once during this project, and binding to C libraries was as easy as it can probably be. And I am glad to see that I managed to match/exceed(with cheating) the speed of LLR64. I am not one to throw away performance gains lightly, glad that Zig was able to handle that. I am also pretty happy that GMP provides a function for the Jacobi symbols, so I didn't have to implement it myself. Last fiddled with by kuratkull on 2020-03-08 at 12:47
 2020-03-12, 19:20 #11 kuratkull     Mar 2007 Estonia 23·17 Posts https://github.com/dkull/rpt/releases/tag/v0.1.0 0.1.0 ----- * fixed issue with gwstartnextfft false-negatives * selftest success at least up to n=150000 (for k<300) * print LLR64 compatible residue * automatic niceness 19 (not configurable) * extra logging when errors happen (won't catch all) * result output says 'maybe' if errors were seen * speed still on par with LLR64 So basically it seems to work - though I still wouldn't bet my head on it working in all cases (yet), though I currently do not know a case where it would fail silently. It currently crashes for large n's (>9M). Also the argument format has changed, and setting threads to 0 causes RPT to auto-benchmark to find the fastest number of threads. Currently my guarantee is that I --selftest 150000, which means all k<300 are tested upto known primes n=150000, and a negative case for each n is generated (n - 1 if it is not a known prime).

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Miscellaneous Math 36 2019-08-26 08:09 henryzz Hardware 11 2018-07-10 07:33 bearnol Factoring 29 2006-08-16 04:57 Ken_g6 Prime Sierpinski Project 10 2005-01-04 14:36 dave_0273 Hardware 2 2004-06-19 15:34

All times are UTC. The time now is 21:14.

Thu Jan 21 21:14:44 UTC 2021 up 49 days, 17:26, 0 users, load averages: 2.66, 2.56, 2.38