mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2023-01-07, 01:10   #34
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

176518 Posts
Default

Save files would be a requirement for significant end user adoption providing other concerns are addressed, IMO.
So too would be input from a worktodo file containing a queue of work, and output to a results file, plain text in both cases. For examples for other programs, see https://www.mersenneforum.org/showpo...8&postcount=22 and https://www.mersenneforum.org/showpo...9&postcount=33.

A few comparisons, before leaving the subject:
subjects of this thread:
GP_LLT 209KB (not far from the average size of a virus)
proof.exe 149 KB (ditto)
MPOC.exe 4MB highly compressible 22K zip size
(no source or documentation for any of them)

Some old friends, and their exe sizes:
Code:
mfaktc 1.5-1.8MB	source available
mfakto 1.6MB		"
CLLucas 293KB 		source available
CUDALucas 350KB-1200KB  "
CUDAPm1	530KB		"
gpuowl 1.8MB		"
Mlucas 30MB		distributed as source only
Prime95 40MB		source available with a small exception related to security strings
(all with help or readme files, plain text input, output, etc.; & a smattering of logging among them)

Windows thread creation is variously stated as ~40 microseconds each; linux 10, and the value of keeping a thread around for a long time is discussed at https://stackoverflow.com/questions/...ating-a-thread
8Mi threads / iteration x 40usec/thread / 64 cores ~ 5.24 seconds estimated elapsed thread creation overhead per iteration for GP_LLT @ ~OBD. Probably longer because Xeon Phi cores are clocked slow. ~552. years for thread creation overhead!

Quote:
Originally Posted by Dr Autonomy View Post
The way to find out how good your hypothetical estimate is, would be to simply run a few (say, three) iterations of any Mersenne number you like under GP_LLT
Three, or ten or ~30 is not sufficient. For large exponents, the early iterations produce the same residue regardless of exponent, so correct results in early iterations are extremely easy to fake, in a program for which source code is withheld. Hence George specifying 100, which is deep enough to make them exponent-specific, and numerous enough to discouragel programs with really atrocious run time scaling, as well as fakes.

Ok I think that about does it. It's been fun.

Last fiddled with by kriesel on 2023-01-07 at 01:42
kriesel is online now   Reply With Quote
Old 2023-01-09, 11:36   #35
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

196710 Posts
Default

Took a snapshot of a virtual machine. CPU i9-10920X down-clocked to 2200 MHz, 2666 MHz DDR4 memory.
Started it, disabled network.

Started test on M62818631 with GP_LLT 1.0.0.5.

Factoring is extremely slow : it uses one core only. Going from k=1 to k=24000 (2*24000*62818631 is less than 42 bits !) took more than 20 minutes. Prime95 goes from k=1 to k=140023 (26 to 42 bits factors) in no time that I could measure, lets say 1 second. For factoring the program is more than a thousand times slower than Prime95 (which is CPU based, graphic cards bring trial factoring speed to another height.)

Then I started the test of that number : the ETA was somewhere in 2028, iteration times where on the order of 2,7 seconds (compared with 0,0026 seconds for Prime95.) Again a thousand times slower than Prime95. After some 120 iterations the files size was already 8 MB. The whole run would increase the file to about 4 TB.

The output files are not readable as such. The Proof program asks to choose a o0 length rdl file (but probably uses the nsp file).

The MPC program measures something and gives the output as percentages.

The conclusion is "Bravo !" for programming, but there is some work to do to start being useful for current GIMPS work.

Reverted the virtual machine to the situation before the snapshot.

Last fiddled with by S485122 on 2023-01-09 at 11:55 Reason: ETA kept jumping around, used it timings instead.
S485122 is offline   Reply With Quote
Old 2023-01-09, 14:10   #36
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11111101010012 Posts
Default

Quote:
Originally Posted by S485122 View Post
Started test on M62818631 with GP_LLT 1.0.0.5.

Factoring is extremely slow: it uses one core only. Going from k=1 to k=24000 (2*24000*62818631 is less than 42 bits !) took more than 20 minutes. Prime95 goes from k=1 to k=140023 (26 to 42 bits factors) in no time that I could measure, lets say 1 second. For factoring the program is more than a thousand times slower than Prime95 (which is CPU based, graphic cards bring trial factoring speed to another height.)

Then I started the test of that number : the ETA was somewhere in 2028, iteration times were on the order of 2,7 seconds (compared with 0,0026 seconds for Prime95.) Again a thousand times slower than Prime95. After some 120 iterations the files size was already 8 MB. The whole run would increase the file to about 4 TB.

The output files are not readable as such. The Proof program asks to choose a o0 length rdl file (but probably uses the nsp file).

The MPC program measures something and gives the output as percentages.

The conclusion is "Bravo !" for programming, but there is some work to do to start being useful for current GIMPS work.

Reverted the virtual machine to the situation before the snapshot.
Thanks for trying it out briefly. i guess the good news is it does do some Mersenne number crunching. Your prime95 comparison was done in the VM also? (Overhead would not be a thousandfold, but not zero either.)

That low GP_LLT version # is funny; https://mersenneforum.org/showpost.p...27&postcount=8 indicates v2.2 on the zip file.

Somewhere he uses the term divide in describing its trial factoring. Perhaps it is so slow from doing long division rather than modpow. FYI, for benchmarking purposes, set TimeStamp=2 in prime95's prime.txt for 1 second worker window timing resolution. One could run several bits deeper in prime95 TF to get some semblance of timing resolution.

I wonder if it finds low factors within its theoretical & practical reach.
https://www.mersenne.ca/exponent/10000103 has a small factor.
Or how it handles multiple small factors.
https://www.mersenne.ca/exponent/10000303 has 3 small factors.

Did you happen to extract any resultant res64/iteration-count pairs from any file? It would be interesting to see whether it's at least producing correct res64 values. Especially after iteration #25 or so (exponent dependent), when the mod 2p-1 begins to have effect. For p=10000019, prime95 yields FA16B251080501DC for the 100th iteration res64 (which it numbers 102) on a seed 4 LL start. Other apps should match. (It's not like PRP, where there are many types, and prime95's interim residues don't match mlucas or gpuowl for type 1 interim residues, but do match at the final output.)

You mention seeing file growth to 8MB in 120 iterations on ~62M. What's growing, and did it appear to be linear with iteration count? (Prime95 does some file space allocation or preallocation which generates space growth early that is not linear with iteration count.)

Did you see any evidence of the ability to stop and resume a primality test from a save file? User set save intervals?

"Some work to do" indeed for raising it from academic exercise to useful computation rate.
A day well spent in a good university math library or on this forum, before beginning to code, could have helped greatly in spending the programming effort to better effect. I hope the OP found the journey educational.
Maybe after a break he'll try again with a more efficient choice of algorithms and design.

I'm putting together some interim and final residues for some exponents, spaced about 10:1 in exponent and iteration count, both LL and PRP type 1, from multiple applications. The final residues will be a subset due to run time or other limitations on available software and hardware for even the top 3 performing GIMPS primality test programs. It will likely become a developer's corner reference thread post.
kriesel is online now   Reply With Quote
Old 2023-01-09, 15:20   #37
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

176518 Posts
Default

Quote:
Originally Posted by S485122 View Post
Factoring is extremely slow : it uses one core only.
Well there's about a factor of 16 performance to be had.

Did you confirm its LL uses multiple threads? How much ram was used?
kriesel is online now   Reply With Quote
Old 2023-01-09, 18:02   #38
S485122
 
S485122's Avatar
 
"Jacob"
Sep 2006
Brussels, Belgium

7×281 Posts
Default

Quote:
Originally Posted by kriesel View Post
Well there's about a factor of 16 performance to be had.

Did you confirm its LL uses multiple threads? How much ram was used?
I am through with testing that program.
The version is the version of the executable contained in the compressed file.
S485122 is offline   Reply With Quote
Old 2023-01-10, 00:18   #39
Dr Autonomy
 
Jan 2023
Alberta, Canada

1210 Posts
Default Merci

Thanks very much for the feedback, Jacob. I appreciate it. Truth be told though, it's a smidge faster than I thought it would be..

You're right that the Trial Factoring sucks. I just threw it in there to prevent someone from doing an entire LLT for no reason (i.e. a discoverable small factor).

Also, the "whole run" of M62818631 would be 960 Meg. The "snp" file (short for "snapshot") is just the last successful iteration (in full), plus some other metrics, for restart purposes. The "rdl" file (short for ResiDue List), will contain exactly (P-1) 16 byte records (960 Meg) when the LLT is complete; one for each iteration, numbered from zero (S == 4).

So on a slowed down I9, your ETA was 2028; on mine, a battered I5 with only four cores, it was Feb, 2040.

That sounds about right..

Still curious as ever though, to see what it could do on a 128-core system. It was built to fully exploit multiple cores, so I've been looking at those systems. Surprisingly affordable they are now.

Also correct when you point out that the rdl file can sometimes be zero length. Until the candidate is proven prime, the rdl file only contains "blocks" of iterations, a "block" being 32,768 iterations. If you've performed less than a block, the remainder is stored in the snp file. It was just efficient to do it that way.

More than you wanted to know, I'm sure.

And thanks for the Bravo. Do you think I should replace the grammar school squaring with a parallelized Karatsuba? Karatsuba gets kind of klunky, and expensive (bytes), when you parallelize it.
Dr Autonomy is offline   Reply With Quote
Old 2023-01-10, 00:47   #40
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

5·1,373 Posts
Default

Quote:
Originally Posted by Dr Autonomy View Post
Do you think I should replace the grammar school squaring with a parallelized Karatsuba? Karatsuba gets kind of klunky, and expensive (bytes), when you parallelize it.
Time is expensive, memory is cheap.
retina is online now   Reply With Quote
Old 2023-01-10, 03:49   #41
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

101011100010012 Posts
Default

Quote:
Originally Posted by Dr Autonomy View Post
And thanks for the Bravo. Do you think I should replace the grammar school squaring with a parallelized Karatsuba? Karatsuba gets kind of klunky, and expensive (bytes), when you parallelize it.
Maybe study up on FFT and try to implement that. If you want to speed your program up, that is a much better choice. It will take to to learn and implement. But, if you make a serious attempt at it, you will get plenty of help with your code. This assumes that you share your code when you ask for help.
Uncwilly is online now   Reply With Quote
Old 2023-01-10, 06:20   #42
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5·1,621 Posts
Default

Learn from the masters. Multilevel ffts; multiple sizes and shapes. Irrational base discrete weighted transforms to avoid need for zero padding. No grammar school or karatsuba to be found.
Download the software and read sources and documentation, and reference info. Try the programs out.

gpuowl: 3 level fft; 2 power-of-two for speed and one in the middle up to 13-smooth for flexibility on total size. Usually 13-smooth is not faster than the next higher 7-smooth fft length.
from the help output, height:middle:tail alternatives for equal length,
Code:
FFT 3.25M [  5.11M -   63.93M]  256:13:512 512:13:256
FFT 3.50M [  5.51M -   68.67M]  1K:7:256 256:14:512 256:7:1K 512:14:256 512:7:512
prime95: 2 level fft and cache line multiplier, well optimized by benchmarking many choices.
mlucas: 3 to 5 level fft typically. Up to 31-smooth IIRC. No single level uses as big as 1024. Most are power-of-two, which typically are the most efficient. It includes benchmarking the several alternate layouts per length and records what is best for the hardware present.
Learn non-power-of-two ffts. These allow tailoring the overall length for a closer more efficient fit to the task size.

Adopt type 1 PRP. It enables FAR SUPERIOR error detection and correction by rewind to last known good save file.
100.000000% good full test for PRP with Gerbicz error check, vs LL's limited 50% detection rate of erroneous full residues.
In practice, an LL first test is useless. These are typically in error 2% of the time per test. So cost is 1 first test + 1 DC + 4% probability of a third test for verification, 2.04 total.
PRP also enables proof file generation, and a power 9 proof file allows verification for ~0.2% of the effort of a first test. PRP effort for test + verification ~1.002 tests, total; 2.04/1.002 ~ 2.036 times faster than LL + LLDC + sometimes LLTC etc.
It's more cost effective to ignore an LL first test and perform a PRP with proof generation on the same exponent,
saving 1.02-1.002 = .018 test equivalent.

Last fiddled with by kriesel on 2023-01-10 at 06:24
kriesel is online now   Reply With Quote
Old 2023-01-10, 23:04   #43
Dr Autonomy
 
Jan 2023
Alberta, Canada

22·3 Posts
Default PRP

"PRP" stands for "Probable Prime", right?

Will the EFF accept as a "discovery" that some number is "probably prime"?

I believe the answer is no.

You are right that I don't know much about "Multilevel ffts". But they're sequential algorithms, right? Or has someone parallelized them? Dunno.

I just think that in an increasingly parallel-processed world, you have to rethink the sequential algorithms, especially recursive ones, towards a more uniform parallel computational framework. Because Moore's Law is waning, and it's being replaced by parallel processors/processing, and the software has to change to keep up.

End of diatribe.

BTW, Jacob was also right about the version of GP_LLT. Microsoft automatically puts a "version" resource in its compiled executables, which I didn't maintain. It defaults to whatever he said it was. I could replace it with a new one, but then I'd have to suffer another four pages of accusations and name-calling. Nah. Pass.

As for source code, I have no idea what I might do with it. Maybe just bite the bullet, buy the damn supercomputer, and win both remaining EFF awards with it. Except the hardware isn't quite there yet..

Or I might sell it, lock, stock, and barrel. Dunno.

Or I might publish.

Hard to say.
Dr Autonomy is offline   Reply With Quote
Old 2023-01-10, 23:19   #44
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

3·5·743 Posts
Default

But, with the error checking and the ability to verify the PRP run (which eliminates the need for a DC), PRP is now the best tool to scan exponents for primality. If a number passes, then it will be confirmed by several different hardware architectures with several different programs, doing the LL test. The PRP test, as it is now, also us to test and verify more numbers quicker.

Some processes that are sequential can also be parallelized.
Uncwilly is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any way to pipe GMP mpz_t into PFGW? hansl Software 3 2019-04-09 19:44
Is there any program... mart_r Software 2 2009-11-15 20:06
So you think you can program rogue Lounge 5 2009-10-02 15:02
Program Primeinator Information & Answers 5 2009-07-16 21:42
which program? drakkar67 Prime Sierpinski Project 14 2005-11-29 06:25

All times are UTC. The time now is 01:31.


Thu Oct 5 01:31:23 UTC 2023 up 21 days, 23:13, 0 users, load averages: 0.81, 0.84, 0.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔