"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
170468 Posts
|
Observations
Observations:
- At exponents below ~1M, run time is about linear with exponent, because it only implements one fixed 64K word fft, and that width is more than enough up to ~1.2M exponent. This makes it markedly slower at 100K exponent, even while using as many threads as possible, than a simple single-threaded perl PRP or LL test program, on the same hardware.
- Task manager indicates saturated use of one hyperthread at 1M exponent, and 8 threads at 10M or larger test exponents used. This is consistent with the likely number of 64k ffts being employed and an assumption that the 64k fft is itself single threaded.
- The program's accompanying database has some unexpected gaps. In one case it allowed LL test of a modest exponent (100019) with a known mere 21 bit factor, despite the code intended to block testing of exponents with known factors.
- The GP_LLT program does not compute all the iterations. It skips five at the beginning.
Since its minimum exponent is 67, and the seed value four, it can preload the fixed precomputed 61 bit result of iteration 5 in a single 64 bit word and begin from there. (See also https://www.mersenneforum.org/showpo...41&postcount=3 from a few years ago.)
- The GP_LLT program seems inconsistent in iteration count. Its GUI seems to indicate it will compute p-1 iterations.
It took a while for me to realize this was an error of interpretation; it displays the number of the current iteration, and includes the current iteration in the subtotal yet to be completed; the sum of those two is p-1. It stops at p-2, correctly. It avoids the first five as noted above.
- The proof viewer program seems to misrepresent iteration timings as xx,xxx.x msec, (milliseconds, 10-3 seconds), while the timing actually corresponds to xx.xxx.x microseconds, 10-6 second units. The SI abbreviation for milliseconds is msec; for microseconds a Greek mu is needed. See https://www.nist.gov/pml/owm/metric-si-prefixes When limited to ASCII character sets, the mu is often approximated as u, which does not conflict with any of the other SI prefixes listed there.
- Because of the known-status blocking implemented by GP_LLT, I ran GP_LLT iteration timing on the nearest allowed exponents instead, and scaled the run time according to exponent where needed. The difference between p-1, p-2, and p-7 is insignificant in the range p=105 and up. (None of the test exponents used were close enough to a multiple of 65536 for the minor exponent differences to affect the number of 64Ki fft superwords width in GP_LLT's operation.)
- The run time of GP_LLT is far longer than commonly used GIMPS software, on the same hardware. Orders of magnitude longer. I found no scenario in which GP_LLT ought be used for GIMPS goals.
GP_LLT run time per primality test at exponents of current interest, p>100M, scales as O(p~3); depending on how many values are used or excluded from a graph fit, 2.88 to 2.97. Commonly used GIMPS applications scale as O(p~2.1), which is an approximation to p2 log p log log p.
GP_LLT run time at modest exponents is hundreds of times longer than for prime95 on the same hardware, at best, (146.7-fold at 1M; 1765.-fold at 100M; 14643.-fold at 1G) and is thousands to tens of thousands of times longer at exponents of current and future prime-seeking interest. (Hundreds of thousands of times slower than gpuowl on a good GPU.)
At OBD, individual GP_LLT iterations are estimated to take 6+ hours each on the i5-1035G1, corresponding to a primality testing time of 2.3+ MILLION years. A factor of 10, 100 or 1000 faster hardware is not sufficient to overcome that. A prudent user would use faster hardware with faster software also.
It's even slower using all 8 hyperthreads, above ~150M exponent, than my simple primitive single threaded perl scripts (which are also too slow to be of any practical prime hunting use).
- Memory usage is about linear with exponent. No paging was observed.
M99999847 469M ram occupied, ~4.918 times exponent size in packed binary.
M1000000007 4632M ram occupied, ~4.857 times exponent size in packed binary.
M3321928171 15660436K = 15293M ram occupied; ~4.827 times exponent size in packed binary, about equal to one packed binary and one at 17.5 bits/64bit word fft layout.
This means p ~ 2^32 would require more than ~20 GiB of installed ram to avoid thrashing. The test system has 64 GiB installed. Ram is not limiting; run time is.
- On very short runs, e.g. 100019 (~27 minutes on i5-1035G1), it produced the correct final residue. Runs of days, weeks, or months are common in the more efficient software. GP_LLT reliability on longer runs is undetermined.
- ECC ram is not sufficient to forestall some run time errors. See for example, at https://mersenneforum.org/showpost.p...&postcount=864, the reproducible difficulties (on two different ECC ram equipped systems) with M333043493 in prime95 v30.8b15 at 18M fft length, finally resolved by forcing 20M fft length.
- .SNP files occupy at least p/8 bytes plus some small additional amount, as little as 200 bytes, increasing I think with number of iterations that have been run.
There is apparently no provision for backup whole residues implemented in the program, to retreat to if an error occurs.
- In https://mersenneforum.org/showpost.p...7&postcount=30 I computed some lower bounds for GP_LLT iteration time for a ~1G exponent. Comparing actual run time shows it over 50 times slower than that bound:
Code:
Exponent Low bound sec/iter Measured sec/iter ratio actual/bound
1000000007 34.1 2010. 58.9
- In a run on OBD exponent 3321928171, taking more than 3 days, GPT_LL produced an empty .rdl file and no .snp file at all, during the run, so the program calculated timing, and interim res64s could not be viewed with a hex editor during a continuing run. Iteration timing was estimated by elapsed time and display of iteration count. Presumably if the OS or program had crashed, all progress would have been lost. After suspending the run, and then selecting exit the program, a .snp file was created. This claimed 14 iterations, but only residues for #6 thru 13 (8 total) were displayed by the proof viewer. The iteration numbers could be selected for copy/paste in the proof viewer display, but not the interim residue values or timing values. Average of the recorded iteration times was 22,165. seconds per iteration. That extrapolates to a total run time of about 2.3 million years.
- Responsiveness of the test system's GUI is substantially affected by the exponent being run. 10M or below was normal responsiveness. 100M was laggy; 332M very laggy, 1G worse; 3.32G extremely difficult and slow to accomplish anything, basically useless for anything else. At 3.32G it was not possible to launch Task Manager.
- The ETA computation/display in the GP_LLT program becomes quite capricious at very large exponent. I saw fluctuations of thousands of years from time to time, including times to completion slightly earlier in the day than when viewed, while billions of iterations remained to be performed, at several hours each. Perhaps the elapsed time computation is overflowing.
- A small exponent I ran to completion (100,019) produced a result matching the output of my LL perl script.
Top of reference tree: https://mersenneforum.org/showthread.php?t=24607
Last fiddled with by kriesel on 2023-03-15 at 23:13
|