20230104, 00:04  #23  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×17×227 Posts 
Quote:
From Don Knuth on at least, grade school multiplication has been understood as ~c m n complexity, where m and n are operand lengths. In squaring, m = n, hence c m^{2}. Yes almost half the digit by digit multiplies can be skipped, just added in twice, after initializing an array with the diagonal terms, as I did back in the 80's, or shifted left by one bit and added once if that is faster. Half is little help when countered by an order disadvantage of thousands or millions for the operands of interest. https://en.wikipedia.org/wiki/Big_O_...mmon_functions Still waiting to see any real signs of appreciation of the feedback that you requested. Just because there's 2n or 4n sets of registers in a processor core, does not mean one gets 2n or 4n computational throughput. Good code in this area tends to be memory bound. I have systems with 256 and 272 hyperthreads (xeon phi are x4). Last fiddled with by kriesel on 20230104 at 00:55 

20230104, 00:37  #24 
Apr 2020
1023_{10} Posts 
Wrong. It is complexity O(n^{2}). It is also complexity O(T_{n}). Or complexity O(1000000n^{2}) for that matter. I suggest you read up on bigO notation.

20230104, 01:47  #25 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
1A73_{16} Posts 
Yeah it is.
But no matter, it still doesn't stand a chance when compared to FFT. If FFTs scare you, then it would be useful to have code doing NTT LLs. Even though LLs are not much use nowadays for MP testing, they are still good for verification when/if a new MP is found. Are you up to it? Last fiddled with by retina on 20230104 at 03:33 
20230104, 21:23  #27 
Jan 2023
Alberta, Canada
2^{2}·3 Posts 
"Methodology Proof Of Concept"  "MPOC".
Compares timings for FFT and Karatsuba, of varying block sizes, against gradeschool multiplication, as a percentage. 
20230104, 22:15  #28 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
25416_{8} Posts 
What's with the posting of the .exe? Can't you post some source code? Please. Or instead post timings of selected exponents.

20230104, 22:27  #29  
Feb 2012
Prague, Czech Republ
2×101 Posts 
Quote:
a commonly trusted party, as soon as a moderator sees it. (FTR: I'm don't use Windows anywhere, it's not about me.) 

20230105, 19:54  #30 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·17·227 Posts 
Comparison of estimated and measured run times
Following is an attempt to compute a lower bound estimate for run time of GP_LLT primality testing at 1Gbit and 1Gdigit, and provide comparative values for other available software. It is based on Dr Autonomous posts earlier in this thread, as I understood them (64Ki fft length pass, and up to 4096 wide grammar school O(n^2) pass, various assumptions that favor GP_LLT, and empirical measurements of prime95 speed on specific hardware, & extrapolation to larger fft lengths unimplemented in prime95. Comparison to Mlucas benchmarks and estimated gpuowl run time are also included.
Benchmarking in prime95 v30.8b15, 64Ki length, 4 cores 1 worker gives 0.09 msec/iter; 4 cores 4 workers gives 0.26 msec/iter each in Windows on an otherwiseidle i51035G1 ("modern" 10th generation AVX512 capable laptop i5 processor https://www.intel.com/content/www/us...fications.html). Let us suppose for run time estimation purposes that GP_LLT is equally efficient as prime95's, at the 64Ki fft performance level, take the GP_LLT author's word that the layers above are grammar school multiplication, and estimate its performance at 64Mi. Also assume it is a triangular squaring, which would be 1024^2/2 + 1024/2 partial products to compute, or 524800 superdigit multiplies. (I think it very unlikely anyone's early version of an fft implementation would be equally as efficient as the hardwarespecifictuned and fastestbenchmarkofalternativeswins ffts of George or Ernst, or close, but assuming speed parity is convenient for ballpark estimating run time lower bound. I would be astounded to find any that were significantly faster on same hardware and inputs and still producing correct results.) To compute a simple lower bound for iteration time, assume no setup time and no more summing and carry propagation time beyond that included in the 64Ki fft, and zero overhead for any error checking. (Which means zero time used in the grammar school multiplication pass of the GP_LLT code, for setup, summing partial products, double for the offdiagonal terms, carry propagation with 2 included, and any error checks, yet accurate results computed.) Also perfect parallelism performance at 4 cores, or more cores on other hardware. And zero incremental time for mod 2^p1 at each iteration at the grammar school multiplication level. And efficient enough code throughout, that GP_LLT is memory bandwidth bound, like prime95 typically is. Estimate GP_LLT iteration time as 0.26 msec/64Ki_superdigit x 524800 superdigit_multiplies/iteration / 4 cores = 34.112 seconds/iteration. A 1Gbit exponent's 1G iterations would take > 34.112 seconds/iteration x 1G iterations / 3600 / 24 / 365 ~ 1082. years. The GP_LLT author indicated using 4096 superdigits wide for OBD iterations, in grammar school multiplication. A gigadigit exponent's 3.322G iterations would take ~ 34.112 seconds x 4^2 x 3.322E9 / 3600 / 24 / 365 = 57,494. years. (I've seen worse scaling.) (In prime95 we could do better than that, because it supports nonpoweroftwo lengths. But GP_LLT perhaps does not, at least yet, since the author indicated using 4096 width of 64K size superdigits = 256M, for OBD computations, not the 192M that Mlucas could do it with. I don't yet understand why, because in my ancient experience it's straightforward to code a grammar school algorithm with length variable at granularity 1 word. Which would save ~1(3/4)^2 of the run time, ~44%.) Now let's compare prime95's 64Mi or 56Mi fft benchmark times for the same i51035G1 processor. (Posted in last attachment of https://www.mersenneforum.org/showpo...19&postcount=5) For 64Mi, best throughput is 6.23 iters/sec; for 56Mi, 7.46 iters/sec, corresponding to 0.1605 and 0.134 seconds/iter. (For no use of hyperthreading; hyperthreading gave lower performance in prime95 benchmarking.) Using the slower 64Mi timing, prime95 could do one 1G PRP or LL in 0.1605 1G / 3600 / 24 / 365 ~ 5.09 years. That is 1082. / 5.09 ~ 213. times faster than the estimated lower bound run time for GP_LLT. (The 56Mi length would be ~213. * 7.46 / 6.23 ~ 255. times faster than the estimate computed here for GP_LLT at 1Gbit.) A factor of two or more speed disadvantage on same hardware is enough to nearly eliminate usage of established software. CUDALucas is no longer being used much on hardware that can run gpuowl, for example. For a rough estimate of what run time for prime95 might be on the i51035G1 if extended to 192Mi fft, consider the 16M and 48M benchmark throughputs, 30.78 iters/sec and 8.73 respectively, a ratio of 3.526. Extrapolating from 64Mi to 192Mi by that ratio gives 6.23 /3.526 ~ 1.77 iters/sec, ~566. msec/iter for the i51035G1, corresponding to ~59.6 years. Performing a similar extrapolation on a Xeon Phi 7250 benchmark using singleworker nonHT timings (see second benchmark attachment at https://www.mersenneforum.org/showpo...4&postcount=11) yields ~23.94 iter/sec / (105.72 /34.84) ~ 7.89 iter/sec, corresponding to 0.127 sec/iteration, ~13.35 years. (I think it unlikely that George will create such an extension while there are more productive uses for his time.) The highest performance CPU hardware I have currently (Xeon Phi 7250 68 core & X4 HT) benchmarked at 30.7 iters/sec at 56Mi in prime95 (without hyperthreading, which was also benchmarked, and at almost every fft length supported, slower). That's 4.12 times the performance of the i51035G1. That would reduce GP_LLT's OBD primality test run time estimate to ~13,955. years. Using dimension 3072^2 rather than 4096^2 in the grammar school pass would reduce that estimate to ~7,850. years. Buying newer compatible hardware with ten or even a hundred times the performance if available at an acceptable price is not sufficient to reduce the GP_LLT run time estimate to a sufficiently small fraction of a human lifetime. Next I estimate what replacing the grammar school multiplication pass in GP_LLT with Karatsuba might provide. I optimistically assume the overhead of Karatsuba's increased number of additions and subtractions would be zero. Using maximum recursion depth of 12 for the 4096 = 2^12 wide pass, to reduce the number of 64Ki fft based superdigit multiplications required, (3/4)^12 ~0.0317. 0.0317 x 13955 ~ 442. years. Or recursion level 11 for the hypothesized 3072 wide pass, 0.0422 x 7850 ~ 332. years. And remember, these are thought to be optimistic lower bounds due to the effect of numerous favorable assumptions and approximations. In Mlucas, on dualXeon e52697v2 hardware or e52690, which in typical prime95 GIMPS work is slower than the Xeon Phi's combined throughput, I have benchmarked 192M fft length down to about 400. msec/iter. https://www.mersenneforum.org/showpo...8&postcount=20 For minimal prime exponent OBD primality testing on a noknownfactor candidate, that would be p~3,321,928,171 or higher, and 3321928170 iterations or more in PRP (PRP chosen for superior error detection possibilities, pretty much mandatory at such long expected run times); 0.4 sec/iteration x 3321928170 iterations / 3600 / 24 / 365 = 140. years. So ~70 years on the Xeon Phi (which I've not benchmarked in Mlucas in Linux yet to be able to use all or most available cores). That's still much too long for someone my age or near it. It's also orders of magnitude too long to run without state of the art numbertheorybased error detection and correction such as the Gerbicz error check for PRP type 1. In gpuowl of a recent efficient fast version, if extended to higher fft lengths than currently supported and released, iteration time would be ~60. msec/iteration at 192M fft length (3pass such as 4K:6:4K), corresponding to 0.06 sec/iteration x 3321928170 iterations / OBD primality test / 3600 / 24 / 365 ~ 6.32 years per OBD PRP/GEC test. Gpuowl's superior speed and power efficiency on selected modern GPUs is why even the authors of prime95 and Mlucas run a significant amount of work on gpuowl. An LL test of equal duration is almost certain to be in error, even if guarded by the Jacobi symbol check's 50% detection rate, and other error checks known to be applicable to LL on fft. To be competitive for GIMPS use, a fast accurate implementation of the best available algorithm is needed. To be competitive, GP_LLT or any primality test software needs a multipass completely fft implementation, like is implemented in prime95, Mlucas, and gpuowl. I speculate a shallow x2, x3 or x4 fft length extension by grammar school or Karatsuba pass might be justifiable for very rare use if it is easier to implement than fully fft. Last fiddled with by kriesel on 20230105 at 20:08 
20230106, 20:43  #31  
Jan 2023
Alberta, Canada
12_{10} Posts 
Quote:
But maybe I'm wrong. The way to find out how good your hypothetical estimate is, would be to simply run a few (say, three) iterations of any Mersenne number you like under GP_LLT on your "Xeon Phi 7250 68 core" machine, exit, then view the resultant "rdl" file with "Proof.exe". It will tell you EXACTLY how long each iteration took, in microseconds (actually, tenths of a microsecond). What's difficult about that? The hardest part would be uploading to this forum the resultant screenshot, or better yet, both the "rdl" and "snp" files. That's all I'm looking for here. 

20230106, 23:12  #32  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·17·227 Posts 
Quote:
Quote:
Quote:
Quote:
4096^2 =16Mi threads at once, or ~half that, for a few iterations on an OBD LL? For an app that is performancecritical, yikes! What do you think that does to the caches when milions of threads compete for a a relative few cores and get cycled in and out before completion? Or can you guarantee they each complete within a timeslice? Or what it does to total RAM usage? "A basic kernel stack is 12K on 32bit Windows and 24K on 64bit Windows." https://techcommunity.microsoft.com/...ds/bap/723824 (somewhat dated, referencing XP, found in a web search specifying Windows 10) Also, https://superuser.com/questions/1469...inwindows10 4096^2 = 16M threads x 24KB/thread = 384 GB for the thread count overhead it appears may be needed to time a few iterations on an OBD with your software. That's the maximum possible installed costly LRDIMM memory in the Xeon Phi, and 6 times the max in the i51035G1, and 3 times the memory currently installed in any single box accessible to me. That's for the thread management, not the app's data storage or code, or the operating system, or page caching, etc. Maybe that explains why you won't post results; perhaps you can't run it on your old i5 for lack of memory. The above linked techcommunity article concludes with "In any case, applications that create enough threads or processes to get anywhere near these limits should rethink their design, as there are almost always alternate ways to accomplish the same goals with a reasonable number. For instance, the general goal for a scalable application is to keep the number of threads running equal to the number of CPUs (with NUMA changing this to consider CPUs per node) and one way to achieve that is to switch from using synchronous I/O to using asynchronous I/O and rely on I/O completion ports to help match the number of running threads to the number of CPUs." In post 9 you indicate you've run M6972593, on your i5 of unspecified model and memory capacity. 2M fft in prime95 covers to ~39.5M exponent; 512K would to ~10M; 256K to ~5M. So GP_LLT might employ 64K superdigit x 8x8 grammar school pass which would be a very manageable 64 threads. Or 8x8/2 +8/2 = 36. The small number of threads would have hidden there the impact of n^2/2 threads exploding in number for larger exponents, which I regard as a fundamental design flaw if I've understood your latest post correctly, that you launch a separate thread for each of the required grammar school squaring partial products' computation as simultaneously as possible. Spawning so many threads is not free in memory, nor in processor time. Quote:
Quote:
Quote:
I notice you steadfastly refuse to run yourself and disclose well documented benchmarks on well described hardware. Or source code. Or disclose what hardware is required. Any competent author of assembly code ought REMEMBER what instruction set enhancements are required because he studied and coded for them. We have no such difficulties getting such essential information from the authors of the software in common use here. And they respond graciously to feedback both requested or volunteered, whether positive, neutral, or negative. Try it. Is the software even capable of performing the computations you claim it does? Or is it just a scout program for credit card numbers etc. or a new zeroday vehicle? Does it contain a 64Ki fft, but lifted from somewhere else, perhaps now distributed without authorization or attribution? Source code and build instructions address such doubts. I am increasingly left with the impression you don't know what you're talking about, and don't know you don't know, and try to compensate for our reluctance to trust our systems or networks to a piginapoke executable or two or three by bluster. That does not work well around here. (At least with a pig one can hear a squeal and know the species!) Continuing on, assuming for the moment it's a legit program, and just beset by a culture clash of some sort: Suggestions for command line parameters: exponent <base10_integer> LL [<iters>] TF <kmin> <kmax> logfile <filename.txt> implying perform on exponent base10_integer, either LL for iters iterations (or to completion if not specified), or TF from k=kmin through k=kmax, or maybe both, TF first, optionally logging every LL iteration or TF class, to a plain text ASCII logfile named filename.txt or whatever, until completion or program halt, frequently flushing the buffer to disk for minimal loss in case of early program termination, power fail, etc. Plain text logs are standard in GIMPS apps. I don't recall seeing mention whether the program periodically saves a work in progress state for possible resumption if interrupted. If so, launch without any parameters, or perhaps with command line parameter resume, would continue the interrupted work from that saved state. Last fiddled with by kriesel on 20230106 at 23:44 

20230106, 23:47  #33 
"Curtis"
Feb 2005
Riverside, CA
37·157 Posts 
Just stop assuming it's a legit program. Any "author" who badgers people into testing when he refuses to do it himself should be assumed to be misrepresenting his software.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Any way to pipe GMP mpz_t into PFGW?  hansl  Software  3  20190409 19:44 
Is there any program...  mart_r  Software  2  20091115 20:06 
So you think you can program  rogue  Lounge  5  20091002 15:02 
Program  Primeinator  Information & Answers  5  20090716 21:42 
which program?  drakkar67  Prime Sierpinski Project  14  20051129 06:25 