mersenneforum.org New LLT program in the pipe..
 Register FAQ Search Today's Posts Mark Forums Read

2023-01-04, 00:04   #23
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×17×227 Posts

Quote:
 Originally Posted by Dr Autonomy For the record, Grade school squaring is NOT time complexity O(p2), it's O(Tp). The latter is approx. half the former. GP_LLT was written to achieve that "impossible" task..
Small differences in scaling's leading constant value magnitude are of little interest, compared to the large difference in order between ~k1 w2 and k2 w log w log log w where w is number of words required for an exponent p, a LARGE p value. The half, and considerable overhead, of a complex algorithm, are overcome by sufficiently large p.
From Don Knuth on at least, grade school multiplication has been understood as ~c m n complexity, where m and n are operand lengths. In squaring, m = n, hence c m2. Yes almost half the digit by digit multiplies can be skipped, just added in twice, after initializing an array with the diagonal terms, as I did back in the 80's, or shifted left by one bit and added once if that is faster. Half is little help when countered by an order disadvantage of thousands or millions for the operands of interest. https://en.wikipedia.org/wiki/Big_O_...mmon_functions

Still waiting to see any real signs of appreciation of the feedback that you requested.

Just because there's 2n or 4n sets of registers in a processor core, does not mean one gets 2n or 4n computational throughput. Good code in this area tends to be memory bound. I have systems with 256 and 272 hyperthreads (xeon phi are x4).

Last fiddled with by kriesel on 2023-01-04 at 00:55

2023-01-04, 00:37   #24
charybdis

Apr 2020

102310 Posts

Quote:
 Originally Posted by Dr Autonomy For the record, Grade school squaring is NOT time complexity O(p2), it's O(Tp). The latter is approx. half the former.
Wrong. It is complexity O(n2). It is also complexity O(Tn). Or complexity O(1000000n2) for that matter. I suggest you read up on big-O notation.

2023-01-04, 01:47   #25
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

1A7316 Posts

Quote:
 Originally Posted by Dr Autonomy Grade school squaring is NOT time complexity O(p2), ...
Yeah it is.

But no matter, it still doesn't stand a chance when compared to FFT.

If FFTs scare you, then it would be useful to have code doing NTT LLs. Even though LLs are not much use nowadays for MP testing, they are still good for verification when/if a new MP is found.

Are you up to it?

Last fiddled with by retina on 2023-01-04 at 03:33

2023-01-04, 03:20   #26
chalsall
If I May

"Chris Halsall"
Sep 2002

101100001000012 Posts

Quote:
 Originally Posted by retina Are you up to it?
An early mentor of mine introduced me to the concept of the FFT.

The idea of using it to do large multiplication blew my mind at the time. What people are doing with it now continues to blow my mind.

2023-01-04, 21:23   #27
Dr Autonomy

Jan 2023

22·3 Posts

"Methodology Proof Of Concept" - "MPOC".

Compares timings for FFT and Karatsuba, of varying block sizes, against grade-school multiplication, as a percentage.
Attached Files
 MPOC.zip (22.1 KB, 28 views)

 2023-01-04, 22:15 #28 Uncwilly 6809 > 6502     """"""""""""""""""" Aug 2003 101×103 Posts 254168 Posts What's with the posting of the .exe? Can't you post some source code? Please. Or instead post timings of selected exponents.
2023-01-04, 22:27   #29
jnml

Feb 2012
Prague, Czech Republ

2×101 Posts

Quote:
 Originally Posted by Uncwilly What's with the posting of the .exe? Can't you post some source code? Please. Or instead post timings of selected exponents.
I for one suggest to delete any post(s) that links to a binary executable, zipped/archived or not, that is not cryptographically signed by
a commonly trusted party, as soon as a moderator sees it.

(FTR: I'm don't use Windows anywhere, it's not about me.)

2023-01-06, 20:43   #31
Dr Autonomy

Jan 2023

1210 Posts

Quote:
 Originally Posted by kriesel The highest performance CPU hardware I have currently (Xeon Phi 7250 68 core & X4 HT) benchmarked at 30.7 iters/sec at 56Mi in prime95 (without hyperthreading, which was also benchmarked, and at almost every fft length supported, slower). That's 4.12 times the performance of the i5-1035G1. That would reduce GP_LLT's OBD primality test run time estimate to ~13,955. years. Using dimension 3072^2 rather than 4096^2 in the grammar school pass would reduce that estimate to ~7,850. years.
I think your analysis does not fully take into account the fact that ALL of what you are calling the "superdigit mulitiplications" in GP_LLT are performed concurrently - i.e. in parallel. They are implemented as software threads which, if I'm not mistaken, the OS will attempt to implement, as far as possible, as hardware threads.

But maybe I'm wrong. The way to find out how good your hypothetical estimate is, would be to simply run a few (say, three) iterations of any Mersenne number you like under GP_LLT on your "Xeon Phi 7250 68 core" machine, exit, then view the resultant "rdl" file with "Proof.exe". It will tell you EXACTLY how long each iteration took, in microseconds (actually, tenths of a microsecond).

What's difficult about that? The hardest part would be uploading to this forum the resultant screenshot, or better yet, both the "rdl" and "snp" files. That's all I'm looking for here.

2023-01-06, 23:12   #32
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·17·227 Posts

Quote:
 Originally Posted by kriesel Benchmarking in prime95 v30.8b15, 64Ki length, 4 cores 1 worker gives 0.09 msec/iter; 4 cores 4 workers gives 0.26 msec/iter each in Windows on an otherwise-idle i5-1035G1 ("modern" 10th generation AVX512 capable laptop i5 processor
That's an example of 4 threads on a single task having efficiency less than one; .09msec x 4 cores=0.36 core-msec vs 0.26 msec for single-core tasks; 72% thread efficiency at x4.

Quote:
 Estimate GP_LLT iteration time as 0.26 msec/64Ki_superdigit x 524800 superdigit_multiplies/iteration / 4 cores = 34.112 seconds/iteration.
(Assumed perfect multithread efficiency of utilization of the 4 physical cores for GP_LLt.) We can not justify assuming /8 hyperthreads, because caches, memory data lines, etc are not duplicated, only registers and not much else IIRC. Even Intel only claims ~30% gain from hyperthreading, and that depends on there being some idle time, which well designed number crunching software avoids. That would be 1.3x, not 2x.https://www.intel.com/content/www/us...threading.html

Quote:
 The highest performance CPU hardware I have currently (Xeon Phi 7250 68 core & X4 HT) benchmarked at 30.7 iters/sec at 56Mi in prime95 (without hyperthreading, which was also benchmarked, and at almost every fft length supported, slower). That's 4.12 times the performance of the i5-1035G1.
That 4.12 times performance ratio is the MEASURED AGGREGATE THROUGHPUT, of 68 slower-clocked lower-wattage x4 HT cores on the 7250 Xeon Phi & very fast MCDRAM, divided by the measured aggregate throughput of 4 faster clocked newer-design x2 HT cores on the i5-1035G1 & SODIMM ram. It would be wrong to multiply that figure by number of cores ratio and number of hyperthreads ratio between the two processors. The prime95 program can support and benchmark dozens of cores and hundreds of threads provided for by this hardware, and that's what the 4.12 ratio is based on.

Quote:
 Originally Posted by Dr Autonomy I think your analysis does not fully take into account the fact that ALL of what you are calling the "superdigit multiplications" in GP_LLT are performed concurrently - i.e. in parallel. They are implemented as software threads which, if I'm not mistaken, the OS will attempt to implement, as far as possible, as hardware threads.
Clearly you are wrong about the effect of launching many more threads in parallel than the hardware provides simultaneous execution support for. It's likely to cause thrashing and suboptimal performance. A better strategy is to do launches in batches of ~n, to ~nxHT, where n is the number of physical cores, or perhaps keep nx(HT+1) in the queue until work for the iteration is nearly completed. (Prime95 does a thorough job of keeping the available hardware well utilized with a single thread per core during fft operations and cache hit rate high; hyperthreading, that is doubling or more the number of threads per physical core, very frequently results in lowered performance, probably due to memory bandwidth and lowered cache efficiency.)

4096^2 =16Mi threads at once, or ~half that, for a few iterations on an OBD LL? For an app that is performance-critical, yikes! What do you think that does to the caches when milions of threads compete for a a relative few cores and get cycled in and out before completion? Or can you guarantee they each complete within a timeslice? Or what it does to total RAM usage?

"A basic kernel stack is 12K on 32-bit Windows and 24K on 64-bit Windows."
https://techcommunity.microsoft.com/...ds/ba-p/723824 (somewhat dated, referencing XP, found in a web search specifying Windows 10)
Also, https://superuser.com/questions/1469...-in-windows-10

4096^2 = 16M threads x 24KB/thread = 384 GB for the thread count overhead it appears may be needed to time a few iterations on an OBD with your software. That's the maximum possible installed costly LRDIMM memory in the Xeon Phi, and 6 times the max in the i5-1035G1, and 3 times the memory currently installed in any single box accessible to me. That's for the thread management, not the app's data storage or code, or the operating system, or page caching, etc. Maybe that explains why you won't post results; perhaps you can't run it on your old i5 for lack of memory.

The above linked techcommunity article concludes with
"In any case, applications that create enough threads or processes to get anywhere near these limits should rethink their design, as there are almost always alternate ways to accomplish the same goals with a reasonable number. For instance, the general goal for a scalable application is to keep the number of threads running equal to the number of CPUs (with NUMA changing this to consider CPUs per node) and one way to achieve that is to switch from using synchronous I/O to using asynchronous I/O and rely on I/O completion ports to help match the number of running threads to the number of CPUs."

In post 9 you indicate you've run M6972593, on your i5 of unspecified model and memory capacity.
2M fft in prime95 covers to ~39.5M exponent; 512K would to ~10M; 256K to ~5M.
So GP_LLT might employ 64K superdigit x 8x8 grammar school pass which would be a very manageable 64 threads. Or 8x8/2 +8/2 = 36. The small number of threads would have hidden there the impact of n^2/2 threads exploding in number for larger exponents, which I regard as a fundamental design flaw if I've understood your latest post correctly, that you launch a separate thread for each of the required grammar school squaring partial products' computation as simultaneously as possible. Spawning so many threads is not free in memory, nor in processor time.

Quote:
 But maybe I'm wrong.
That seems clear to me. But perhaps you've been unclear in your explanation of how the program operates. Something that source code would help with.

Quote:
 The way to find out how good your hypothetical estimate is, would be to simply run a few (say, three) iterations of any Mersenne number you like under GP_LLT on your "Xeon Phi 7250 68 core" machine, exit, then view the resultant "rdl" file with "Proof.exe". It will tell you EXACTLY how long each iteration took, in microseconds (actually, tenths of a microsecond).
Somehow, the software you posted as an attachment and described as providing milliseconds/iteration timing, now provides 100ns/iteration timing resolution? Re iteration times, I'm confused. Post 9 says milliseconds (units?). post 21 says microseconds. post 31 says tenths of microseconds (resolution?). This is the sort of circus that ensues from withholding source code and other documentation.

Quote:
Planning. Preparation, both for possibly hostile software, and for thorough investigation of speed and accuracy, and GP_LLT data file content and structure, etc, at multiple exponents, well spaced, for empirical determination of run time scaling, both for GP_LLT and for prime95 & perhaps others in the same sandbox instance. Other priorities. And I'm under no obligation to do your software testing and black box debugging for you.

I notice you steadfastly refuse to run yourself and disclose well documented benchmarks on well described hardware. Or source code. Or disclose what hardware is required. Any competent author of assembly code ought REMEMBER what instruction set enhancements are required because he studied and coded for them. We have no such difficulties getting such essential information from the authors of the software in common use here. And they respond graciously to feedback both requested or volunteered, whether positive, neutral, or negative. Try it.
Is the software even capable of performing the computations you claim it does? Or is it just a scout program for credit card numbers etc. or a new zero-day vehicle? Does it contain a 64Ki fft, but lifted from somewhere else, perhaps now distributed without authorization or attribution? Source code and build instructions address such doubts.

I am increasingly left with the impression you don't know what you're talking about, and don't know you don't know, and try to compensate for our reluctance to trust our systems or networks to a pig-in-a-poke executable or two or three by bluster. That does not work well around here. (At least with a pig one can hear a squeal and know the species!)

Continuing on, assuming for the moment it's a legit program, and just beset by a culture clash of some sort:

Suggestions for command line parameters:
-exponent <base10_integer>
-LL [<iters>]
-TF <kmin> <kmax>
-logfile <filename.txt>
implying perform on exponent base10_integer, either -LL for iters iterations (or to completion if not specified), or TF from k=kmin through k=kmax, or maybe both, TF first,
optionally logging every LL iteration or TF class, to a plain text ASCII logfile named filename.txt or whatever, until completion or program halt, frequently flushing the buffer to disk for minimal loss in case of early program termination, power fail, etc.
Plain text logs are standard in GIMPS apps.

I don't recall seeing mention whether the program periodically saves a work in progress state for possible resumption if interrupted. If so, launch without any parameters, or perhaps with command line parameter -resume, would continue the interrupted work from that saved state.

Last fiddled with by kriesel on 2023-01-06 at 23:44

 2023-01-06, 23:47 #33 VBCurtis     "Curtis" Feb 2005 Riverside, CA 37·157 Posts Just stop assuming it's a legit program. Any "author" who badgers people into testing when he refuses to do it himself should be assumed to be misrepresenting his software.

 Similar Threads Thread Thread Starter Forum Replies Last Post hansl Software 3 2019-04-09 19:44 mart_r Software 2 2009-11-15 20:06 rogue Lounge 5 2009-10-02 15:02 Primeinator Information & Answers 5 2009-07-16 21:42 drakkar67 Prime Sierpinski Project 14 2005-11-29 06:25

All times are UTC. The time now is 13:56.

Mon Jun 5 13:56:17 UTC 2023 up 291 days, 11:24, 0 users, load averages: 1.09, 1.04, 1.05