![]() |
![]() |
#45 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
806610 Posts |
![]() Quote:
Before a new prime find is accepted as real, not just an error, it is checked by running multiple software on multiple hardware types by multiple people. That's the time for LL testing. Credit for the discovery, and prize award, goes to the first person to complete a PRP or LL on the candidate and generate and report the PRP-P or LL-P result. After verification. https://www.mersenneforum.org/showpo...5&postcount=14 Re multilevel ffts, & has someone parallelized them: I gave you three software examples already. They've been parallelized for years or decades. Selling your source code that's now documented as ~1000 times too slow to use for DC, and will be relatively slower yet at higher exponents than the current DC level? Too funny. Don't blame the hardware for how slow your software is. Do better. You've already been given the map. What reference works have you read on parallel programming? Last fiddled with by kriesel on 2023-01-11 at 00:03 |
|
![]() |
![]() |
![]() |
#46 |
If I May
"Chris Halsall"
Sep 2002
Barbados
22·5·571 Posts |
![]() |
![]() |
![]() |
![]() |
#47 | ||||
P90 years forever!
Aug 2002
Yeehaw, FL
22×2,089 Posts |
![]() Quote:
If your program concludes "prime", the EFF will also require you to gather several independent LL tests. Quote:
Quote:
Note that LL (and PRP) tests are sequential in that each iteration requires the result of the previous iteration. Quote:
View this entire episode as a learning experience. You probably learned a bit writing your program. If I may, I'd recommend two more possible "lessons learned": 1) Do more research into the state-of-the-art before diving into a major project. If you had done so, you would have learned how currently available programs tackle the problem and realized that significantly improving on them is a tall order. 2) Learn how to do back-of-the-envelope calculations regarding the difficulty of a task. By doing so you'd realize that achieving the next EFF goal would require large investments of time and GREAT luck. Furthermore, you'll spend more in electricity and hardware than the value of the EFF award. Achieving the billion digit EFF goal is not possible at this time. What an odd analogy. |
||||
![]() |
![]() |
![]() |
#48 | |
"Jacob"
Sep 2006
Brussels, Belgium
7·281 Posts |
![]() Quote:
The cool reception you got was because you asked the forum members to run a program but "refused" to give results from that program run on your own machine. You could have compared the speed of your program with Prime95 on your own machine... Last fiddled with by S485122 on 2023-01-11 at 10:30 |
|
![]() |
![]() |
![]() |
#49 | ||
Jan 2023
Alberta, Canada
22×3 Posts |
![]()
Thanks for your honesty on that score. I would definitely agree with your use of the word "modest" here.
Quote:
Thanks. Quote:
I just found out that the "socket" that houses all Threadripper CPUs, including the 64-core incarnation, has a published average DRAM access rate of 102.4 GB/s, implemented by four 72-bit DRAM channels (eight total, four disabled, apparently by design), which translates to 15,728.64 qwords per microsecond. Since GP_LLT requires approximately 274,945,015,808 qword transfers per iteration for a billion decimal digit prime candidate, that works out to approx. 8.74 seconds per iteration on a system with two 64-core Threadripper CPUs. That's approximately 3,672.38 times slower than it needs to be to finish the BDD LLT in three months. Somewhat disappointing to be sure, but that's today. Tomorrow may yet hold better prospects. |
||
![]() |
![]() |
![]() |
#50 | |
"Mihai Preda"
Apr 2015
145310 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#51 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×37×109 Posts |
![]() Quote:
One could consider skipping the first few iterations by storing one that is still compact enough to be fast to retrieve, say S4 instead of beginning with S0, but that is approximately a 1-100ppb savings in practice. (I see prime95's first few iterations run faster than the full-Mp-width later iterations, so may be even less potential savings.) Up to the first ~25-30 iterations are unaffected by mod Mp. Reading them from storage is likely to be slower and error prone than generating in place. Something to consider is whether it pays to switch to higher than res64 at some point. The number of res64 bits that differ from the error-produced res64 0x02 (cudalucas is notorious for that) is limited, at p~1G, and somewhat higher, vanishes. Following are low-128-bit and low-64-bit residues by LL seed 4 iteration number for M999999937. Code:
Perlll6.pl v 0.10 Jan 11 2023 (C) 2019-2023 Kriesel LL seed 4 begin p=999999937 iter 1 x 0x0000000000000000000000000000000e 0x000000000000000e iter 2 x 0x000000000000000000000000000000c2 0x00000000000000c2 iter 3 x 0x00000000000000000000000000009302 0x0000000000009302 iter 4 x 0x000000000000000000000000546b4c02 0x00000000546b4c02 iter 5 x 0x00000000000000001bd696d9f03d3002 0x1bd696d9f03d3002 iter 6 x 0x0306f7b285eead7d8cc88407a9f4c002 0x8cc88407a9f4c002 iter 7 x 0xef7e6a06ce335de155599f9d37d30002 0x55599f9d37d30002 iter 8 x 0xf850ea776d9b1220f460d65ddf4c0002 0xf460d65ddf4c0002 iter 9 x 0x8f625008a4fdb542e180d8077d300002 0xe180d8077d300002 iter 10 x 0xe649e87ae88f08849bdb491df4c00002 0x9bdb491df4c00002 iter 11 x 0xa8d39cfc7cbf7ba94cebb477d3000002 0x4cebb477d3000002 iter 12 x 0x903f8514817c09730b97d1df4c000002 0x0b97d1df4c000002 iter 13 x 0x93965c5abff22aa9acef477d30000002 0xacef477d30000002 iter 14 x 0x50dc102b27fef87e9cbd1df4c0000002 0x9cbd1df4c0000002 iter 15 x 0xdbc7f9d128e0bf7902f477d300000002 0x02f477d300000002 iter 16 x 0xa64ec20e91d0d5cd0bd1df4c00000002 0x0bd1df4c00000002 iter 17 x 0xb8fad5312c20d5c42f477d3000000002 0x2f477d3000000002 iter 18 x 0x14703a32fe5b4010bd1df4c000000002 0xbd1df4c000000002 iter 19 x 0x7c14bfb0d6eb9042f477d30000000002 0xf477d30000000002 iter 20 x 0x16f06d113397410bd1df4c0000000002 0xd1df4c0000000002 iter 21 x 0x1d9899224ced042f477d300000000002 0x477d300000000002 iter 22 x 0x93d0b2611cb410bd1df4c00000000002 0x1df4c00000000002 iter 23 x 0x2627a70302d042f477d3000000000002 0x77d3000000000002 iter 24 x 0x06ec73f50b410bd1df4c000000000002 0xdf4c000000000002 iter 25 x 0x008f4e642d042f477d30000000000002 0x7d30000000000002 iter 26 x 0x50152290b410bd1df4c0000000000002 0xf4c0000000000002 iter 27 x 0x1dd31a42d042f477d300000000000002 0xd300000000000002 iter 28 x 0x4f35690b410bd1df4c00000000000002 0x4c00000000000002 iter 29 x 0x0d6c8599332f8b2e266149b63b09f206 0x266149b63b09f206 iter 30 x 0xe2fb7a34e167f4fe3f6b4dc8516d2516 0x3f6b4dc8516d2516 iter 31 x 0xaa4dc1ae5a118ae3517e288a43427713 0x517e288a43427713 iter 32 x 0x364993ae63255462369430ee8cf115ac 0x369430ee8cf115ac Code:
Jan 12 19:09:10] Worker starting [Jan 12 19:09:10] Setting affinity to run worker on CPU core #1 [Jan 12 19:09:13] Setting affinity to run helper thread 1 on CPU core #2 [Jan 12 19:09:13] Setting affinity to run helper thread 2 on CPU core #3 [Jan 12 19:09:13] Setting affinity to run helper thread 3 on CPU core #4 [Jan 12 19:09:15] Starting Gerbicz error-checking PRP test of M1168999969 using AVX-512 FFT length 64M, Pass1=2K, Pass2=32K, clm=1, 4 threads [Jan 12 19:09:15] Preallocating disk space for the proof interim residues file p1168999969.residues [Jan 12 19:39:34] Error pre-allocating proof interim residues file [Jan 12 19:39:35] Errno: 28, No space left on device [Jan 12 19:39:35] DOSerrno: 112 [Jan 12 19:39:35] Will use proof power 8 instead of 10. [Jan 12 19:39:35] PRP proof using power=8 and 64-bit hash size. [Jan 12 19:39:35] Proof requires 37.4GB of temporary disk space and uploading a 1315MB proof file. [Jan 12 19:40:39] M1168999969 interim PRP residue 000000000000001B at iteration 1 [Jan 12 19:40:50] M1168999969 interim PRP residue 000000000000088B at iteration 2 [Jan 12 19:41:00] M1168999969 interim PRP residue 0000000000DAF26B at iteration 3 [Jan 12 19:41:07] M1168999969 interim PRP residue 000231C54B5F6A2B at iteration 4 [Jan 12 19:41:13] M1168999969 interim PRP residue D310B7D97DD4E9AB at iteration 5 [Jan 12 19:41:19] M1168999969 interim PRP residue 2AC0B180838228AB at iteration 6 [Jan 12 19:41:25] M1168999969 interim PRP residue 9B5ACA650265A6AB at iteration 7 [Jan 12 19:41:33] M1168999969 interim PRP residue B47759B0D250A2AB at iteration 8 [Jan 12 19:41:41] M1168999969 interim PRP residue DF36E033DAB69AAB at iteration 9 [Jan 12 19:41:50] M1168999969 interim PRP residue C94525688DC28AAB at iteration 10 [Jan 12 19:42:00] M1168999969 interim PRP residue FC7BEC947CDA6AAB at iteration 11 [Jan 12 19:42:12] M1168999969 interim PRP residue 6346CE367F0A2AAB at iteration 12 [Jan 12 19:42:23] M1168999969 interim PRP residue 628CE0A31369AAAB at iteration 13 [Jan 12 19:42:37] M1168999969 interim PRP residue B332521E7C28AAAB at iteration 14 [Jan 12 19:42:52] M1168999969 interim PRP residue 8FA2E79E4DA6AAAB at iteration 15 [Jan 12 19:43:08] M1168999969 interim PRP residue 4B1EDCC1F0A2AAAB at iteration 16 [Jan 12 19:43:24] M1168999969 interim PRP residue 7CA1EF99369AAAAB at iteration 17 [Jan 12 19:43:42] M1168999969 interim PRP residue 4AD4B787C28AAAAB at iteration 18 [Jan 12 19:44:03] M1168999969 interim PRP residue 9BECD064DA6AAAAB at iteration 19 [Jan 12 19:44:26] M1168999969 interim PRP residue 50E7261F0A2AAAAB at iteration 20 [Jan 12 19:44:53] M1168999969 interim PRP residue 0604619369AAAAAB at iteration 21 [Jan 12 19:45:21] M1168999969 interim PRP residue 9CE1187C28AAAAAB at iteration 22 [Jan 12 19:45:52] M1168999969 interim PRP residue 7D23864DA6AAAAAB at iteration 23 [Jan 12 19:46:25] M1168999969 interim PRP residue 07CC61F0A2AAAAAB at iteration 24 [Jan 12 19:47:00] M1168999969 interim PRP residue 45AE19369AAAAAAB at iteration 25 [Jan 12 19:47:39] M1168999969 interim PRP residue 63B187C28AAAAAAB at iteration 26 [Jan 12 19:48:27] M1168999969 interim PRP residue 28B864DA6AAAAAAB at iteration 27 [Jan 12 19:49:38] M1168999969 interim PRP residue D6C61F0A2AAAAAAB at iteration 28 [Jan 12 19:50:59] M1168999969 interim PRP residue 816D7BEF30D75130 at iteration 29 [Jan 12 19:52:10] M1168999969 interim PRP residue 275063E494808E2A at iteration 30 [Jan 12 19:53:21] M1168999969 interim PRP residue A950797E4050E41B at iteration 31 [Jan 12 19:54:37] M1168999969 interim PRP residue FFB5800303D3F3A4 at iteration 32 https://www.mersenneforum.org/showpo...72&postcount=9 2) I'm not following your data rate math there. 102.4 GB/s * 1qword/8B = 12.8E9 qword/sec; / (1E6 microsec/second) = 12.8E3 qword/microsec = 12800. qword/microsec. Also, that's probably a marketing figure, not achievable in real useful software. Even if it meant GiB/sec, that's a factor of 1.024^3, 12800 * 1.07374... ~13744 qword/microsec. A link for specifications source would be useful. Mainly though any code for such large undertakings must be much more efficient in read/write from main ram, mainly using cache well, and George already wrote about what is feasible. Last fiddled with by kriesel on 2023-01-13 at 20:06 |
|
![]() |
![]() |
![]() |
#52 | ||
Jan 2023
Alberta, Canada
22×3 Posts |
![]() Quote:
102.4 GB/s = 12,800 qwords per microsecond. therefore 274,945,015,808 qword transfers takes 10.74 seconds on a dual 64-core CPU system, 4,512.61 times slower than it needs to be. So basically, three and a half orders of magnitude. That would be achievable if every die (or better yet, every core) had its own bus, but that could never happen, right? Just thinking out loud.. Oh, the link, right: Wikichip sTRX4 Quote:
As for PRP, I looked at the math. It's above my pay grade. Might as well try to convince me to pursue a metric tensor simulation. "A man's got to know his limitations" - Clint Eastwood.. P.S. (an hour after the OP; sorry): I believe ECC is sufficient for ensuring the accuracy of the LLT, and CRC32 sufficient for ensuring file integrity. Last fiddled with by Dr Autonomy on 2023-01-14 at 00:42 Reason: Added a P.S. |
||
![]() |
![]() |
![]() |
#53 |
Jan 2021
California
23E16 Posts |
![]()
And you'd be wrong. ECC won't ensure the integrity of the LLT - it will just make it much more likely that the result is not corrupted due to a memory error, but definitely won't guard against program bugs or any of a number of other things that can go wrong in the hardware during the computation.
|
![]() |
![]() |
![]() |
#54 | |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
23·857 Posts |
![]() Quote:
For protection from malicious changes SHA3 is likely a far better choice. But be mindful of how you handle the hash, that can also suffer damage and/or be altered. It's tricky. However none of that is important because there is already a robust system in place for this stuff, known as the Gerbicz error check, that is used for the PRP runs. Unless there is something better to improve on that, then things like CRC and SHA are not needed. But ECC is still necessary IMO. |
|
![]() |
![]() |
![]() |
#55 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
176028 Posts |
![]()
Insisting on staying with LL for production testing will prevent acceptance of the program by end users.
Moot while it is orders of magnitude too slow. Fix the design, algorithmic and implementation related speed issues, and all-LL/no-PRP is still a showstopper for many. Staying with LL only, no PRP, locks in a factor of more than two computing cost disadvantage, for verified result, inherently, as a number theoretic bound on run time. A factor of two speed disadvantage is enough to deter widespread usage. (Gpuowl replaced cllucas on AMD GPUs while still LL based, before the GEC was posted by R. Gerbicz.) At the GIMPS project level, LL first testing is deprecated, and LL DC is discouraged too; PRP with proof generation is faster and more reliable, given observed error rate of the best available code on the actual hardware fleet. Without a strong error detection method capable of >99% error detection rate on the fly during the test, LL is essentially wasted cycles in practice in production testing. No such high error detection rate check is known for the LL sequence. That said, a great deal of programming related to improving performance of GP_LLT applies to either LL or PRP. Both need fast squaring, and a variety of fft lengths to do it. A possible niche for LL-only GP_LLT is for new-prime verification only, not discovery, where computation rate disadvantage of two or slightly more could be tolerated. Another small niche is for users who are irrational, preferring the LL theoretically definitive test that is only of order 98% reliable, over the PRP test that is essentially 100% reliable, and despite PRP offering several orders of magnitude faster verification. (Yes, literally, hundreds or thousands of times faster; optimal proof power 9 or 10 or 11, equate to ~512 or 1024 or 2048 times verification speed advantage, even assuming LL would only ever require a single retest.) If GP_LLT were made equally as fast per test as prime95 PRP or LL, I would stay with prime95 PRP on CPU, including on the ECC-equipped systems, because of the >2:1 advantage per exponent completed for the project, due to PRP proof generation and verification at low cost relative to full retest and superior error detection rate. (Automatic work flow via PrimeNet API is a convenience, not a decisive requirement, although it might be required for such as curtisc.) Also I run PRP first tests on GPU with gpuowl, because of overall speed and far superior reliability. Haven't run CUDALucas in production for years. To displace prime95 or gpuowl, an LL-only program would need to be more than twice as fast per completed primality test as the PRP&proof capable program it competes with. Last fiddled with by kriesel on 2023-01-14 at 09:27 |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any way to pipe GMP mpz_t into PFGW? | hansl | Software | 3 | 2019-04-09 19:44 |
Is there any program... | mart_r | Software | 2 | 2009-11-15 20:06 |
So you think you can program | rogue | Lounge | 5 | 2009-10-02 15:02 |
Program | Primeinator | Information & Answers | 5 | 2009-07-16 21:42 |
which program? | drakkar67 | Prime Sierpinski Project | 14 | 2005-11-29 06:25 |