![]() |
![]() |
#12 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
(edits in quote are in bold)
Quote:
RTX2080 run indicates ~2143. GHD/day; GTX1650 at such levels is ~753. GHD/day; GTX1650 Super 954. A GTX1650 is estimated to complete 88-92 in 12.8 months; a GTX 1650 Super in 10.1 months. Mlucas s1 files for OBD are ~405MB each, so for p & q pairs, ~810MB per exponent, just two exponents' sets fit in the dropbox 2GB free space. Google drive at 15 GB is roomier, allowing 18 pairs. It would be good to avoid a surplus of stage-1-completed and paucity of stage-2 attempts. Operationally, it would be simplest if stage 1 and stage 2 occur on the same system, or at least same LAN (avoiding the cloud-storage shuffle and multiuser coordination on a single exponent). There is some throughput advantage to doing stage 1 on 16 GiB systems and running exclusively stage 2 on big-ram systems. WSL vs. native Linux performance should be investigated before committing to lengthy runs. (And I'm seeing substantial indications of perhaps WSL-related memory leak on a test system running Mlucas v20.1 P-1 on 665M. Of order 100MB/hour in first day or two after a system restart. Haven't attempted an mprime on WSL on the same system comparison yet.) Last fiddled with by kriesel on 2021-10-30 at 14:27 |
|
![]() |
![]() |
![]() |
#13 |
Jun 2003
The Computer
401 Posts |
![]()
It also begs the question if P-1 should be done following reaching 92 bits or if we should wait until one system can do both P-1 stages and PRP consecutively. Since gpuowl 7 requires this and perhaps future CPU-based software will as well, I want to avoid any redundancy or at least any failure to optimize should it be deemed that P-1 and PRP should run on the same system the same way Ken mentioned that P-1 Stage 1 and Stage 2 should be on the same system.
|
![]() |
![]() |
![]() |
#14 | |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
170458 Posts |
![]() Quote:
CUDAPm1 is currently moot for OBD P-1. It has demonstrated in light testing an inability to perform both stages above ~430M exponent or achieve resume from a save file above some higher exponent. It is limited by its source code to 231-1. It is slower than Gpuowl on same hardware. There are numerous bugs. It is no longer being maintained. Etc. Gpuowl is currently moot for OBD P-1. No existing version of Gpuowl supports OBD P-1. (So consecutive P-1/PRP as in gpuowl 6.x, or combined P-1 stage 1/PRP powering of 3, as in gpuowl V7.x, are both moot. Gpuowl development has ceased, at least for now, with the last GitHub commit occurring in late March 2021. Note that on the same hardware and exponent, on Windows, gpuowl v7.x combined P-1/prp has been observed to be slower than gpuowl v6.11 serial P-1 standalone followed by PRP for wavefront or 100Mdigit or ~gigabit IIRC. Some versions of gpuowl nominally support OBD PRP but they lack PRP proof, and the prp time is ~5 years/exponent on a Radeon VII.) I think mprime/prime95 will remain moot for OBD P-1 and PRP also. On AVX512 only, prime95 supports 64M max fft, 1/3 the size needed for OBD P-1 or PRP. George may implement parallel P-1/PRP in prime95/mprime in the future, but is likely to only support up to ~64M fft length ~1.17G exponent. He's not interested in coding for cases that would take decades to run to completion. There are lots of things competing for George's attention. Mlucas recently added standalone P-1 capability. It might someday get the combined P-1/PRP capability, but not until after it gets PRP proof generation capability. Mlucas now supports Mersenne exponents up to 233, but OBD PRP would take decades on affordable/available hardware in either case. CUDALucas is moot for OBD PRP. It's LL only, has no Jacobi symbol based error check, per the source code maxes out at 231-1, and in practice crashes above ~1.4G exponent. It is not being maintained. If fixed it would likely be slower than gpuowl on the same NVIDIA hardware and exponent. https://www.mersenneforum.org/showpo...6&postcount=14 Demonstrating feasibility of OBD standalone P-1 appeals to me. Usage of software informs future development, finds bugs, and independently confirms resolution of bugs. Suggesting running both P-1 stages on the same system was motivated by an operational efficiency consideration, not a feasibility or primarily performance consideration. Consider some alternate scenarios following, all of which presume Mlucas v20.x or later: 1) Same system, standalone P-1: system with plenty of ram runs stage 1 using little of the ram, then stage 2 using most of the ram. No user intervention or file transfer between systems or between participants. Simplest. 2) Same LAN, mixed systems, parallelism at stage level: ~16 GiB systems run stage 1. ~64+ GiB systems run stage 2. Manually copy files across LAN via private server as needed, or, have different systems' program instances run in same private LAN server folders sequentially (which would in my case involve lots of fiddling with jobs.sh and mlucas.cfg at the stage 1/stage 2 switch because of disparate processor instruction sets, core counts etc). Less simple; somewhat better utilization of high-ram systems; greater total throughput by letting 16GiB systems help. 3) Same LAN, mixed systems, parallelism at stage level and more: ~16 GiB systems run stage 1. Perhaps two run the same first exponent and bounds and then the resulting p<exponent> files are compared, to try to detect stage 1 error / crudely assess reliability. If no match, try a third afterward. ~64+ GiB systems run stage 2. Manually copy stage 1 files across LAN via private server to the stage 2 systems as needed. Split stage 2 to two or more large-ram systems with B2 subranges chosen to roughly equalize run times. Even less simple; somewhat better utilization of high-ram systems; greater throughput by letting 16 GiB systems help; reduced latency for a given task. 4) Collaboration among participants via internet on a single exponent: One participant (A) runs stage 1 solo, on a ~16 GiB system. Stage 1 is not amenable to splitting or parallelism at the systems level. Participant A posts stage 1 result files p<exponent>, and optionally q<exponent>, on a cloud drive accessible by others, such as Dropbox or google drive with sharing enabled. Multiple participants previously prequalified hardware by run-time scaling tests and posting those results. The other participants must basically trust that A did stage 1 accurately, and did not just generate random bits. (Or a helpful or cautious participant could run stage 1 again to the same B1 and compare the resulting files.) By mutual agreement, participant(s) A, B (, etc?) subdivide the stage 2 bounds of the single exponent, for roughly equal stage 2 run time on probably disparate hardware. The other participants download A's stage 1 results files for their respective parallel stage 2 runs on different B2 subranges. They then work in parallel to complete the OBD P-1 stage 2 and report the results of each distinct B2 subrange. This is probably the fastest way to get one OBD P-1 completed, and the most complicated. It might be the only feasible way for some participants to constructively contribute in under a year or two run time with the hardware they have. The collaboration scenario is the same model Ernst has chosen for the F33 P-1 effort, which is ~order of magnitude larger than one OBD P-1. Participation in that requires >120 GiB ram. It may also be how one of the earliest OBD P-1 completed gets run. Last fiddled with by kriesel on 2021-11-04 at 16:23 |
|
![]() |
![]() |
![]() |
#15 | |
"Viliam Furík"
Jul 2018
Martin, Slovakia
32×89 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
Run time scaling completed twice, from 10M to 500M or higher plus short stage 1 OBD interim timing, yields multiple extrapolations to OBD GPU72 bounds P-1 both stages completion within one year of continuous running, if not loaded with too much other work simultaneously. About 10-11 months solo looks feasible on this dual xeon e5-2697v2 system with 128 GiB ram on Ubuntu, WSL1, Win10. The initial two attachments at the link below describe some possibilities for yet faster run time. (Run time scaling results for additional systems may be posted at the same link later.)
See https://www.mersenneforum.org/showpo...5&postcount=17 for more background |
![]() |
![]() |
![]() |
#17 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
There is a November 28 patch for Mlucas v20.1.1.
Mersenne.ca is now accepting P-1 results from Mlucas and Gpuowl for p>109, via bulk factor file upload. As far as I know, there is no reservation system for P-1 work for p>109 at this time other than posting in this thread. Current TF status is shown at https://www.mersenne.ca/obd. Current indicated Level is 22.06. 3321928171 30% complete from 91 to 92, ETA to 92 2022-01-31 by johnny_jack 3321928307 83% complete from 90 to 91, ETA to 91 2021-12-04 & will continue immediately to 92 by kriesel 3321928319 31% complete from 90 to 91, ETA to 91 2021-12-26 & will continue immediately to 92 by kriesel 3321938373 01% complete from 89 to 90, ETA to 90 2022-01-21 by kriesel There are 13 others in progress at bit levels 86-87 to 88-89 by kriesel as part of an effort to go expeditiously to OBD Level 24 or higher. Faster GPUs are running the higher of the bit levels indicated above; slower GPUs and colab sessions are running the lower of the bit levels mentioned above. All exponents 3321928171 to 3321929987 (currently 33 exponents) are being taken higher, with the expectation that some will have a factor discovered before reaching level 26 (TF to 92 bits completed on 26 surviving unfactored exponents) or level 28 (28 to 92 bits TF and through stage 2 P-1 to good bounds without factors found). These P-1 survivors would be preparation for someone someday after additional considerable hardware advance, using hardware that does not yet exist and won't for years, using software that does not yet exist either, going to level 29 (29 OBD PRP/GEC/proof-gen) & level 30 (30 OBD PRP/GEC/proof-gen & verified). So at the moment: OBD TF completed to 91, reserved to 92 bits: 1 OBD TF completed to 92 bits: 0 Systems with qualification(s) completed for OBD P-1: 1 Exponents reserved for P-1: 0 Exponents completed thru stage1 P-1: 0 Exponents completed thru stage2 P-1: 0 There's no expectation of being able to qualify less than supercomputer hardware for PRP, and PRP/GEC/proof software for OBD or higher does not exist, so as expected, systems with qualifications completed for OBD PRP etc = 0. Last fiddled with by kriesel on 2021-11-29 at 18:31 |
![]() |
![]() |
![]() |
#18 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
11110001001012 Posts |
![]()
Run time scaling performed, from 10M to 500M or higher plus short stage 1 OBD interim timing, and experimentation with core counts and 1 vs 2 instances, etc., yields multiple extrapolations to OBD GPU72 bounds P-1 both stages completion within one year of continuous running, if stage 2 is done in tandem with another system at least as fast. About 1.25-1.7 years solo looks feasible on this dual xeon e5-2690v2 system with 64 GiB ECC ram on Ubuntu, WSL1, Win10. The third attachment at the link below describe some possibilities for yet faster run time. (Run time scaling results for additional systems may be posted at the same link later.)
See https://www.mersenneforum.org/showpo...5&postcount=17 for more background. Note the previously qualified system also contains ECC ram. Last fiddled with by kriesel on 2022-02-02 at 15:51 |
![]() |
![]() |
![]() |
#19 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1E2516 Posts |
![]()
Run time scaling performed, from 10M to 332M or higher plus short stage 1 OBD interim timing, yields multiple extrapolations to OBD GPU72 bounds P-1 both stages completion within 1.4 years of continuous running, if stage 2 is done in tandem with another system at least as fast. About 2 years solo looks feasible on this i5-7600T system with 64 GiB nonECC ram on CentOS 7.9. Because it lacks ECC on the ram, its use for OBD P-1 is likely to be brief. The fourth attachment at the link below describes tests and results. (Run time scaling results for additional systems may be posted at the same link later.)
See https://www.mersenneforum.org/showpo...5&postcount=17 for more background. |
![]() |
![]() |
![]() |
#20 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
The following are known to be too slow for OBD P-1 solo or tandem to proper bounds within a year, lack adequate 64 GiB ram maximum capacity for running tandem stage 2 (at least in the systems I had available to test, if not intrinsically by CPU design), and lack support for ECC ram.
i7-7500U (my system 16 GiB max; Intel specs now say 32 GiB max) i7-4790 or lesser model number in that series (4770 etc), or alternate CPUs in the same motherboard models, e.g. i3-4170, Celeron G1840 (my systems 16 GiB max; Intel specs now say 32 GiB max) i5-1035G1 (my systems 32 GiB max; Intel specs now say 64 GiB max) i7-8750H (my system 32 GiB max; Intel specs now say 64 GiB max) The following support adequate amounts of ECC ram, but would be too slow, judging by relative prime95 performance: dual-Xeon X5650 dual-Xeon E5645 dual-Xeon E5520 The following do not support sufficient ram for even OBD stage 1 P-1, lack support for ECC ram, and would be too slow, judging by relative prime95 performance: Core 2 Duo E8200 i3-M370 Pentium M750 The following are too inefficient and slow under WSL, and limited to access to 16 cores/64 hyperthreads there: Xeon Phi 7210 Xeon Phi 7250 These will require a native Linux boot to probably qualify. Last fiddled with by kriesel on 2022-02-05 at 20:52 |
![]() |
![]() |
![]() |
#21 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
OBD P-1 is now feasible. For now, Mlucas v20.x is the only known software capable of OBD Mersenne P-1 factoring, with sufficiently large fft lengths and several months of QA testing and revision (patching) accomplished. The latest version is recommended, which is V20.1.1 2022-12-02. See http://www.mersenneforum.org/mayer/README.html and https://mersenneforum.org/showthread.php?t=27295
I propose the following:
Last fiddled with by kriesel on 2022-02-07 at 16:58 |
![]() |
![]() |
![]() |
#22 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,717 Posts |
![]()
Current TF status is shown at https://www.mersenne.ca/obd. Current indicated Level is 22.14.
3321928171 completed to the full 92 bits, 2022-02-05 by mersenne.ca user johnny_jack 3321928307 78% complete from 91 to 92, ETA 2022-02-23 by kriesel 3321928319 71% complete from 91 to 92, ETA 2021-02-26 by kriesel 3321938373 complete to 90, available for reservation to higher There are 15 others in the exponent range up to 3321939987, in progress at bit levels 86-87 to 88-89 by kriesel, as part of an effort to go expeditiously to OBD Level 24 or higher. Anyone with a sufficiently fast GPU is welcome to help reach OBD level 24. At the moment: OBD TF completed to 92, ready for reservation or release to others for P-1: 1 OBD TF completed to 91, reserved to 92 bits: 2 Systems with qualification(s) completed & posted for OBD P-1: 3 (1 is unconditionally) Exponents reserved for P-1: 0 Exponents completed thru stage 1 P-1: 0 Exponents completed thru stage 2 P-1: 0 It's now up to the user who completed TF to 92 on 3321928171, to reserve it for P-1, or reserve it as a joint effort, or release it for reservation for P-1 by others, by 2022-03-05. No action results in delay until by default it becomes available to all for reservation after 2022-03-05. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A couple of 15e candidates | fivemack | NFS@Home | 1 | 2014-11-30 07:52 |
How to calculate FFT lengths of candidates | pepi37 | Riesel Prime Search | 8 | 2014-04-17 20:51 |
No available candidates on server | japelprime | Prime Sierpinski Project | 2 | 2011-12-28 07:38 |
Adding New Candidates | wblipp | Operation Billion Digits | 6 | 2011-04-10 17:45 |
new candidates for M...46 and M48 | cochet | Miscellaneous Math | 4 | 2008-10-24 14:33 |