-   Computer Science & Computational Number Theory (
-   -   Pépin tests of Fermat numbers beyond F24 (

ET_ 2016-02-05 09:40

Luigi, you order your birthday present yet? My hope is that things will run sufficiently fast on your newer hardware that you might get throughput matching my 4-core speed using just 3 (or even 2, due to memory contention giving diminishing returns beyond 2-threads) threads, which will make me feel less guilty about monopolizing your cycles. :)[/QUOTE]

I completed the to-do list and cleared my doubts about the pieces to pick up: I will place my order next monday, and get the new toy before the end of the week.


LaurV 2016-09-14 05:08

Ernst, can we have a status on F29?
Just curious...

Xyzzy 2016-09-14 14:12

[QUOTE=LaurV;442492]Ernst, can we have a status on F29?[/QUOTE]It is probably still composite.


ewmayer 2016-09-15 00:50

[QUOTE=LaurV;442492]Ernst, can we have a status on F29?
Just curious...[/QUOTE]

Well, I'm doing both the 30M and 32M-FFT runs side-by-side on my Haswell quad, so each is running half as fast as it would alone, but it makes it very easy to cross-check residues, since 30M runs only a few % faster than 32M (.1810 sec/iter vs .1860 sec/iter). I'm running each job 4-threaded, even though it gets only the equivalent of 2 cores compute power. Why am I doing this? Because my Haswell system, despite running at stock, has never been completely stable (whereas the very same binary has been running rock-solidly on my slower Mike-built Broadwell NUC for over a year, with the only interruptions being power outages ... but note the NUC is doing 1st-time LL tests, not Fermat testing). I average around 1 week between mystery-fails; roughly half the time both jobs start emitting fatal roundoff errors (sign of data corruption) at the same time; ~30% of the fails involve just one run - this is the reason for making both 4-threaded, so the surviving run can continue at full 4-core speed - and ~20% of the fails are BSOD-style (no UPS on this system, not worth the hassle).

When a run fails due to fatal-ROE-style data corruption it will attempt to auto-restart for the most-recent savefile; this succeeds perhaps 1/3rd of the time, suggesting that is the fraction of times the data-corruption occurred in the main residue array as opposed to the auxiliary data tables needed for the FFT-mul. (Adding code logic to auto-reinit those tables is on my to-do list, but not currently a high priority.)

Whenever I reboot/restart it gives me a chance to allow whichever of the 2 runs has fallen behind to catch up before restarting the other one.

There have been at least 2 (perhaps 3 - too lazy to dig thru the savefiles just now) occasions where one run has started emitting non-matching residues - in such cases I have to halt both and restart from the nearest savefile preceding the start of mismatching residues, since I have no [i]a priori[/i] way of determining which one went off the rails.

Anyhow, both runs passed 130M iters a few days ago, which is a smidge more than 1/4 done. Total time for the side-by-side runs will be ~3 years, so only 27 months to go. :(

Since I plan to do F30 on a manycore system (I'm lookin' at you, Knight's Landing), it is possible those parallel runs will complete before the F29 ones, or not much later. I'll have a better sense of things by EOY, roughly the timeframe I estimate for a halfway-decent initial AVX-512 assembly-code deployment.

LaurV 2016-09-15 05:32

Nice, thanks for sharing that!

That is what I (used to) do when running two LLs in parallel in two Titans - stop both and resume from the last match, as there was no way to know which one failed. I said "used to" because momentarily I stopped my LL testing at all (no resources), and not because I would do it differently now. The old way is still the best. You can save a lot of time, actually (your time and others' time, compared with the version where one test would fail and you don't know and report a bad result, and others need to TC).

@Xyzzy: :yucky:
(as said before, this icon is improper named, it has nothing to do with "yuck", i see a grumpy guy sticking out a tongue in defiance, or spite, of another. Long ago I wanted it for myself, but you gave me the lightning rabbit... hey, btw, now with pokemon in vogue, maybe I can make some money if I sell it? Any takers?)


ET_ 2016-09-21 11:01

As you probably already noticed, "my" Skylake didn't come.

I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread.

I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then).

If help is still needed on November, I hope I will be ready.

ewmayer 2016-09-21 21:09

[QUOTE=ET_;443146]As you probably already noticed, "my" Skylake didn't come.

I had health issues, then had to sell my mom's home, my wife's home was too small and we had to look for another one, sell the old one, buy the new one in a city that is 450 miles away from where I live, prepare to relocate, look for a job and some other things I could put on the unhappy me thread.

I want to apologize with the forum and Ernst for offering a helping hand and silently retracting it. The new PC should finally arrive after our relocation is complete, I hope end of October (you will know by then).

If help is still needed on November, I hope I will be ready.[/QUOTE]

No worries, Luigi, and I hope things have calmed down on the various fronts you mention.

In the meantime, the usefulness of having both runs executing locally on the same (or similar) hardware has become clear to me - If I had a 2nd *well quad, that would double my throughput, but can't justify the expense. So I plan to just have my Haswell complete F29, during which time I will produce AVX512 code which will hopefully bring F30 within reasonable reach, either on a KNL or one of the future desktop systems which will support that doouble-the-width-of-AVX2 instruction set.

Luigi, you go ahead and put your hardware, when it arrives, to use for GIMPS - you definitely have a better chance of discovering a prime that way. :)

ewmayer 2017-11-14 02:19

The 2 side-by-side runs at 30M and 32M FFT I mentioned in my Sep 2016 post both got to iter ~230M on my Haswell in late April with me doing near-daily Res64 cross-checks, but it was a major headache, the Haswell - since powered off except for occasional avx2 code tests and tweaks - being sufficiently unstable that it was often a challenge to get 24 hours of uptime. At that point I moved the respective runs to a pair of faster systems, both physically hosted by our own David Stanfill, a.k.a. user airsquirrels. One system is David's own GPU-filled 32-core Xeon/avx2, the other is the Intel Knights Landing box a bunch of us GIMPSers interested in doing/supporting AVX-512 code development crowdfunded this past spring. After doing the basic AVX-512 assembly code port and some experiments to figure out which FFT length should be run on each of these 2 servers, the 30M-FFT run was finished on the Xeon (best per-iter time there ~24 ms using 32 threads) and the 32M on the
64-core KNL (best-time ~35ms/iter using 64 threads) to finish. The Xeon run completed in late July, and the KNL run finished on 2 Sep., both gave these final Selfridge-Hurwitz residues (as well as matching full-length final residue files):
F29 is not prime. Res64: BECE1E5FFBE5EC12. Program: E17.0
R29 mod 2^36 = 68650658834
R29 mod 2^35 - 1 = 17796269382
R29 mod 2^36 - 1 = 21246329964
Since my original Haswell runs were from a common 32M-FFT savefile (as opposed to being run completely independently @30 and 32M) and had numerous restarts as well as several residue mismatches needing backtracking to see which of the 2 runs (which were proceeding side-by-side on my quad Haswell at similar speeds, allowing for easier cross-checking) went "off the rails", data-integrity-wise, after the faster Xeon completed its run of the final ~300Miters I further used it to rerun the first ~230Miters, all @30M FFT length. That partial run just completed over the weekend, with 230 Miter result matching those of the Haswell-begun runs.

Many thanks to David for the generous gift of Xeon cycles and the IT support.

I've posted a tarball of the interim-checkpoint status files and the full final residue file via link added to the OP in this thread. I also have persistent full checkpoint files at 10Miter intervals, but as each such files is 2^29/8 = 64Mbyte, and there are 2^29/(10 million) = 53 such, the total archive is ~3.3GB. The residue files are more or less completely random byte data, i.e. cannot effectively be compressed. I have burned a copy of that archive to a thumb drive as a backup for the one on my HD, and will make it available on request should anyone wish to do e.g. a parallel run validation via rerun of said 10Miter intervals.

ewmayer 2018-01-12 07:32

1 Attachment(s)
Since I am currently in the process of adding PRP (as opposed to LL) testing for Mersenne numbers, as well as fast Suyama-style cofactor-PRP testing ('fast' in the 'when starting with a PRP-test residue for the modulus in question' sense) for both Mersenn and Fermat numbers to my Mlucas code, this seems worth sharing here. While going through a box of some old documents today, I came across a folder of overhead-projector transparecies for several talks I gave back in my 6-year stint in academia from '93-'99. One set of slides was for the following talk I gave at the 1997 West Coast NT conference at Asilomar conference center in Pacific Grove, near Monterey. (My Christmas-gift-to-self last month was a $60 Epson flatbed scanner, which is paying for itself in all kinds of useful ways.)

There is an obvious small typo on slide 4, the 2nd bullet point should read (boldface highlights the needed fix) "...since (M_p-1)/2 is just a string of (p-1) binary ones, the Euler base-3 test amounts to a series of [b](p-2)[/b] squarings and multiplications by 3...":

CRGreathouse 2018-01-22 06:45

I never realized you were at Case!

ewmayer 2018-06-24 03:30

Update on the side-by-side Pépin tests of F30:

Run #1 using AVX-512 build on all 64 processors of the 1.3 GHz "GIMPS KNL" @64M FFT, begun 25 Mar 2017, just passed 1/3rd mark at 360M iterations. At 67.7 ms/iter ETC is 31. Dec 2019 (ugh.)

Run #2 using AVX2 build on all 32 processors of David Stanfill's 2.3 GHz Xeon server [E5-2698 v3] @60M FFT, begun 12 Nov 2017, is 20M iterations behind the KNL one. The best runtime I've gotten on this system is 48 ms/iter, but (for reasons not fully known, best guess is some combo of system load from other low-CPU processes and possibly temps) things more typically run 10-15% slower than that. The Xeon also tends to have more outages than the KNL, so let's add another 5% to that, say 57 ms/iter, ETC is mid-October 2019.

Unlike my F29 runs on my Haswell quad, there have been no interim-residue mismatches between the above 2 runs. Hooray for ECC memory!

One intersting issue I ran into recently with regard to future test of larger Fermat numbers relates to residue shift such as George's Prime95 code uses to effect different-data first-time and double-check runs of a given Mersenne number at the same FFT length for both runs: Recently finished adding support for this to my Mlucas code. In the context of the LL test for Mersennes, the trickiest part, especially for highly optimized codes which sandwich the carry step between the final pass of the inverse FFT for one iteration and the initial pass of the forward-FFT for the next iteration (and as in my case, combine all 3 steps into the body of a single processing loop), is efficiently figuring out where to inject the shifted -2 which in an unshifted-residue LL test simply gets subtracted from the 0-element of the residue array. In the context of shift, one needs to first efficiently compute which residue word gets the shifted -2, and remember, this is in the context of the crazy mix of 2 different wordsizes (floor(p/n) and ceiling(p/n)) mandated by the Crandall-Fagin DWT. Then one is typically dealing with a multithreaded decomposition of the above fused FFT-pass/carry-step loop, so one needs to figure out which thread is processing said array word. Lots of fun. But, the update of the residue shift count each iteration is simple: it's simply a doubling (mod p), and the fact that p is prime means the odds of the resulting sequence of per-iteration shift counts will repeat in some kind of short-cycle sequence (or even worse, fall into some kind of limit cycle which is independent of the initial shift, i.e. which would be the same for 2 runs starting out with diffeent ahift values) are small.

For Fermat-mod arithmetic we have the opposite problem: There is no -2 carry needing to be injected each iteration as there is for the LL test, but the per-iteration shift-count doubling is problematic, since our modulus = 2^(2^n) + 1, i.e. our shift count gets doubled mod (2^n), which is guaranteed to yield (and then stay at) shift count s = 0 after at most n iterations. Instead updating the shift each iteration via s = 2*s+1, which one can do by including a scalar multiplier equal to the initial seed of the Pépin test, is little better, e.g. for n = 4, consider the kinds of shift-count sequences that result from different values of the initial shift:


The most promising avenue found here is to update the shift count via s = 2*s + random[0,1].

The reason this becomes relevant for Fermat numbers only a few indices above F30 is this: so far I've been able to use dual residue-shift-less runs at slightly different FFT lengths to ensure that all but the initial handful of iterations for each run involve different FFT data. For example for F24 we can do one run at 896K and another at 960K or 1M. for F30, as noted above, the minimal FFT length has crept up relative to the next-higher power of 2 due to roundoff error trends, and each doubling of the modulus length squeezes that lower bound that much closer to the power of 2 FFT length. For F33 I have already put in place a special FFT length of form 63*2^k, but that large-ish odd composite radix is relatively inefficient, so if it proves slower than the power of 2 just above it, it will be useful to be able to do both runs at the same power-fo-2 FFT length, using the residue-shift scheme to ensure different-data-ness.

All times are UTC. The time now is 15:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.