mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   PFGW 4.0.3 (with gwnum v28.7) Released (https://www.mersenneforum.org/showthread.php?t=13969)

mdettweiler 2010-09-29 21:06

I've also tried comparing iteration timings with Prime95 v26.2 vs. 25.11 (both Windows 32-bit) to verify whether the problem is in gwnum, or just PFGW.

In both cases, I used the Advanced>Time option, and ran 1000 iterations of M38000000.

25.11: ~60 ms/iter.
26.2: ~50 ms/iter.

So there's a significant speedup going to version 26.2. Of course, this is a much bigger FFT than that used on the base 5 numbers; so I also tried 1000 iterations of M1100000.

25.11: ~1.2 ms/iter.
26.2: ~1.5 ms/iter.

It seems that version 26.2 is actually slower on this FFT.

Note that Prime95 v26.2 used the Pentium 4 type-3 56K FFT for this number, whereas the base 5 numbers tested earlier were done with a Core2 type-3 128K FFT. However, there does seem to be a commonality in that in both cases, the v26 gwnum program tested slower on my CPU at these low FFTs.

Prime95 2010-09-29 21:47

[QUOTE=mdettweiler;231968]I've also tried comparing iteration timings with Prime95 v26.2 vs. 25.11 (both Windows 32-bit) to verify whether the problem is in gwnum, or just PFGW.[/QUOTE]

You can add "PRP=289184,5,477336,-1" to worktodo.txt to time the exact numbers in question.

It is baffling to me why these smaller FFTs are slower on your Core 2 but not for anyone else. Maybe CPU-Z or one of the other programs that do a more thorough dump of CPU characteristics might shed some light.

mdettweiler 2010-09-30 02:26

[QUOTE=Prime95;231975]You can add "PRP=289184,5,477336,-1" to worktodo.txt to time the exact numbers in question.

It is baffling to me why these smaller FFTs are slower on your Core 2 but not for anyone else. Maybe CPU-Z or one of the other programs that do a more thorough dump of CPU characteristics might shed some light.[/QUOTE]
Okay, here's the iteration timings I got for that with Prime95 v25.11 and 26.2:

25.11: ~3.55 ms/iter.
26.2: ~3.45 ms/iter.

Would you know--I get a speed boost with 26.2 after all. It would seem, then, that this is an issue in PFGW and not in gwnum.

rogue 2010-09-30 02:41

[QUOTE=mdettweiler;232001]Okay, here's the iteration timings I got for that with Prime95 v25.11 and 26.2:

25.11: ~3.55 ms/iter.
26.2: ~3.45 ms/iter.

Would you know--I get a speed boost with 26.2 after all. It would seem, then, that this is an issue in PFGW and not in gwnum.[/QUOTE]

Not necessarily. I've already shown the timings on Windows for the same build. Such a vast proportion of time is spent in gwnum that it is unlikely that PFGW could cause such a significant slow down. There is clearly something curious going on here though.

mdettweiler 2010-09-30 06:10

[QUOTE=rogue;232002]Not necessarily. I've already shown the timings on Windows for the same build. Such a vast proportion of time is spent in gwnum that it is unlikely that PFGW could cause such a significant slow down. There is clearly something curious going on here though.[/QUOTE]
Here's what I get comparing iteration times on 289184*5^477336-1 for LLR 3.8.1 and 3.8.2:

3.8.1: ~3.45 ms/iter.
3.8.2: ~3.40 ms/iter.

Just like Prime95, LLR gets a speed increase with 3.8.2 as expected.

What I really should do, though, is run the test from start to finish on each version of both Prime95 and LLR. We've already seen that in such a test PFGW 3.3.6 is inexplicably faster than 3.4.0, but it would be interesting to see if the same holds true for Prime95 and LLR. Sure, the ms/iter. figures show the newer version to be faster in both such cases, but in each case the figures fluctuated rather wildly and I had to come up with a "gut estimate average" to post here. The potential for experimental error is, needless to say, rather large.

George, quick question: is there a way to make Prime95 print the exact wall-clock runtime at the end of a test, like PFGW and LLR do? As it is now, there's not really an easy way to directly measure this with Prime95.

rogue 2010-09-30 12:39

[QUOTE=mdettweiler;232017]Here's what I get comparing iteration times on 289184*5^477336-1 for LLR 3.8.1 and 3.8.2:

3.8.1: ~3.45 ms/iter.
3.8.2: ~3.40 ms/iter.

Just like Prime95, LLR gets a speed increase with 3.8.2 as expected.

What I really should do, though, is run the test from start to finish on each version of both Prime95 and LLR. We've already seen that in such a test PFGW 3.3.6 is inexplicably faster than 3.4.0, but it would be interesting to see if the same holds true for Prime95 and LLR. Sure, the ms/iter. figures show the newer version to be faster in both such cases, but in each case the figures fluctuated rather wildly and I had to come up with a "gut estimate average" to post here. The potential for experimental error is, needless to say, rather large.

George, quick question: is there a way to make Prime95 print the exact wall-clock runtime at the end of a test, like PFGW and LLR do? As it is now, there's not really an easy way to directly measure this with Prime95.[/QUOTE]

I'm curious. Is 3.4.0 slower for other n and other bases?

Prime95 2010-09-30 14:28

[QUOTE=mdettweiler;232017]
George, quick question: is there a way to make Prime95 print the exact wall-clock runtime at the end of a test, like PFGW and LLR do? As it is now, there's not really an easy way to directly measure this with Prime95.[/QUOTE]

The date/time is displayed at the start of every line output to the screen.

mdettweiler 2010-09-30 14:58

[QUOTE=rogue;232047]I'm curious. Is 3.4.0 slower for other n and other bases?[/QUOTE]
For base 2, 3.4.0 is faster:
[code]
PFGW Version 3.3.6.20100908.Win_Stable [GWNUM 25.14]
2071*2^270307-1 is composite: RES64: [816B1DBFBFC67D09] (123.9523s+0.0004s)


PFGW Version 3.4.0.32BIT.20100925.Win_Dev [GWNUM 26.2]
2071*2^270307-1 is composite: RES64: [816B1DBFBFC67D09] (108.1168s+0.0003s)
[/code]
Ditto for base 3:
[code]
PFGW Version 3.3.6.20100908.Win_Stable [GWNUM 25.14]
170979002*3^50000+1 is composite: RES64: [CBA9FAA11257431A] (15.1078s+0.0014s)

PFGW Version 3.4.0.32BIT.20100925.Win_Dev [GWNUM 26.2]
170979002*3^50000+1 is composite: RES64: [CBA9FAA11257431A] (12.1903s+0.0017s)[/code]
And for another n on base 5, 3.4.0 is again faster:
[code]
PFGW Version 3.3.6.20100908.Win_Stable [GWNUM 25.14]
18656*5^65474-1 is composite: RES64: [BB2682E39AA9CB16] (42.5636s+0.0034s)

PFGW Version 3.4.0.32BIT.20100925.Win_Dev [GWNUM 26.2]
18656*5^65474-1 is composite: RES64: [BB2682E39AA9CB16] (37.9236s+0.0038s)[/code]
The problem, it would seem, is localized to this particular range of n on base 5 (possibly this particular FFT size).

Mark, did you by chance check which FFT 3.4.0 chose for the two (larger) base 5 tests on your CPU? Mine used "Core2 type-3 FFT length 128K"; maybe yours chose a different CPU architecture? (Grasping at straws here...)
[QUOTE=Prime95;232059]The date/time is displayed at the start of every line output to the screen.[/QUOTE]
But that only prints out at the [i]end[/i] of each test; what I'd need for it to do is to print the time at the beginning as well so I could subtract and get the runtime.

I suppose what I could do is stick a miniscule test (n=100 or so) in the worktodo.txt file right before the base 5 test. That way, it prints out the time at the tiny test's completion (i.e., at the start of the base 5 test) and again at the end of the base 5 test. I'll try that later today.

rogue 2010-09-30 15:15

This is what 3.4.0 chose on Win64:

Special modular reduction using zero-padded Core2 type-3 FFT length 128K, Pass1=128, Pass2=1K on 289184*5^477336-1

mdettweiler 2010-09-30 15:27

[QUOTE=rogue;232065]This is what 3.4.0 chose on Win64:

Special modular reduction using zero-padded Core2 type-3 FFT length 128K, Pass1=128, Pass2=1K on 289184*5^477336-1[/QUOTE]
That's the same as what I got:

Special modular reduction using zero-padded Core2 type-3 FFT length 128K, Pass1=128, Pass2=1K on 289184*5^477336-1

Does the 32-bit version by chance give you something different?

rogue 2010-09-30 15:45

[QUOTE=mdettweiler;232068]That's the same as what I got:

Special modular reduction using zero-padded Core2 type-3 FFT length 128K, Pass1=128, Pass2=1K on 289184*5^477336-1

Does the 32-bit version by chance give you something different?[/QUOTE]

Nope.


All times are UTC. The time now is 00:57.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.