mersenneforum.org Sr2sieve on PPC/Linux
 Register FAQ Search Today's Posts Mark Forums Read

2006-12-05, 21:40   #45
BlisteringSheep

Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
HASH_MAX_DENSITY

Quote:
 Originally Posted by BlisteringSheep .... MAX_HASH_DENSITY reduced. I haven't experimented with different values; I've just been using Geoff's recommendation to halve it. I'll try some others.
Nothing performed any better than 0.32. I tried 0.16, 0.24, 0.40, 0.48 & 0.56.
• 0.16: 253571 p/sec
• 0.24: 265782 p/sec
• 0.32: 265975 p/sec
• 0.40: 265653 p/sec
• 0.48: 263292 p/sec
• 0.56: 263331 p/sec
• 0.62: 263521 p/sec

2006-12-06, 01:17   #46
geoff

Mar 2003
New Zealand

13·89 Posts

Quote:
 Originally Posted by BlisteringSheep Geoff, I got the new 1.4.7 tarball. Is it possible to create the cache file once and then link to it? I have a ton of machines all working out off of a NFS tree. I'm currently hard-linking all of the executables and SoB.dat, and would like to link the sr2cache.bin file as well (I prefer hard links to symlinks for things on the same file system).
Yes that's what I do. All machines can link to the same SoB.dat, but only machines with the same register width and byte order can link to the same sr2cache.bin.

Quote:
 Note that USE_INLINE_MULMOD makes it run really fast (over 461000 p/sec), but it finds no factors at all (duplicates or new).
OK, thanks for testing this. If it is returning wrong results then the speed can't be compared to a working version, I'll remove it from the next version.

Quote:
 think it also shows that it's worthwhile to have MAX_HASH_DENSITY reduced. I haven't experimented with different values; I've just been using Geoff's recommendation to halve it. I'll try some others.
Halving the maximum density will double the hashtable size, which should generally give a small speedup so long as L1 cache size is not a constraint. The speedup was not dramatic so the default is probably OK.

2006-12-06, 03:15   #47
BlisteringSheep

Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts

Quote:
 Originally Posted by geoff OK, thanks for testing this. If it is returning wrong results then the speed can't be compared to a working version, I'll remove it from the next version.
If there are ever any tests I can do for you, let me know. I've got at least five different types of PPC64 systems.
I will experiment some more with the BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK suggestions you made next.

2006-12-06, 05:18   #48
BlisteringSheep

Oct 2006
On a Suzuki Boulevard C90

3668 Posts

Quote:
 Originally Posted by BlisteringSheep I will experiment some more with the BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK suggestions you made next.
None of this panned out. I tried combinations with different changes to WORK and LIMIT_BASE, and nothing was better than the current values. In fact, they were all slightly slower, though not enought to be significant.

I promise, last post of the night
[Bet you're all going to be glad when I go away ]

2006-12-07, 01:34   #49
geoff

Mar 2003
New Zealand

13×89 Posts

Quote:
 Originally Posted by BlisteringSheep None of this panned out. I tried combinations with different changes to WORK and LIMIT_BASE, and nothing was better than the current values. In fact, they were all slightly slower, though not enought to be significant.
That is OK, hopefully it means that the default is optimal.

In version 1.4.8 I have made another attempt at the inline mulmod, no guarantees it will work, but if you want to test it just compile with -DUSE_INLINE_MULMOD added to CPPFLAGS.

Quote:
 [Bet you're all going to be glad when I go away ]
Not at all, your testing so far has been a great help :-)

2006-12-07, 04:54   #50
BlisteringSheep

Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts

Quote:
 Originally Posted by geoff In version 1.4.8 I have made another attempt at the inline mulmod, no guarantees it will work, but if you want to test it just compile with -DUSE_INLINE_MULMOD added to CPPFLAGS.
It did indeed work and is slightly faster. On my short test case (10 minutes with 1 factor), it found the factor with these speeds (all compiles with gcc 4.1.1) on a 2.0 GHz 970:
• 240975 p/sec with defaults
• 242375 p/sec with HASH_MAX_DENSITY to 0.32
• 245071 p/sec with INLINE plus HASH_MAX_DENSITY
I am currently doing a slightly longer test (little over an hour) with a range that has 2 factors.

Note that these tests are on a different machine with a slower CPU than the results at the top of this page, and the numbers can't be directly compared.

Last fiddled with by BlisteringSheep on 2006-12-07 at 04:56 Reason: speed disclaimer

 2006-12-07, 15:40 #51 BlisteringSheep     Oct 2006 On a Suzuki Boulevard C90 2×3×41 Posts 1.4.8 USE_INLINE_MULMOD results On the 2.5GHz 970MPs, the speedup is more significant, from about 308000 p/sec to 331000 (over 7%). There is a similar significant improvement on the 2.2 GHz 970FX, from about 266000 to 286000 (again over 7%). One thing I did think to do in these tests vs. the one last night on the slower CPU was to remove mulmod-ppc64.o from the list of ASM_OBJS when using the USE_INLINE_MULMOD. Last fiddled with by BlisteringSheep on 2006-12-07 at 16:00 Reason: added 2.2 GHz 970FX results
2006-12-08, 01:06   #52
geoff

Mar 2003
New Zealand

13×89 Posts

Quote:
 Originally Posted by BlisteringSheep On the 2.5GHz 970MPs, the speedup is more significant, from about 308000 p/sec to 331000 (over 7%). There is a similar significant improvement on the 2.2 GHz 970FX, from about 266000 to 286000 (again over 7%). One thing I did think to do in these tests vs. the one last night on the slower CPU was to remove mulmod-ppc64.o from the list of ASM_OBJS when using the USE_INLINE_MULMOD.
Great :-) The inline assembler linkage can probably be improved further, at the moment a whole lot of registers are hard coded and it might be better to allow GCC to choose which ones to use.

I don't really know enough about PPC assembler to guess what is likely to work best though, and trial and error will be a long process without a machine at hand to test it on.

If you have the patience to do this yourself, the basic idea is to replace a register in the clobber list with an entry in the output list associated with a temporary variable. The current code looks like this:

Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
register uint64_t ret;

asm ("li      %0, 64"          "\n\t"
"sub     %0, %0, %5"      "\n\t"
"mulld   r7, %1, %2"      "\n\t"
"mulhdu  r8, %1, %2"      "\n\t"
"mulld   r26, r7, %4"     "\n\t"
"mulhdu  r27, r7, %4"     "\n\t"
"mulld   r28, r8, %4"     "\n\t"
"mulhdu  r29, r8, %4"     "\n\t"
"srd     r9, r9, %5"      "\n\t"
"sld     r10, r10, %0"    "\n\t"
"or      r9, r9, r10"     "\n\t"
"mulld   r9, r9, %3"      "\n\t"
"sub     %0, r7, r9"      "\n\t"
"cmpdi   cr6, %0, 0"      "\n\t"
"bge+    cr6, 0f"         "\n\t"
"0:"
: "=&r" (ret)
: "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
: "r7","r8","r9","r10","r26","r27","r28","r29","cr6" );

return ret;
}
To allow GCC to choose two registers to use instead of r7,r8 you could change it to:
Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
register uint64_t ret, tmp1, tmp2;

asm ("li      %0, 64"          "\n\t"
"sub     %0, %0, %7"      "\n\t"
"mulld   %1, %3, %4"      "\n\t"
"mulhdu  %2, %3, %4"      "\n\t"
"mulld   r26, %1, %6"     "\n\t"
"mulhdu  r27, %1, %6"     "\n\t"
"mulld   r28, %2, %6"     "\n\t"
"mulhdu  r29, %2, %6"     "\n\t"
"srd     r9, r9, %7"      "\n\t"
"sld     r10, r10, %0"    "\n\t"
"or      r9, r9, r10"     "\n\t"
"mulld   r9, r9, %5"      "\n\t"
"sub     %0, %1, r9"      "\n\t"
"cmpdi   cr6, %0, 0"      "\n\t"
"bge+    cr6, 0f"         "\n\t"
"0:"
: "=&r" (ret), "=&r" (tmp1), "=&r" (tmp2)
: "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
: "r9","r10","r26","r27","r28","r29","cr6" );

return ret;
}
I wouldn't expect the gain from doing this to be very great however, and it might even be slower (GCC doesn't always make the best choice of register to use). You will also need to make analogous changes to the PRE2_MULMOD64() macro to see the real effect.

 2006-12-10, 23:05 #53 geoff     Mar 2003 New Zealand 100100001012 Posts In version 1.4.9 I have made a small change to the inline PRE2_MULMOD64 macro that should allow GCC to recognise that the initial subtraction results in a loop invariant. You can try this change out by replacing the #if 1' with #if 0' in asm-ppc64.h. Last fiddled with by geoff on 2006-12-10 at 23:06
2006-12-11, 07:36   #54
BlisteringSheep

Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts

Quote:
 Originally Posted by geoff In version 1.4.9 I have made a small change to the inline PRE2_MULMOD64 macro that should allow GCC to recognise that the initial subtraction results in a loop invariant. You can try this change out by replacing the #if 1' with #if 0' in asm-ppc64.h.
WOW! That made a huge difference on my 2.0 GHz box; from ~245k to over 260k/sec. I'll try it out tomorrow on the faster machines and see what we can really get it cranking to. I'll also let it run to verify correctness.

 2006-12-11, 16:29 #55 BlisteringSheep     Oct 2006 On a Suzuki Boulevard C90 2×3×41 Posts On the faster chips, this hurt performance. With the #if 1 on a 2.5 970MP it knocked it from ~330kp/sec back to ~300kp/sec.

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 304 2021-11-06 13:51 pepi37 Software 5 2013-08-09 22:31 SaneMur Information & Answers 2 2011-08-21 22:04 mgpower0 Prime Sierpinski Project 54 2008-07-15 16:50 nuggetprime Riesel Prime Search 40 2007-12-03 06:01

All times are UTC. The time now is 20:17.

Sat Jul 2 20:17:19 UTC 2022 up 79 days, 18:18, 1 user, load averages: 1.10, 1.27, 1.23