mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Sierpinski/Riesel Base 5

Reply
 
Thread Tools
Old 2006-12-05, 21:40   #45
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
Default HASH_MAX_DENSITY

Quote:
Originally Posted by BlisteringSheep View Post
.... MAX_HASH_DENSITY reduced. I haven't experimented with different values; I've just been using Geoff's recommendation to halve it. I'll try some others.
Nothing performed any better than 0.32. I tried 0.16, 0.24, 0.40, 0.48 & 0.56.
  • 0.16: 253571 p/sec
  • 0.24: 265782 p/sec
  • 0.32: 265975 p/sec
  • 0.40: 265653 p/sec
  • 0.48: 263292 p/sec
  • 0.56: 263331 p/sec
  • 0.62: 263521 p/sec
BlisteringSheep is offline   Reply With Quote
Old 2006-12-06, 01:17   #46
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
Geoff,
I got the new 1.4.7 tarball. Is it possible to create the cache file once and then link to it? I have a ton of machines all working out off of a NFS tree. I'm currently hard-linking all of the executables and SoB.dat, and would like to link the sr2cache.bin file as well (I prefer hard links to symlinks for things on the same file system).
Yes that's what I do. All machines can link to the same SoB.dat, but only machines with the same register width and byte order can link to the same sr2cache.bin.

Quote:
Note that USE_INLINE_MULMOD makes it run really fast (over 461000 p/sec), but it finds no factors at all (duplicates or new).
OK, thanks for testing this. If it is returning wrong results then the speed can't be compared to a working version, I'll remove it from the next version.

Quote:
think it also shows that it's worthwhile to have MAX_HASH_DENSITY reduced. I haven't experimented with different values; I've just been using Geoff's recommendation to halve it. I'll try some others.
Halving the maximum density will double the hashtable size, which should generally give a small speedup so long as L1 cache size is not a constraint. The speedup was not dramatic so the default is probably OK.
geoff is offline   Reply With Quote
Old 2006-12-06, 03:15   #47
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default

Quote:
Originally Posted by geoff View Post
OK, thanks for testing this. If it is returning wrong results then the speed can't be compared to a working version, I'll remove it from the next version.
If there are ever any tests I can do for you, let me know. I've got at least five different types of PPC64 systems.
I will experiment some more with the BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK suggestions you made next.
BlisteringSheep is offline   Reply With Quote
Old 2006-12-06, 05:18   #48
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

3668 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
I will experiment some more with the BABY_WORK, GIANT_WORK, EXP_WORK, and SUBSEQ_WORK suggestions you made next.
None of this panned out. I tried combinations with different changes to WORK and LIMIT_BASE, and nothing was better than the current values. In fact, they were all slightly slower, though not enought to be significant.

I promise, last post of the night
[Bet you're all going to be glad when I go away ]
BlisteringSheep is offline   Reply With Quote
Old 2006-12-07, 01:34   #49
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
None of this panned out. I tried combinations with different changes to WORK and LIMIT_BASE, and nothing was better than the current values. In fact, they were all slightly slower, though not enought to be significant.
That is OK, hopefully it means that the default is optimal.

In version 1.4.8 I have made another attempt at the inline mulmod, no guarantees it will work, but if you want to test it just compile with -DUSE_INLINE_MULMOD added to CPPFLAGS.

Quote:
[Bet you're all going to be glad when I go away ]
Not at all, your testing so far has been a great help :-)
geoff is offline   Reply With Quote
Old 2006-12-07, 04:54   #50
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
Default

Quote:
Originally Posted by geoff View Post
In version 1.4.8 I have made another attempt at the inline mulmod, no guarantees it will work, but if you want to test it just compile with -DUSE_INLINE_MULMOD added to CPPFLAGS.
It did indeed work and is slightly faster. On my short test case (10 minutes with 1 factor), it found the factor with these speeds (all compiles with gcc 4.1.1) on a 2.0 GHz 970:
  • 240975 p/sec with defaults
  • 242375 p/sec with HASH_MAX_DENSITY to 0.32
  • 245071 p/sec with INLINE plus HASH_MAX_DENSITY
I am currently doing a slightly longer test (little over an hour) with a range that has 2 factors.

Note that these tests are on a different machine with a slower CPU than the results at the top of this page, and the numbers can't be directly compared.


Last fiddled with by BlisteringSheep on 2006-12-07 at 04:56 Reason: speed disclaimer
BlisteringSheep is offline   Reply With Quote
Old 2006-12-07, 15:40   #51
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Thumbs up 1.4.8 USE_INLINE_MULMOD results

On the 2.5GHz 970MPs, the speedup is more significant, from about 308000 p/sec to 331000 (over 7%).

There is a similar significant improvement on the 2.2 GHz 970FX, from about 266000 to 286000 (again over 7%).

One thing I did think to do in these tests vs. the one last night on the slower CPU was to remove mulmod-ppc64.o from the list of ASM_OBJS when using the USE_INLINE_MULMOD.

Last fiddled with by BlisteringSheep on 2006-12-07 at 16:00 Reason: added 2.2 GHz 970FX results
BlisteringSheep is offline   Reply With Quote
Old 2006-12-08, 01:06   #52
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
On the 2.5GHz 970MPs, the speedup is more significant, from about 308000 p/sec to 331000 (over 7%).

There is a similar significant improvement on the 2.2 GHz 970FX, from about 266000 to 286000 (again over 7%).

One thing I did think to do in these tests vs. the one last night on the slower CPU was to remove mulmod-ppc64.o from the list of ASM_OBJS when using the USE_INLINE_MULMOD.
Great :-) The inline assembler linkage can probably be improved further, at the moment a whole lot of registers are hard coded and it might be better to allow GCC to choose which ones to use.

I don't really know enough about PPC assembler to guess what is likely to work best though, and trial and error will be a long process without a machine at hand to test it on.

If you have the patience to do this yourself, the basic idea is to replace a register in the clobber list with an entry in the output list associated with a temporary variable. The current code looks like this:

Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
  register uint64_t ret;

  asm ("li      %0, 64"          "\n\t"
       "sub     %0, %0, %5"      "\n\t"
       "mulld   r7, %1, %2"      "\n\t"
       "mulhdu  r8, %1, %2"      "\n\t"
       "mulld   r26, r7, %4"     "\n\t"
       "mulhdu  r27, r7, %4"     "\n\t"
       "mulld   r28, r8, %4"     "\n\t"
       "mulhdu  r29, r8, %4"     "\n\t"
       "adde    r9, r27, r28"    "\n\t"
       "addze   r10, r29"        "\n\t"
       "srd     r9, r9, %5"      "\n\t"
       "sld     r10, r10, %0"    "\n\t"
       "or      r9, r9, r10"     "\n\t"
       "mulld   r9, r9, %3"      "\n\t"
       "sub     %0, r7, r9"      "\n\t"
       "cmpdi   cr6, %0, 0"      "\n\t"
       "bge+    cr6, 0f"         "\n\t"
       "add     %0, %0, %3"      "\n"
       "0:"
       : "=&r" (ret)
       : "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
       : "r7","r8","r9","r10","r26","r27","r28","r29","cr6" );

  return ret;
}
To allow GCC to choose two registers to use instead of r7,r8 you could change it to:
Code:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t p)
{
  register uint64_t ret, tmp1, tmp2;

  asm ("li      %0, 64"          "\n\t"
       "sub     %0, %0, %7"      "\n\t"
       "mulld   %1, %3, %4"      "\n\t"
       "mulhdu  %2, %3, %4"      "\n\t"
       "mulld   r26, %1, %6"     "\n\t"
       "mulhdu  r27, %1, %6"     "\n\t"
       "mulld   r28, %2, %6"     "\n\t"
       "mulhdu  r29, %2, %6"     "\n\t"
       "adde    r9, r27, r28"    "\n\t"
       "addze   r10, r29"        "\n\t"
       "srd     r9, r9, %7"      "\n\t"
       "sld     r10, r10, %0"    "\n\t"
       "or      r9, r9, r10"     "\n\t"
       "mulld   r9, r9, %5"      "\n\t"
       "sub     %0, %1, r9"      "\n\t"
       "cmpdi   cr6, %0, 0"      "\n\t"
       "bge+    cr6, 0f"         "\n\t"
       "add     %0, %0, %5"      "\n"
       "0:"
       : "=&r" (ret), "=&r" (tmp1), "=&r" (tmp2)
       : "r" (a), "r" (b), "r" (p), "r" (pMagic), "r" (pShift)
       : "r9","r10","r26","r27","r28","r29","cr6" );

  return ret;
}
I wouldn't expect the gain from doing this to be very great however, and it might even be slower (GCC doesn't always make the best choice of register to use). You will also need to make analogous changes to the PRE2_MULMOD64() macro to see the real effect.
geoff is offline   Reply With Quote
Old 2006-12-10, 23:05   #53
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

100100001012 Posts
Default

In version 1.4.9 I have made a small change to the inline PRE2_MULMOD64 macro that should allow GCC to recognise that the initial subtraction results in a loop invariant. You can try this change out by replacing the `#if 1' with `#if 0' in asm-ppc64.h.

Last fiddled with by geoff on 2006-12-10 at 23:06
geoff is offline   Reply With Quote
Old 2006-12-11, 07:36   #54
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2·3·41 Posts
Talking

Quote:
Originally Posted by geoff View Post
In version 1.4.9 I have made a small change to the inline PRE2_MULMOD64 macro that should allow GCC to recognise that the initial subtraction results in a loop invariant. You can try this change out by replacing the `#if 1' with `#if 0' in asm-ppc64.h.
WOW! That made a huge difference on my 2.0 GHz box; from ~245k to over 260k/sec. I'll try it out tomorrow on the faster machines and see what we can really get it cranking to. I'll also let it run to verify correctness.
BlisteringSheep is offline   Reply With Quote
Old 2006-12-11, 16:29   #55
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default

On the faster chips, this hurt performance. With the #if 1 on a 2.5 970MP it knocked it from ~330kp/sec back to ~300kp/sec.
BlisteringSheep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 304 2021-11-06 13:51
32-bit of sr1sieve and sr2sieve for Win pepi37 Software 5 2013-08-09 22:31
sr2sieve question SaneMur Information & Answers 2 2011-08-21 22:04
sr2sieve client mgpower0 Prime Sierpinski Project 54 2008-07-15 16:50
How to use sr2sieve nuggetprime Riesel Prime Search 40 2007-12-03 06:01

All times are UTC. The time now is 20:17.


Sat Jul 2 20:17:19 UTC 2022 up 79 days, 18:18, 1 user, load averages: 1.10, 1.27, 1.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔