mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > Sierpinski/Riesel Base 5

Reply
 
Thread Tools
Old 2007-06-18, 03:04   #122
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13×89 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
geoff,
I just wanted to let you know that this did cause a good speedup, about 10% in my testing. That is enough to make the 1.5.x series faster than the later 1.4.x versions (without this addition 1.5.x runs slightly slower than 1.4.39 or 1.4.42).
Great, thanks for testing it. I'll add it as a default compile option in the next version (1.5.7).

Did you need to make any change to the timestamp() function, or does the default gettimeofday() have high enough resolution to choose the right method? It would be interesting to see the output from running it on riesel.dat with the -vv switch.
geoff is offline   Reply With Quote
Old 2007-06-20, 07:03   #123
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

F616 Posts
Default

Quote:
Originally Posted by geoff View Post
Great, thanks for testing it. I'll add it as a default compile option in the next version (1.5.7).

Did you need to make any change to the timestamp() function, or does the default gettimeofday() have high enough resolution to choose the right method? It would be interesting to see the output from running it on riesel.dat with the -vv switch.
Well, I don't know whether or not I "needed" to make a change to timestamp(), but I didn't. I'll post some results from two different CPUs, a 2.2 GHz 970FX and a 2.5 GHz 970MP (the 970MP is dual-core and has more cache).
BlisteringSheep is offline   Reply With Quote
Old 2007-06-20, 07:05   #124
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default results from 2.2 GHz 970FX

Test with original asm-ppc64.h

sr2sieve 1.5.6 -- A sieve for the Sierpinski/Riesel Base 2 projects.
L1 data cache 32Kb (detected), L2 cache 512Kb (default).
Reading `riesel.dat' ...
Read 4293433 terms for 67 sequences from dat format file `riesel.dat'.
Split 67 base 2 sequences into 849 base 2^360 subsequences.
Opening version 1 cache file `sr2cache.bin'...
Loaded Legendre symbol lookup tables for 67 sequences from `sr2cache.bin'.
Using 32 Kb for the baby-steps giant-steps hashtable, maximum density 0.42.
Using ladder method gen/1.
Using 256Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 0.01 terms in this range.
sr2sieve started: 464 <= n <= 20000000, 1867875000000000 <= p <= 1867875100000000
p=1867875013500547, 224926 p/sec, 0 factors, 13.50% done, ETA 20 Jun 02:32
p=1867875027001231, 224696 p/sec, 0 factors, 27.00% done, ETA 20 Jun 02:32
p=1867875040501993, 225091 p/sec, 0 factors, 40.50% done, ETA 20 Jun 02:32
p=1867875054002329, 224960 p/sec, 0 factors, 54.00% done, ETA 20 Jun 02:32
p=1867875067502201, 225057 p/sec, 0 factors, 67.50% done, ETA 20 Jun 02:31
p=1867875081003013, 225017 p/sec, 0 factors, 81.00% done, ETA 20 Jun 02:31
p=1867875094503673, 224457 p/sec, 0 factors, 94.50% done, ETA 20 Jun 02:31

sr2sieve stopped: at p=1867875100000000 because range is complete.
Found factors for 0 terms in 448.851 cpu sec. (expected about 0.01)



Test with original asm-ppc64.h + vec-ppc64.txt

sr2sieve 1.5.6 -- A sieve for the Sierpinski/Riesel Base 2 projects.
L1 data cache 32Kb (detected), L2 cache 512Kb (default).
Reading `riesel.dat' ...
Read 4293433 terms for 67 sequences from dat format file `riesel.dat'.
Split 67 base 2 sequences into 849 base 2^360 subsequences.
Opening version 1 cache file `sr2cache.bin'...
Loaded Legendre symbol lookup tables for 67 sequences from `sr2cache.bin'.
Using 32 Kb for the baby-steps giant-steps hashtable, maximum density 0.42.
Best time for baby step method gen/2: 120.
Best time for baby step method gen/4: 107.
Best time for baby step method gen/8: 102.
Best time for baby step method gen/1: 112.
Best time for giant step method gen/2: 78.
Best time for giant step method gen/4: 76.
Best time for giant step method gen/8: 76.
Best time for giant step method gen/1: 83.
Best time for ladder method gen/2: 5.
Best time for ladder method gen/4: 4.
Best time for ladder method gen/8: 4.
Best time for ladder method gen/1: 9.
Best time for ladder method add/1: 7.
Using baby step method gen/8, giant step method gen/4, ladder method gen/4.
Using 256Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 0.01 terms in this range.
sr2sieve started: 464 <= n <= 20000000, 1867875000000000 <= p <= 1867875100000000
p=1867875014680547, 243753 p/sec, 0 factors, 14.68% done, ETA 20 Jun 02:38
p=1867875029230103, 243560 p/sec, 0 factors, 29.23% done, ETA 20 Jun 02:38
p=1867875043909729, 243884 p/sec, 0 factors, 43.91% done, ETA 20 Jun 02:38
p=1867875058590089, 243455 p/sec, 0 factors, 58.59% done, ETA 20 Jun 02:38
p=1867875073138559, 243790 p/sec, 0 factors, 73.14% done, ETA 20 Jun 02:38
p=1867875087819641, 243305 p/sec, 0 factors, 87.82% done, ETA 20 Jun 02:38

sr2sieve stopped: at p=1867875100000000 because range is complete.
Found factors for 0 terms in 414.741 cpu sec. (expected about 0.01)
BlisteringSheep is offline   Reply With Quote
Old 2007-06-20, 07:08   #125
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default results from 2.5 GHz 970MP

Test with original asm-ppc64.h

sr2sieve 1.5.6 -- A sieve for the Sierpinski/Riesel Base 2 projects.
L1 data cache 32Kb (default), L2 cache 1024Kb (detected).
Reading `riesel.dat' ...
Read 4293433 terms for 67 sequences from dat format file `riesel.dat'.
Split 67 base 2 sequences into 849 base 2^360 subsequences.
Opening version 1 cache file `sr2cache.bin'...
Loaded Legendre symbol lookup tables for 67 sequences from `sr2cache.bin'.
Using 32 Kb for the baby-steps giant-steps hashtable, maximum density 0.42.
Using ladder method gen/1.
Using 512Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 0.01 terms in this range.
sr2sieve started: 464 <= n <= 20000000, 1867875000000000 <= p <= 1867875100000000
p=1867875015468277, 258874 p/sec, 0 factors, 15.47% done, ETA 19 Jun 20:40
p=1867875031065427, 258372 p/sec, 0 factors, 31.07% done, ETA 19 Jun 20:40
p=1867875046531853, 258631 p/sec, 0 factors, 46.53% done, ETA 19 Jun 20:40
p=1867875062128679, 258576 p/sec, 0 factors, 62.13% done, ETA 19 Jun 20:40
p=1867875077595349, 258553 p/sec, 0 factors, 77.60% done, ETA 19 Jun 20:40
p=1867875093061433, 258133 p/sec, 0 factors, 93.06% done, ETA 19 Jun 20:40

sr2sieve stopped: at p=1867875100000000 because range is complete.
Found factors for 0 terms in 390.175 cpu sec. (expected about 0.01)



Test with original asm-ppc64.h + vec-ppc64.txt

sr2sieve 1.5.6 -- A sieve for the Sierpinski/Riesel Base 2 projects.
L1 data cache 32Kb (default), L2 cache 1024Kb (detected).
Reading `riesel.dat' ...
Read 4293433 terms for 67 sequences from dat format file `riesel.dat'.
Split 67 base 2 sequences into 849 base 2^360 subsequences.
Opening version 1 cache file `sr2cache.bin'...
Loaded Legendre symbol lookup tables for 67 sequences from `sr2cache.bin'.
Using 32 Kb for the baby-steps giant-steps hashtable, maximum density 0.42.
Best time for baby step method gen/2: 105.
Best time for baby step method gen/4: 94.
Best time for baby step method gen/8: 89.
Best time for baby step method gen/1: 97.
Best time for giant step method gen/2: 68.
Best time for giant step method gen/4: 66.
Best time for giant step method gen/8: 65.
Best time for giant step method gen/1: 72.
Best time for ladder method gen/2: 4.
Best time for ladder method gen/4: 3.
Best time for ladder method gen/8: 3.
Best time for ladder method gen/1: 7.
Best time for ladder method add/1: 6.
Using baby step method gen/8, giant step method gen/8, ladder method gen/4.
Using 512Kb for the Sieve of Eratosthenes bitmap.
Expecting to find factors for about 0.01 terms in this range.
sr2sieve started: 464 <= n <= 20000000, 1867875000000000 <= p <= 1867875100000000
p=1867875016777271, 278335 p/sec, 0 factors, 16.78% done, ETA 19 Jun 20:46
p=1867875033425419, 278718 p/sec, 0 factors, 33.43% done, ETA 19 Jun 20:46
p=1867875050202569, 278750 p/sec, 0 factors, 50.20% done, ETA 19 Jun 20:46
p=1867875066848389, 278544 p/sec, 0 factors, 66.85% done, ETA 19 Jun 20:46
p=1867875083626081, 278435 p/sec, 0 factors, 83.63% done, ETA 19 Jun 20:46

sr2sieve stopped: at p=1867875100000000 because range is complete.
Found factors for 0 terms in 362.522 cpu sec. (expected about 0.01)
BlisteringSheep is offline   Reply With Quote
Old 2007-06-21, 00:28   #126
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

48516 Posts
Default

Quote:
Originally Posted by BlisteringSheep View Post
Best time for ladder method gen/2: 4.
Best time for ladder method gen/4: 3.
Best time for ladder method gen/8: 3.
Best time for ladder method gen/1: 7.
Best time for ladder method add/1: 6.
Using baby step method gen/8, giant step method gen/8, ladder method gen/4.
It looks like gettimeofday() has just enough resolution to make a reasonable choice, but not enough to always make the best choice. In this case it can't distinguish between gen/4 and gen/8 methods, and so chooses gen/4 just because it is first on the list.

Does anyone know how to use a more precise timer on the ppc64? If not then I could run each method a few times in a loop and take the total time.
geoff is offline   Reply With Quote
Old 2007-06-21, 00:50   #127
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

11001111101012 Posts
Default

Quote:
Originally Posted by geoff View Post
It looks like gettimeofday() has just enough resolution to make a reasonable choice, but not enough to always make the best choice. In this case it can't distinguish between gen/4 and gen/8 methods, and so chooses gen/4 just because it is first on the list.

Does anyone know how to use a more precise timer on the ppc64? If not then I could run each method a few times in a loop and take the total time.
Here are two functions you can use:

#include <time.h>
#include <stdio.h>
#include <sys/sysctl.h>

uin64_t get_cycles_per_clock(void)
{
uint64_t i, t[100];
uint64_t cpufrequency = 0, tbfrequency = 0, tbcycles;
size_t cpufrequencylen = sizeof(uint64_t), tbfrequencylen = sizeof(uint64_t);
double u;
static int cpumib[2] = { CTL_HW, HW_CPU_FREQ } ;
static int tbmib[2] = { CTL_HW, HW_TB_FREQ } ;

sysctl(cpumib,2,&cpufrequency,&cpufrequencylen,0,0);
sysctl(tbmib,2,&tbfrequency,&tbfrequencylen,0,0);

u = (double) cpufrequency / (double) tbfrequency;
tbcycles = (uint64_t) u;

while (tbcycles + 0.5 < u) tbcycles += 1;
while (tbcycles - 0.5 > u) tbcycles -= 1;

for (i=0; i<100; ++i)
t[i] = mach_absolute_time() * tbcycles;

for (i=1; i<100; ++i)
if (t[i] - t[i-1] > 0)
break;

return t[i] - t[i-1];
}


uint64_t get_clocks(void) {
unsigned long retval;
asm volatile("mftb %0": "=r"(retval));
return retval;
}

Call get_cycles_per_clock() once. Call get_clocks() before and after the function call you are timing. Multiple the difference by the returned value from get_cycles_per_clock(). BTW that value is 75 on a 2.5 GHz PPC and 60 on a 2.0 GHz PPC.
rogue is offline   Reply With Quote
Old 2007-06-21, 01:06   #128
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

13·89 Posts
Default

Quote:
Originally Posted by rogue View Post
uint64_t get_clocks(void) {
unsigned long retval;
asm volatile("mftb %0": "=r"(retval));
return retval;
}

Call get_cycles_per_clock() once. Call get_clocks() before and after the function call you are timing. Multiple the difference by the returned value from get_cycles_per_clock(). BTW that value is 75 on a 2.5 GHz PPC and 60 on a 2.0 GHz PPC.
Thanks very much :-) Since I only need to know which method is fastest and not the absolute time it takes, I assume I could just compare the results of get_clocks() without calling get_cycles_per_clock() first.
geoff is offline   Reply With Quote
Old 2007-06-21, 02:49   #129
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

3×5×443 Posts
Default

Quote:
Originally Posted by geoff View Post
Thanks very much :-) Since I only need to know which method is fastest and not the absolute time it takes, I assume I could just compare the results of get_clocks() without calling get_cycles_per_clock() first.
That would work.
rogue is offline   Reply With Quote
Old 2007-06-21, 04:36   #130
BlisteringSheep
 
BlisteringSheep's Avatar
 
Oct 2006
On a Suzuki Boulevard C90

2×3×41 Posts
Default

geoff, rogue, Just to let you know that I tested the get_clocks() routine and it works correctly under Linux.
The get_cycles_per_clock() routine does not compile under Linux, as I believe that it is using the OS-X sysctl() API.
BlisteringSheep is offline   Reply With Quote
Old 2007-06-22, 01:49   #131
geoff
 
geoff's Avatar
 
Mar 2003
New Zealand

115710 Posts
Default

I have added the vector methods code along with rogue's assembler for timing to the version 1.5.7 source.

I think there is still room to improve performance on the ppc64, in at least two ways:

1. The mulmod routine should be able to be re-arranged so that the multiplications are done in a different order and the intermediate product re-used in the next mulmod. At the moment we are effectively multiplying pMagic*b every time, although both pMagic and b are constant over many mulmods.

2. Assuming there are enough spare registers, it should be possible to interleave the instructions from two seperate mulmods, which should allow the CPU to get better throughput. e.g. y1 = x1*b (mod p) and y2 = x2*b (mod p) are independent computations so it should be possible to intermix the instructions from each. At the moment there are unnecessary dependenncies between them because they both use the same scratch registers.
geoff is offline   Reply With Quote
Old 2007-06-22, 02:59   #132
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

664510 Posts
Default

Quote:
Originally Posted by geoff View Post
I have added the vector methods code along with rogue's assembler for timing to the version 1.5.7 source.

I think there is still room to improve performance on the ppc64, in at least two ways:

1. The mulmod routine should be able to be re-arranged so that the multiplications are done in a different order and the intermediate product re-used in the next mulmod. At the moment we are effectively multiplying pMagic*b every time, although both pMagic and b are constant over many mulmods.
This multiply is basically free since it can be done at the same time as the other multiplies in the pipeline, so I don't think there would be any gain.

Quote:
Originally Posted by geoff View Post
2. Assuming there are enough spare registers, it should be possible to interleave the instructions from two seperate mulmods, which should allow the CPU to get better throughput. e.g. y1 = x1*b (mod p) and y2 = x2*b (mod p) are independent computations so it should be possible to intermix the instructions from each. At the moment there are unnecessary dependenncies between them because they both use the same scratch registers.
There are plenty of registers. It should be possible to do two (or possibly 3 or 4) at once, but it would require either changing the ASM or using gcc syntax for inline assembler, which I am not any good at.
rogue is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
srsieve/sr2sieve enhancements rogue Software 304 2021-11-06 13:51
32-bit of sr1sieve and sr2sieve for Win pepi37 Software 5 2013-08-09 22:31
sr2sieve question SaneMur Information & Answers 2 2011-08-21 22:04
sr2sieve client mgpower0 Prime Sierpinski Project 54 2008-07-15 16:50
How to use sr2sieve nuggetprime Riesel Prime Search 40 2007-12-03 06:01

All times are UTC. The time now is 13:31.


Thu Jun 30 13:31:58 UTC 2022 up 77 days, 11:33, 0 users, load averages: 1.51, 1.97, 2.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔