mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve enhancements (https://www.mersenneforum.org/showthread.php?t=25486)

rogue 2021-01-02 01:54

[QUOTE=pepi37;568000]Excuse me, but what is point of have MT sieve that six core CPU doesnot have "horsepower" to feel all workers?
So if I take 6*1 instance of your sieve then my CPU will have "enough horsepower" but if I use 1*6 then it wont?
I still think that is only cosmetic bug , not real one, since

Sieve speed is same , version from build 2037 and latest one.[/QUOTE]

I'm not certain what you mean by "cosmetic bug" in terms of what you highlighted in red?

When the application starts all of the workers are started and put themselves in a state called "waiting for work". The main thread works in a loop like this:

1) find thread in "waiting for work"
2) if none are found, sleep for 10 ms.
3) if one is found, get the next 1e6 primes then tell the worker to start processing that chunk
4) return to step 1

You have 6 threads, so the assumption is that this loop is iterated 6 times and thus all workers have work. That assumption is not correct.

At some point the worker started by step 3 will finish. It will then go back to the status of "waiting for work".

What you see happening with this sieve is that the worker started by the first iteration of the loop finishes its chunk of work before the main thread has iterated 6 times.

Because of this the first worker (one that already did a chunk of 1e6 primes) will get the next chunk of 1e6 primes instead of the 5th or 6th worker.

In essence getting 6 chunks of 1e6 primes takes more time than testing a single chunk of 1e6 primes.

This is going to happen with fncsieve, fkbnsieve, gfndsieve, and dmdsieve because of how fast their worker threads can process a single chunk of work. This is why I suggested using -w to bump from 1e6 to 1e7. This should give the main thread the opportunity to feed all of the workers before any individual worker completes it chunk of work.

The only solution to this problem would be to modify the framework such that the worker threads work on a range of primes, i.e. 100e9 to 101e9, 101e9 to 102e9, etc instead of a count of primes. The downside of such a change is that each worker will be testing a variable number of primes rather than a fixed number of primes. This would not be good for GPU workers as they are most efficient using a number of primes that is a multiple of the GPU work group size.

pepi37 2021-01-02 02:08

[QUOTE=rogue;568010]I'm not certain what you mean by "cosmetic bug" in terms of what you highlighted in red?
[/QUOTE]


To explain: when I start your[COLOR=Red][B] latest released sieve then p=0 [/B][/COLOR]( and should not be zero since it doesnot start from zero , it start from header value in sieve)
And since sieve in previous version works ok then it is cosmetic bug not flaw in program itself ( also sieve speed is same)


fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!!

Happy5214 2021-01-02 07:42

[QUOTE=pepi37;568012]fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!![/QUOTE]

What is the CPU utilization with -w 1e7? I have a feeling there may still be room to up that number even more, until that p=0 finally goes away (i.e. all workers are working).

rogue 2021-01-02 14:14

[QUOTE=pepi37;568012]To explain: when I start your[COLOR=Red][B] latest released sieve then p=0 [/B][/COLOR]( and should not be zero since it doesnot start from zero , it start from header value in sieve)
And since sieve in previous version works ok then it is cosmetic bug not flaw in program itself ( also sieve speed is same)


fbncsieve.exe -p[B][COLOR=Red] 347880187459691[/COLOR][/B] -P 6000000000000000 -i 1100.txt -o 1100.txt -f N -W6[B][COLOR=SeaGreen] [COLOR=Blue]-w 1e7[/COLOR][/COLOR][/B] -O s53factodes.txt
fbncsieve v1.4, a program to find factors of k*b^n+c numbers for fixed b, n, and c and variable k
Sieve started: 347880187459691 < p < 6e15 with 75104 terms (1014 < k < 1999948, k*10^1100000+1) (expecting 5887 factors)
[COLOR=red][B] p=0[/B][/COLOR], [COLOR=blue][B]46.94M p/sec[/B][/COLOR], no factors found


But in one part you were correct , using -w 1e7 it nearly doubles speed!!!![/QUOTE]

The "p=0" should be fixed in the next release of the framework. The problem is that it tries to determine the max prime that has been used for testing, but if you have workers that have done nothing, then their max prime is 0. The change to the framework is to ignore workers that are doing nothing when determining max prime.

Happy5214 2021-01-05 09:35

I'm not able to create new sieve files using srsieve2 (SVN rev 84):
[code]$ ./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Fatal Error: Expected 95001 terms, but only set 0
[/code]

rogue 2021-01-05 13:07

[QUOTE=Happy5214;568456]I'm not able to create new sieve files using srsieve2 (SVN rev 84):
[code]$ ./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Fatal Error: Expected 95001 terms, but only set 0
[/code][/QUOTE]

Hmm. I cannot reproduce this on Windows or OS X. In other words, it works fine on both OSes. I assume this is linux. Does it work if you compile with debug=yes or -O2 instead of -O3?

Here is the OS X output:

[code]
./srsieve2 -W 4 -n 5e4 -N 145e3 -P "1e9" -o 't17_b2.prp' -f B -s "34767*2^n-1"
srsieve2 v1.3, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Sieve started: 3 < p < 1e9 with 95001 terms (50000 < n < 145000, k*2^n+c) (expecting 89964 factors)
Sieving with generic logic
Split 1 base 2 sequence into 10 base 2^20 sequences.
Sieve completed at p=1000000009.
Processor time: 153.19 sec. (0.53 sieving) (3.14 cores)
8181 terms written to t17_b2.prp
Primes tested: 50843824. Factors found: 86820. Remaining terms: 8181. Time: 48.66 seconds.
[/code]

Happy5214 2021-01-06 07:24

[QUOTE=rogue;568469]Hmm. I cannot reproduce this on Windows or OS X. In other words, it works fine on both OSes. I assume this is linux. Does it work if you compile with debug=yes or -O2 instead of -O3?[/QUOTE]

Yes, it was Ubuntu 20.04. It worked for both of those settings, so I tried to compile again with -O3 after a [c]make clean[/c], and that worked. Maybe it was a leftover object file? Sorry about that. :redface:

rogue 2021-01-06 13:11

[QUOTE=Happy5214;568551]Yes, it was Ubuntu 20.04. It worked for both of those settings, so I tried to compile again with -O3 after a [c]make clean[/c], and that worked. Maybe it was a leftover object file? Sorry about that. :redface:[/QUOTE]

That's okay. I'm glad your problem is resolved.

rogue 2021-01-07 16:42

I ran range of Sierpinski/Riesel sequences thru the various programs that can sieve them. This can be used as a guide to determine which program to use for your sieving purposes. Note that all tested the same input file for the same range of P:

[code]
srsieve -P2002626803 -m1e10 b25_old.abcd
srsieve 1.1.0 -- A sieve for integer sequences in n of the form k*b^n+c.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
srsieve started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
Sieving 1000950101 <= p <= 2002626803 eliminated 49708 terms, 1482777 remain.
Wrote 1482777 terms for 81 sequences to srsieve format file `srsieve.out'.
srsieve stopped: at p=2002626803 because --pmax was reached.

These two lines are from srsieve.log:
01/07/21 08:51:03 srsieve started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
01/07/21 09:33:45 srsieve stopped: at p=2002626803 because --pmax was reached. -- 2562 seconds

sr2sieve -ib25_old.abcd -P2002626803 -q
sr2sieve 1.9.1 -- A sieve for multiple sequences k*b^n+/-1 or b^n+/-k.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
Expecting to find factors for about 49622.16 terms in this range.
sr2sieve 1.9.1 started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
p=1990805857, 2188973 p/sec, 49245 factors, 98.8% done, 0 sec/factor, ETA 07 Jan 08:47
sr2sieve 1.9.1 stopped: at p=2002626803 because range is complete.
Found factors for 49708 terms in 486.927 sec. (expected about 49622.16)

sr2sieve -ib25_old.abcd -P2002626803 -q -x
sr2sieve 1.9.1 -- A sieve for multiple sequences k*b^n+/-1 or b^n+/-k.
Read 1532485 terms for 81 sequences from ABCD format file `b25_old.abcd'.
Split 81 base 25 sequences into 1432 base 25^90 subsequences.
Expecting to find factors for about 49622.16 terms in this range.
sr2sieve 1.9.1 started: 300000 <= n <= 1000000, 1000950101 <= p <= 2002626803
p=1982941577, 1086937 p/sec, 48960 factors, 98.0% done, 0 sec/factor, ETA 07 Jan 10:20
sr2sieve 1.9.1 stopped: at p=2002626803 because range is complete.
Found factors for 49708 terms in 977.900 sec. (expected about 49622.16)

srsieve2 -ib25_old.abcd -P2002626803
srsieve2 v1.3.1, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Split 81 base 25 sequences into 1432 base 25^90 sequences.
Sieve started: 1000950101 < p < 2e9 with 1532485 terms (300000 < n < 1000000, k*25^n+c) (expecting 49531 factors)
p=1996357801, 19.78K p/sec, 49480 factors found at 15.61 f/sec (last 1 min), 99.6% done. ETC 2021-01-07 10:16
Sieve completed at p=2000000087.
CPU time: 2319.69 sec. (0.23 sieving) (1.00 cores)
1482777 terms written to b25_n.abcd
Primes tested: 47452160. Factors found: 49708. Remaining terms: 1482777. Time: 2325.74 seconds.

srsieve2cl -ib25_old.abcd -P2e9
srsieve2cl v1.3.1, a program to find factors of k*b^n+c numbers for fixed b and variable k and n
Sieving with generic logic
Split 81 base 25 sequences into 1432 base 25^90 sequences.
GPU primes per worker is 143360
Sieve started: 1000950101 < p < 2e9 with 1532485 terms (300000 < n < 1000000, k*25^n+c) (expecting 49531 factors)
p=1925947301, 65.81K p/sec, 46892 factors found at 40.83 f/sec (last 1 min), 92.6% done. ETC 2021-01-07 08:38
Sieve completed at p=2002626803.
CPU time: 189.30 sec. (0.12 sieving) (0.26 cores) GPU time: 713.29 sec.
1482777 terms written to b25_n.abcd
Primes tested: 47452160. Factors found: 49708. Remaining terms: 1482777. Time: 723.70 seconds.
[/code]

Summarized:
[list][*]srsieve - 2562 seconds[*]sr2sieve - 487 seconds (with Legendre tables)
{*]sr2sieve - 977 seconds (w/o Legendre tables)[*]srsieve2 - 2326 seconds[*]srsieve2cl - 723 seconds[/list]
So the conclusion is this:[list][*]srsieve - only use if you do not trust srsieve2/srsieve2cl[*]sr2sieve - use if Legendre tables can be built from the sequences[*]sr2sieve - use if Legendre tables cannot be built from the sequences and you have a slow GPU[*]srsieve2 - use to start sieving sequences if you have a slow GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[*]srsieve2 - use to start sieving sequences if you have too many sequences for the GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[*]srsieve2cl - use to start sieving sequences if you have a fast GPU then switch to sr2sieve if Legendre tables can be built from the sequences.[/list]
"too many sequences for the GPU" is going to be dependent on GPU memory. I do not know (at this time) how to determine beforehand if the GPU can handle all of the sequences thrown at it. What I do know is that my Quadro P3200 can handle more than 5000 sequences at a time (but less than 10000) and still be at least 3x faster than srsieve2.

If you have read between the lines you will see stats for srsieve2cl. Nobody here has seen it as it is in development. It should be ready for release this weekend.

After srsieve2cl is released, you will need to do some independent testing to compare srsieve2cl to sr2sieve to see which is faster.

Maybe someone the forum could package up a script that take an input file and range of P and tell the user which program to use. That same script could compare the outputs to verify that there are no missed factors.

One last thing, srsieve2cl does not end on the exact boundary specified on the command line. That is okay. The key is that I had to use the max p from that run to set the max p for srsieve2, srsieve, and sr2sieve to ensure that I was comparing apples to apples.

henryzz 2021-01-07 22:21

What is holding srsieve2 back from reaching sr2sieve's speed? How come the sieving % was only 23%? How does srsieve2cl compare when running on the cpu? I assume sr1sieve is still much faster than srsieve2 for one sequence.

rogue 2021-01-08 01:21

[QUOTE=henryzz;568685]What is holding srsieve2 back from reaching sr2sieve's speed? How come the sieving % was only 23%? How does srsieve2cl compare when running on the cpu? I assume sr1sieve is still much faster than srsieve2 for one sequence.[/QUOTE]

There are significant differences between sr2sieve and srsieve. srsieve2 only handles what srsieve can handle, which is fairly general. I will have to write and test a lot of code before srsieve2 can do what sr2sieve does. I work on it in bits and pieces, when I am motivated, but sr2sieve is a mess from the coding perspective. Until today I didn't think I could write code to compete (speed wise) with sr2sieve and I didn't just want to "copy and paste" its code into srsieve2 because it isn't quite that easy. I was given an ask to support p > 2^52 on sr1sieve for non-x86 CPUs. I know how to do that so I think I also have a change of pulling sr2sieve into srsieve2 and getting better performance without all of myriad code paths that sr2sieve supports.


All times are UTC. The time now is 21:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.