Register FAQ Search Today's Posts Mark Forums Read

 2017-09-28, 15:41 #1 IBethune   Nov 2010 52 Posts Multithreaded PFGW I have implemented a version of PFGW which supports multi-threading, essentially along the same lines as LLR i.e. adding a gwset_num_threads() call between gwinit() and gwsetup(). The code is working correctly i.e. giving the correct answers, but does not always give speedup. If anyone wants to play, the patch is here: https://pastebin.com/xw7ykJbM (note that the patch also works with the most recent gwnum, since it removes the use of the deprecated gwmap_to_fft_info() function). Multiple threads are enabled using the -T command line argument. For example, if I run a PRP test on a large factorial prime 208003!-1 I find that going from 1 to 2 threads I get a 1.6x speedup, which is comparable to what I get using the LLR program to test 387*2^3322763+1, which is a Proth of roughly the same digit length. Code: ../pfgw64 -q'208003!-1' -V -T2 PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2] Generic modular reduction using generic reduction FMA3 FFT length 336K, Pass1=448, Pass2=768, clm=4, 2 threads on A 3374558-bit number ./llr64.3.8.20 -q387*2^3322763+1 -d -t2 Starting Proth prime test of 387*2^3322763+1 Using all-complex FMA3 FFT length 256K, Pass1=128, Pass2=2K, 2 threads, a = 5 However, if I test the same Proth using PFGW (either a PRP or a primality test), the performance goes down when I switch from 1 to two threads. The main difference I see is that this test uses the "Special modular reduction": Code: ../pfgw64 -q'387*2^3322763+1' -V -T2 PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2] Special modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1 ../pfgw64 -q'387*2^3322763+1' -V -t -T2 PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2] Primality testing 387*2^3322763+1 [N-1, Brillhart-Lehmer-Selfridge] Running N-1 test using base 5 Special modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1 However, even if I hack the code to use the generic reduction for this candidate, I still observe the slowdown: Code: ../pfgw64 -q'387*2^3322763+1' -V -T2 PFGW Version 3.8.3.64BIT.20161203.Mac_Dev [GWNUM 29.2] Generic modular reduction using all-complex FMA3 FFT length 240K, Pass1=1280, Pass2=192, clm=2, 2 threads on 387*2^3322763+1 So question to George, Mark and other knowledgable people... is there anything which PFGW is doing differently to LLR in terms of how it uses gwnum that results in an configuration which is not efficient on multiple threads? - Iain
2017-09-28, 19:15   #2
Batalov

"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

3·7·479 Posts

Quote:
 Originally Posted by IBethune ... is there anything which PFGW is doing differently to LLR in terms of how it uses gwnum that results in an configuration which is not efficient on multiple threads? - Iain
As soon as you are in the 'Generic modular reduction using generic reduction' territory, by definition you have PFGW doing things differently to LLR. Neither LLR nor P95 have an equivalent use mode.

 2017-09-28, 20:16 #3 rogue     "Mark" Apr 2003 Between here and the 23·13·67 Posts The term "Special" or "Generic" txt is output depending upon the call used to gwnum to set up the modular reduction. gwsetup() vs gwsetup_general_mod()/gwsetup_general_mod_64(). The text following "using" is the output produced by a call to gwmodulo_as_string().
 2017-09-29, 08:27 #4 IBethune   Nov 2010 318 Posts That's why I'm confused - it doesn't seem to be as simple as generic/special reduction controlling whether the multithreaded FFT performs well or not e.g. million digit Factorial prime with PFGW -> generic reduction -> good thread scaling million digit Proth prime with LLR -> special reduction -> good thread scaling million digit Proth prime with PFGW -> special reduction -> poor thread scaling million digit Proth prime with PFGW, hacked to use generic reduction -> poor thread scaling. So there is some other effect at work here, which I don't quite understand at this point. I wonder if George has any insight?
 2017-09-29, 11:28 #5 axn     Jun 2003 2×2,719 Posts Iain, can you post the output (showing the FFT) from both single thread and 2-thread runs for the Proth number? Also the iteration timings and the CPU usage as well for both of these? What kind of CPU is it? If it is HT, are both the threads of the 2-thread run somehow running on the same physical core? Last fiddled with by axn on 2017-09-29 at 11:32 Reason: Clarified "output"
 2018-03-16, 17:52 #6 rogue     "Mark" Apr 2003 Between here and the 23·13·67 Posts I applied the change and didn't see any improvement either. Did you look at how llr implemented it?

 Similar Threads Thread Thread Starter Forum Replies Last Post houding Software 1 2016-06-20 12:11 tului Software 6 2015-11-28 21:59 Joe O Sierpinski/Riesel Base 5 5 2010-09-30 14:07 Andi47 Msieve 1 2010-02-20 01:16 nuggetprime Msieve 5 2008-08-11 07:48

All times are UTC. The time now is 22:04.

Mon Feb 6 22:04:16 UTC 2023 up 172 days, 19:32, 1 user, load averages: 1.31, 1.38, 1.28