mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

2022-06-24, 17:42   #2773
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

665510 Posts

Quote:
 Originally Posted by Zhangrc why does GPUowl do some iterations beyond the exponent?
GEC block size is kept constant through the run on an exponent, and is often highly composite. The exponent is necessarily prime or there is no point to the PRP test. So GECblocksize > mod (exponent, GECblocksize) > 0. The entire set of gpuowl PRP iterations are guarded by GEC by computing additional iterations to complete the last GEC block, up to and past the exponent. (IIRC Preda has explained this before.)
For example, 77232917 / 1000 = 77232.917, so 77233 blocks of 1000 (77233000 iterations) would be used.
The overhead of GEC is small, but such that larger than default blocksize computing a few more total iterations can actually be more efficient, if the reliability is high. (See end of this reference post.)
Code:
-block <value>     : PRP GEC block size, or LL iteration-block size. Must divide 10'000.
10,000 = 104 = 24 54 implying legal block sizes of 2 4 5 8 10 16 20 25 40 50 80 100 125 200 250 400 500 625 1000 1250 2000 2500 5000, and perhaps 1 and 10,000. For 113032800 = 400 * 282582 >113032481, blocksize> 113032800-113032481 -> blocksize > 319, apparently you used block size 400.
Block size is determined at the start, stored in the save file and used unchanged throughout the gpuowl run of the exponent. Saving 0.3% on the whole run by using block size 1000 instead of 400 more than pays for a possible additional ~1000-400= 600 iterations at the end; 113M*0.3% = 339,000.

Mprime/prime95 does it differently, leaving unguarded the last few iterations past the last whole block up to the exponent. It also dynamically varies GEC block size based on error rate observed or iterations left IIRC.

Last fiddled with by kriesel on 2022-06-24 at 18:21

 2022-07-30, 15:24 #2774 SyauqiA   Jul 2022 Indonesia 102 Posts is this the expected performance of my GPU? Hello, I am a beginner user of gpuowl and a real beginner to the technical aspects of Prime95 and GPU computing. I've been using Prime95 for a few months with my CPU, but now I want to utilize my GPU for contributing, too. My laptop's GPU is AMD Radeon R7 M440. I've downloaded a build of version v7.2-93-ga5402c5-dirty from https://www.mersenneforum.org/showpo...4&postcount=30 . I tested it with M77936867 and it's performing at around 26800 us/iter : Code: 20220730 21:36:28 GpuOwl VERSION v7.2-93-ga5402c5-dirty 20220730 21:36:28 GpuOwl VERSION v7.2-93-ga5402c5-dirty 20220730 21:36:28 config: -user SyauqiMA -cpu recomp-radeonr7-2-w2 -device 1 -log 20000 -save 5 20220730 21:36:28 config: -prp 77936867 -iters 60000 -proof 9 -use NO_ASM -log 20000 20220730 21:36:28 device 1, unique id '' 20220730 21:36:28 recomp-radeonr7-2-w2 77936867 FFT: 4M 1K:8:256 (18.58 bpw) 20220730 21:36:28 recomp-radeonr7-2-w2 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DMAX_ACCURACY=1 -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,} -DNO_ASM=1 -cl-std=CL2.0 -cl-finite-math-only " 20220730 21:36:31 recomp-radeonr7-2-w2 77936867 OpenCL compilation in 2.55 s 20220730 21:36:31 recomp-radeonr7-2-w2 77936867 maxAlloc: 0.0 GB 20220730 21:36:31 recomp-radeonr7-2-w2 77936867 You should use -maxAlloc if your GPU has more than 4GB memory. See help '-h' 20220730 21:36:31 recomp-radeonr7-2-w2 77936867 P1(0) 0 bits 20220730 21:36:31 recomp-radeonr7-2-w2 77936867 PRP starting from beginning 20220730 21:36:43 recomp-radeonr7-2-w2 77936867 OK 0 on-load: blockSize 400, 0000000000000003 20220730 21:36:43 recomp-radeonr7-2-w2 77936867 validating proof residues for power 9 20220730 21:36:43 recomp-radeonr7-2-w2 77936867 Proof using power 9 20220730 21:37:16 recomp-radeonr7-2-w2 77936867 OK 800 0.00% 1579c241dc63eca6 26859 us/it + check 10.83s + save 0.61s; ETA 24d 05:29 20220730 21:41:23 recomp-radeonr7-2-w2 77936867 10000 fc4f135f7cf4ad29 26843 20220730 21:46:03 recomp-radeonr7-2-w2 77936867 OK 20000 0.03% 3cd1bd9d5e09cbc5 26851 us/it + check 10.84s + save 0.63s; ETA 24d 05:09 20220730 21:50:32 recomp-radeonr7-2-w2 77936867 30000 c4e0ff35e3290d98 26844 20220730 21:52:08 recomp-radeonr7-2-w2 77936867 Stopping, please wait.. 20220730 21:52:20 recomp-radeonr7-2-w2 77936867 OK 33600 0.04% dbfb036c7ae970f8 26854 us/it + check 10.84s + save 0.60s; ETA 24d 05:06 20220730 21:52:20 recomp-radeonr7-2-w2 Exiting because "stop requested" 20220730 21:52:20 recomp-radeonr7-2-w2 Bye Is the us/iter really expected from my GPU? Or did I missed something in the setup, or is there something I did wrong? Because i think with GPU, that speed is really slow. Any help will be appreciated! Oh, and another weird thing is my GPU-Z shows 0% GPU load, but the GPU and memory clock is maxed out when doing the PRP. Thanks for your time and your help, and sorry if my english is a bit weird! Attached Thumbnails
2022-07-31, 11:16   #2775
LaurV
Romulan Interpreter

"name field"
Jun 2011
Thailand

2·4,999 Posts

Quote:
 Originally Posted by tdulcet That is not what I observed recently when benchmarking the Colab GPUs
Related to that cudaLucas benchmarking, the A100 fft files you distribute in the package are mistakenly called "Nvidia A100 blah blah....", which is not known by the cudaLucas guts, so as soon as Gugu has the benevolence to give the user the A100 (I say "the" because I think they have only one, which they distribute to all users in turns, considering how often we get it ), then few hours will be wasted to make the "A100 blah blah" files, unless the user realizes, and rename the two files properly before starting the Colab Notebook. The new made files are 99% similar with the one you distribute (which is normal, timing will differ here and there for different runs, but the remaining FFTs should be the same).

So, please edit the file names in the package and remove the "Nvidia" part.

Last fiddled with by LaurV on 2022-07-31 at 11:20

2022-07-31, 14:01   #2776
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5·113 Posts

Quote:
 Originally Posted by SyauqiA Hello, I am a beginner user of gpuowl and a real beginner to the technical aspects of Prime95 and GPU computing. I've been using Prime95 for a few months with my CPU, but now I want to utilize my GPU for contributing, too. My laptop's GPU is AMD Radeon R7 M440. I've downloaded a build of version v7.2-93-ga5402c5-dirty from https://www.mersenneforum.org/showpo...4&postcount=30 . I tested it with M77936867 and it's performing at around 26800 us/iter ... Is the us/iter really expected from my GPU? Or did I missed something in the setup, or is there something I did wrong? Because i think with GPU, that speed is really slow. Any help will be appreciated! Oh, and another weird thing is my GPU-Z shows 0% GPU load, but the GPU and memory clock is maxed out when doing the PRP. Thanks for your time and your help, and sorry if my english is a bit weird!
Welcome to the forum, and to GPU GIMPS computing. Your English seems to me quite good in the quoted post, and that is appreciated. You may find some of the reference info I've assembled over the past 4 years useful.
Congrats on getting gpuowl up and running.

Given that the DP/SP ratio for your GPU model is only 1/16, it might be more productively used in TF with mfakto than other computation types. https://www.techpowerup.com/gpu-spec...-r7-m440.c2851 shows it has about 1% of the DP performance of a Vega 20 Radeon VII GPU, and about 1.4% of the memory bandwidth. The datasheet indicates your GPU is an IGP (integrated graphics processor). There are slower GPUs than yours, and considerably faster. IGPs tend to have low power budgets and run slow. They also IIRC can have clock linkages to the CPU, & to system ram clock since they share system ram, & fan speed since they share a fan, being in the same chip package as the CPU.

To confirm gpuowl is running on your GPU, not some other OpenCL device in the system, pause the prime95 application, monitor CPU usage in Task Manager, and monitor any other GPUs in your system also with GPU-Z. If you're running Windows 10 or later, Task Manager may also show loading on the primary GPU. On an otherwise idle system, load increases you see after launching gpuowl and letting it load files and get going will show what system resources it's using.

Also run gpuowl-win -h >help.txt and review the part of its output that lists OpenCl devices found on your system, checking the device number for the R7 M440 is present and matches the one you specify to run gpuowl on the command line or in its config.txt. (That device listing by number is a feature I asked Mihai for, for scenarios like this.) Make sure you're running Gpuowl on the OpenCL device intended, not a different GPU or IGP or CPU cores.

There are benchmark tables for many GPU models online for primality testing (DP dependent) and TF (SP or INT dependent). I don't see your GPU model listed there. Please contribute benchmark data as directed at those pages, after verifying device mapping and use as above.

Re odd or missing sensor values, I've seen that combined use of Windows 7, AMD drivers, Windows remote desktop use, and GPU-Z result in some sensor readings being zeroed or blanked. I started reporting that issue to TechPowerUp beginning at v2.7.0, and subsequently also to AMD, but there's been no resolution despite numerous GPU-Z and driver updates. Perhaps you are experiencing something similar. I've attached a GPU-Z capture illustrating that with a discrete RX550 GPU.

Good luck and have fun!
Attached Thumbnails

Last fiddled with by kriesel on 2022-07-31 at 14:03

2022-07-31, 16:19   #2777
SyauqiA

Jul 2022
Indonesia

216 Posts

Quote:

In my first post, I have already confirmed my gpuowl is running with the right GPU.
This is my device list from the help file :
Code:
-device <N>        : select a specific device:
0  : Intel(R) UHD Graphics 620- not-AMD
1  : Iceland-AMD Radeon R7 M440 AMD
and this is my config.txt file :
Code:
-user SyauqiMA -cpu recomp-radeonr7-2-w2 -device 1 -log 20000 -save 5
Just to clarify, I am using Windows 10.

But I forgot to send another important information in my last post (i'm sorry!). So in addition to 0 GPU load in GPU-Z, the CPU and GPU usage for gpuowl in Task Manager is also mostly 0 (I've added screenshots for those), despite my fan is spinning hard. My overall CPU and GPU usage while running gpuowl (and no other demanding software running) is still idle level, just with Firefox running, but I have no problem of GPU load while opening a game. Is this perhaps an issue of my system's settings?

Thank you for your time, and in the meantime maybe I will try out mfakto
Attached Thumbnails

 2022-08-04, 00:48 #2779 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 5·113 Posts Windows 11 Task Manager sees mfaktc activity on a GPU, but not gpuowl on the same GPU at the same time. The attached image shows mfaktc running on a GTX 1050Ti, 3d at 100%. It drops to 90% as does overall GPU usage indicated after gpuowl v6.11-380 is launched also on the same GPU and OpenCl compile is completed so PRP iterations start. Attached Thumbnails
2022-08-09, 12:26   #2780
tdulcet

"Teal Dulcet"
Jun 2018

2×5×7 Posts

Quote:
 Originally Posted by LaurV Related to that cudaLucas benchmarking, the A100 fft files you distribute in the package are mistakenly called "Nvidia A100 blah blah....", which is not known by the cudaLucas guts, so as soon as Gugu has the benevolence to give the user the A100 (I say "the" because I think they have only one, which they distribute to all users in turns, considering how often we get it ), then few hours will be wasted to make the "A100 blah blah" files, unless the user realizes, and rename the two files properly before starting the Colab Notebook. The new made files are 99% similar with the one you distribute (which is normal, timing will differ here and there for different runs, but the remaining FFTs should be the same). So, please edit the file names in the package and remove the "Nvidia" part.
Sorry, I did not notice your post sooner, as I only occasionally check this thread. Next time please post in our dedicated thread, as this is unrelated to GpuOwl.

Interesting, thanks for letting us know! As you may have seen in this post, at the same time I did that benchmarking, I also regenerated all the optimization files using twice as many iterations and added them for the A100 GPU, where I used four times the default number of iterations. These optimization files also cover a much larger range of FFT lengths (1K to 32768K) compared to what our GPU notebook currently does by default when the files do not already exist (1024K to 8192K). I did this on Google Cloud, as we had some credits that were expiring. The other five GPUs had identical names on Google Cloud as on Colab, so that is weird that the A100 is different. Daniel and I have never been lucky enough to actually get an A100 GPU on Colab, so we have not able to test this. Anyway, on Google Cloud it was called NVIDIA A100-SXM4-40GB, so I just pushed an update to rename those two files using just A100-SXM4-40GB. See here for the changes.

If you are interested, I attached a file with the timings on the A100 GPU for FFT lengths from 1M to 32M in GpuOwl for both the master and v6 branches, which could be compared to those respective CUDALucas optimization files.
Attached Files
 A100 bench.txt (18.9 KB, 0 views)

 2022-08-09, 14:36 #2781 LaurV Romulan Interpreter     "name field" Jun 2011 Thailand 2·4,999 Posts Thanks. At the time I was posting that, I also figured out the fact those files had more work put into them, so I ended up renaming them by hand too (i.e. ignoring my files, and keeping yours, with the "Nvidia" part deleted from the name). Currently I am getting an A100 every 3 or 4 days (with 4 sessions in the same time, one from the four gets the A100, the rest either V100 or P100). But yeah, as I was telling to Chris in the colab thread, this is paid account, and, I believe, more important: this is Singapore center, which seems not as much "stressed" as the US ones. Gugu allocates resources "geographically", and as I am in Thailand, I mostly get the Singapore clocks... Not many people want to "colab" in the area This is how A100 (currently running) looks like, compared with v100. (edit: of course, these posts can be moved to colab threads, sorry for offtopic; I am a bit in hurry now, I may move them myself later if nobody does it meantime) Last fiddled with by LaurV on 2022-08-09 at 14:42

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1684 2022-04-19 20:25 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 20:41.

Tue Aug 9 20:41:46 UTC 2022 up 33 days, 15:29, 2 users, load averages: 1.66, 1.28, 1.22