![]() |
![]() |
#1 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1D7916 Posts |
![]()
Note, this is present only for historical purposes, or for use of hardware that can't run gpuowl but can run cllucas. Gpuowl is about twice as fast as cllucas on the same hardware and has other advantages, including superior error detection and superior gpu device support. Cllucas is no longer being maintained, while gpuowl is frequently updated.
See the gpuowl reference thread at https://www.mersenneforum.org/showthread.php?t=23386 and the Available Software summary at http://www.mersenneforum.org/showpos...91&postcount=2 This thread is intended to hold only reference material specifically for clLucas, the OpenCL based Lucas Lehmer test program (not to be confused with CUDALucas). (Suggestions are welcome. Discussion posts in this thread are not encouraged. Please use the reference material discussion thread http://www.mersenneforum.org/showthread.php?t=23383. Off-topic posts may be moved or removed, to keep the reference threads clean, tidy, and useful.) Table of contents
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-15 at 16:12 Reason: added interim file sizes |
![]() |
![]() |
#2 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·503 Posts |
![]()
Learning how to run and benchmark clLucas took a while, since its user interface differed in a lot of details from CUDALucas and CUDAPm1, already familiar to me. Benchmarking convenience features in the CUDA code were not fully carried over to the clLucas code, so I supplemented them with a set of Windows batch files.
During fft benchmarking clLucas, I found its timings were not very reproducible. So I put the timings for the various settings choices and fft lengths into a large spreadsheet, and reran timings on the shorter timing cases, iteratively as the minimum per fft length moved about in the parameter choices. After this extensive benchmarking of the various fft lengths, thread counts, and sixtepfft choice, I ran a single double-check using the 3670016 (3584K) fft length. I obtained a per iteration timing of 18.41msec, and found the fft benchmark output of clLucas understates the time to do an actual full iteration by a factor of 18.41/9.36 =~1.97:1. For comparison, gpuOwL v1.9-74f1a38 4M -legacy took 10.88msec on the same gpu. Non-power-of-two fft lengths in clLucas were plentiful but many did not provide speed advantages over its power-of-two lengths, and none provide speed advantages over gpuOwL's small set of power-of-two lengths in their useful ranges. clLucas offers larger fft lengths than gpuOwL, so can run exponents gpuOwL does not currently support. I sliced and diced the clLucas benchmark a few different ways in plots. The first attachment shows all threads and sixstep choices plotted together, above 1M fft length. The second shows the per-fft-minimum timings versus fft length. The third shows the ratio for each fft length of max timing option / min timing option. Dividing the fft timing by the fft length in K to flatten the plot is shown in the fourth attachment. Note all clLucas values are over 2 microseconds per K fft timing, and remember to about double it to get ~4 microsecond/K iteration timing scale. The power of two ffts are the low points there. For comparison, gpuOwL's 5.01msec/2048K is 2.44microsec/K; 10.88 msec/4096K is 2.66 microsecond/K iteration timing; 21.26msec/8192K is 2.60 microsecond/K. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:28 |
![]() |
![]() |
#3 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
165718 Posts |
![]()
The attachment contains some observations made while getting familiar with cllucas 1.04 for Windows. As always, it is shared in appreciation of the efforts of the code author and those who helped in the early development and testing. Where applicable I've included pointers to thread posts. Please PM me with any additions or corrections. This particular list is based on less usage than others I've made for other software (partly because I only recently acquired AMD gpus).
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:28 |
![]() |
![]() |
#4 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·503 Posts |
![]()
It was necessary to build some batch files to benchmark clLucas. Extending them provides a guided sequence for setting up and tuning clLucas after the program files have been placed in a working directory.
Unzip, read them, then proceed. The main is cllstart. It will prompt for actions and wait. Ctrl-C will stop it. Use at your own risk. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:29 |
![]() |
![]() |
#5 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·503 Posts |
![]()
Note, if output redirection is used, only the Platform line(s) are redirected; the rest is apparently sent to stderr.
Code:
Platform 0 : Advanced Micro Devices, Inc. $ clLucas -h|-v $ clLucas [-d device_number] [-info] [-sixstepfft] [-i inifile] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-k] ex ponent|input_filename $ clLucas [-d device_number] [-info] [-sixstepfft] [-i inifile] [-polite iteration] -r $ clLucas [-d device_number] [-info] [-sixstepfft] -clfftbench start end distance -h print this help message -v print version number -info print device information -sixstepfft use Six Step FFT -i set .ini file name (default = "clLucas.ini") -f set fft length (if round off error then exit) -s save all checkpoint files -t check round off error all iterations -polite GPU is polite every n iterations (default -polite 1) (-polite 0 = GPU aggressive) -clfftbench exec clFFT benchmark (Ex. $ ./clLucas -d 1 -clfftbench 1048576 8388608 1048576 ) -r exec residue test. -k enable keys (p change -polite, t disable -t, s change -s) Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-18 at 14:29 |
![]() |
![]() |
#6 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·5·503 Posts |
![]()
Based on a brief test on a large exponent, 1143276383, requiring fft length 64Mi, p or q file size is 536870928 bytes, or fft_length x 8 bytes + 16 bytes each. So a p file and a q file pair for that exponent together occupy just over 1 GiB.
It appears to be storing double precision float format of the interim residue. Some other programs store a much more compact packed binary representation that is independent of fft_length. Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-11-15 at 16:11 |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL-specific reference material | kriesel | kriesel | 33 | 2023-03-06 22:59 |
Mfaktc-specific reference material | kriesel | kriesel | 9 | 2022-05-15 13:21 |
Mfakto-specific reference material | kriesel | kriesel | 5 | 2020-07-02 01:30 |
CUDALucas-specific reference material | kriesel | kriesel | 9 | 2020-05-28 23:32 |
CUDAPm1-specific reference material | kriesel | kriesel | 12 | 2019-08-12 15:51 |