2017-04-19, 03:07   #2597
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13×373 Posts
fftlength fallback

Quote:
 Originally Posted by kriesel On a Quadro 2000, testing the limits of fft length crashes the program, without completing any benchmarks as far as I can tell. Observed repeatable on the only Quadro 2000 I've yet run it on. \$ cudalucas2.05.1-cuda4.2-windows-x64 -cufftbench 1 65536 5 >>bigfftbench.txt CUDALucas.cu(1055) : cudaSafeCall() Runtime API error 2: out of memory. On a GeForce GTX480, it will run many fft lengths and output timings to stdout, then terminate before reaching 65536 or producing the fft lengths file. At least on mine. Then scaling back the maximum to what it reached on stdout produces a file. Quadro 2000 has 1GB VRAM, GTX480 has 1.5.
If the program can not allocate sufficient card memory, have it fallback to the next smaller fftlength and state it's doing so; recursively until sufficient memory can be allocated. On the gtx480, the maximum fftlength that can be benchmarked is somewhat a function of the 2.051 CUDA version. On the Quadro 2000 it's pretty constant. Running the same executables on the two cards the GTX480 outputs many fftlength timings to stdout before hitting the limit or finishing; on the Quadro2000 the program doesn't output any fftlength timings before halting with the out of memory error if started with a reasonable lower fftlength and a too-large upper fftlength. GTX480 will do at least up to 58320; Quadro2000 up to 38880. (Both do 38880/GB)

Last fiddled with by kriesel on 2017-04-19 at 03:15 Reason: add empirical data

 2017-04-29, 16:19 #2598 TheJudger     "Oliver" Mar 2005 Germany 100010101102 Posts OK, got my hand on another set of P100, this time the higher clocked Tesla P100-SXM2-16GB. Compared to the Tesla P100-PCIE-16GB these babies have 300W TDP (instead of 250W) and a bit higher clock rates (base and boost clocks). Memory bandwidth is exactly the same.. and CUDALucas performance too! So seems like CUDALucas is completely memory bandwidth bound on P100! Oliver
 2017-05-08, 00:49 #2599 storm5510 Random Account     Aug 2009 U.S.A. 2×3×7×43 Posts 2.06 Beta runs very well on my hardware. The only difference I see is the round-off error values are a bit higher. With 2.05, they stayed in the 0.05 range. With the Beta, they are running 0.62 to 0.75. I don't know if this is worth mentioning, but I guess it won't hurt. Thanks!
2017-05-08, 00:49   #2599
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

13·373 Posts
2.06 beta

Quote:
 Originally Posted by storm5510 2.06 Beta runs very well on my hardware. The only difference I see is the round-off error values are a bit higher. With 2.05, they stayed in the 0.05 range. With the Beta, they are running 0.62 to 0.75. I don't know if this is worth mentioning, but I guess it won't hurt. Thanks!
Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?

2017-05-08, 23:45   #2601
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

12F116 Posts

Quote:
 Originally Posted by kriesel Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?
Correction, cudapm1 terminates if round-off error exceeds 0.40;cudalucas changes fftlength and resumes from the last checkpoint if the round-off error exceeds 0.35.

2017-05-09, 04:11   #2602
storm5510
Random Account

Aug 2009
U.S.A.

2·3·7·43 Posts

Quote:
 Originally Posted by kriesel Hmm. What hardware? In my experience, on multiple card types, a round-off error >0.50 causes the program to terminate. I'm also seeing and documenting multiple other issues and notifying flashjh as they're found. How many other beta testers are out there?
HP workstation with an i5-3570 running Windows 10 Pro x64.. NVidia GTX-750Ti running CUDA 8.

Edit: I just noticed a serious error in my post about the Beta. The round-off error values should be 0.062 to 0.075. I fudged the decimal point. Sorry!

Last fiddled with by storm5510 on 2017-05-09 at 04:20 Reason: Correction

 2017-06-05, 17:28 #2604 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 13·373 Posts recovery from illegal residues Ugh. Overnight a CUDALucas V2.05.1 run goes from fine, to pointless, just over half way through. This was not the system or gpu that was overheating previously. | Jun 04 23:44:45 | M80409907 42500000 0x1b919bf0d5dc5667 | 4320K 0.29297 37.1880 743.76s | 19:19:45:29 52.85% | | Jun 04 23:57:09 | M80409907 42520000 0x26de355425603463 | 4320K 0.28125 37.2359 744.71s | 19:19:25:02 52.87% | | Jun 05 00:09:28 | M80409907 42540000 0x0000000000000002 | 4320K 0.28125 36.9652 739.30s | 19:19:04:25 52.90% | | Jun 05 00:20:54 | M80409907 42560000 0x0000000000000002 | 4320K 0.23906 34.3121 686.24s | 19:18:42:01 52.92% | ... until manual detection ~10am. About three gpu-weeks lost. System backups failed. Now have savefiles ini parameter set to one, executable switched to 2.06beta, and launched. V2.05 would continue to compute meaningless 0x02 residues at full fft length. The April 18 beta computes 0x02 residues until the next checkpoint, & terminates after detecting illegal residue. Batch file wrapper relaunches it, and repeat. May 5 build 2.06beta does the same as the April 18 build. So I conclude an illegal residue and bad c and t files produced by v2.051 are detected but not recovered from by v206beta thru April 18 or May 5 build. I manually renamed the c and t files to prevent restarting from them. After a bit of running, the savefiles directory looks like: Directory of C:\Users\Ken\My Documents\cl-quadro2000-2\savefiles 06/05/2017 11:03 AM . 06/05/2017 11:03 AM .. 06/05/2017 10:38 AM 0 .empty.txt 06/05/2017 11:59 AM 10,051,280 s80409907.100000.fda890b7e00cf3cd.cls 06/05/2017 11:03 AM 10,051,280 s80409907.14136.73bcfcd608d670c6.cls 06/05/2017 10:38 AM 10,051,280 s80409907.43616986.0000000000000002.cls 06/05/2017 10:52 AM 10,051,280 s80409907.43618538.0000000000000002.cls 5 File(s) 40,205,120 bytes It looks like the naming convention for savefiles is s[exponent].[iteration].[hex 64-bit residue].cls. What's the procedure for actually using a valid savefile if needed? Halt CUDALucas, delete or rename or move c and t files, move or copy an s file and rename it to be a c or t file, relaunch CUDALucas? The readme file covers creation of savefiles but seems to be silent about use of savefiles. (end)
 2017-06-24, 20:27 #2606 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 484910 Posts CUDALucas feature suggestion: run time estimation for entire worktodo file Prime95 estimates completion date/time for multiple entries in its worktodo file. It's very handy for scheduling. CUDALucas computes estimated time of arrival of completion (ETA), of the current exponent only, while in progress. It would be useful to have an option in CUDALucas to read the entire worktodo file, and compute and display time span and completion date for each line of the worktodo file, at the outset of a run following the usual header and optional info section. Option could be -w or -work. The output might look something like the following (for a GTX 1050 Ti or similar speed card): Code:  Work Queue Status Start-date start-time exponent current-iteration run-length iteration-time total-time-est completion-estimate %-complete Jun 22 18:05:07 M80443463 41200000 4608K 13.0346 12d 3:13:40 Jun 28 16:10:31 51.21% Jun 28 16:10:31 M43161917 0 2304K 6.0141 3d 0:06:20 Jul 01 16:16:51 0.00% (any additional exponents queued would follow in list) Total of 2 exponents queued, occupying estimated 15d 3:20:00 total, 8d 22:11:44 remaining. Draft pseudocode (without having looked at the existing source code): Code:  Get current date and time Set start date and time for first work as current date and time Zero exponent count, estimated run time total, estimated remaining time total open worktodo file for read Output header for work table While (!EOF) { read a line of worktodo parse to obtain exponent increment count of valid lines containing work if there is a checkpoint file for the exponent in the working directory to resume from{ read it to obtain fft length and saved iteration number (or error handling if read fails or values obtained are not valid) } else { determine fft length for the exponent assume zero iterations performed } perhaps, if it is the first valid work line, save exponent for resumption or start after work estimation look up the iteration time for the fft length compute duration in seconds as (exponent-2) iterations times iteration-time divided by 1000 compute percent-done as iterations done / (exponent-2) * 100 compute remaining duration for this exponent as duration * (1 - percent-done/100) compute completion date and time as start date and time plus remaining duration output line set start date and time (for next worktodo line) as completion date and time for this line's exponent increment count of valid lines containing work add this exponent's estimated time to total estimated run time add this exponent's estimated remaining duration to total remaining time estimate } close worktodo output summary line with count of exponents and total estimated run time. (continue; resume or start first exponent in worktodo, if there is at least one valid work line) One could consider also re-estimating the worktodo each time after completing an exponent. Comments? (end) Last fiddled with by kriesel on 2017-06-24 at 20:40 Reason: distinguish full run time versus remaining run time totals
2017-07-11, 03:12   #2607
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

484910 Posts

Quote:
 Originally Posted by kriesel If the program can not allocate sufficient card memory, have it fallback to the next smaller fftlength and state it's doing so; recursively until sufficient memory can be allocated. On the gtx480, the maximum fftlength that can be benchmarked is somewhat a function of the 2.051 CUDA version. On the Quadro 2000 it's pretty constant. Running the same executables on the two cards the GTX480 outputs many fftlength timings to stdout before hitting the limit or finishing; on the Quadro2000 the program doesn't output any fftlength timings before halting with the out of memory error if started with a reasonable lower fftlength and a too-large upper fftlength. GTX480 will do at least up to 58320; Quadro2000 up to 38880. (Both do 38880/GB)
The above are for the extreme case of requesting an fftbench run from 1k to some maximum. I subsequently found that with separate runs, higher values could be benchmarked. Make a run 1-38880, then 38880 to a higher value, and repeat, and manually combine the separate runs into one fft file. This was done to explore the limits of the software and hardware. The higher fft lengths have little practical value on slower GPUs, since a full LL run would take several years, perhaps longer than the remaining reliable-function life of a GPU card. A line in an fft file like
16384 294471259 132.2527
corresponds to more than a year, and
32768 580225813 282.8326
to an LL test run taking up to 5.2 years of a Quadro 2000,
38880 685923253 396.8878
up to 8.6 years.

FFT or threads benchmarking to wide ranges of fft lengths has revealed some serious anomalies on certain GPU and CUDA-level combinations.

Last fiddled with by kriesel on 2017-07-11 at 03:14

