View Single Post
Old 2018-06-03, 18:18   #5
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

29·173 Posts
Default Memory error control

Memory errors might occur in the gpu vram, or in the system ram, or if particularly unlucky, both. Either can affect the GIMPS calculation results of GPU applications. Ideally we would all use highly reliable hardware, with ECC present and turned on.

On the cpu side:
System ram can be tested with memtest86 or memtest86+. https://www.memtest86.com/ or http://www.memtest.org/
Memtest86+ has the capability to prepare a table of bad physical locations.
System ram is inexpensive, so bad modules can be detected and removed or replaced, and the system retested. Retest periodically (annually?) is advisable.

For Linux systems, those badram tables from memtest86+ can be input to the Linux badram kernel patch, which allocates those bad physical locations and hangs on to them so they don't get allocated to some application we care about whose results could be ruined by memory errors, such as GIMPS computations.

For Windows systems, there is not an equivalent user-appliable patch available to my knowledge. For at least some versions, there's a built-in alternative described at https://superuser.com/questions/4200...ive-ram#490522 including lots of detail. Note the caution about possibly causing a boot failure if done incorrectly. This should be a temporary workaround while replacement RAM is on order.

For other OSes, there may be no alternative to RAM replacement or removal.

On the GPU side:
NVIDIA GPU memory can be tested with the -memtest option of CUDALucas.
AMD or NVIDIA with gpumemtest https://sourceforge.net/projects/cudagpumemtest/
Also https://www.raymond.cc/blog/having-p...st-its-memory/
Intel IGPs use system ram so that gets tested on the system side.

ECC is often not available, and if present and enabled reduces performance. (Only high end pro-quality card models included ECC in their design.)

Speculatively:
The gpu memory may or may not be subject to the virtual memory management of the host OS. It may be possible to develop code to do bad-gpu-memory lockout at the application level, or at the driver level. Whether that results in gpu memory fragmentation that causes problems is to be determined.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-07-16 at 18:45
kriesel is offline