View Single Post 2021-06-05, 06:21   #10
axn

Jun 2003

112·43 Posts Quote:
 Originally Posted by Prime95 It is on my list of things to look into for prime95. The one problem is the optimization takes a lot of memory. Thus, for many users there may be little benefit.
While the optimal memory is 2^13 temps for current wavefront, we can make do with much lesser amounts and still gain a lot (compared to distinct P-1 + PRP). Illustrative numbers:

Current wavefront is around 110m which uses 6M FFT. Given memory of 1GB, we can get 1024/48 = 21 temps. Let's assume we're targeting B1=1.2m which is about 1.73mbits of straight P-1.

With 16 temps (largest power of 2 < 21), we can do the P-1 stage 1 with an additional ~290k multiplies. However, we aren't limited to power of two temps (though that is the easiest to conceptualize). We utilize all 21 temps (handling all 5 bit patterns and a few 6 bit patterns), this reduces effort to ~275k multiplies). Compare this with the optimal 2^13 temps which gets this done in ~132k multiplies. But also compare this with straight P-1 which gets this done in 1.73m multiplies!!

Also, there is a downside to using a large number of temps. All the temps are part of your state! So if you need to write a checkpoint, you will need to write all of them to the disk. In the optimal case, that is ~110GB of IO per checkpoint. Obviously, this is not good. You could reduce it by accumulating the results and just writing that out - but that would mean 2*temp muls before each checkpoint. Here also, having fewer temps is helpful.

In short, less memory is still a major gain, and might be a blessing in disguise.  