Hi, new to the forum and interested in hardware implementation/acceleration as applied to prime searching. I am first trying to quantify the scale of the problem for testing exponents >100M. I made some naive calculations but these don't seem to accord with the incredible speed at which I see Prime95 iterating for these candidates.

My back-of-the-envelope calculation just for the transform stages goes like this:

- Start with exponent up to 127M for simplicity
- On a 64-bit machine this forms an FFT of length 2M (I understand IBDWT is used meaning this doesn't need to double pre-modulo operation)
- This would require 1M radix-2 butterfly kernels per stage x 20 stages = 20M kernels
- This is doubled for the reverse transform -> 40M kernels
- Each kernel operation requires a twiddle factor multiplication and 2 additions as well as data operations so the minimum I can see this being performed in is 10 instruction cycles (very rough guess)
- This gives 400M instruction cycles per iteration for the transforms alone which would require 100ms/iter on a modern processor core

I then compare this with what I see on a single-core of an 11th Gen core-i7 @ 3.8Ghz which is ~17ms/iter for the entire operation.

The estimate doesn't even account for the pointwise-multiplication, twiddle-factor generation, modulo operation, word-length adjustment, memory latency, etc. - I have clearly missed some major aspect of the fundamentals or efficiency improvements at play.

If anyone could point out where the discrepancies lie it would be greatly appreciated. Also, do Prime95 and similar implementations actually spend the vast majority of processing time resolving the FFT butterflies?