View Single Post
Old 2020-08-03, 15:37   #40
Happy5214's Avatar
Nov 2008
The Alamo City

389 Posts

Originally Posted by rogue View Post
Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.
Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.

Originally Posted by henryzz View Post
The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.
The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.

Last fiddled with by Happy5214 on 2020-08-03 at 15:40
Happy5214 is online now   Reply With Quote