View Single Post
Old 2020-08-12, 17:44   #44
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

26×3×31 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/p.
The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.
rogue is offline   Reply With Quote