mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mtsieve enhancements (https://www.mersenneforum.org/showthread.php?t=25486)

rogue 2020-08-02 18:44

[QUOTE=Happy5214;552329]I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?[/QUOTE]

Sorry about that. Too many things swimming my brain.

henryzz 2020-08-03 13:50

Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.

rogue 2020-08-03 14:03

[QUOTE=henryzz;552423]Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.[/QUOTE]

Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.

Happy5214 2020-08-03 14:37

[QUOTE=rogue;552426]Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.[/QUOTE]
I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.

rogue 2020-08-03 14:59

[QUOTE=Happy5214;552429]I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.[/QUOTE]

Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.

henryzz 2020-08-03 15:00

The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.

Happy5214 2020-08-03 15:37

[QUOTE=rogue;552431]Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.[/QUOTE]

Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.

[QUOTE=henryzz;552432]The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.[/QUOTE]

The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.

henryzz 2020-08-03 16:09

[QUOTE=Happy5214;552438]The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.[/QUOTE]

I checked an old version of the source, not a recent version.

rogue 2020-08-03 23:23

[QUOTE=Happy5214;552438]Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.[/QUOTE]

I have to believe that the Apple ARM chips they intend to use on the new MacBooks will support SVE, but I haven't found anything to explicitly state that. It is possible that they might only reserve chips with that capability to their higher end offerings when those are switched over.

Happy5214 2020-08-12 16:59

1 Attachment(s)
The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/[I]p[/I].

rogue 2020-08-12 17:44

[QUOTE=Happy5214;553447]The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/[I]p[/I].[/QUOTE]

The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.


All times are UTC. The time now is 21:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.