mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-08-02, 18:44   #34
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

17×347 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?
Sorry about that. Too many things swimming my brain.
rogue is offline   Reply With Quote
Old 2020-08-03, 13:50   #35
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·2,861 Posts
Default

Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.
henryzz is offline   Reply With Quote
Old 2020-08-03, 14:03   #36
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

17·347 Posts
Default

Quote:
Originally Posted by henryzz View Post
Something worthwhile for mtsieve would be c versions of the asm functions so that it is portable to arm cpus. Arm asm versions would also be useful. One day I might do this although finding the time is an issue.
Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.
rogue is offline   Reply With Quote
Old 2020-08-03, 14:37   #37
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

7×53 Posts
Default

Quote:
Originally Posted by rogue View Post
Routines in c should be easy to write as most asm are variants of powmod or mulmod. Those are fairly each to code in c, but much harder to optimize.

It is my desire to buy an ARM based MacBook in the future and to write the asm routines for it at that time. Of course others are welcome to do that as well.

The downside of ARM is that the choices for programs to execute PRP tests is very limited.
I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.
Happy5214 is online now   Reply With Quote
Old 2020-08-03, 14:59   #38
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

134138 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
I'm planning to buy an ODROID N2+ in the coming days for the express purpose of learning and testing ARM assembly (and as a replacement for my old RPi 2 B+ that hasn't worked in years). I'll let you know when it arrives if I feel up to porting the ASM routines.
Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.
rogue is offline   Reply With Quote
Old 2020-08-03, 15:00   #39
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×2,861 Posts
Default

The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.
henryzz is offline   Reply With Quote
Old 2020-08-03, 15:37   #40
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

37110 Posts
Default

Quote:
Originally Posted by rogue View Post
Cool. I suggest that you start with fpu_mulmod function. That will likely be the easiest one to port. Most of the others can be built on top of that in one way or another. next up would by the 4x version of an fpu routine although I do not know what gains you can get on ARM by doing more than one mulmod concurrently and I don't know how many is optimal. I suspect that ARM does not have an 80-bit fpu, so it will be limited to p < 2^52. I also do not know if ARM has any vector instructions such like SSE or AVX on x86. You will notice that Worker.h has some builtin checks for AVX compatibility. You will likely need to add something similar to control ARM code paths.
Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.

Quote:
Originally Posted by henryzz View Post
The issue will be moving beyond 53 bits on non-x86.
Has Montgomery multiplication been tried in mtsieve? It wouldn't be applicable in all sieves but it may be faster for powmods.
The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.

Last fiddled with by Happy5214 on 2020-08-03 at 15:40
Happy5214 is online now   Reply With Quote
Old 2020-08-03, 16:09   #41
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2×2,861 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
The x86_asm_ext folder is filled with Montgomery arithmetic routines inherited from the older sieve programs.
I checked an old version of the source, not a recent version.
henryzz is offline   Reply With Quote
Old 2020-08-03, 23:23   #42
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

17·347 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
Yeah, no 80-bit floats on ARM. ARM does have NEON, which appears analogous to SSE and is available on all 64-bit ARM processors. There is a defined instruction set extension for larger vectors called Scalable Vector Extension (SVE), which provides an interface for vectors from 128-bit to 2048-bit, with the hardware register size being set at any 128-bit interval in that range. However, it doesn't appear that SVE is currently implemented in any commercially available general-purpose ARM CPU as of ~2018 (phones and SOCs included), so it's probably not worth coding at this point.
I have to believe that the Apple ARM chips they intend to use on the new MacBooks will support SVE, but I haven't found anything to explicitly state that. It is possible that they might only reserve chips with that capability to their higher end offerings when those are switched over.
rogue is offline   Reply With Quote
Old 2020-08-12, 16:59   #43
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

7·53 Posts
Default

The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/p.
Attached Files
File Type: txt fpu_mulmod.txt (1,015 Bytes, 7 views)
Happy5214 is online now   Reply With Quote
Old 2020-08-12, 17:44   #44
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

17×347 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
The ODROID N2+ model sold out before I could get around to ordering it, so I bought the cheaper C4 instead. It hasn't shipped yet, so I'm still waiting for it. Meanwhile, I made an attempt at porting fpu_mulmod, and I think I came up with something. I've attached it in case anyone wants to test it for me. The ARM FPU registers don't form a stack like the x87 registers do, so I didn't sense a need to pre-load 1/p.
The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.
rogue is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mtsieve rogue Software 447 2020-09-19 05:50
srsieve/sr2sieve enhancements rogue Software 280 2020-09-17 16:10
LLRnet enhancements kar_bon No Prime Left Behind 10 2008-03-28 11:21
TODO list and suggestions/comments/enhancements Greenbank Octoproth Search 2 2006-12-03 17:28
Suggestions for future enhancements Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 13:24.

Sat Sep 19 13:24:55 UTC 2020 up 9 days, 10:35, 1 user, load averages: 1.09, 1.18, 1.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.