mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-08-14, 11:41   #45
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

5648 Posts
Default

Quote:
Originally Posted by rogue View Post
The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.

What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available.

To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.
I think you're right. On the Cortex A55 that the ODROID C4 uses, FMUL has a latency of 4 cycles and a throughput of 2 instructions/cycle, while a double-precision FDIV requires 22 cycles and a throughput of 1/19. I'll revert to my original design and stuff it in d8, which is apparently callee-saved and thus should be saved across calls.

Last fiddled with by Happy5214 on 2020-08-14 at 11:43
Happy5214 is offline   Reply With Quote
Old 2020-08-19, 15:26   #46
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

22·3·31 Posts
Default

I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several first-timer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.

The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made.
Attached Files
File Type: gz fpu_mulmod.tar.gz (1.7 KB, 7 views)
Happy5214 is offline   Reply With Quote
Old 2020-08-19, 18:26   #47
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

17416 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
[...], and a simple (grossly inefficient) Fermat test with a Mersenne prime exponent using fpu_mulmod (should result in 1), along with a simple Makefile to build them.
Needed correction added. I was too impatient to test M31, and smaller ones were too quick.
Happy5214 is offline   Reply With Quote
Old 2020-08-19, 20:06   #48
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

589910 Posts
Default

Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.
rogue is offline   Reply With Quote
Old 2020-08-20, 10:24   #49
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

1011101002 Posts
Default

Quote:
Originally Posted by rogue View Post
Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.
Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.

I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?
Happy5214 is offline   Reply With Quote
Old 2020-08-20, 12:00   #50
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

10111000010112 Posts
Default

Quote:
Originally Posted by Happy5214 View Post
Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.

I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?
For now that is okay. You can change add an asm function to store 1/p for 4 p and I could migrate the other calls to it where 4 values are put onto the stack in successive calls.
rogue is offline   Reply With Quote
Old 2020-08-22, 02:12   #51
Citrix
 
Citrix's Avatar
 
Jun 2003

157010 Posts
Default Calculating best Q for srsieve2.exe

Quote:
Originally Posted by rogue View Post
Do you know some sequences where Q > 720? It might be possible for srsieve2 to compute it, unless it would be extremely large.
Moving this from the other thread:

A faster automated way of finding the bestQ would be

1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left
For N1, N2, N3, N4, ... left in the file
Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...)
So all N can be represented as
N1=k1*BASE_MULTIPLE+c
N2=k2*BASE_MULTIPLE+c

2) Then LIMIT_BASE should 720*7*11=55440
NDIVISORS=LIMIT_BASE

3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1;
OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE)

4) Use the current code of srsieve2 to find bestq

5) Q=bestq*BASE_MULTIPLE
Citrix is offline   Reply With Quote
Old 2020-08-22, 02:55   #52
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

17×347 Posts
Default

Quote:
Originally Posted by Citrix View Post
Moving this from the other thread:

A faster automated way of finding the bestQ would be

1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left
For N1, N2, N3, N4, ... left in the file
Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...)
So all N can be represented as
N1=k1*BASE_MULTIPLE+c
N2=k2*BASE_MULTIPLE+c

2) Then LIMIT_BASE should 720*7*11=55440
NDIVISORS=LIMIT_BASE

3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1;
OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE)

4) Use the current code of srsieve2 to find bestq

5) Q=bestq*BASE_MULTIPLE
You are welcome to play around with the code, build and test. Speed really isn't an issue since it only has to execute that code once in a great while, but if a better Q can be found (thus allowing for faster sieving), then that would be great.
rogue is offline   Reply With Quote
Old 2020-08-22, 03:48   #53
Citrix
 
Citrix's Avatar
 
Jun 2003

30428 Posts
Default

I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code.

The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated.
Citrix is offline   Reply With Quote
Old 2020-08-25, 15:55   #54
Happy5214
 
Happy5214's Avatar
 
"Alexander"
Nov 2008
The Alamo City

22×3×31 Posts
Default

Quote:
Originally Posted by rogue View Post
One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.
Quote:
Originally Posted by Happy5214 View Post
I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?
The regular x86 fpu_powmod function does use right-to-left exponentiation, though the others use left-to-right. Should I rewrite the x86 version while I'm writing the ARM version?

Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers).

Here are the timings for my test programs on ARM Cortex-A55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent):
Code:
time ./mulmod
3^82589932 mod 82589933 = 1

real    0m1.043s
user    0m1.040s
sys     0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real    0m0.870s
user    0m0.864s
sys     0m0.004s

time ./mulmod_iter_4a
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real    0m1.737s
user    0m1.736s
sys     0m0.000s

time ./mulmod_4a_4b_4p 
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real    0m2.300s
user    0m2.296s
sys     0m0.000s
This compares to the x86 FPU test of the same calculation:
Code:
time ./mulmod
3^82589932 mod 82589933 = 1

real    0m1.260s
user    0m1.232s
sys     0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real    0m1.110s
user    0m1.074s
sys     0m0.005s

time ./mulmod_iter_4a 
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real    0m1.284s
user    0m1.252s
sys     0m0.004s

time ./mulmod_4a_4b_4p 
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real    0m1.543s
user    0m1.487s
sys     0m0.005s
Notably, the functions with a single mulmod are faster on ARM than on x86, while the multiple-mulmod functions scale much better on x86. The poor performance of multiple mulmods on ARM could be due to pipelining or slow memory accesses (the single mulmod functions are entirely contained in registers in my ARM implementation).
Attached Files
File Type: gz fpu_mulmod.tar.gz (2.0 KB, 4 views)
Happy5214 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mtsieve rogue Software 447 2020-09-19 05:50
srsieve/sr2sieve enhancements rogue Software 280 2020-09-17 16:10
LLRnet enhancements kar_bon No Prime Left Behind 10 2008-03-28 11:21
TODO list and suggestions/comments/enhancements Greenbank Octoproth Search 2 2006-12-03 17:28
Suggestions for future enhancements Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 09:07.

Sun Sep 20 09:07:33 UTC 2020 up 10 days, 6:18, 0 users, load averages: 1.23, 1.39, 1.40

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.