mersenneforum.org mtsieve enhancements
 Register FAQ Search Today's Posts Mark Forums Read

2020-08-14, 11:41   #45
Happy5214

"Alexander"
Nov 2008
The Alamo City

389 Posts

Quote:
 Originally Posted by rogue The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance. What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available. To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.
I think you're right. On the Cortex A55 that the ODROID C4 uses, FMUL has a latency of 4 cycles and a throughput of 2 instructions/cycle, while a double-precision FDIV requires 22 cycles and a throughput of 1/19. I'll revert to my original design and stuff it in d8, which is apparently callee-saved and thus should be saved across calls.

Last fiddled with by Happy5214 on 2020-08-14 at 11:43

2020-08-19, 15:26   #46
Happy5214

"Alexander"
Nov 2008
The Alamo City

389 Posts

I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several first-timer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.

The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made.
Attached Files
 fpu_mulmod.tar.gz (1.7 KB, 21 views)

2020-08-19, 18:26   #47
Happy5214

"Alexander"
Nov 2008
The Alamo City

389 Posts

Quote:
 Originally Posted by Happy5214 [...], and a simple (grossly inefficient) Fermat test with a Mersenne prime exponent using fpu_mulmod (should result in 1), along with a simple Makefile to build them.
Needed correction added. I was too impatient to test M31, and smaller ones were too quick.

 2020-08-19, 20:06 #48 rogue     "Mark" Apr 2003 Between here and the 5,953 Posts Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.
2020-08-20, 10:24   #49
Happy5214

"Alexander"
Nov 2008
The Alamo City

389 Posts

Quote:
 Originally Posted by rogue Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have pre-computed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.
Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.

I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?

2020-08-20, 12:00   #50
rogue

"Mark"
Apr 2003
Between here and the

5,953 Posts

Quote:
 Originally Posted by Happy5214 Registers d8-d15 are all callee-saved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8-d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can. I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?
For now that is okay. You can change add an asm function to store 1/p for 4 p and I could migrate the other calls to it where 4 values are put onto the stack in successive calls.

2020-08-22, 02:12   #51
Citrix

Jun 2003

30478 Posts
Calculating best Q for srsieve2.exe

Quote:
 Originally Posted by rogue Do you know some sequences where Q > 720? It might be possible for srsieve2 to compute it, unless it would be extremely large.
Moving this from the other thread:

A faster automated way of finding the bestQ would be

1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left
For N1, N2, N3, N4, ... left in the file
Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...)
So all N can be represented as
N1=k1*BASE_MULTIPLE+c
N2=k2*BASE_MULTIPLE+c

2) Then LIMIT_BASE should 720*7*11=55440
NDIVISORS=LIMIT_BASE

3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1;
OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE)

4) Use the current code of srsieve2 to find bestq

5) Q=bestq*BASE_MULTIPLE

2020-08-22, 02:55   #52
rogue

"Mark"
Apr 2003
Between here and the

10111010000012 Posts

Quote:
 Originally Posted by Citrix Moving this from the other thread: A faster automated way of finding the bestQ would be 1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left For N1, N2, N3, N4, ... left in the file Set BASE_MULTIPLE=gcd (N2-N1, N3-N2, N4-N3, ...) So all N can be represented as N1=k1*BASE_MULTIPLE+c N2=k2*BASE_MULTIPLE+c 2) Then LIMIT_BASE should 720*7*11=55440 NDIVISORS=LIMIT_BASE 3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1; OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE) 4) Use the current code of srsieve2 to find bestq 5) Q=bestq*BASE_MULTIPLE
You are welcome to play around with the code, build and test. Speed really isn't an issue since it only has to execute that code once in a great while, but if a better Q can be found (thus allowing for faster sieving), then that would be great.

 2020-08-22, 03:48 #53 Citrix     Jun 2003 62716 Posts I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code. The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated.
2020-08-25, 15:55   #54
Happy5214

"Alexander"
Nov 2008
The Alamo City

389 Posts

Quote:
 Originally Posted by rogue One place where the framework could benefit significantly is by switching from right-to-left exponentiation to left-to-right exponentiation. If anyone is interested in changing an ASM routine to use left-to-right exponentiation, please let me know.
Quote:
 Originally Posted by Happy5214 I was thinking about doing this as an exercise in learning floating-point ASM, but the existing code appears to me to already be left-to-right, matching the algorithm in the source comments. Am I reading it wrong?
The regular x86 fpu_powmod function does use right-to-left exponentiation, though the others use left-to-right. Should I rewrite the x86 version while I'm writing the ARM version?

Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers).

Here are the timings for my test programs on ARM Cortex-A55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent):
Code:
time ./mulmod
3^82589932 mod 82589933 = 1

real    0m1.043s
user    0m1.040s
sys     0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real    0m0.870s
user    0m0.864s
sys     0m0.004s

time ./mulmod_iter_4a
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real    0m1.737s
user    0m1.736s
sys     0m0.000s

time ./mulmod_4a_4b_4p
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real    0m2.300s
user    0m2.296s
sys     0m0.000s
This compares to the x86 FPU test of the same calculation:
Code:
time ./mulmod
3^82589932 mod 82589933 = 1

real    0m1.260s
user    0m1.232s
sys     0m0.000s

time ./mulmod_iter
3^82589932 mod 82589933 = 1

real    0m1.110s
user    0m1.074s
sys     0m0.005s

time ./mulmod_iter_4a
2^82589932 mod 82589933 = 1
2^82589932 mod 82589933 = 2
2^82589932 mod 82589933 = 4
2^82589932 mod 82589933 = 8

real    0m1.284s
user    0m1.252s
sys     0m0.004s

time ./mulmod_4a_4b_4p
2^82589932 mod 82589933 = 1
3^82589932 mod 82589933 = 1
4^82589932 mod 82589933 = 1
5^82589932 mod 82589933 = 1

real    0m1.543s
user    0m1.487s
sys     0m0.005s
Notably, the functions with a single mulmod are faster on ARM than on x86, while the multiple-mulmod functions scale much better on x86. The poor performance of multiple mulmods on ARM could be due to pipelining or slow memory accesses (the single mulmod functions are entirely contained in registers in my ARM implementation).
Attached Files
 fpu_mulmod.tar.gz (2.0 KB, 19 views)

 2020-09-21, 17:26 #55 Citrix     Jun 2003 32×52×7 Posts I am looking at BestQ code Code: uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s) { uint32_t baby, giant, work; ChooseSteps(&baby, &giant, Q, s); work = baby*BABY_WORK + s*(giant-1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK; return work; } I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps. Can you help me understand what is the Q*EXP_WORK -- I do not see anything corresponding to this in the discrete log function? For a large Q if only one subsequence is left - it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q. Thanks.

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 281 2020-09-29 16:36 rogue Software 447 2020-09-19 05:50 kar_bon No Prime Left Behind 10 2008-03-28 11:21 Greenbank Octoproth Search 2 2006-12-03 17:28 Reboot It Software 16 2003-10-17 01:31

All times are UTC. The time now is 03:22.

Wed Oct 28 03:22:08 UTC 2020 up 48 days, 33 mins, 2 users, load averages: 1.64, 1.74, 1.68

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.