[QUOTE=rogue;553455]The advantage of the fpu_push() is that you only need to compute 1/p only once. Then you multiply by 1/p. This would save calls to ucvtf d2/ARGp. It really comes down to how many concurrent instructions you can execute in the FPU (pipeline). So if the fmul is waiting for the first ucvtf even if you have two or three more ucvtfs between the first ucvtf and the fmul, then this probably isn't costing you anything in performance.
What's great is that not having an FPU "stack" makes coding for the FPU simpler even if other benefits of pipelining are not available. To get mtsieve to build one vs the other will require the sources to be placed in a new folder and a modified makefile. It will also require source changes to not compile AVX logic in the C++ source when compiled on ARM platforms. It might be as simple as an #ifdef ARM in those places.[/QUOTE] I think you're right. On the Cortex A55 that the ODROID C4 uses, FMUL has a latency of 4 cycles and a throughput of 2 instructions/cycle, while a doubleprecision FDIV requires 22 cycles and a throughput of 1/19. I'll revert to my original design and stuff it in d8, which is apparently calleesaved and thus should be saved across calls. 
1 Attachment(s)
I finally got my hands on that ODROID C4. It arrived Sunday (a day early). I tried to boot it Monday, only to find that the spare wireless keyboard I had allocated to it no longer worked. Today, I managed to find a spare Ethernet cable and plug it into our router, and I was able to log in through ssh. After several firsttimer coding errors, I finally have a working version of fpu_mulmod. I've attached a tarball with the new header, a remake of the push functions (which just copies the appropriate value to register d8), the two aforementioned version of fpu_mulmod (one using the 1/p value in d8 and the other using a simple fdiv by p every time), and a simple (grossly inefficient) Fermat test with a Mersenne prime using fpu_mulmod (should result in 1), along with a simple Makefile to build them.
The normal mulmod (with 1/p calculated once and stored in d8) runs in 1.52s on my ODROID, while the divp version (with the fdiv by p within mulmod on each iteration) takes 2.3s. Interestingly, the x87 version takes 1.43s on my Core 2 Quad, just showing how much progress processor technology has made. 
[QUOTE=Happy5214;554253][...], and a simple (grossly inefficient) Fermat test with a Mersenne prime [B]exponent[/B] using fpu_mulmod (should result in 1), along with a simple Makefile to build them.[/QUOTE]
Needed correction added. :redface: I was too impatient to test M31, and smaller ones were too quick. 
Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have precomputed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.

[QUOTE=rogue;554288]Next up would be to port some of the others. Most are just an increment harder than fpu_mulmod. fpu_mulmod_4a_4b_4p() will be hardest, but it just means that you have precomputed 1/p for four distinct values of p. If you have enough registers that are retained between calls to the function, then that would be great.[/QUOTE]
Registers d8d15 are all calleesaved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can. I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those? 
[QUOTE=Happy5214;554362]Registers d8d15 are all calleesaved and can be used. However, the disadvantage of not having an FPU stack is that saving the four 1/p values in d8d11 would likely require a separate function, since it can't simply shift each value over like the x87 stack can.
I renamed the fpu_push functions to fpu_store, since they store the result in d8 rather than pushing it to a stack. Are there any issues with renaming those?[/QUOTE] For now that is okay. You can change add an asm function to store 1/p for 4 p and I could migrate the other calls to it where 4 values are put onto the stack in successive calls. 
Calculating best Q for srsieve2.exe
[QUOTE=rogue;554188] Do you know some sequences where Q > 720? It might be possible for srsieve2 to compute it, unless it would be extremely large.[/QUOTE]
Moving this from the other thread: A faster automated way of finding the bestQ would be 1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left For N1, N2, N3, N4, ... left in the file Set BASE_MULTIPLE=gcd (N2N1, N3N2, N4N3, ...) So all N can be represented as N1=k1*BASE_MULTIPLE+c N2=k2*BASE_MULTIPLE+c 2) Then LIMIT_BASE should 720*7*11=55440 NDIVISORS=LIMIT_BASE 3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1; OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE) 4) Use the current code of srsieve2 to find bestq 5) Q=bestq*BASE_MULTIPLE 
[QUOTE=Citrix;554557]Moving this from the other thread:
A faster automated way of finding the bestQ would be 1) Set BASE_MULTIPLE to the gcd of differences of all consecutive terms left For N1, N2, N3, N4, ... left in the file Set BASE_MULTIPLE=gcd (N2N1, N3N2, N4N3, ...) So all N can be represented as N1=k1*BASE_MULTIPLE+c N2=k2*BASE_MULTIPLE+c 2) Then LIMIT_BASE should 720*7*11=55440 NDIVISORS=LIMIT_BASE 3) Instead of R[n%LIMIT_BASE] = 1; we use R[k%LIMIT_BASE] = 1; OR R[ (n/BASE_MULTIPLE) %LIMIT_BASE] = 1; where k=(n/BASE_MULTIPLE) 4) Use the current code of srsieve2 to find bestq 5) Q=bestq*BASE_MULTIPLE[/QUOTE] You are welcome to play around with the code, build and test. Speed really isn't an issue since it only has to execute that code once in a great while, but if a better Q can be found (thus allowing for faster sieving), then that would be great. 
I modified sr1sieve to read any Q from command line and got significant benefit for low weight series. I am trying to combine the benefit with your faster FPU/AVX code.
The problem modifying srsieve2 code is that it supports multiple k version and calculating the Q for that is very complicated. 
1 Attachment(s)
[QUOTE=rogue;550047]One place where the framework could benefit significantly is by switching from righttoleft exponentiation to lefttoright exponentiation. If anyone is interested in changing an ASM routine to use lefttoright exponentiation, please let me know.[/QUOTE]
[QUOTE=Happy5214;552329]I was thinking about doing this as an exercise in learning floatingpoint ASM, but the existing code appears to me to already be lefttoright, matching the algorithm in the source comments. Am I reading it wrong?[/QUOTE] The regular x86 fpu_powmod function [i]does[/i] use righttoleft exponentiation, though the others use lefttoright. Should I rewrite the x86 version while I'm writing the ARM version? Speaking of, the fpu_mulmod ARM ports have been finished and are attached. I managed to use the minimum amount of memory accesses (just transferring the array values to/from registers). Here are the timings for my test programs on ARM CortexA55 (which do a poor man's Fermat test on the largest known Mersenne prime exponent): [code]time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.043s user 0m1.040s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m0.870s user 0m0.864s sys 0m0.004s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.737s user 0m1.736s sys 0m0.000s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m2.300s user 0m2.296s sys 0m0.000s [/code] This compares to the x86 FPU test of the same calculation: [code]time ./mulmod 3^82589932 mod 82589933 = 1 real 0m1.260s user 0m1.232s sys 0m0.000s time ./mulmod_iter 3^82589932 mod 82589933 = 1 real 0m1.110s user 0m1.074s sys 0m0.005s time ./mulmod_iter_4a 2^82589932 mod 82589933 = 1 2^82589932 mod 82589933 = 2 2^82589932 mod 82589933 = 4 2^82589932 mod 82589933 = 8 real 0m1.284s user 0m1.252s sys 0m0.004s time ./mulmod_4a_4b_4p 2^82589932 mod 82589933 = 1 3^82589932 mod 82589933 = 1 4^82589932 mod 82589933 = 1 5^82589932 mod 82589933 = 1 real 0m1.543s user 0m1.487s sys 0m0.005s [/code] Notably, the functions with a single mulmod are faster on ARM than on x86, while the multiplemulmod functions scale much better on x86. The poor performance of multiple mulmods on ARM could be due to pipelining or slow memory accesses (the single mulmod functions are entirely contained in registers in my ARM implementation). 
I am looking at BestQ code
[code] uint32_t GenericSubsequenceHelper::RateQ(uint32_t Q, uint32_t s) { uint32_t baby, giant, work; ChooseSteps(&baby, &giant, Q, s); work = baby*BABY_WORK + s*(giant1)*GIANT_WORK + Q*EXP_WORK + s*SUBSEQ_WORK; return work; } [/code] I understand the BABY_WORK, GIANT_WORK and SUBSEQ_WORK steps. Can you help me understand what is the Q*EXP_WORK  I do not see anything corresponding to this in the discrete log function? For a large Q if only one subsequence is left  it should be significantly faster than using a lower Q but the "Q*EXP_WORK" prevents use of large Q. Thanks. 
All times are UTC. The time now is 16:23. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.