20190825, 17:43  #199 
"Tilman Neumann"
Jan 2016
Germany
1000001000_{2} Posts 
You mean something like
Code:
long submod_v3(long x, long y, long n) { long r0 = xy; return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n: r0; } Code:
long submod_v3(long x, long y, long n) { long r0; r0 = xy; if (x+Long.MIN_VALUE < y+Long.MIN_VALUE) { r0 += n; } return r0; } The first variant is actually what I'm doing. Just kept the style you used before for recognition value. Last fiddled with by Till on 20190825 at 17:46 
20190825, 19:25  #200  
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2D3D_{16} Posts 
Quote:
I meant the versions without any branching code, explicit or otherwise. In C: Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n) { uint64_t r0 = a  b; return r0 + (n & ((int64_t)r0 >> (int64_t)63)); } __inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n) { return (a  b) + (n & ((int64_t)(a  b) >> (int64_t)63)); } Last fiddled with by xilman on 20190825 at 19:29 

20190826, 00:51  #201  
"Ben"
Feb 2007
3×17×73 Posts 
Quote:


20190826, 13:31  #202  
"Ben"
Feb 2007
3×17×73 Posts 
Quote:
Code:
0000000000000050 <submod>: r0 = x  y; 50: 48 89 f8 mov %rdi,%rax 53: 48 29 f0 sub %rsi,%rax r0 += n; 56: 48 01 c2 add %rax,%rdx 59: 48 39 f7 cmp %rsi,%rdi 5c: 48 0f 42 c2 cmovb %rdx,%rax } 60: c3 retq Code:
size submod_1 submod_2 submod_3 submod_4 submod_5 submod_6 42 0.63 0.82 0.64 0.64 0.66 0.64 44 0.70 0.91 0.70 0.71 0.74 0.69 46 0.79 1.04 0.80 0.81 0.82 0.80 48 0.95 1.27 0.97 0.96 0.99 0.94 50 1.12 1.46 1.15 1.13 1.16 1.14 

20190826, 14:58  #203  
"Tilman Neumann"
Jan 2016
Germany
1000001000_{2} Posts 
Quote:
Quote:
Quote:
So maybe both r0 and the condition could be computed kind of parallelly? I am obviously no expert there, but maybe there is some space to improve the assembler code. 

20190826, 20:15  #204  
"Tilman Neumann"
Jan 2016
Germany
208_{16} Posts 
Quote:
Looking closer at your code: Doing signed operations on unsigned integers looks a bit adventurous. ;) But since Java does not have primitive unsigned types, testing was easy for me. In Java it boils down to Code:
long submod(long a, long b, long n) { return (a  b) + (n & ((long)(a  b) >> 63)); } But your proposition does not seem to perform any better than what we got before, i.e. Code:
long submod(long x, long y, long n) { return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n : r0; } 

20190827, 14:40  #205  
"Tilman Neumann"
Jan 2016
Germany
2^{3}×5×13 Posts 
Quote:
Quote:
When I was saying it does not perform any better I meant it is more or less equal to my last implementation. What we can say for sure is that the two variants are very close judged by the overall tinyecm performance. But I did more tests and it seems that your branchless implementation is slightly better for n>=52 bit. Not much, just 05% depending on the size of N. But I think I can comfirm that (which is not that easy having garbage collection and more strange things influencing timings in Java). I will switch to your version. Last fiddled with by Till on 20190827 at 14:59 Reason: inserted reference 

20191013, 14:59  #206 
"Tilman Neumann"
Jan 2016
Germany
2^{3}×5×13 Posts 
tinyecm map array
The map[] array of tinyecm.c is still a mystery to me.
It looks like a map from primes to prime indices (mod 60). But if that is true than the first line is wrong: 0, 1, 2, 0, 0, 0, 0, 3, 0, 0, contains one '0' too much between '2' and '3'. No matter what I tried to improve the array, performance is degrading. @Ben Did you trim some parameters depending on the map array? 
20191014, 13:49  #207  
"Ben"
Feb 2007
7213_{8} Posts 
Quote:
Code:
PA: 0 1 2 3 4 5 6 7 8 ... map: 0 1 2 0 0 0 0 3 0 ... Pb: 0 1 2 7 11 13 17 ... Tinyecm uses a small set of fixed B2/B1 parameters depending on the input size, and for those parameters I've precomputed exactly the values [PA] that produce the maximal primepairings for D=60 and U=1, using Montgomery's pair algorithm. These are the values stored in, for example, the "b1_70" array. So stage two amounts to looking up each Pb value in the "b1_xx" array using Pb[map[b1_xx[i]]] and multiplying its crossproduct into the accumulator, advancing Pa (the giant step value) each time a 0 is encountered. Last fiddled with by bsquared on 20191014 at 13:51 

20191014, 14:00  #208 
"Ben"
Feb 2007
3·17·73 Posts 
I suppose one slight optimization (and additional obfuscation) would be to preapply map[] to the b1_xx arrays to skip one level of indirection per loop iteration (so we'd just have Pb[b1_xx[i]]). But I'm sure that time is insignificant compared to the modular crossproduct and accumulation.

20191014, 14:35  #209 
"Tilman Neumann"
Jan 2016
Germany
520_{10} Posts 
Thanks for the explanation; in my ignorance I halfways expected that to be a bug.
And on the little optimization possibility I trust your assessment. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Factoring Mersenne numbers  paulunderwood  Miscellaneous Math  18  20170827 14:56 
Factoring Big Numbers In c#  ShridharRasal  Factoring  10  20080320 17:17 
Question about factoring code  Peter Nelson  Software  9  20050330 18:28 
Factoring Fermat Numbers  Axel Fox  Software  14  20030704 18:57 
Questions about SSE2 code and Factoring  Joe O  Software  2  20020913 23:39 