mersenneforum.org Some code for factoring numbers up to 2^63
 Register FAQ Search Today's Posts Mark Forums Read

 2019-08-25, 17:43 #199 Till     "Tilman Neumann" Jan 2016 Germany 10000010002 Posts You mean something like Code:  long submod_v3(long x, long y, long n) { long r0 = x-y; return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n: r0; } instead of Code:  long submod_v3(long x, long y, long n) { long r0; r0 = x-y; if (x+Long.MIN_VALUE < y+Long.MIN_VALUE) { r0 += n; } return r0; } ? The first variant is actually what I'm doing. Just kept the style you used before for recognition value. Last fiddled with by Till on 2019-08-25 at 17:46
2019-08-25, 19:25   #200
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2D3D16 Posts

Quote:
 Originally Posted by Till You mean something like Code:  long submod_v3(long x, long y, long n) { long r0 = x-y; return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n: r0; } instead of Code:  long submod_v3(long x, long y, long n) { long r0; r0 = x-y; if (x+Long.MIN_VALUE < y+Long.MIN_VALUE) { r0 += n; } return r0; } ? The first variant is actually what I'm doing. Just kept the style you used before for recognition value.
I don't know if "you" in the above means me.

I meant the versions without any branching code, explicit or otherwise. In C:
Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t r0 = a - b;
return r0 + (n & ((int64_t)r0 >> (int64_t)63));
}

__inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n)
{
return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63));
}
I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.

Last fiddled with by xilman on 2019-08-25 at 19:29

2019-08-26, 00:51   #201
bsquared

"Ben"
Feb 2007

3×17×73 Posts

Quote:
 Originally Posted by Till My updated table looks like this now: Code: bits | Hart Lehman1 Lehman2 Squfof1 Squfof2 Rho1 Rho2 ECM(J) | ECM(C) --------+------------------------------------------------------------------------------------+------------ 42 | 1.01 1.01 1.37 2.29 3.85 1.82 1.98 2.53 | 1.24 44 | 1.60 1.60 2.18 3.23 5.28 2.35 2.57 2.82 | 1.34 46 | 2.56 2.48 3.48 4.84 7.34 3.09 3.45 3.20 | 1.50 48 | 3.87 3.81 5.46 6.92 10.14 4.31 4.68 3.82 | 1.82 50 | 6.33 6.38 8.67 11.82 16.49 5.64 6.42 4.50 | 2.09 52 | 11.01 10.31 14.28 18.78 21.51 7.59 8.48 5.71 | 2.52 54 | 16.00 16.20 22.90 29.37 10.50 11.98 8.15 | 3.27 56 | 25.74 25.65 36.89 40.69 14.29 15.90 9.44 | 3.81 58 | 40.84 41.20 59.31 56.01 21.56 12.26 | 4.86 60 | 63.28 77.57 30.40 14.90 | 5.90 62 | 98.75 106.21 40.55 19.34 | 7.20
Very nice, lots of great progress. Now fastest for you starting at 48 bits with this data set.

2019-08-26, 13:31   #202
bsquared

"Ben"
Feb 2007

3×17×73 Posts

Quote:
 Originally Posted by Till 3. Another improvement of the submod() method. This one may be interesting in C, too. Using Code: long submod_v2(long x, long y, long n) { long r0; r0 = x-y; if (x < y) { r0 += n; } return r0; } instead of Code: long submod_v1(long x, long y, long n) { long r0; r0 = x-y; if (r0 > x) { r0 += n; } return r0; } yields another overall performance improvement of about 10%.
Nice one. Compiles to something nearly identical to the inline assembler version (timings under submod_v6 in the table below):

Code:
0000000000000050 <submod>:
r0 = x - y;
50:	48 89 f8             	mov    %rdi,%rax
53:	48 29 f0             	sub    %rsi,%rax
r0 += n;
56:	48 01 c2             	add    %rax,%rdx
59:	48 39 f7             	cmp    %rsi,%rdi
5c:	48 0f 42 c2          	cmovb  %rdx,%rax
}
60:	c3                   	retq
Code:
size    submod_1  submod_2  submod_3  submod_4  submod_5  submod_6
42      0.63      0.82      0.64      0.64      0.66      0.64
44      0.70      0.91      0.70      0.71      0.74      0.69
46      0.79      1.04      0.80      0.81      0.82      0.80
48      0.95      1.27      0.97      0.96      0.99      0.94
50      1.12      1.46      1.15      1.13      1.16      1.14
Should really make a separate test for just the submod; the real differences between these become invisible once other functions in the ecm code start dominating.

2019-08-26, 14:58   #203
Till

"Tilman Neumann"
Jan 2016
Germany

10000010002 Posts

Quote:
 Originally Posted by xilman I don't know if "you" in the above means me. I meant the versions without any branching code, explicit or otherwise. In C: Code: __inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n) { uint64_t r0 = a - b; return r0 + (n & ((int64_t)r0 >> (int64_t)63)); } __inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n) { return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63)); } I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.
Sorry, I was preconditioned and just seeing the red name I thought it was Ben who answered ;-) I didn't try your proposition yet, but I will.

Quote:
 Originally Posted by bsquared Very nice, lots of great progress. Now fastest for you starting at 48 bits with this data set.
Yes, I am quite happy about the progress, too :-)

Quote:
 Originally Posted by bsquared Nice one. Compiles to something nearly identical to the inline assembler version (timings under submod_v6 in the table below): Code: 0000000000000050 : r0 = x - y; 50: 48 89 f8 mov %rdi,%rax 53: 48 29 f0 sub %rsi,%rax r0 += n; 56: 48 01 c2 add %rax,%rdx 59: 48 39 f7 cmp %rsi,%rdi 5c: 48 0f 42 c2 cmovb %rdx,%rax } 60: c3 retq Code: size submod_1 submod_2 submod_3 submod_4 submod_5 submod_6 42 0.63 0.82 0.64 0.64 0.66 0.64 44 0.70 0.91 0.70 0.71 0.74 0.69 46 0.79 1.04 0.80 0.81 0.82 0.80 48 0.95 1.27 0.97 0.96 0.99 0.94 50 1.12 1.46 1.15 1.13 1.16 1.14 Should really make a separate test for just the submod; the real differences between these become invisible once other functions in the ecm code start dominating.
In the new variant (submod_6), the condition does not depend on r0.
So maybe both r0 and the condition could be computed kind of parallelly? I am obviously no expert there, but maybe there is some space to improve the assembler code.

2019-08-26, 20:15   #204
Till

"Tilman Neumann"
Jan 2016
Germany

20816 Posts

Quote:
 Originally Posted by xilman I don't know if "you" in the above means me. I meant the versions without any branching code, explicit or otherwise. In C: Code: __inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n) { uint64_t r0 = a - b; return r0 + (n & ((int64_t)r0 >> (int64_t)63)); } __inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n) { return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63)); } I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.

Looking closer at your code: Doing signed operations on unsigned integers looks a bit adventurous. ;-) But since Java does not have primitive unsigned types, testing was easy for me. In Java it boils down to
Code:
    long submod(long a, long b, long n) {
return (a - b) + (n & ((long)(a - b) >> 63));
}
In Java it seems to be important to use the ">>" operator (shifting in the sign bit, too) instead of ">>>" (ignoring the sign bit).

But your proposition does not seem to perform any better than what we got before, i.e.
Code:
    long submod(long x, long y, long n) {
return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n : r0;
}

2019-08-27, 14:40   #205
Till

"Tilman Neumann"
Jan 2016
Germany

23×5×13 Posts

Quote:
 Originally Posted by xilman I don't know if "you" in the above means me. I meant the versions without any branching code, explicit or otherwise. In C: Code: __inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n) { uint64_t r0 = a - b; return r0 + (n & ((int64_t)r0 >> (int64_t)63)); } __inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n) { return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63)); } I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.

Quote:
 Originally Posted by Till But your proposition does not seem to perform any better than what we got before, i.e. Code:  long submod(long x, long y, long n) { return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n : r0; }

When I was saying it does not perform any better I meant it is more or less equal to my last implementation. What we can say for sure is that the two variants are very close judged by the overall tinyecm performance. But I did more tests and it seems that your branch-less implementation is slightly better for n>=52 bit. Not much, just 0-5% depending on the size of N. But I think I can comfirm that (which is not that easy having garbage collection and more strange things influencing timings in Java).

I will switch to your version.

Last fiddled with by Till on 2019-08-27 at 14:59 Reason: inserted reference

 2019-10-13, 14:59 #206 Till     "Tilman Neumann" Jan 2016 Germany 23×5×13 Posts tinyecm map array The map[] array of tinyecm.c is still a mystery to me. It looks like a map from primes to prime indices (mod 60). But if that is true than the first line is wrong: 0, 1, 2, 0, 0, 0, 0, 3, 0, 0, contains one '0' too much between '2' and '3'. No matter what I tried to improve the array, performance is degrading. @Ben Did you trim some parameters depending on the map array?
2019-10-14, 13:49   #207
bsquared

"Ben"
Feb 2007

72138 Posts

Quote:
 Originally Posted by Till The map[] array of tinyecm.c is still a mystery to me. It looks like a map from primes to prime indices (mod 60). But if that is true than the first line is wrong: 0, 1, 2, 0, 0, 0, 0, 3, 0, 0, contains one '0' too much between '2' and '3'. No matter what I tried to improve the array, performance is degrading. @Ben Did you trim some parameters depending on the map array?
It is a map from differences (P - A) to an array where the precomputed values [P-A]Q are stored, where P is the current stage 2 prime, A is the current giant step value, and Q is the stage 1 residue. Mostly (P-A) are prime, but I also need for example the value 1 and the value D/2, which are not prime. So the map is indexed from 0:

Code:
P-A: 0 1 2 3 4 5 6 7 8 ...
map: 0 1 2 0 0 0 0 3 0 ...
Pb:  0 1 2 7 11 13 17 ...
The precomputed [P-A]Q are stored in the "Pb" array.

Tinyecm uses a small set of fixed B2/B1 parameters depending on the input size, and for those parameters I've precomputed exactly the values [P-A] that produce the maximal prime-pairings for D=60 and U=1, using Montgomery's pair algorithm. These are the values stored in, for example, the "b1_70" array.

So stage two amounts to looking up each Pb value in the "b1_xx" array using Pb[map[b1_xx[i]]] and multiplying its cross-product into the accumulator, advancing Pa (the giant step value) each time a 0 is encountered.

Last fiddled with by bsquared on 2019-10-14 at 13:51

2019-10-14, 14:00   #208
bsquared

"Ben"
Feb 2007

3·17·73 Posts

Quote:
 Originally Posted by bsquared So stage two amounts to looking up each Pb value in the "b1_xx" array using Pb[map[b1_xx[i]]] and multiplying its cross-product into the accumulator, advancing Pa (the giant step value) each time a 0 is encountered.
I suppose one slight optimization (and additional obfuscation) would be to pre-apply map[] to the b1_xx arrays to skip one level of indirection per loop iteration (so we'd just have Pb[b1_xx[i]]). But I'm sure that time is insignificant compared to the modular cross-product and accumulation.

 2019-10-14, 14:35 #209 Till     "Tilman Neumann" Jan 2016 Germany 52010 Posts Thanks for the explanation; in my ignorance I half-ways expected that to be a bug. And on the little optimization possibility I trust your assessment.

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Miscellaneous Math 18 2017-08-27 14:56 ShridharRasal Factoring 10 2008-03-20 17:17 Peter Nelson Software 9 2005-03-30 18:28 Axel Fox Software 14 2003-07-04 18:57 Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 19:29.

Fri Dec 2 19:29:29 UTC 2022 up 106 days, 16:58, 0 users, load averages: 0.92, 1.01, 0.99