mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2019-08-25, 17:43   #199
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

10000010002 Posts
Default

You mean something like
Code:
 long submod_v3(long x, long y, long n) {
     long r0 = x-y;
     return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n: r0;
 }
instead of
Code:
 long submod_v3(long x, long y, long n) {
     long r0;
     r0 = x-y;
     if (x+Long.MIN_VALUE < y+Long.MIN_VALUE) {
         r0 += n;
     }
     return r0;
  }
?


The first variant is actually what I'm doing. Just kept the style you used before for recognition value.

Last fiddled with by Till on 2019-08-25 at 17:46
Till is offline   Reply With Quote
Old 2019-08-25, 19:25   #200
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2D3D16 Posts
Default

Quote:
Originally Posted by Till View Post
You mean something like
Code:
 long submod_v3(long x, long y, long n) {
     long r0 = x-y;
     return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n: r0;
 }
instead of
Code:
 long submod_v3(long x, long y, long n) {
     long r0;
     r0 = x-y;
     if (x+Long.MIN_VALUE < y+Long.MIN_VALUE) {
         r0 += n;
     }
     return r0;
  }
?


The first variant is actually what I'm doing. Just kept the style you used before for recognition value.
I don't know if "you" in the above means me.

I meant the versions without any branching code, explicit or otherwise. In C:
Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t r0 = a - b;
return r0 + (n & ((int64_t)r0 >> (int64_t)63));
}

__inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n)
{
return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63));
}
I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.

Last fiddled with by xilman on 2019-08-25 at 19:29
xilman is offline   Reply With Quote
Old 2019-08-26, 00:51   #201
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default

Quote:
Originally Posted by Till View Post

My updated table looks like this now:
Code:
bits    | Hart    Lehman1    Lehman2    Squfof1    Squfof2    Rho1      Rho2     ECM(J)      | ECM(C)
--------+------------------------------------------------------------------------------------+------------
42      | 1.01     1.01       1.37       2.29        3.85      1.82     1.98     2.53        | 1.24
44      | 1.60     1.60       2.18       3.23        5.28      2.35     2.57     2.82        | 1.34
46      | 2.56     2.48       3.48       4.84        7.34      3.09     3.45     3.20        | 1.50
48      | 3.87     3.81       5.46       6.92       10.14      4.31     4.68     3.82        | 1.82
50      | 6.33     6.38       8.67      11.82       16.49      5.64     6.42     4.50        | 2.09
52      | 11.01   10.31       14.28     18.78       21.51      7.59     8.48     5.71        | 2.52
54      | 16.00   16.20       22.90                 29.37      10.50    11.98    8.15        | 3.27
56      | 25.74   25.65       36.89                 40.69      14.29    15.90    9.44        | 3.81
58      | 40.84   41.20       59.31                 56.01               21.56    12.26       | 4.86
60      | 63.28                                     77.57               30.40    14.90       | 5.90
62      | 98.75                                     106.21              40.55    19.34       | 7.20
Very nice, lots of great progress. Now fastest for you starting at 48 bits with this data set.
bsquared is offline   Reply With Quote
Old 2019-08-26, 13:31   #202
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×17×73 Posts
Default

Quote:
Originally Posted by Till View Post
3. Another improvement of the submod() method.
This one may be interesting in C, too.

Using
Code:
long submod_v2(long x, long y, long n) {
    long r0;
    r0 = x-y;
    if (x < y) {
        r0 += n;
    }
    return r0;
}
instead of
Code:
long submod_v1(long x, long y, long n) {
    long r0;
    r0 = x-y;
    if (r0 > x) {
        r0 += n;
    }
    return r0;
}
yields another overall performance improvement of about 10%.
Nice one. Compiles to something nearly identical to the inline assembler version (timings under submod_v6 in the table below):

Code:
0000000000000050 <submod>:
    r0 = x - y;
      50:	48 89 f8             	mov    %rdi,%rax
      53:	48 29 f0             	sub    %rsi,%rax
        r0 += n;
      56:	48 01 c2             	add    %rax,%rdx
      59:	48 39 f7             	cmp    %rsi,%rdi
      5c:	48 0f 42 c2          	cmovb  %rdx,%rax
}
      60:	c3                   	retq
Code:
size    submod_1  submod_2  submod_3  submod_4  submod_5  submod_6
42      0.63      0.82      0.64      0.64      0.66      0.64
44      0.70      0.91      0.70      0.71      0.74      0.69
46      0.79      1.04      0.80      0.81      0.82      0.80
48      0.95      1.27      0.97      0.96      0.99      0.94
50      1.12      1.46      1.15      1.13      1.16      1.14
Should really make a separate test for just the submod; the real differences between these become invisible once other functions in the ecm code start dominating.
bsquared is offline   Reply With Quote
Old 2019-08-26, 14:58   #203
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

10000010002 Posts
Default

Quote:
Originally Posted by xilman View Post
I don't know if "you" in the above means me.

I meant the versions without any branching code, explicit or otherwise. In C:
Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t r0 = a - b;
return r0 + (n & ((int64_t)r0 >> (int64_t)63));
}

__inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n)
{
return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63));
}
I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.
Sorry, I was preconditioned and just seeing the red name I thought it was Ben who answered ;-) I didn't try your proposition yet, but I will.


Quote:
Originally Posted by bsquared View Post
Very nice, lots of great progress. Now fastest for you starting at 48 bits with this data set.
Yes, I am quite happy about the progress, too :-)


Quote:
Originally Posted by bsquared View Post
Nice one. Compiles to something nearly identical to the inline assembler version (timings under submod_v6 in the table below):

Code:
0000000000000050 <submod>:
    r0 = x - y;
      50:    48 89 f8                 mov    %rdi,%rax
      53:    48 29 f0                 sub    %rsi,%rax
        r0 += n;
      56:    48 01 c2                 add    %rax,%rdx
      59:    48 39 f7                 cmp    %rsi,%rdi
      5c:    48 0f 42 c2              cmovb  %rdx,%rax
}
      60:    c3                       retq
Code:
size    submod_1  submod_2  submod_3  submod_4  submod_5  submod_6
42      0.63      0.82      0.64      0.64      0.66      0.64
44      0.70      0.91      0.70      0.71      0.74      0.69
46      0.79      1.04      0.80      0.81      0.82      0.80
48      0.95      1.27      0.97      0.96      0.99      0.94
50      1.12      1.46      1.15      1.13      1.16      1.14
Should really make a separate test for just the submod; the real differences between these become invisible once other functions in the ecm code start dominating.
In the new variant (submod_6), the condition does not depend on r0.
So maybe both r0 and the condition could be computed kind of parallelly? I am obviously no expert there, but maybe there is some space to improve the assembler code.
Till is offline   Reply With Quote
Old 2019-08-26, 20:15   #204
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

20816 Posts
Default

Quote:
Originally Posted by xilman View Post
I don't know if "you" in the above means me.

I meant the versions without any branching code, explicit or otherwise. In C:
Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t r0 = a - b;
return r0 + (n & ((int64_t)r0 >> (int64_t)63));
}

__inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n)
{
return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63));
}
I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.

Looking closer at your code: Doing signed operations on unsigned integers looks a bit adventurous. ;-) But since Java does not have primitive unsigned types, testing was easy for me. In Java it boils down to
Code:
    long submod(long a, long b, long n) {
        return (a - b) + (n & ((long)(a - b) >> 63));
    }
In Java it seems to be important to use the ">>" operator (shifting in the sign bit, too) instead of ">>>" (ignoring the sign bit).


But your proposition does not seem to perform any better than what we got before, i.e.
Code:
    long submod(long x, long y, long n) {
        return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n : r0;
    }
Till is offline   Reply With Quote
Old 2019-08-27, 14:40   #205
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

23×5×13 Posts
Default

Quote:
Originally Posted by xilman View Post
I don't know if "you" in the above means me.

I meant the versions without any branching code, explicit or otherwise. In C:
Code:
__inline uint64_t submod_3(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t r0 = a - b;
return r0 + (n & ((int64_t)r0 >> (int64_t)63));
}

__inline uint64_t submod_4(uint64_t a, uint64_t b, uint64_t n)
{
return (a - b) + (n & ((int64_t)(a - b) >> (int64_t)63));
}
I don't speak Java but the C code uses arithmetic right shifts and bitwise-and, rather than conditional addition.


Quote:
Originally Posted by Till View Post
But your proposition does not seem to perform any better than what we got before, i.e.
Code:
    long submod(long x, long y, long n) {
        return (x+Long.MIN_VALUE < y+Long.MIN_VALUE) ? r0+n : r0;
    }



When I was saying it does not perform any better I meant it is more or less equal to my last implementation. What we can say for sure is that the two variants are very close judged by the overall tinyecm performance. But I did more tests and it seems that your branch-less implementation is slightly better for n>=52 bit. Not much, just 0-5% depending on the size of N. But I think I can comfirm that (which is not that easy having garbage collection and more strange things influencing timings in Java).


I will switch to your version.

Last fiddled with by Till on 2019-08-27 at 14:59 Reason: inserted reference
Till is offline   Reply With Quote
Old 2019-10-13, 14:59   #206
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

23×5×13 Posts
Default tinyecm map array

The map[] array of tinyecm.c is still a mystery to me.
It looks like a map from primes to prime indices (mod 60).
But if that is true than the first line is wrong:
0, 1, 2, 0, 0, 0, 0, 3, 0, 0,
contains one '0' too much between '2' and '3'.


No matter what I tried to improve the array, performance is degrading.


@Ben Did you trim some parameters depending on the map array?
Till is offline   Reply With Quote
Old 2019-10-14, 13:49   #207
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

72138 Posts
Default

Quote:
Originally Posted by Till View Post
The map[] array of tinyecm.c is still a mystery to me.
It looks like a map from primes to prime indices (mod 60).
But if that is true than the first line is wrong:
0, 1, 2, 0, 0, 0, 0, 3, 0, 0,
contains one '0' too much between '2' and '3'.


No matter what I tried to improve the array, performance is degrading.


@Ben Did you trim some parameters depending on the map array?
It is a map from differences (P - A) to an array where the precomputed values [P-A]Q are stored, where P is the current stage 2 prime, A is the current giant step value, and Q is the stage 1 residue. Mostly (P-A) are prime, but I also need for example the value 1 and the value D/2, which are not prime. So the map is indexed from 0:

Code:
P-A: 0 1 2 3 4 5 6 7 8 ...
map: 0 1 2 0 0 0 0 3 0 ...
Pb:  0 1 2 7 11 13 17 ...
The precomputed [P-A]Q are stored in the "Pb" array.

Tinyecm uses a small set of fixed B2/B1 parameters depending on the input size, and for those parameters I've precomputed exactly the values [P-A] that produce the maximal prime-pairings for D=60 and U=1, using Montgomery's pair algorithm. These are the values stored in, for example, the "b1_70" array.

So stage two amounts to looking up each Pb value in the "b1_xx" array using Pb[map[b1_xx[i]]] and multiplying its cross-product into the accumulator, advancing Pa (the giant step value) each time a 0 is encountered.

Last fiddled with by bsquared on 2019-10-14 at 13:51
bsquared is offline   Reply With Quote
Old 2019-10-14, 14:00   #208
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3·17·73 Posts
Default

Quote:
Originally Posted by bsquared View Post

So stage two amounts to looking up each Pb value in the "b1_xx" array using Pb[map[b1_xx[i]]] and multiplying its cross-product into the accumulator, advancing Pa (the giant step value) each time a 0 is encountered.
I suppose one slight optimization (and additional obfuscation) would be to pre-apply map[] to the b1_xx arrays to skip one level of indirection per loop iteration (so we'd just have Pb[b1_xx[i]]). But I'm sure that time is insignificant compared to the modular cross-product and accumulation.
bsquared is offline   Reply With Quote
Old 2019-10-14, 14:35   #209
Till
 
Till's Avatar
 
"Tilman Neumann"
Jan 2016
Germany

52010 Posts
Default

Thanks for the explanation; in my ignorance I half-ways expected that to be a bug.


And on the little optimization possibility I trust your assessment.
Till is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Factoring Mersenne numbers paulunderwood Miscellaneous Math 18 2017-08-27 14:56
Factoring Big Numbers In c# ShridharRasal Factoring 10 2008-03-20 17:17
Question about factoring code Peter Nelson Software 9 2005-03-30 18:28
Factoring Fermat Numbers Axel Fox Software 14 2003-07-04 18:57
Questions about SSE2 code and Factoring Joe O Software 2 2002-09-13 23:39

All times are UTC. The time now is 19:29.


Fri Dec 2 19:29:29 UTC 2022 up 106 days, 16:58, 0 users, load averages: 0.92, 1.01, 0.99

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โŠ โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โŠ† โŠ‚ โŠ„ โŠŠ โŠ‡ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”