 mersenneforum.org Coding Challenges
 Register FAQ Search Today's Posts Mark Forums Read  2005-08-05, 16:50 #1 R.D. Silverman   Nov 2003 11101001001002 Posts Coding Challenges Hello everybody, I want to extend my thanks to everyone who participated in my little "challenge". Some nice speedups were obtained. Kudos to those who helped . It is unfortunate that the fastest code produced had to come from going to assembler; this clearly makes it less portable. (but thanks to xenon for nice work and thanks to Alex for improvements to the C code  ) [Yes, we are all aware that going completely to assembler would yield faster code] I am a bit surprised that noone responded to the challenge to improve the lattice reduction code. It is not by coincidence that I presented both routines because the algorithms are * almost* identical. I reworked Arjen's conditional shift & subtract code for the lattice reduction into something closer to the modinverse code. I give that code here: /************************************************************************/ /* */ /* New routine to do lattice basis reduction via Euclidean algorithm */ /* */ /************************************************************************/ new_latred(q,r, v1, v2) int q,r, v1[], v2[]; { /* start of new_latred */ int a11, a12, a21, a22, t, rem; double norm_row1, tmp_norm, dot_prod; int accum1, accum2, t1,t2; int tmp_a11, tmp_a12; a11 = q; a21 = r; a12 = 0; a22 = 1; //printf("| %d %d |\n", a11, a12); //printf("| %d %d |\n\n", a21, a22); /* we take advantage of the fact that most partial quotients are small */ /* by doing the division by successive conditional subtraction. */ norm_row1 = (double)a11 * (double)a11 + (double)a12 * (double)a12; while(1) { accum1 = a21; accum2 = a22; rem = a11 - a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem -= a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem -= a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem -= a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem -= a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem -= a21; if (rem >= a21) { t = a11/a21; accum1 = t * a21; accum2 = t * a22; } } } } } } tmp_a11 = a11 - accum1; // update row 1 tmp_a12 = a12 - accum2; tmp_norm = (double)tmp_a11 * (double)tmp_a11 + (double)tmp_a12 * (double)tmp_a12; if (tmp_norm > norm_row1) break; // we are done t1 = tmp_a11; // swap rows t2 = tmp_a12; a11 = a21; a12 = a22; a21 = t1; a22 = t2; norm_row1 = tmp_norm; // printf("| %d %d |\n", a11, a12); // printf("| %d %d |\n\n", a21, a22); } /* Need to do a single Gram-Schmidt row reduction if possible */ dot_prod = (double)a11 * a21 + (double)a12 * a22; t = round_double(dot_prod/norm_row1); //printf("G-S t value = %d\n",t); v1 = a11 - t * a21;; v1 = a12 - t * a22; v2 = a21; v2 = a22; //printf("final result:\n"); //printf("| %d %d |\n", v1, v1); //printf("| %d %d |\n\n", v2, v2); return(0); } /* end of new_latred */ -------------------------------------------------------------------- I then compared the two routines for 3 different values of p as r varied over 10000 different, consecutive values. Here are the time RATIOS for the two routines (old routine/new routine) . I give values for 20 different r values for each prime. r: 2.153262 r: 2.536551 r: 2.333333 r: 2.467522 r: 2.265208 r: 2.054390 r: 2.488469 r: 2.240045 r: 2.071930 r: 2.285150 r: 2.020040 r: 2.409091 r: 2.093275 r: 2.271066 r: 2.068036 r: 2.264728 r: 2.444172 r: 2.147155 r: 2.220281 ----------------- r: 1.811325 r: 2.209565 r: 1.966776 r: 2.392782 r: 1.994787 r: 1.534354 r: 2.500306 r: 2.003802 r: 2.159395 r: 1.934477 r: 2.259038 r: 1.589700 r: 2.106504 r: 1.920269 r: 1.893444 r: 2.049665 r: 1.839205 r: 1.827455 r: 1.873463 ----------------- r: 2.010581 r: 2.088218 r: 2.262241 r: 2.306903 r: 2.110598 r: 1.971844 r: 2.464969 r: 1.773949 r: 2.348313 r: 2.564947 r: 2.322762 r: 2.154452 r: 2.184591 r: 2.193857 r: 2.186851 r: 1.842859 r: 2.255534 r: 2.199179 r: 3.877670 =============================================== The new routine is more than twice as fast as the old one!!!!!  I see some additional improvements that might be made. Rather than just proceeding with the final Gram-Schmidt reduction, we can (perhaps?) avoid the final floating point division by checking that t > 1/2 before the division simply by comparing the two norms. i.e. 2*dot_prod > norm_row1. Whether this check gains anything depends on how often t is non-zero. We can also do some more subtracts in the inner loop before succumbing to doing the division etc. Would anyone like to take a crack at the improved version?     2005-08-05, 18:42 #2 dsouza123   Sep 2002 12268 Posts For the lattice basis reduction: What values of p and range of r ? How can the results be verified ? My reasoning behind modinv versus latred: The mathematics behind modinv routine was very understandable, familar, easily verifiable. The code was understandable and short. The code for lattice basis reduction was harder to follow, and contained conditional compiler code and was twice as long as the modinv. You provided very helpful, detailed information on the range of the parameters for modinv. (I did redo the modiv routine in Delphi and MASM32, but your using C/C++ and xenon handled the x86 Assembly.) Is there any unoptimized code to look at ? The 2X+ speed increase is already quite good. Code in x86 assembly or C or both ?   2005-08-05, 19:14   #3
R.D. Silverman

Nov 2003

22·5·373 Posts Quote:
 Originally Posted by dsouza123 For the lattice basis reduction: What values of p and range of r ?

For the lattice basis reduction:
What values of p and range of r ?

Single precision. Same range of p and r as in modinv.

How can the results be verified ?

The matrix determinant is p. It remains invariant.
Each reduction step is equivalent to multiplying by a
uni-modular marix, and hence the determinant does not change.
The coefficients in the matrix get smaller, until the min is reached,
then further "reductions" result in one row becoming very small
while the other grows big again. The condition on the L2 norm
in the code says when to stop.

My reasoning behind modinv versus latred:
The mathematics behind modinv routine
was very understandable, familar, easily verifiable.

But the math for the lattice reduction is the same: Euclidean Algorithm
[reduction follws with a Gram-Schmidt orthogonalization at the end
if it is useful]

The code was understandable and short.

Thank you for the compliment.

Code in x86 assembly or C or both ?

C is strongly preferred, but of course both are acceptable.

Note: I already tried the test I suggested that avoids the final
floating div at the expense of an extra branch. It does give a slight
speedup.   2005-08-05, 20:43 #4 dsouza123   Sep 2002 10100101102 Posts The newer code is easier to follow. I'll try in assembly, have some implementation ideas but no algorithmic ones. modinv used a, modulus with modulus the prime new_latred has q , r so is r for a and q for modulus ?   2005-08-05, 20:52   #5
R.D. Silverman

Nov 2003

22×5×373 Posts Quote:
 Originally Posted by dsouza123 The newer code is easier to follow. I'll try in assembly, have some implementation ideas but no algorithmic ones. modinv used a, modulus with modulus the prime new_latred has q , r so is r for a and q for modulus ?
Sort of. p is a prime in both cases. a and r play similar roles
algorithmically, but their meaning is different. r stands for 'root'

It is the root of a polynomial (which itself is irrelevant to the
algorithm) taken mod p.

Arjen's old code basically tried to replace the division operation
that computes t at each iteration with a series of trial shifts and
subtracts... Better variable names would have aided the clarity......   2005-08-06, 05:04   #6
xenon

Jul 2004

5×7 Posts Quote:
 Originally Posted by R.D. Silverman Sort of. p is a prime in both cases. a and r play similar roles algorithmically, but their meaning is different. r stands for 'root'
Please give a few p values for benchmarking. Can you confirm that r is guaranteed to be in the interval [1, p-1]? Well, I see that your code uses "q" and we talk much of "p", I assume we mean the same thing.   2005-08-06, 12:52   #7
R.D. Silverman

Nov 2003

22×5×373 Posts Quote:
 Originally Posted by xenon Please give a few p values for benchmarking. Can you confirm that r is guaranteed to be in the interval [1, p-1]? Well, I see that your code uses "q" and we talk much of "p", I assume we mean the same thing.
'q' arises from its standard usage in NFS literature as 'special-q' prime.
r is indeed in [1,p-1]. q is typically 15 to 27 bits; choose some at random.

BTW, I have already improved the code with:

dot_prod = (double)a11 * (double)a21 + (double)a12 * (double)a22;

if ((dot_prod + dot_prod) < norm_row1)
{
v1 = a11;
v1 = a12;
v2 = a21;
v2 = a22;
return(0);
}

t = round_double(dot_prod/norm_row1); etc.

The if test avoids quite a few divisions at low cost.

I tried adding a couple of more conditional subtracts in the main loop,
but saw little difference.

Thanks again for the help!!!!   2005-08-06, 13:34 #8 xenon   Jul 2004 5×7 Posts The code doesn't work with r=1. Perhaps you mean r in [2, p-1]. Before looking at timings, I want to get the result correct. While modular inverse has one correct answer, lattice reduction can have more than one. You didn't specify exactly what you want. I assume shorter vectors are better. But I found that your code doesn't produce the shortest vectors when r is small. I get shorter vector by computing t = nearest_int(dot_product/norm) in every iteration (this is slow). About verifying the results, any other method besides computing determinant? Computing determinant isn't "strict" enough to catch all errors. i.e. if r changes, the determinant doesn't change.   2005-08-06, 18:05   #9
dsouza123

Sep 2002

2×331 Posts Quote:
 Originally Posted by R.D. Silverman I then compared the two routines for 3 different values of p as r varied over 10000 different, consecutive values. Here are the time RATIOS for the two routines (old routine/new routine) . I give values for 20 different r values for each prime.
What were the 3 different values of p ( aka q ) along with its starting r ?

If the same input for testing is used by everyone it wil be easier
to verify accuracy and compare the efficiency of improvements.

========================================================
A few results with somewhat random q and r ( each q is prime and r is not).
Are these results from Delphi conversion correct ? (a11 == q a21 == r)

| 134217689 0| a11 a12
| 92342327 1| a21 a22
No need for G-S t
| 92342327 1| v1 v1
| 41875362 -92342328| v2 v2

| 524309 0| a11 a12
| 10003 1| a21 a22
G-S t value = 2
| 1697 105| v1 v1
| 4153 -52| v2 v2

The MASM32 version gives completely different answers.

| 524309 0| a11 a12
| 10003 1| a21 a22
G-S t value = 0
| 759 -262| v1 v1
| 179 629| v2 v2

haven't added the if ((dot_prod + dot_prod) < norm_row1) code
to the MASM32 version yet.   2005-08-06, 20:38   #10
R.D. Silverman

Nov 2003

11101001001002 Posts Quote:
 Originally Posted by xenon The code doesn't work with r=1. Perhaps you mean r in [2, p-1]. Before looking at timings, I want to get the result correct. While modular inverse has one correct answer, lattice reduction can have more than one. You didn't specify exactly what you want. I assume shorter vectors are better. But I found that your code doesn't produce the shortest vectors when r is small. I get shorter vector by computing t = nearest_int(dot_product/norm) in every iteration (this is slow). About verifying the results, any other method besides computing determinant? Computing determinant isn't "strict" enough to catch all errors. i.e. if r changes, the determinant doesn't change.
If G-S is carried forward completely, it will always produce a basis
with the shortest possible vector. But this comes at the expense of
the other vector being quite large.

To determine if you have a correct final answer consider the original
and final row vectors (say) V1, V2 and V1F , V2F

if e * V1 + f * V2 = e1 * V1F + f1 * V2F has a solution (e1, f1) for
all (e,f), then the final basis is valid.   2005-08-07, 00:38   #11
xenon

Jul 2004

5·7 Posts Quote:
 Originally Posted by R.D. Silverman But this comes at the expense of the other vector being quite large.
Code:
Example: q=1299721, r=30

|  30          1   |      |1441     -43276  |
|   1     -43324   |      |  30         1   |

option 1                  option 2
The second one is shorter. But your code produces the first one. Am I correct?   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post jrafanelli Software 2 2018-01-11 15:16 kelzo Programming 3 2016-11-27 05:16 devarajkandadai Miscellaneous Math 1 2013-11-09 23:30 starrynte Programming 1 2008-12-30 22:31 BotXXX Factoring 16 2007-05-27 03:37

All times are UTC. The time now is 09:34.

Thu May 19 09:34:55 UTC 2022 up 35 days, 7:36, 0 users, load averages: 1.62, 1.64, 1.69