20050805, 16:50  #1 
Nov 2003
2^{6}×113 Posts 
Coding Challenges
Hello everybody,
I want to extend my thanks to everyone who participated in my little "challenge". Some nice speedups were obtained. Kudos to those who helped . It is unfortunate that the fastest code produced had to come from going to assembler; this clearly makes it less portable. (but thanks to xenon for nice work and thanks to Alex for improvements to the C code ) [Yes, we are all aware that going completely to assembler would yield faster code] I am a bit surprised that noone responded to the challenge to improve the lattice reduction code. It is not by coincidence that I presented both routines because the algorithms are * almost* identical. I reworked Arjen's conditional shift & subtract code for the lattice reduction into something closer to the modinverse code. I give that code here: /************************************************************************/ /* */ /* New routine to do lattice basis reduction via Euclidean algorithm */ /* */ /************************************************************************/ new_latred(q,r, v1, v2) int q,r, v1[], v2[]; { /* start of new_latred */ int a11, a12, a21, a22, t, rem; double norm_row1, tmp_norm, dot_prod; int accum1, accum2, t1,t2; int tmp_a11, tmp_a12; a11 = q; a21 = r; a12 = 0; a22 = 1; //printf(" %d %d \n", a11, a12); //printf(" %d %d \n\n", a21, a22); /* we take advantage of the fact that most partial quotients are small */ /* by doing the division by successive conditional subtraction. */ norm_row1 = (double)a11 * (double)a11 + (double)a12 * (double)a12; while(1) { accum1 = a21; accum2 = a22; rem = a11  a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem = a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem = a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem = a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem = a21; if (rem >= a21) { accum1 += a21; accum2 += a22; rem = a21; if (rem >= a21) { t = a11/a21; accum1 = t * a21; accum2 = t * a22; } } } } } } tmp_a11 = a11  accum1; // update row 1 tmp_a12 = a12  accum2; tmp_norm = (double)tmp_a11 * (double)tmp_a11 + (double)tmp_a12 * (double)tmp_a12; if (tmp_norm > norm_row1) break; // we are done t1 = tmp_a11; // swap rows t2 = tmp_a12; a11 = a21; a12 = a22; a21 = t1; a22 = t2; norm_row1 = tmp_norm; // printf(" %d %d \n", a11, a12); // printf(" %d %d \n\n", a21, a22); } /* Need to do a single GramSchmidt row reduction if possible */ dot_prod = (double)a11 * a21 + (double)a12 * a22; t = round_double(dot_prod/norm_row1); //printf("GS t value = %d\n",t); v1[0] = a11  t * a21;; v1[1] = a12  t * a22; v2[0] = a21; v2[1] = a22; //printf("final result:\n"); //printf(" %d %d \n", v1[0], v1[1]); //printf(" %d %d \n\n", v2[0], v2[1]); return(0); } /* end of new_latred */  I then compared the two routines for 3 different values of p as r varied over 10000 different, consecutive values. Here are the time RATIOS for the two routines (old routine/new routine) . I give values for 20 different r values for each prime. r: 2.153262 r: 2.536551 r: 2.333333 r: 2.467522 r: 2.265208 r: 2.054390 r: 2.488469 r: 2.240045 r: 2.071930 r: 2.285150 r: 2.020040 r: 2.409091 r: 2.093275 r: 2.271066 r: 2.068036 r: 2.264728 r: 2.444172 r: 2.147155 r: 2.220281  r: 1.811325 r: 2.209565 r: 1.966776 r: 2.392782 r: 1.994787 r: 1.534354 r: 2.500306 r: 2.003802 r: 2.159395 r: 1.934477 r: 2.259038 r: 1.589700 r: 2.106504 r: 1.920269 r: 1.893444 r: 2.049665 r: 1.839205 r: 1.827455 r: 1.873463  r: 2.010581 r: 2.088218 r: 2.262241 r: 2.306903 r: 2.110598 r: 1.971844 r: 2.464969 r: 1.773949 r: 2.348313 r: 2.564947 r: 2.322762 r: 2.154452 r: 2.184591 r: 2.193857 r: 2.186851 r: 1.842859 r: 2.255534 r: 2.199179 r: 3.877670 =============================================== The new routine is more than twice as fast as the old one!!!!! I see some additional improvements that might be made. Rather than just proceeding with the final GramSchmidt reduction, we can (perhaps?) avoid the final floating point division by checking that t > 1/2 before the division simply by comparing the two norms. i.e. 2*dot_prod > norm_row1. Whether this check gains anything depends on how often t is nonzero. We can also do some more subtracts in the inner loop before succumbing to doing the division etc. Would anyone like to take a crack at the improved version? 
20050805, 18:42  #2 
Sep 2002
2×331 Posts 
For the lattice basis reduction:
What values of p and range of r ? How can the results be verified ? My reasoning behind modinv versus latred: The mathematics behind modinv routine was very understandable, familar, easily verifiable. The code was understandable and short. The code for lattice basis reduction was harder to follow, and contained conditional compiler code and was twice as long as the modinv. You provided very helpful, detailed information on the range of the parameters for modinv. (I did redo the modiv routine in Delphi and MASM32, but your using C/C++ and xenon handled the x86 Assembly.) Is there any unoptimized code to look at ? The 2X+ speed increase is already quite good. Code in x86 assembly or C or both ? 
20050805, 19:14  #3  
Nov 2003
2^{6}×113 Posts 
Quote:
For the lattice basis reduction: What values of p and range of r ? Single precision. Same range of p and r as in modinv. How can the results be verified ? The matrix determinant is p. It remains invariant. Each reduction step is equivalent to multiplying by a unimodular marix, and hence the determinant does not change. The coefficients in the matrix get smaller, until the min is reached, then further "reductions" result in one row becoming very small while the other grows big again. The condition on the L2 norm in the code says when to stop. My reasoning behind modinv versus latred: The mathematics behind modinv routine was very understandable, familar, easily verifiable. But the math for the lattice reduction is the same: Euclidean Algorithm [reduction follws with a GramSchmidt orthogonalization at the end if it is useful] The code was understandable and short. Thank you for the compliment. Code in x86 assembly or C or both ? C is strongly preferred, but of course both are acceptable. Note: I already tried the test I suggested that avoids the final floating div at the expense of an extra branch. It does give a slight speedup. 

20050805, 20:43  #4 
Sep 2002
296_{16} Posts 
The newer code is easier to follow.
I'll try in assembly, have some implementation ideas but no algorithmic ones. modinv used a, modulus with modulus the prime new_latred has q , r so is r for a and q for modulus ? 
20050805, 20:52  #5  
Nov 2003
16100_{8} Posts 
Quote:
algorithmically, but their meaning is different. r stands for 'root' It is the root of a polynomial (which itself is irrelevant to the algorithm) taken mod p. Arjen's old code basically tried to replace the division operation that computes t at each iteration with a series of trial shifts and subtracts... Better variable names would have aided the clarity...... 

20050806, 05:04  #6  
Jul 2004
100011_{2} Posts 
Quote:


20050806, 12:52  #7  
Nov 2003
2^{6}×113 Posts 
Quote:
r is indeed in [1,p1]. q is typically 15 to 27 bits; choose some at random. BTW, I have already improved the code with: dot_prod = (double)a11 * (double)a21 + (double)a12 * (double)a22; if ((dot_prod + dot_prod) < norm_row1) { v1[0] = a11; v1[1] = a12; v2[0] = a21; v2[1] = a22; return(0); } t = round_double(dot_prod/norm_row1); etc. The if test avoids quite a few divisions at low cost. I tried adding a couple of more conditional subtracts in the main loop, but saw little difference. Thanks again for the help!!!! 

20050806, 13:34  #8 
Jul 2004
23_{16} Posts 
The code doesn't work with r=1. Perhaps you mean r in [2, p1].
Before looking at timings, I want to get the result correct. While modular inverse has one correct answer, lattice reduction can have more than one. You didn't specify exactly what you want. I assume shorter vectors are better. But I found that your code doesn't produce the shortest vectors when r is small. I get shorter vector by computing t = nearest_int(dot_product/norm) in every iteration (this is slow). About verifying the results, any other method besides computing determinant? Computing determinant isn't "strict" enough to catch all errors. i.e. if r changes, the determinant doesn't change. 
20050806, 18:05  #9  
Sep 2002
1226_{8} Posts 
Quote:
If the same input for testing is used by everyone it wil be easier to verify accuracy and compare the efficiency of improvements. ======================================================== A few results with somewhat random q and r ( each q is prime and r is not). Are these results from Delphi conversion correct ? (a11 == q a21 == r)  134217689 0 a11 a12  92342327 1 a21 a22 No need for GS t  92342327 1 v1[0] v1[1]  41875362 92342328 v2[0] v2[1]  524309 0 a11 a12  10003 1 a21 a22 GS t value = 2  1697 105 v1[0] v1[1]  4153 52 v2[0] v2[1] The MASM32 version gives completely different answers.  524309 0 a11 a12  10003 1 a21 a22 GS t value = 0  759 262 v1[0] v1[1]  179 629 v2[0] v2[1] haven't added the if ((dot_prod + dot_prod) < norm_row1) code to the MASM32 version yet. 

20050806, 20:38  #10  
Nov 2003
2^{6}×113 Posts 
Quote:
with the shortest possible vector. But this comes at the expense of the other vector being quite large. To determine if you have a correct final answer consider the original and final row vectors (say) V1, V2 and V1F , V2F if e * V1 + f * V2 = e1 * V1F + f1 * V2F has a solution (e1, f1) for all (e,f), then the final basis is valid. 

20050807, 00:38  #11  
Jul 2004
5·7 Posts 
Quote:
Code:
Example: q=1299721, r=30  30 1  1441 43276   1 43324   30 1  option 1 option 2 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Anyone know enough about coding to do this?  jrafanelli  Software  2  20180111 15:16 
Python Coding Help?  kelzo  Programming  3  20161127 05:16 
Two possible challenges  devarajkandadai  Miscellaneous Math  1  20131109 23:30 
coding midlet for TF  starrynte  Programming  1  20081230 22:31 
The RSA Challenges  RIP  BotXXX  Factoring  16  20070527 03:37 