20190509, 21:38  #23  
"Robert Gerbicz"
Oct 2005
Hungary
3·5·109 Posts 
Quote:
Code:
printf("Gmp version number=%s\n",gmp_version); Code:
gerbicz@gerbiczMS7972:~/gmp6.1.2$ gcc v Using builtin specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64linuxgnu/8/ltowrapper OFFLOAD_TARGET_NAMES=nvptxnone OFFLOAD_TARGET_DEFAULT=1 Target: x86_64linuxgnu Configured with: ../src/configure v withpkgversion='Ubuntu 8.3.06ubuntu1~18.10' withbugurl=file:///usr/share/doc/gcc8/README.Bugs enablelanguages=c,ada,c++,go,brig,d,fortran,objc,objc++ prefix=/usr withgccmajorversiononly programsuffix=8 programprefix=x86_64linuxgnu enableshared enablelinkerbuildid libexecdir=/usr/lib withoutincludedgettext enablethreads=posix libdir=/usr/lib enablenls enablebootstrap enableclocale=gnu enablelibstdcxxdebug enablelibstdcxxtime=yes withdefaultlibstdcxxabi=new enablegnuuniqueobject disablevtableverify enablelibmpx enableplugin enabledefaultpie withsystemzlib withtargetsystemzlib enableobjcgc=auto enablemultiarch disablewerror witharch32=i686 withabi=m64 withmultiliblist=m32,m64,mx32 enablemultilib withtune=generic enableoffloadtargets=nvptxnone withoutcudadriver enablechecking=release build=x86_64linuxgnu host=x86_64linuxgnu target=x86_64linuxgnu Thread model: posix gcc version 8.3.0 (Ubuntu 8.3.06ubuntu1~18.10) gerbicz@gerbiczMS7972:~/gmp6.1.2$ 

20190516, 02:09  #24 
May 2019
13 Posts 
Re all,
it's super late and I'm very tired after an amazing day at the Stata Center but... I updated the factordb DB with the two factors of LCS35's n. Secret message was simply: "!!! Happy Birthday LCS !!! (seed value b for p = 712238904468723561162000937465778229877598711342253664788091132335)" Here goes the factorization... Code:
n from LCS35: 6314466083072888893799357126131292332363298818330841375588990772701957128924885547308446055753206513618346628848948088663500368480396588171361987660521897267810162280557475393838308261759713218926668611776954526391570120690939973680089721274464666423319187806830552067951253070082020241246233982410737753705127344494169501180975241890667963858754856319805507273709904397119733614666701543905360152543373982524579313575317653646331989064651402133985265800341991903982192844710212464887459388853582070318084289023209710907032396934919962778995323320184064522476463966355937367009369212758092086293198727008292431243681 prime 1: 154562048657421937084682913138460711844778329290535727168950717733424202505601392625877205195325207546299861196433571846578805282932208177729566895723130005216848832881503977502006366130238044889916634521661721389848859842071584845041109033358607085246937424599502955074736164230571310927294787034883784001451 prime 2: 40853923313791905056390006230327028982596606945346318846549337657379367189284051739393623398120773967666443351405399733470781263258210328375697986182263865342952431430451931199880262185216184022204644820522254304557640038220504811993426511340296997627585129527426806995411578324937406976125239089383993333731 Last fiddled with by jasonp on 20190516 at 11:48 Reason: added code tags 
20190521, 00:03  #25 
"Kevin W. Hake"
May 2019
5_{16} Posts 
Hi! Forum newbie here :)
I also took a look at the problem late last year and a test program using MPIR made me realize it was solvable in only a few years with general purpose hardware. I had a guess that an FPGA could do it in months... never got an implementation off the ground tho (my HDL skills are not really existant!). To help answer TacticalCoder's question about why it was solvable so much faster: Some theories:
My own results with MPIR show that one iteration takes ~4,800 cycles. That's an improvement factor over 50x from Rivest's benchmark program in 1999. AVX2 alone doesn't account for all of that... maybe ~4x (I've compiled without AVX optimizations). Better pipelining and register space in the CPU (which was certainly predictable in 1999, but not taken into account in the puzzle) would further reduce the overhead of performing 2048bit calculations on 64bit hardware. Then maybe on top, the speed difference of java vs c, if that's where Rivest got his benchmark. But as we've already established, the puzzle was more doomed than the 5x50x improvement in the ability to do wide operations on 64bit hardware we've gotten in the past 20 years. Why not just make a 2048bit squaring circuit and 4096to2048 modulo and forget about the crazy algorithms we're using to make do with limited width hardware? Put another way: Solving the puzzle by squaring requires 79685186856218 iterations. If we want to solve it in one week, how fast does our machine need to crunch? 1 week = 604800 seconds. 79685186856218/604800 = 131754608 iterations/sec. Now imagine you could build a machine that only did this calculation, one iteration per cycle. That's 132 MHz. Doesn't sound so impossible now, does it? You can buy FPGAs today that clock to 500MHz. In 1999, 60 MHz FPGAs were available. I'm not sure what kind of area they had, but I suspect if someone at Altera or Xilinx had given the puzzle a serious look in 1999 it would've been solved quickly then, maybe only a year or two later! About the performance questions for the GMP solver: Since the solver is singlethreaded, multicore processors don't help at all. In fact, each added core only increases the power consumption and heat dissipated... besides that, each core needs to synchronize with some cache, ram, and other resources... all these factors mean more cores make it a little harder to operate the CPU at very high clock speeds. In other words, an i7 isn't necessary, you can go the budget route. For some years now, extra performance has become more about cores than clocks (the opposite optimization that we want for LCS35). My MPIR program seemed to run ~20% faster per clock on Intel cpus  likely because AMD's AVX2 instruction hardware isn't as optimized (that may change with Zen 2). From a bit of research, the fastest clocked AVX2 Intel cpu was made years ago (I wasn't sure if AVX512 was supported in GMP yet, and they are lowerclocked expensive parts anyhow), and should be able to finish the puzzle in 2 years and change. So either nobody was working on the puzzle, or it would be solved any day (which indeed happened! TacticalCoder was busy crunching the whole time!!). I was optimistic and bought a cheap i3 7350k anyway, which happily clocked to 4.85GHz stably at 60C, and with pretty low power consumption... it only has 2 cores. I never did try it in Linux with GMP (I was busy), but MPIR gave me a bit above 1 million iterations per second. From other comments sounds like GMP would've given me a pretty decent boost as well. Also would've saved me a bit of time arguing with the MPIR devs :) RAM speed should have no measurable impact, as your whole main loop should be small enough to reside completely in CPU cache. In my case, even CPU cache speed had no measurable impact, which surprised me. This was testable because I can independently over/underclock the CPU and the cache. The calculation speed was directly related to CPU clock, but seemed unchanged by cache clock speed. Which means the cache was barely being used; the CPU likely has enough register space to contain all of the data for every iteration (CPUs can contain hidden unnamed registers to speed things up). TacticalCoder: The automatic Turboboost limits itself based on the temperature of the CPU. If it gets too hot, it throttles down for a bit. If your machine is dustier, has a smaller heatsink, or the ambient air temperature is higher, your machine will spend more time cooling off at a lower clock speed between boosts. Also, you could have other processes stealing cpu time, or at least heating up the CPU and keeping your clocks lower. As far as protecting from errors... I'll be honest, Shamir's suggestion sounded very intuitive in the paper but when I sat down to implement it, I couldn't make heads or tails of it. But in practice you don't really need it; Initial decoding must be single threaded, but verification is parallelizable. No need to complicate the main decoder thread with any extra operations. Obviously your machine won't run forever... there are power outages, surges, etc... My decoder wrote progress to disk every ~5minutes or so. Those 5 minute blocks can be recalculated in another thread at any time, and in parallel, with the same code as the main thread. If you were really worried about verification you could rent 100 threads on Amazon for a couple hours a week. In any case, on my thrown together overclocked machine (fails Prime95 instantly by the way, tho presumably only floating point) I crunched something like 15% of LCS35 and never ran into errors (the progress was verified). So the odds of having an error are pretty darn low with modern hardware. 
20190521, 00:19  #26  
Sep 2002
Database er0rr
2·5·467 Posts 
Quote:


20190521, 00:49  #27  
"Kevin W. Hake"
May 2019
5 Posts 
Quote:
Somewhat related, I found in MPIR that performing a square operation followed by modulo was actually faster than the powmod function... but perhaps that's only because MPIR's implementation isn't good. I'm curious to try GMP's. I also didn't quite grasp Montgomery multiplication, and maybe I'm missing something that would've made it crunch much faster. 

20190521, 01:07  #28  
Sep 2002
Database er0rr
2·5·467 Posts 
Quote:
You might also find that aligning n (over which you are taking the mod) with a shift to left to the nearest 64 bits. So you are taking mod n*2^i in the loop and then doing a final mod n after the loop. If n is already aligned then this point is moot. This would make a difference of 2% if not aligned. HTH Last fiddled with by paulunderwood on 20190521 at 01:09 

20190521, 02:30  #29  
"Kevin W. Hake"
May 2019
5 Posts 
Quote:


20190522, 13:34  #30  
May 2019
2 Posts 
Quote:
Let me preface all this by saying I'm not a math person but I love puzzles and the solutions to puzzles. I first started looking at LCS 35 after I read that someone had solved it (I had never heard of it before then). Writing a pure java implementation of the solver was simple (even for me) but the portion about protecting for errors threw me. The thing I loved most about the solver is you could have it chugging away (I would save the progression every 100 million t's) while working on improving it's speed. Even though I was stuck on the error checking, I managed to enable GMP in my java implementation and it takes my 2 year old machine ~18.5 minutes to generate 100 million t's using GMP (I think that works out to almost 29 years to get a solution). Finally though, after about 3 weeks of searching and reading, I managed to get a working error check. Instead of 2^2^t mod n, you calculate 2^2^t mod cn where c is a 50bit prime. If you take your result mod n, you get the same answer as if you had been doing 2^2^t mod n all along. If you take your result mod c, you get one side if your verification number. To get the other side you need to perform the following (where k is the current value of t); 2^(2^k) mod c k' = 2^k mod c1 2^k' mod c Fermat's little theorem to the rescue. 

20190523, 06:31  #31  
May 2019
157_{8} Posts 
Quote:
The error check you proposed has a 25% overhead (my estimate) and has a 1/c chance of leaving an error undetected. However, there is an even stronger error check, called the Gerbicz error check. It has negligible overhead and has a 2/n probability or less of leaving an error undetected, much smaller than 1/c. (Is that right?) Link to the Gerbicz error check: https://www.mersenneforum.org/showpo...36&postcount=1. It works for any N since only squaring is involved. Correct me if I am wrong. 

20190523, 11:22  #32 
May 2019
2 Posts 
I can try to calculate the overhead later. Shouldn't be too difficult.
I had a quick look at your link and I'm not sure if I would be able to implement this check. Thanks, I will keep reading up on it though. 
20190524, 18:43  #33 
"Robert Gerbicz"
Oct 2005
Hungary
3·5·109 Posts 
Wikipedia:
Code:
The idea behind the challenge is that the only known way to find the value of w without knowing the factorisation of n is by t successive squarings. Skipping more iterations is the same, showing only the (r^4) mod m case. You can compute r^4 mod m without computing r^2 mod m (and then r^4 with one more squaring): compute (r^4) mod p for many primes, and with CRT reconstruct r^4: val=sum(c[i]*M/P[i]) for M=vecprod(P), and here comes the trick, we can skip the computation of val, and directly compute the mod m value: but the problem with this is that: Code:
r^4 mod m=(sum(c[i]*M/P[i]) mod M) mod m Code:
d*M<=val<(d+1)*M then r^4 mod m=v mod m, where v=sum(c[i]*M/P[i])d*M v==sum(c[i]*H[i])d*G mod m, where H[i]=(M/P[i])%m and G=M%m standard ideas: if you can compute c*H with 256 threads in T time, then the computation of sum(c[i]*H[i]) takes only log2(256)*T=8*T time. (the same applies to eff). a working example in PariGp: Code:
f(r,m)={P=[];pr=2^32;for(i=1,256,pr=precprime(pr1);P=concat(P,pr)); c=vector(256); M=vecprod(P); G=M%m; H=vector(256,i,(M/P[i])%m); inv=vector(256,i,lift(Mod(M/P[i],P[i])^(1))); eff=vector(256,i,1.0/P[i]); for(i=1,256,p=P[i]; res=lift(Mod(r,p)^4); c[i]=(res*inv[i])%p); d=floor(sum(i=1,256,c[i]*eff[i])); return((sum(i=1,256,c[i]*H[i])G*d)%m)} m=random(2^2047); r=random(m); f(r,m)==(r^4)%m r=random(10^20); f(r,m)==(r^4)%m %16 = 1 %18 = 0 Ofcourse you can precompute P,M,G,H,inv,eff; you don't need to store all of these in all threads. Last fiddled with by R. Gerbicz on 20190524 at 18:44 Reason: grammar 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Crypto News  Nick  Tales From the Crypt(o)  52  20201217 21:16 
I hate this time of year  davieddy  Lounge  4  20091018 04:39 
Crypto 2007  R.D. Silverman  Lounge  2  20070808 20:24 
crypto game  MrHappy  Lounge  0  20050119 16:27 
Is it time to change the CPU year measurement?  E_tron  Lounge  7  20040629 10:17 