mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Tales From the Crypt(o)

Reply
 
Thread Tools
Old 2019-05-09, 21:38   #23
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×32×83 Posts
Default

Quote:
Originally Posted by TacticalCoder View Post
I'm also doing my own "revue de presse" (searching for the articles talking about that problem) and I stumbled upon the message saying that on the same hardware as I used it's actually doable in 2.16 years (I could speedup but couldn't reproduce the same speedup: maybe because of my RAM which is slower but that looks weird ?) so I created an account to ask questions about that :)
Sometimes there is a pre-installed gmp on the system, and it could mess up when you install more gmp version(s) which is the actually used gmp version, so insert this in the code to see it:
Code:
printf("Gmp version number=%s\n",gmp_version);
Also, update gcc to at least 8.3.0:
Code:
gerbicz@gerbicz-MS-7972:~/gmp-6.1.2$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/8/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 8.3.0-6ubuntu1~18.10' --with-bugurl=file:///usr/share/doc/gcc-8/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-8 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1~18.10) 
gerbicz@gerbicz-MS-7972:~/gmp-6.1.2$
Today tried mpir (http://mpir.org/index.html), but it was slower than gmp by roughly 15-20%, for this problem.
R. Gerbicz is offline   Reply With Quote
Old 2019-05-16, 02:09   #24
TacticalCoder
 
May 2019

13 Posts
Default

Re all,


it's super late and I'm very tired after an amazing day at the Stata Center but... I updated the factordb DB with the two factors of LCS35's n.


Secret message was simply: "!!! Happy Birthday LCS !!! (seed value b for p = 712238904468723561162000937465778229877598711342253664788091132335)"



Here goes the factorization...


Code:
n from LCS35:

6314466083072888893799357126131292332363298818330841375588990772701957128924885547308446055753206513618346628848948088663500368480396588171361987660521897267810162280557475393838308261759713218926668611776954526391570120690939973680089721274464666423319187806830552067951253070082020241246233982410737753705127344494169501180975241890667963858754856319805507273709904397119733614666701543905360152543373982524579313575317653646331989064651402133985265800341991903982192844710212464887459388853582070318084289023209710907032396934919962778995323320184064522476463966355937367009369212758092086293198727008292431243681

prime 1:

154562048657421937084682913138460711844778329290535727168950717733424202505601392625877205195325207546299861196433571846578805282932208177729566895723130005216848832881503977502006366130238044889916634521661721389848859842071584845041109033358607085246937424599502955074736164230571310927294787034883784001451

prime 2:

40853923313791905056390006230327028982596606945346318846549337657379367189284051739393623398120773967666443351405399733470781263258210328375697986182263865342952431430451931199880262185216184022204644820522254304557640038220504811993426511340296997627585129527426806995411578324937406976125239089383993333731

Last fiddled with by jasonp on 2019-05-16 at 11:48 Reason: added code tags
TacticalCoder is offline   Reply With Quote
Old 2019-05-21, 00:03   #25
kevinhake
 
"Kevin W. Hake"
May 2019

5 Posts
Default

Hi! Forum newbie here :)

I also took a look at the problem late last year and a test program using MPIR made me realize it was solvable in only a few years with general purpose hardware. I had a guess that an FPGA could do it in months... never got an implementation off the ground tho (my HDL skills are not really existant!).

To help answer TacticalCoder's question about why it was solvable so much faster:
Some theories:
  1. Rivest made a typo for his initial value. Perhaps squaringsPerSecond should have been 30000 and not 3000. It probably wasn't a typo tho.
  2. Rivest's initial value was generated with a slow program. It seems likely his 3000 iterations/s was calculated by a 1999 java program similar to the one he used to generate the puzzle. I haven't tested this, but I wouldn't be surprised if java was 10x slower than c in 1999.
  3. (The real reason) In 1999, the latest consumer hardware was running at 733MHz. The puzzle says this machine performs iterations 3000 times per second. That means they were expecting a 2048-bit multiply and modulo to take a whopping 244,333 cycles! This is the puzzle's fatal flaw. Even though the math is fairly sound that iterations must be done sequentially, there is no such limit preventing a machine from only requiring 1 (or a handful) of cycles for a simple multiply and divide. That number, 244333, is so large I doubt the writers of the puzzle even looked at it... I don't know the number of theoretically required 64-bit operations to do the 2048-bit ones but 244k just sounds absurd to me. 119 operations per bit??
For some reason in 1999 Rivest seemed very confident that the number of required clocks per iteration of the algorithm would remain constant for 35 years. Most computers only have hardware for 64-bit integer operations at most, but that's not because it's hard to make them wider, it's because almost nobody needs bigger numbers (or if so, only a tiny fraction of their computing time). General purpose cpus are optimized to give you the biggest bang for the buck for the average user. Bignum libraries make do with the narrow-width hardware by breaking apart the operations into pieces and then gluing them all back together. Vector instructions (like AVX, AVX-2, AVX-512) were designed for computing many calculations in parallel, they're not even optimized for wide-integer operations. In fact, even the new AVX-512 still only supports 64-bit multiplications with carry (in batches of 8). Though the integer operations are still limited to little 64 bit chunks, GMP can parallelize those chunks with AVX, reducing that number of clocks required per iteration.

My own results with MPIR show that one iteration takes ~4,800 cycles. That's an improvement factor over 50x from Rivest's benchmark program in 1999. AVX-2 alone doesn't account for all of that... maybe ~4x (I've compiled without AVX optimizations). Better pipelining and register space in the CPU (which was certainly predictable in 1999, but not taken into account in the puzzle) would further reduce the overhead of performing 2048-bit calculations on 64-bit hardware. Then maybe on top, the speed difference of java vs c, if that's where Rivest got his benchmark.
But as we've already established, the puzzle was more doomed than the 5x-50x improvement in the ability to do wide operations on 64-bit hardware we've gotten in the past 20 years. Why not just make a 2048-bit squaring circuit and 4096-to-2048 modulo and forget about the crazy algorithms we're using to make do with limited width hardware?

Put another way: Solving the puzzle by squaring requires 79685186856218 iterations. If we want to solve it in one week, how fast does our machine need to crunch? 1 week = 604800‬ seconds. 79685186856218/604800‬ = 131754608 iterations/sec. Now imagine you could build a machine that only did this calculation, one iteration per cycle. That's 132 MHz. Doesn't sound so impossible now, does it? You can buy FPGAs today that clock to 500MHz. In 1999, 60 MHz FPGAs were available. I'm not sure what kind of area they had, but I suspect if someone at Altera or Xilinx had given the puzzle a serious look in 1999 it would've been solved quickly then, maybe only a year or two later!


About the performance questions for the GMP solver:

Since the solver is single-threaded, multi-core processors don't help at all. In fact, each added core only increases the power consumption and heat dissipated... besides that, each core needs to synchronize with some cache, ram, and other resources... all these factors mean more cores make it a little harder to operate the CPU at very high clock speeds. In other words, an i7 isn't necessary, you can go the budget route. For some years now, extra performance has become more about cores than clocks (the opposite optimization that we want for LCS35).
My MPIR program seemed to run ~20% faster per clock on Intel cpus - likely because AMD's AVX2 instruction hardware isn't as optimized (that may change with Zen 2). From a bit of research, the fastest clocked AVX-2 Intel cpu was made years ago (I wasn't sure if AVX-512 was supported in GMP yet, and they are lower-clocked expensive parts anyhow), and should be able to finish the puzzle in 2 years and change. So either nobody was working on the puzzle, or it would be solved any day (which indeed happened! TacticalCoder was busy crunching the whole time!!). I was optimistic and bought a cheap i3 7350k anyway, which happily clocked to 4.85GHz stably at 60C, and with pretty low power consumption... it only has 2 cores. I never did try it in Linux with GMP (I was busy), but MPIR gave me a bit above 1 million iterations per second. From other comments sounds like GMP would've given me a pretty decent boost as well. Also would've saved me a bit of time arguing with the MPIR devs :)

RAM speed should have no measurable impact, as your whole main loop should be small enough to reside completely in CPU cache. In my case, even CPU cache speed had no measurable impact, which surprised me. This was testable because I can independently over/underclock the CPU and the cache. The calculation speed was directly related to CPU clock, but seemed unchanged by cache clock speed. Which means the cache was barely being used; the CPU likely has enough register space to contain all of the data for every iteration (CPUs can contain hidden unnamed registers to speed things up).

TacticalCoder: The automatic Turbo-boost limits itself based on the temperature of the CPU. If it gets too hot, it throttles down for a bit. If your machine is dustier, has a smaller heatsink, or the ambient air temperature is higher, your machine will spend more time cooling off at a lower clock speed between boosts. Also, you could have other processes stealing cpu time, or at least heating up the CPU and keeping your clocks lower.

As far as protecting from errors... I'll be honest, Shamir's suggestion sounded very intuitive in the paper but when I sat down to implement it, I couldn't make heads or tails of it.
But in practice you don't really need it; Initial decoding must be single threaded, but verification is parallelizable. No need to complicate the main decoder thread with any extra operations. Obviously your machine won't run forever... there are power outages, surges, etc... My decoder wrote progress to disk every ~5minutes or so. Those 5 minute blocks can be re-calculated in another thread at any time, and in parallel, with the same code as the main thread. If you were really worried about verification you could rent 100 threads on Amazon for a couple hours a week.
In any case, on my thrown together overclocked machine (fails Prime95 instantly by the way, tho presumably only floating point) I crunched something like 15% of LCS35 and never ran into errors (the progress was verified). So the odds of having an error are pretty darn low with modern hardware.
kevinhake is offline   Reply With Quote
Old 2019-05-21, 00:19   #26
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

2·1,933 Posts
Default

Quote:
Originally Posted by kevinhake View Post

Put another way: Solving the puzzle by squaring requires 79685186856218 iterations. If we want to solve it in one week, how fast does our machine need to crunch? 1 week = 604800‬ seconds. 79685186856218/604800‬ = 131754608 iterations/sec. Now imagine you could build a machine that only did this calculation, one iteration per cycle. That's 132 MHz. Doesn't sound so impossible now, does it? You can buy FPGAs today that clock to 500MHz. In 1999, 60 MHz FPGAs were available. I'm not sure what kind of area they had, but I suspect if someone at Altera or Xilinx had given the puzzle a serious look in 1999 it would've been solved quickly then, maybe only a year or two later!
I have a very small basic Xilinx FPGA -- so I am under-qualified to say much about 2048x2048-->4096-->2048. One would need a hell of lot of mult units on a FPGA board to achieve it. I really don't know what is possible these days, nor the cost of such boards.
paulunderwood is offline   Reply With Quote
Old 2019-05-21, 00:49   #27
kevinhake
 
"Kevin W. Hake"
May 2019

1012 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I have a very small basic Xilinx FPGA -- so I am under-qualified to say much about 2048x2048-->4096-->2048. One would need a hell of lot of mult units on a FPGA board to achieve it. I really don't know what is possible these days, nor the cost of such boards.
I was going to try it on an Arty A7-35T, it cost less than $100 usd. I wasn't even going to use the built-in multiply units, as each one is not very wide... I think then you might end up with a hierarchy with longer delays, but I'm not sure. I was going to use only look up tables, using the squaring circuit from a paper I found, "An Improved Squaring Circuit for Binary Numbers" by Kabiraj Sethi and Rutuparna Panda (https://pdfs.semanticscholar.org/29f...e93286160d.pdf)

Somewhat related, I found in MPIR that performing a square operation followed by modulo was actually faster than the powmod function... but perhaps that's only because MPIR's implementation isn't good. I'm curious to try GMP's. I also didn't quite grasp Montgomery multiplication, and maybe I'm missing something that would've made it crunch much faster.
kevinhake is offline   Reply With Quote
Old 2019-05-21, 01:07   #28
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

386610 Posts
Default

Quote:
Originally Posted by kevinhake View Post
I was going to try it on an Arty A7-35T, it cost less than $100 usd. I wasn't even going to use the built-in multiply units, as each one is not very wide... I think then you might end up with a hierarchy with longer delays, but I'm not sure. I was going to use only look up tables, using the squaring circuit from a paper I found, "An Improved Squaring Circuit for Binary Numbers" by Kabiraj Sethi and Rutuparna Panda (https://pdfs.semanticscholar.org/29f...e93286160d.pdf)

Somewhat related, I found in MPIR that performing a square operation followed by modulo was actually faster than the powmod function... but perhaps that's only because MPIR's implementation isn't good. I'm curious to try GMP's. I also didn't quite grasp Montgomery multiplication, and maybe I'm missing something that would've made it crunch much faster.
I reinvented the wheel by devising a mod reduction that had a the same run time as a multiplication in my own muti-precision JavaScript code. Apparently, the technique can be found in "Modern Computer Arithmetic" by Joachim von zur Gathen and Jürgen Gerhard, Cambridge University Press, 2nd edition, 2003. [118]. I have no idea whether this particular technique is applicable to FPGA.

You might also find that aligning n (over which you are taking the mod) with a shift to left to the nearest 64 bits. So you are taking mod n*2^i in the loop and then doing a final mod n after the loop. If n is already aligned then this point is moot. This would make a difference of 2% if not aligned.

HTH

Last fiddled with by paulunderwood on 2019-05-21 at 01:09
paulunderwood is offline   Reply With Quote
Old 2019-05-21, 02:30   #29
kevinhake
 
"Kevin W. Hake"
May 2019

5 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I reinvented the wheel by devising a mod reduction that had a the same run time as a multiplication in my own muti-precision JavaScript code. Apparently, the technique can be found in "Modern Computer Arithmetic" by Joachim von zur Gathen and Jürgen Gerhard, Cambridge University Press, 2nd edition, 2003. [118]. I have no idea whether this particular technique is applicable to FPGA.
HTH
Oh cool! I was thinking because we are dividing by a constant, there must be some shortcut for the division, especially in hardware. I barely got started figuring that out tho.
kevinhake is offline   Reply With Quote
Old 2019-05-22, 13:34   #30
theyowguy
 
May 2019

216 Posts
Default

Quote:
As far as protecting from errors... I'll be honest, Shamir's suggestion sounded very intuitive in the paper but when I sat down to implement it, I couldn't make heads or tails of it.

Let me preface all this by saying I'm not a math person but I love puzzles and the solutions to puzzles.


I first started looking at LCS 35 after I read that someone had solved it (I had never heard of it before then). Writing a pure java implementation of the solver was simple (even for me) but the portion about protecting for errors threw me. The thing I loved most about the solver is you could have it chugging away (I would save the progression every 100 million t's) while working on improving it's speed. Even though I was stuck on the error checking, I managed to enable GMP in my java implementation and it takes my 2 year old machine ~18.5 minutes to generate 100 million t's using GMP (I think that works out to almost 29 years to get a solution).


Finally though, after about 3 weeks of searching and reading, I managed to get a working error check.


Instead of 2^2^t mod n, you calculate 2^2^t mod cn where c is a 50-bit prime.


If you take your result mod n, you get the same answer as if you had been doing 2^2^t mod n all along.


If you take your result mod c, you get one side if your verification number. To get the other side you need to perform the following (where k is the current value of t);


2^(2^k) mod c
k' = 2^k mod c-1
2^k' mod c


Fermat's little theorem to the rescue.
theyowguy is offline   Reply With Quote
Old 2019-05-23, 06:31   #31
2M215856352p1
 
May 2019

3·37 Posts
Default

Quote:
Originally Posted by theyowguy View Post
Let me preface all this by saying I'm not a math person but I love puzzles and the solutions to puzzles.


I first started looking at LCS 35 after I read that someone had solved it (I had never heard of it before then). Writing a pure java implementation of the solver was simple (even for me) but the portion about protecting for errors threw me. The thing I loved most about the solver is you could have it chugging away (I would save the progression every 100 million t's) while working on improving it's speed. Even though I was stuck on the error checking, I managed to enable GMP in my java implementation and it takes my 2 year old machine ~18.5 minutes to generate 100 million t's using GMP (I think that works out to almost 29 years to get a solution).


Finally though, after about 3 weeks of searching and reading, I managed to get a working error check.


Instead of 2^2^t mod n, you calculate 2^2^t mod cn where c is a 50-bit prime.


If you take your result mod n, you get the same answer as if you had been doing 2^2^t mod n all along.


If you take your result mod c, you get one side if your verification number. To get the other side you need to perform the following (where k is the current value of t);


2^(2^k) mod c
k' = 2^k mod c-1
2^k' mod c


Fermat's little theorem to the rescue.
Hi. I am also new to the forum. I believe that you have gotten the theory correct.

The error check you proposed has a 2-5% overhead (my estimate) and has a 1/c chance of leaving an error undetected.

However, there is an even stronger error check, called the Gerbicz error check. It has negligible overhead and has a 2/n probability or less of leaving an error undetected, much smaller than 1/c. (Is that right?)

Link to the Gerbicz error check: https://www.mersenneforum.org/showpo...36&postcount=1. It works for any N since only squaring is involved.

Correct me if I am wrong.
2M215856352p1 is offline   Reply With Quote
Old 2019-05-23, 11:22   #32
theyowguy
 
May 2019

2 Posts
Default

I can try to calculate the overhead later. Shouldn't be too difficult.


I had a quick look at your link and I'm not sure if I would be able to implement this check.


Thanks, I will keep reading up on it though.
theyowguy is offline   Reply With Quote
Old 2019-05-24, 18:43   #33
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×32×83 Posts
Default

Wikipedia:
Code:
The idea behind the challenge is that the only known way to find the value
 of w without knowing the factorisation of n is by t successive squarings.
This is just wrong, a final step is tricky, but close to standard ideas.

Skipping more iterations is the same, showing only the (r^4) mod m case.
You can compute r^4 mod m without computing r^2 mod m (and then r^4 with one more squaring):

compute (r^4) mod p for many primes, and with CRT reconstruct r^4:
val=sum(c[i]*M/P[i]) for M=vecprod(P),
and here comes the trick, we can skip the computation of val, and directly compute the mod m value:
but the problem with this is that:
Code:
r^4 mod m=(sum(c[i]*M/P[i]) mod M) mod m
and here you just can't swap mod M and mod m (!), but if you know the range of val:

Code:
d*M<=val<(d+1)*M
then r^4 mod m=v mod m, where
v=sum(c[i]*M/P[i])-d*M
v==sum(c[i]*H[i])-d*G mod m, where
H[i]=(M/P[i])%m and G=M%m
and final observation: here abs(sum(c[i]*H[i]-d*G))<256*m, so mod m reduction is easy.
standard ideas: if you can compute c*H with 256 threads in T time, then the
computation of sum(c[i]*H[i]) takes only log2(256)*T=8*T time.
(the same applies to eff).

a working example in Pari-Gp:

Code:
f(r,m)={P=[];pr=2^32;for(i=1,256,pr=precprime(pr-1);P=concat(P,pr));
c=vector(256);
M=vecprod(P);
G=M%m;
H=vector(256,i,(M/P[i])%m);
inv=vector(256,i,lift(Mod(M/P[i],P[i])^(-1)));
eff=vector(256,i,1.0/P[i]);
for(i=1,256,p=P[i];
res=lift(Mod(r,p)^4);
c[i]=(res*inv[i])%p);
d=floor(sum(i=1,256,c[i]*eff[i]));
return((sum(i=1,256,c[i]*H[i])-G*d)%m)}

m=random(2^2047);
r=random(m);
f(r,m)==(r^4)%m
r=random(10^20);
f(r,m)==(r^4)%m
%16 = 1
%18 = 0
In the last example showing that there was a computation error in d, but you can handle that easily (if without floor it would be so close to an integer).
Ofcourse you can precompute P,M,G,H,inv,eff; you don't need to store all of these in all threads.

Last fiddled with by R. Gerbicz on 2019-05-24 at 18:44 Reason: grammar
R. Gerbicz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Crypto News Nick Tales From the Crypt(o) 52 2020-12-17 21:16
I hate this time of year davieddy Lounge 4 2009-10-18 04:39
Crypto 2007 R.D. Silverman Lounge 2 2007-08-08 20:24
crypto game MrHappy Lounge 0 2005-01-19 16:27
Is it time to change the CPU year measurement? E_tron Lounge 7 2004-06-29 10:17

All times are UTC. The time now is 14:00.


Fri Oct 22 14:00:48 UTC 2021 up 91 days, 8:29, 1 user, load averages: 1.26, 1.24, 1.26

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.