20050727, 03:55  #1 
Jul 2005
101000_{2} Posts 
ECM/FPGA Implementation
Hi Folks,
Is anyone here interested in hardware level integer factorization? Or is there perhaps another subforum here somewhere that deals with it, please? I'm in the debugging stages of a Verilog/FPGA implementation of the ECM factoring method, and sure would like to know that I'm not reinventing the wheel. Thus far I have not found anyone else doing this other than a couple of much more ambitious research lab projects. Thanks, Ron 
20050727, 08:43  #2 
Oct 2004
23^{2} Posts 
ECM in FPGA
I use Xilinx fpgas (Altera are also good) and am also interested in this stuff. I was looking to put chips in parallel on a board and accelerate trial factoring or other projects like RSA solving.
It does seem ECM can be implemented in fpga or asic without too much difficulty. Part of my reading was this: http://www.crypto.ruhrunibochum.de...rences/ecm.pdf Although this is used as part of a bigger project, the ECM itself could work. They are using Xilinx Virtex series, but a lower cost implementation might work well on Spartan 3 or Virtex 4 chips which are the current generations with lower power consumption too. Peter 
20050727, 16:58  #3 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Very interesting paper. Thanks for the link.
Alex 
20050728, 13:50  #4 
Jul 2005
2·193 Posts 
I have a Spartan3 Starter Kit and I'm trying to implement (VHDL) a naive algorithm to solve the discrete log for proth/riesel sieving.
I haven't run anything on an FPGA yet, but I've simulated a small run (p=101) using the IEEE1164 standard 32bit signed INTEGER types and the algorithm works. (Post should appear in comp.arch.fpga in a few hours. Just search that group for proth). Once I've got this all working and happy I'll pick a larger p and check for n up to 90M. (For a specific p where I know no factors exist) and time it. This is just the first stage. I've then got to make it handle 64bit numbers (luckily I only need a left shift, comparison and subtraction and an increment). If I can get enough sievers on the board to make it competative with the Klasson/Jobling proth_sieve then I'll be happy (see below). I'd also have to add PC > FPGA comms (serial) and a controlling program to get the next prime, eliminate k values based on quadratic residues etc. However, I doubt this is going to compete against proth_sieve. So... To do that I'd have to implement BabySteps GiantSteps in an FPGA. A bit more daunting as you've got to have a hash table (on board logic or external memory), and also be able to multiply 2 64bit numbers (so a possible 128 bit result) and then perform mod p (where p could be up to 64bit). Not impossible but beyond my novice FPGA/VHDL skills. I'll be a bit closer after the weekend... 
20050731, 06:53  #5 
Jul 2005
2^{3}×5 Posts 
How about a hardware design language subforum?
I was hoping there might be enough people interested that we could petition the moderators to start a new subgroup on hardware design languages, issues, etc., but that's probably unlikely with only four of us. Oh well, it sounds like everyone has interesting projects  and your project Greenbank sounds well beyond where I am now.
I'll have to take a quick look at proth/riesel sieving after I've posted this because I'm not familiar with it, but it sounds much more complicated than the Elliptic Curve Method. I agree that ECM isn't very "pleasing" or even desirable in most cases because of its probabilistic nature, but I'm hoping that for all practical purposes it works just fine  and it's about as simple as an algorithm of this nature can get. I'm going to have the same problems and choices about what type of development board to purchase, etc, fairly soon (I hope ;), so maybe we could share advice on the lowest price/performance development boards or something? I plan on using an Ethernet socket rather than serial RS232 if possible (and it isn't too much of a hit on the development boards logic). I've written (and partially tested) Verilog routines to do arbitrarily long multiplication and "mod p" as well as a few other routines needed by ECM (gcd, and igcdex for example). They're mostly just textbook examples that I've adapted for my own use, and they're written in Verilog rather than VHDL, but I'd be happy to share some of them if you like. Regards, Ron 
20050731, 08:53  #6 
Mar 2005
Poland
5×7 Posts 
Altera
I am also interested in FPGA programming, however I started studying it only half year ago.
Also, because of my University's agreements with Altera, I mostly programmed FPGAs using AHDL and MAX PLUS2, although I had to learn VHDL also. But, when we compered some implementations using VHDL and AHDL it turned that AHDL is much better (not really surprising, AHDL is optimised for Altera chips...), that's why I stayed with AHDL (and that was purpose of Altera's agreements ) Now I'm working on multiplying big integers using FFT (DFT rather). Please, if you can share some materials about your projects, I would be grateful. Regards, Washuu 
20050731, 11:42  #7  
Oct 2004
1000010001_{2} Posts 
Quote:
Also if looking for boards there is quite an extensive list here: http://www.fpgafaq.org/FPGA_Boards.shtml There is obviously some tradeoff between board cost and features. For spartan3, you could look at the boards from Buched (Tony Burch in Australia) and Enterpoint (UK). Thirdly as for interfacing, you could put an embedded processor in your FPGA so that other comms from chip to host is minimal. RS232 is "SO" legacy. Suggest people consider USB whether onchip or using an external microchip PIC (model containing USB) to breakout to parallel IO connected to FPGA IO pins. I like ethernet, but for a multiFPGA setup would be inclined to implement it ONCE per board to save logic, and use some other interconnect on board. 

20050731, 16:38  #8 
Jul 2005
2·193 Posts 
The maths for my algorithm is not that hard, I'm just rubbish at explaining it clearly. It's a very basic (read: slow) algorithm to solve the discrete log.
It only involves a left shift by 1 bit (multiply by 2), a comparison with p (and a possible subtraction), an increment of a counter and then a comparison with up to 9 target values (which would be solutions for that p). This is iterated many millions of times. I was only going to use RS232 because the Spartan3 Starter Kit comes with a 9Pin DSub already wired in. A parallel port interface would probably be better but would take effort on my part with a breakout box (which I don't have :). A serial interface will not be quick enough if I want to challenge the speed of the existing software sievers. My calculations show I'd need roughly 6MBit/sec throughput to achieve this. 10Mbit ethernet may just work (given there would be little/no contention on the lines). The only other option would be to move more of the precomputation (performed on a PC) over to the FPGA side, however this is way beyond my simple FPGA experience. That's the problem really, each block of work completes so quickly that you need high bandwidth to keep the FPGA sievers running. To reduce the bandwidth required I have to move some complicated maths over to the FPGA, but this takes up too much real estate on the FPGA and I lose performance. 
20050731, 18:48  #9  
Jul 2005
2^{3}·5 Posts 
Quote:
Washuu  I don't know what I have that might be useful to you since the algorithms we're using are so different. If you have anything specific I would be happy to try to help, but I only started learning Verilog recently myself. I'm using "linear iterative array" multipliers (See Knuth, "The Art of Computer Programming", Vol2, Seminumerical Algorithms, "How fast can we multiply". They are slow (one clock cycle per data bit), but can handle arbitrarily long integers. By the way, I purchased a copy of the Lattice ispLever Basic tools, so that and the free Icarus Simulator is what I'm mostly using for development. The Lattice tools seemed better than the offerings from either Altera or Lattice (better compiler error messages, etc), and at about $600 was HALF the price of the nearest competitor. Regards to all,  Ron 

20060325, 17:30  #10  
Jun 2005
17E_{16} Posts 
Quote:
Drew 

20060325, 23:19  #11 
Oct 2004
23^{2} Posts 
PC versus fpga implementation of LL test
Hi drew,
I can answer your question: http://www.ccm.ece.vt.edu/papers/cra...PLD04_mult.pdf See this paper by Craven et al, or other documents detailing the same work. They conclude: "In conclusion, a standalone hardware implementation of squaring 12million digit numbers in under compared to a fast Pentium. This design, with integers or convolve large sequences. While some in terms of dollars or silicon area, outweighs the indicate that FPGAs can now compete with Pentiums from highspeed floatingpoint multiplication." They were comparing Performance and cost of a Pentium implementation against a hardware design in a very large Xilinx Virtex 2 pro fpga. The conclusion was that the virtex was a bit (1.76x) faster than that (single core) Pentium. Unfortunately their hardware budget was quite a bit more expensive. They concluded that on price/performance ratio the Virtex implementation lost. For future work they suggest putting a multiple virtex into each of several pcs on a gigabit ethernet connected network. I would say that better than that is now available to use TEN gigabit ethernet (eg myrinet cards) in your cluster, or have the virtex boards interconnected using their own low latency custom interconnect fabric eg fast serial or multiple lvds type links. Now, considering further. We may actually be interested in a FAST result rather than cost eg to verify candidate primes. However with pc or unix box we can already throw multiple pcs at the problem as is used for current verification runs. However, taking that idea, if we threw MULTIPLE virtex chips at the problem we could produce an (expensive) verifier machine which produced a faster result. However, there is a substantial price premium for xilinx top of the range chips, so in my opinion the better price/performance could be achieved by using lower virtex models where the logic/price tradeoff is a sweet spot. Also virtex 2/pro is LAST generation xilinx hardware. VIRTEX FOUR is the current range and is faster and cheaper. Unfortunately PCs also got faster, eg little bit faster clockspeed, and dualcore or two dualcores on a motherboard. I even considered doing this in an array of (cheap) xilinx spartan chips, but they would need external ddr/ddr2 to store and manipulate the fft data and have fast bandwidth links between them. Yes it can be done but possibly still not the best achievable price/performance. In a way you can have very fast verification IF you are willing to invest substantially to get it (and write your own implementation). Craven team wrote some/most but did not finish or optimise totally, they estimated some parts of their design. On a positive not they were using older versions of the HDL synthesis tools. Current ISE 8.1 is quite a bit faster, and more importantly, the DESIGN is more optimised in silicon. Therefore using the tools I have, even their design would get a speedup which may invalidate their comparison, moving in favour of the Xilinx fpga solution. I have considered LL testing an exponent about 700 million digits, for which a pc would take ages. However, with an EXPENSIVE fpga implementation I would get an answer in a realistic time. If that was prime, I could win the EFF prize. However, for that particular number there is also a SMALL prize if you can characterize it as NONPRIME. So either way is a winner, but the cost would exceed the hardware/project cost. At the moment I have an array of Spartan 3 XC3400 and XC31500 chips which are interconnected using LVTTL, or LVDS or DDR signalling links and which I can talk to over ethernet, RS232 or the PCI bus with my own linux drivers. I can also gain or borrow a PCI express Virtex 4 board with SODIMM DDR2 memory modules. Such board could be populated with one, two or even four V4 of the LX or SX variety (of medium size). In future I may have V4 FX which are the ones with Powerpc processor, two gigabit macs, and very high speed serial links. FX are in short supply from Xilinx as yet. I have access to ddr controller IP and also xilinx have some coregen tools to make fft of various sizes, and complex fft are made out of that. Although not trivial I could replicate what the craven paper describes, and then have to finish off their partial design. I could then modify it to share the FFT over multiple V4 chips for greater performance. For now I am thinking that alternative uses may be ECM. Implementing regular bruteforce trial factoring is not great use of spartan3 array compared to a pc because of the limited clockspeed compared to a PC. Yes I can do 64x64=128 bit multiply but performance is limited. So in terms of contribution of a small fpga array, I am thinking ECM unless anyone has better ideas. I find it good at flashing many LEDs too ;) Last fiddled with by Peter Nelson on 20060325 at 23:34 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
FPGA based FFT  Could it work?  airsquirrels  Hardware  77  20210501 19:27 
Intel Xeon E5 with Arria 10 FPGA onpackage  BotXXX  Hardware  1  20161118 00:10 
FPGA based NFS sieving  wwf  Hardware  5  20130512 11:57 
Numbertheoretic FPGA function implementation suggestions?  rdotson  Hardware  18  20050925 13:04 
RSA implementation  flava  Programming  12  20041026 03:51 