mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2005-07-27, 03:55   #1
rdotson
 
rdotson's Avatar
 
Jul 2005

1010002 Posts
Lightbulb ECM/FPGA Implementation

Hi Folks,

Is anyone here interested in hardware level integer factorization? Or is there perhaps another sub-forum here somewhere that deals with it, please? I'm in the debugging stages of a Verilog/FPGA implementation of the ECM factoring method, and sure would like to know that I'm not reinventing the wheel. Thus far I have not found anyone else doing this other than a couple of much more ambitious research lab projects.

Thanks,

Ron
rdotson is offline   Reply With Quote
Old 2005-07-27, 08:43   #2
Peter Nelson
 
Peter Nelson's Avatar
 
Oct 2004

232 Posts
Default ECM in FPGA

I use Xilinx fpgas (Altera are also good) and am also interested in this stuff. I was looking to put chips in parallel on a board and accelerate trial factoring or other projects like RSA solving.

It does seem ECM can be implemented in fpga or asic without too much difficulty.

Part of my reading was this:

http://www.crypto.ruhr-uni-bochum.de...rences/ecm.pdf

Although this is used as part of a bigger project, the ECM itself could work.

They are using Xilinx Virtex series, but a lower cost implementation might work well on Spartan 3 or Virtex 4 chips which are the current generations with lower power consumption too.

Peter
Peter Nelson is offline   Reply With Quote
Old 2005-07-27, 16:58   #3
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Very interesting paper. Thanks for the link.

Alex
akruppa is offline   Reply With Quote
Old 2005-07-28, 13:50   #4
Greenbank
 
Greenbank's Avatar
 
Jul 2005

2·193 Posts
Default

I have a Spartan-3 Starter Kit and I'm trying to implement (VHDL) a naive algorithm to solve the discrete log for proth/riesel sieving.

I haven't run anything on an FPGA yet, but I've simulated a small run (p=101) using the IEEE1164 standard 32bit signed INTEGER types and the algorithm works. (Post should appear in comp.arch.fpga in a few hours. Just search that group for proth).

Once I've got this all working and happy I'll pick a larger p and check for n up to 90M. (For a specific p where I know no factors exist) and time it.

This is just the first stage. I've then got to make it handle 64-bit numbers (luckily I only need a left shift, comparison and subtraction and an increment).

If I can get enough sievers on the board to make it competative with the Klasson/Jobling proth_sieve then I'll be happy (see below). I'd also have to add PC -> FPGA comms (serial) and a controlling program to get the next prime, eliminate k values based on quadratic residues etc.

However, I doubt this is going to compete against proth_sieve. So...

To do that I'd have to implement Baby-Steps Giant-Steps in an FPGA. A bit more daunting as you've got to have a hash table (on board logic or external memory), and also be able to multiply 2 64-bit numbers (so a possible 128 bit result) and then perform mod p (where p could be up to 64-bit). Not impossible but beyond my novice FPGA/VHDL skills.

I'll be a bit closer after the weekend...
Greenbank is offline   Reply With Quote
Old 2005-07-31, 06:53   #5
rdotson
 
rdotson's Avatar
 
Jul 2005

23×5 Posts
Default How about a hardware design language sub-forum?

I was hoping there might be enough people interested that we could petition the moderators to start a new sub-group on hardware design languages, issues, etc., but that's probably unlikely with only four of us. Oh well, it sounds like everyone has interesting projects - and your project Greenbank sounds well beyond where I am now.

I'll have to take a quick look at proth/riesel sieving after I've posted this because I'm not familiar with it, but it sounds much more complicated than the Elliptic Curve Method. I agree that ECM isn't very "pleasing" or even desirable in most cases because of its probabilistic nature, but I'm hoping that for all practical purposes it works just fine - and it's about as simple as an algorithm of this nature can get.

I'm going to have the same problems and choices about what type of development board to purchase, etc, fairly soon (I hope ;-), so maybe we could share advice on the lowest price/performance development boards or something? I plan on using an Ethernet socket rather than serial RS232 if possible (and it isn't too much of a hit on the development boards logic).

I've written (and partially tested) Verilog routines to do arbitrarily long multiplication and "mod p" as well as a few other routines needed by ECM (gcd, and igcdex for example). They're mostly just textbook examples that I've adapted for my own use, and they're written in Verilog rather than VHDL, but I'd be happy to share some of them if you like.

Regards,

Ron
rdotson is offline   Reply With Quote
Old 2005-07-31, 08:53   #6
Washuu
 
Mar 2005
Poland

5×7 Posts
Default Altera

I am also interested in FPGA programming, however I started studying it only half year ago.

Also, because of my University's agreements with Altera, I mostly programmed FPGAs using AHDL and MAX PLUS-2, although I had to learn VHDL also.

But, when we compered some implementations using VHDL and AHDL it turned that AHDL is much better (not really surprising, AHDL is optimised for Altera chips...), that's why I stayed with AHDL (and that was purpose of Altera's agreements )

Now I'm working on multiplying big integers using FFT (DFT rather).

Please, if you can share some materials about your projects, I would be grateful.

Regards,

Washuu
Washuu is offline   Reply With Quote
Old 2005-07-31, 11:42   #7
Peter Nelson
 
Peter Nelson's Avatar
 
Oct 2004

10000100012 Posts
Default

Quote:
Originally Posted by rdotson
I was hoping there might be enough people interested that we could petition the moderators to start a new sub-group on hardware design languages, issues, etc., but that's probably unlikely with only four of us. Oh well, it sounds like everyone has interesting projects - and your project Greenbank sounds well beyond where I am now.

.....

I'm going to have the same problems and choices about what type of development board to purchase, etc, fairly soon (I hope ;-), so maybe we could share advice on the lowest price/performance development boards or something? I plan on using an Ethernet socket rather than serial RS232 if possible (and it isn't too much of a hit on the development boards logic).

....

Regards,

Ron
AGREED it would be nice to have a space in the forum for FPGA/HDL.

Also if looking for boards there is quite an extensive list here:

http://www.fpga-faq.org/FPGA_Boards.shtml

There is obviously some tradeoff between board cost and features.

For spartan3, you could look at the boards from Buched (Tony Burch in Australia) and Enterpoint (UK).

Thirdly as for interfacing, you could put an embedded processor in your FPGA so that other comms from chip to host is minimal. RS232 is "SO" legacy. Suggest people consider USB whether on-chip or using an external microchip PIC (model containing USB) to breakout to parallel IO connected to FPGA IO pins.

I like ethernet, but for a multi-FPGA setup would be inclined to implement it ONCE per board to save logic, and use some other interconnect on board.
Peter Nelson is offline   Reply With Quote
Old 2005-07-31, 16:38   #8
Greenbank
 
Greenbank's Avatar
 
Jul 2005

2·193 Posts
Default

The maths for my algorithm is not that hard, I'm just rubbish at explaining it clearly. It's a very basic (read: slow) algorithm to solve the discrete log.

It only involves a left shift by 1 bit (multiply by 2), a comparison with p (and a possible subtraction), an increment of a counter and then a comparison with up to 9 target values (which would be solutions for that p). This is iterated many millions of times.

I was only going to use RS232 because the Spartan-3 Starter Kit comes with a 9-Pin D-Sub already wired in. A parallel port interface would probably be better but would take effort on my part with a breakout box (which I don't have :-).

A serial interface will not be quick enough if I want to challenge the speed of the existing software sievers. My calculations show I'd need roughly 6MBit/sec throughput to achieve this. 10Mbit ethernet may just work (given there would be little/no contention on the lines).

The only other option would be to move more of the pre-computation (performed on a PC) over to the FPGA side, however this is way beyond my simple FPGA experience.

That's the problem really, each block of work completes so quickly that you need high bandwidth to keep the FPGA sievers running. To reduce the bandwidth required I have to move some complicated maths over to the FPGA, but this takes up too much real estate on the FPGA and I lose performance.
Greenbank is offline   Reply With Quote
Old 2005-07-31, 18:48   #9
rdotson
 
rdotson's Avatar
 
Jul 2005

23·5 Posts
Default

Quote:
Originally Posted by Greenbank
...That's the problem really, each block of work completes so quickly that you need high bandwidth to keep the FPGA sievers running. To reduce the bandwidth required I have to move some complicated maths over to the FPGA, but this takes up too much real estate on the FPGA and I lose performance.
Yes, I arrived at pretty much the same conclusion before I started. That's part of the reason I decided to use ECM - because it's simple enough to fit the entire thing on the FPGA. I'll need a port to deliver numbers to be factored and to retrieve the results, but for my first tests I can simply hard-code those into the FPGA. :-)

Washuu - I don't know what I have that might be useful to you since the algorithms we're using are so different. If you have anything specific I would be happy to try to help, but I only started learning Verilog recently myself. I'm using "linear iterative array" multipliers (See Knuth, "The Art of Computer Programming", Vol2, Seminumerical Algorithms, "How fast can we multiply". They are slow (one clock cycle per data bit), but can handle arbitrarily long integers.

By the way, I purchased a copy of the Lattice ispLever Basic tools, so that and the free Icarus Simulator is what I'm mostly using for development. The Lattice tools seemed better than the offerings from either Altera or Lattice (better compiler error messages, etc), and at about $600 was HALF the price of the nearest competitor.

Regards to all,

-- Ron
rdotson is offline   Reply With Quote
Old 2006-03-25, 17:30   #10
drew
 
drew's Avatar
 
Jun 2005

17E16 Posts
Default

Quote:
Originally Posted by rdotson
Yes, I arrived at pretty much the same conclusion before I started. That's part of the reason I decided to use ECM - because it's simple enough to fit the entire thing on the FPGA. I'll need a port to deliver numbers to be factored and to retrieve the results, but for my first tests I can simply hard-code those into the FPGA. :-)

Washuu - I don't know what I have that might be useful to you since the algorithms we're using are so different. If you have anything specific I would be happy to try to help, but I only started learning Verilog recently myself. I'm using "linear iterative array" multipliers (See Knuth, "The Art of Computer Programming", Vol2, Seminumerical Algorithms, "How fast can we multiply". They are slow (one clock cycle per data bit), but can handle arbitrarily long integers.

By the way, I purchased a copy of the Lattice ispLever Basic tools, so that and the free Icarus Simulator is what I'm mostly using for development. The Lattice tools seemed better than the offerings from either Altera or Lattice (better compiler error messages, etc), and at about $600 was HALF the price of the nearest competitor.

Regards to all,

-- Ron
An old thread, I know, but I'm wondering...about how fast would one expect an FPGA to perform LL testing? Better than Prime95? And if so, by how much? I've only had very rudimentary exposure to FFT implementations, but I think this wold be a very interesting problem...my current job has me dabbling in FPGA development.

Drew
drew is offline   Reply With Quote
Old 2006-03-25, 23:19   #11
Peter Nelson
 
Peter Nelson's Avatar
 
Oct 2004

232 Posts
Default PC versus fpga implementation of LL test

Hi drew,

I can answer your question:

http://www.ccm.ece.vt.edu/papers/cra...PLD04_mult.pdf

See this paper by Craven et al, or other documents detailing the same work.

They conclude:
"In conclusion, a stand-alone hardware implementation
of squaring 12-million digit numbers in under
compared to a fast Pentium. This design, with
integers or convolve large sequences. While some
in terms of dollars or silicon area, outweighs the
indicate that FPGAs can now compete with Pentiums
from high-speed floating-point multiplication."

They were comparing Performance and cost of a Pentium implementation against a hardware design in a very large Xilinx Virtex 2 pro fpga.

The conclusion was that the virtex was a bit (1.76x) faster than that (single core) Pentium.

Unfortunately their hardware budget was quite a bit more expensive.

They concluded that on price/performance ratio the Virtex implementation lost.

For future work they suggest putting a multiple virtex into each of several pcs on a gigabit ethernet connected network.

I would say that better than that is now available to use TEN gigabit ethernet (eg myrinet cards) in your cluster, or have the virtex boards interconnected using their own low latency custom interconnect fabric eg fast serial or multiple lvds type links.

Now, considering further.

We may actually be interested in a FAST result rather than cost eg to verify candidate primes.

However with pc or unix box we can already throw multiple pcs at the problem as is used for current verification runs.

However, taking that idea, if we threw MULTIPLE virtex chips at the problem we could produce an (expensive) verifier machine which produced a faster result.

However, there is a substantial price premium for xilinx top of the range chips, so in my opinion the better price/performance could be achieved by using lower virtex models where the logic/price tradeoff is a sweet spot.

Also virtex 2/pro is LAST generation xilinx hardware. VIRTEX FOUR is the current range and is faster and cheaper.

Unfortunately PCs also got faster, eg little bit faster clockspeed, and dualcore or two dualcores on a motherboard.

I even considered doing this in an array of (cheap) xilinx spartan chips, but they would need external ddr/ddr2 to store and manipulate the fft data and have fast bandwidth links between them.

Yes it can be done but possibly still not the best achievable price/performance.

In a way you can have very fast verification IF you are willing to invest substantially to get it (and write your own implementation). Craven team wrote some/most but did not finish or optimise totally, they estimated some parts of their design.

On a positive not they were using older versions of the HDL synthesis tools. Current ISE 8.1 is quite a bit faster, and more importantly, the DESIGN is more optimised in silicon. Therefore using the tools I have, even their design would get a speedup which may invalidate their comparison, moving in favour of the Xilinx fpga solution.

I have considered LL testing an exponent about 700 million digits, for which a pc would take ages. However, with an EXPENSIVE fpga implementation I would get an answer in a realistic time.

If that was prime, I could win the EFF prize. However, for that particular number there is also a SMALL prize if you can characterize it as NON-PRIME.

So either way is a winner, but the cost would exceed the hardware/project cost.

At the moment I have an array of Spartan 3 XC3-400 and XC3-1500 chips which are interconnected using LVTTL, or LVDS or DDR signalling links and which I can talk to over ethernet, RS232 or the PCI bus with my own linux drivers.

I can also gain or borrow a PCI express Virtex 4 board with SODIMM DDR2 memory modules. Such board could be populated with one, two or even four V4 of the LX or SX variety (of medium size).

In future I may have V4 FX which are the ones with Powerpc processor, two gigabit macs, and very high speed serial links. FX are in short supply from Xilinx as yet.

I have access to ddr controller IP and also xilinx have some coregen tools to make fft of various sizes, and complex fft are made out of that. Although not trivial I could replicate what the craven paper describes, and then have to finish off their partial design.

I could then modify it to share the FFT over multiple V4 chips for greater performance.

For now I am thinking that alternative uses may be ECM. Implementing regular bruteforce trial factoring is not great use of spartan3 array compared to a pc because of the limited clockspeed compared to a PC. Yes I can do 64x64=128 bit multiply but performance is limited. So in terms of contribution of a small fpga array, I am thinking ECM unless anyone has better ideas.

I find it good at flashing many LEDs too ;-)

Last fiddled with by Peter Nelson on 2006-03-25 at 23:34
Peter Nelson is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
FPGA based FFT - Could it work? airsquirrels Hardware 77 2021-05-01 19:27
Intel Xeon E5 with Arria 10 FPGA on-package BotXXX Hardware 1 2016-11-18 00:10
FPGA based NFS sieving wwf Hardware 5 2013-05-12 11:57
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04
RSA implementation flava Programming 12 2004-10-26 03:51

All times are UTC. The time now is 21:17.

Tue May 11 21:17:34 UTC 2021 up 33 days, 15:58, 0 users, load averages: 2.45, 2.52, 2.32

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.