mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-10-05, 00:43   #1
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10000001012 Posts
Default FPGA based FFT - Could it work?

The real answer is of course it could work, but how close is it to being something remotely close to feasible?

Searching around the forums didn't produce any very serious discussion of the topic, but as result of the KNL research and work I have been diving deeper into the many ways to implement FFTs and have been having some mental fun.

There are existing cores that can spit out 8-65536 point double precision complex FFTs in very reasonable cycle counts, and FPGAs that very fast and/or with many transceiver links that could be grouped or tied via PCIe or other to a traditional CPU definitely exist.

LL testing is a very specific task process but FFTs are not. Could we build an FPGA or PCIe card that simply performed a 4M, or 8M point double precision complex FFT in some number of microseconds and still see enough benefit shuttling the data back and forth from a CPU for squaring and normalizing?

If we are dreaming, why limit to double precision? Use scaling stages and a nice sized fixed point implementation since we can build the hardware however we want, then we don't need as large of FFTs.

What about if we did even smaller passes (512K FFT, for example) on the very fast FPGA and the large pass on the CPU?

If we could run 4M FFTs at 10-100x the speed of current best-in-class PCs, could we just let a few of those cards churn through the whole 4M range (or double check the whole 4M range), then take the next optimal FFT configuration and update the FPGAs to help our there?

What do smarter people think?

Last fiddled with by airsquirrels on 2016-10-05 at 00:45
airsquirrels is offline   Reply With Quote
Old 2016-10-05, 01:56   #2
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

348910 Posts
Default

Double precision floating point requires a lot of hardware resources on an FPGA, and a top-of-the-line FPGA with huge on-chip resources usually costs about $20k. If the objective is high throughput of LL tests you can fill a room with PCs packed with GPUs for that kind of money.

Many years ago someone pointed to a paper that implemented and integer FFT with elements modulo 2^64-2^32+1; its performance on a $20k-at-the-time FPGA wasn't really competitive with Prime95.

FFTs implemented in an ASIC are probably a different story, but do any of those work in double precision?
jasonp is offline   Reply With Quote
Old 2016-10-05, 08:31   #3
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2·19·233 Posts
Default

Many "consumer" (i.e. "normal price", not 25k USD) FPGAs have 24-bit integer or float multipliers. But their performance is lousy. You may need lower base also, to avoid overflows, carry, whatever.

I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way that for a given prime p and a given pair of start/stop q (or k), each will TF its own modularity class of k, for the given prime p, between q-start and q-stop (this is easily calculable). So you only need to send to the board 3 parameters, p, q-start, q-stop, and it will give you back either a factor or a "no factor". The firmware can be written in blind C, it will do sieving (each MCU for its own class), and squaring (they have 32-bit integer multipliers that run at over 100MHz), etc., so they could do all 960 classes in parallel at once (for a mfaktc equivalent 4620 classes split).

Then, for about $600 you get a dedicated TF machine which can par with a top GPU, for a fraction of energy (all this assembly will only digest 25 to 35 watt, depends on your clock and interconnection). I may be willing to put all this together if I find the money... STM32/cortex it is what I am doing for my daily job, but I don't much like to invest the money and I can hardly find the time... Of course the disadvantage to an asic would be that it consumes much more, but there are many advantages, like a low price to start, and possibility to re-program. I mean, compared with an asic, which you can not change after you did, i.e. it will always be dedicated to the job, like bitcoin miners, if the mining is not profitable anymore, you can scrap the hardware, this concept is totally recyclable, you just rewrite the firmware of all the 975 ICs (which have flash memory, can be written 10 to 20 thousand times) and you have another parallel machine, to which you can give another task. The only problem with it will be that the transfer speed between the 960 "cores" (the other 15 are used for communication and task distribution only) is extremely slow, so you would need to find a task for it which doesn't need data transfer between cores (therefore no LL testing, no FFT).

Last fiddled with by LaurV on 2016-10-05 at 08:47 Reason: s/4920/4620/ Typo, also re-calculated the power
LaurV is offline   Reply With Quote
Old 2016-10-05, 12:57   #4
Gordon
 
Gordon's Avatar
 
Nov 2008

1111101012 Posts
Default

Quote:
Originally Posted by LaurV View Post

I think a better beat for the buck would be to take 975 pieces of STM32L Cortex MCUs for about $0.55 each and put them together on a PCB, and connect/firmware them in such a way ......
What I'd really like is mfaktc turned into an asic, you can but dream
Gordon is offline   Reply With Quote
Old 2016-10-05, 16:13   #5
chris2be8
 
chris2be8's Avatar
 
Sep 2009

23×239 Posts
Default

Many years ago I read the spec for an Inmos A100. It had a bank of ALUs connected to speed up multiple precision multiplication, you loaded one operand (I'll call it A) into them, then fed the other operand (B) in a word at a time and it multiplied that word of B by each word of A and summed with the previous result. You could chain them together to cope with as long an operand as you wanted.

It was basically schoolbook multiplication with as many ALUs as words of the operands (words were 16 bits).

You would want more modern technology today. But the design would be a starting point.

This assumes I've remembered it correctly. It was a long time ago.

Another search term would be hardware to speed up RSA cryptography. I've see the term bignum engine used.

Chris
chris2be8 is offline   Reply With Quote
Old 2016-10-05, 16:42   #6
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×29×109 Posts
Default

The problem is that weird special-purpose devices aren't built in the newest available processes; for $340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with.

The biggest Virtex 7, which is a super-expensive piece of kit, has 3600 741MHz 25x18->43 multipliers; you need about nine of those to be a 53x53->106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost.

There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this not-very-exciting hardware

Last fiddled with by fivemack on 2016-10-05 at 16:46
fivemack is offline   Reply With Quote
Old 2016-10-05, 21:01   #7
Gordon
 
Gordon's Avatar
 
Nov 2008

3·167 Posts
Default

Quote:
Originally Posted by fivemack View Post
The problem is that weird special-purpose devices aren't built in the newest available processes; for $340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with.

The biggest Virtex 7, which is a super-expensive piece of kit, has 3600 741MHz 25x18->43 multipliers; you need about nine of those to be a 53x53->106 multiplier, so maybe if you stretch things very carefully it might have three times the raw performance of the Skylake at thirty times the cost.

There is custom crypto hardware, but (apart from the custom crypto hardware that runs AES operations at the raw cycle speed) it tends to be optimised for security. "openssl speed rsa" says 3300 private RSA1024 operations per second and 88 private RSA4096 operations per second on this not-very-exciting hardware
I was thinking more along the lines of bitcoin h/w like this USB Miner.

$55 - 8.5 Giga hashes per sec, and a whole 4w of power.

Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though.
Gordon is offline   Reply With Quote
Old 2016-10-05, 21:16   #8
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

329810 Posts
Default

Quote:
Originally Posted by Gordon View Post
I was thinking more along the lines of bitcoin h/w like this USB Miner.

$55 - 8.5 Giga hashes per sec, and a whole 4w of power.

Don't know how to compare bitcoin gig hashes to what we need, would be fun to try though.
The pricetag for custom ASIC fabrication probably starts in the 7 figures range, for latest generation technology. Plus, of course, a team of engineers for a year and another 7-figures-range budget for software/licenses. Oh, and another team of engineers and 7-figure-range budget for package/pcb design and fab. Then we'd have a $100 part that does trial division at 10x the rate of a GPU.
bsquared is offline   Reply With Quote
Old 2016-10-05, 22:12   #9
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by bsquared View Post
The pricetag for custom ASIC fabrication probably starts in the 7 figures range, for latest generation technology. Plus, of course, a team of engineers for a year and another 7-figures-range budget for software/licenses. Oh, and another team of engineers and 7-figure-range budget for package/pcb design and fab. Then we'd have a $100 part that does trial division at 10x the rate of a GPU.
There are a lot of lower cost ways to get space on a die and cheap ASICs made from piecing together commodity blocks, but we would still have quite a bit of PCB and interface development time. With TF the logic for trial division on silicon is actually very simple.. The sieving on the other hand.... that is where the complexity starts to build.

Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience?
airsquirrels is offline   Reply With Quote
Old 2016-10-05, 22:30   #10
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

2×17×97 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
There are a lot of lower cost ways to get space on a die and cheap ASICs made from piecing together commodity blocks, but we would still have quite a bit of PCB and interface development time. With TF the logic for trial division on silicon is actually very simple.. The sieving on the other hand.... that is where the complexity starts to build.

Is it true that Verilog/VHDL made for an FPGA can be used to easily create the unoptimized ASIC initial design. Does anyone on here actually work with FPGAs or design ASICs? Or are we all just speculating from our related field experience?
Yeah, there are cheaper ways, but the broader point is that there is a very large up-front cost and time commitment for probably a marginal absolute performance improvement (unless you go all-in with the latest largest silicon), although probably significant performance/watt improvement. The reason custom bitcoin miners exist is that there is a return-on-investment for the companies that make them. Here, not so much.

I write verilog for FPGAs but never have for ASICs; not sure how well a design would translate. Likely it depends on lots of details.

Last fiddled with by bsquared on 2016-10-05 at 22:32
bsquared is offline   Reply With Quote
Old 2016-10-06, 00:19   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

9,791 Posts
Default

Quote:
Originally Posted by fivemack View Post
The problem is that weird special-purpose devices aren't built in the newest available processes; for $340 for an Intel i7-6700K, you get four 4GHz pipelined 64x64->128 multipliers and 32 4GHz pipelined 53x53->106 multipliers, which is really quite a challenge to compete with.
Not quite so fast with regard to the exact 100-bits-plus product using FMA, young Jedi. :)

I have modpow routines based on that approach as a build option in my TF code for both Intel CPUs (starting with Haswell) and nVidia GPUs, though note on the latter this approach runs 10-20x slower than one based on multiword 32-bit integer math.

The reason using the FMA hardware to compute 106-bit products is slower than you state above is twofold:

[1] Each such basic product needs 2 MULs (one can be just ordinary mul, the other must needs be an fma, though AFAIK both instructions use the same hardware on Intel and probably most every modern FMA-supporting processor);

[2] The two output "halves" will typically need to be properly normalized with respect to whatever base one is using for one's multiword math.

Here is the cost analysis from the 'latest new code added' comments in my TF code, from mid-2015 - DNINT is pseudocode for 'double-precision round to nearest integer'):
Code:
Added support for nVidia CUDA, geared toward 32-bit-int and floating-double (esp. FMA-arithmetic)
capability of Fermi+ families of GPUs. Also added AVX2-based modmul routines based on capability
of FMA-double math to exactly compute 106-bit product of 53-bit signed inputs via the sequence

	[input doubles x,y]
	double hi = x*y;
	double lo = FMA(x,y, -hi);
	double hh,cy;

This still needs normalization to properly split the resultant significant bits across lo,hi,
e.g. in a base-2^52 signed-int multiword context:

	hh  = DNINT(hi*TWO52FLINV);		// This part remains in hi...
	cy  = FMA(hh ,-TWO52FLOAT,hi);	// hi - hh*2^52 is 'backward carry' from hi into lo, needed for proper base-normalization
	lo += cy;
	hi  = hh;

Thus the cost of each 100+ - bit product is 1 ADD, 2 MUL, 2 FMA, 1 DNINT. (We assume/hope the negations are free).
I.e. 4 floating MULs per double-word product. Still very attractive on Haswell+, especially since the supported vector widths keep increasing, whereas the legacy 64x64 ==> 128-bit MUL (and the low-half-only variant, IMUL) remain stuck in scalar-only mode.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Xeon E5 with Arria 10 FPGA on-package BotXXX Hardware 1 2016-11-18 00:10
FPGA based NFS sieving wwf Hardware 5 2013-05-12 11:57
Sugg. for work distr. based on Syd's database hhh Aliquot Sequences 2 2009-04-12 02:26
ECM/FPGA Implementation rdotson Hardware 12 2006-03-26 22:58
Number-theoretic FPGA function implementation suggestions? rdotson Hardware 18 2005-09-25 13:04

All times are UTC. The time now is 11:55.

Fri Oct 23 11:55:16 UTC 2020 up 43 days, 9:06, 0 users, load averages: 1.76, 1.59, 1.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.