I haven't given it serious thought, but would the following idea work?
In Knuth Vol 2, I think Knuth states there is a hardware multiply algorithm that operates in linear time. If it is pipelineable, then would it be possible to use the standard recursive turnonebigmultiplyinto3halfsizemultiplies until you got down to say 10,000 bit chunks, then feed these 10,000 bit chunks to the pipelined FPGA multiply algorithm.
