In general this is a very hard job. See this thread for suggestions the last time someone wanted to implement the LL test in hardware. My personal opinion is that it's possible to significantly beat Prime95 running on even the fastest general-purpose hardware, but not by 1000x like FPGAs can sometimes do. You can do 100x as many operation in parallel, but general purpose hardware has 20-30x your clock rate.
