CADO does not support GPUs; other researchers have substituted their own GPU components for the corresponding pieces in CADONFS (i.e. make the Block Weidemann code use a GPU for matrix multiplies).
It's an open question whether lowering the target norm makes it easier to find good hits; it certainly drastically reduces the number of hits found, and you would think that it would help the size optimization deal with the loworder coefficients if the top few polynomial coefficients were much smaller than average. But I don't know that, so far all the hits I've found have led to polynomials a lot worse than their current best one.
Delivering big batches of hits isn't a problem; I can take them by email or you can upload to Rapidshare or some other online storage service if the results are more than a few megabytes compressed.
Greg: I don't know how many hits we should look for. Nobody has ever done a massively parallel polynomial search, as in throwing BOINC at it for months. For RSA768 we spent three months and collected something like 15 million stage 1 hits, and the best resulting polynomial was just about as good as the one actually used for the sieving. For RSA896 the current code divides the search space for each leading coefficient into 46 pieces, so searching one of those only exhausts 2% of the available search space. A trillion hits would probably be too much to deal with, though it would be awesome :)
