mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Operation Kibibit (https://www.mersenneforum.org/forumdisplay.php?f=97)
-   -   Call for volunteers: RSA896 (https://www.mersenneforum.org/showthread.php?t=17460)

 jasonp 2012-11-17 14:12

Call for volunteers: RSA896

As of SVN823 the Msieve code is almost stable enough to consider a release of v1.51 (the only thing left is deciding what to do with the windows build projects, which will need a lot of fixing after the recent GPU overhaul). [url="www.boo.net/~jasonp/msieve_svn823.zip"]A win32 binary[/url] is available for your perusal.

As a larger-scale test of the code, I propose we run polynomial selection on RSA896. Paul Zimmermann reports his group still has a lot of work left with this task, and using a GPU is much more effective on this size problem than using a CPU. So if you have a graphics card and at least version 4.2 of the Nvidia drivers, consider running a range of poly selection stage 1 for RSA896 with me (I've been at it for 2 weeks now).

More specifically, write the following in a file worktodo.ini:
[code]
412023436986659543855531365332575948179811699844327982845455626433876445565248426198098870423161841879261420247188869492560931776375033421130982397485150944909106910269861031862704114880866970564902903653658867433731720813104105190864254793282601391257624033946373269391
[/code]
and then run

msieve -v -np1 "stage1_norm=5e33 X,Y"

where X and Y is a range of numbers with magnitude 10^12 to 10^15. You should pick a fairly large range since the code will skip over many leading coefficients that are not smooth enough, and multiple machines (say up to 5) can search the same range with a high probability of not duplicating any work. With a GTS450 card (which is pretty low-end) I get about 80000 hits per week, which is about 2MB zipped. So you will periodically want to interrupt a run in progress, restart with X set to start above the last coefficient searched, and msieve.dat.m moved to another filename.

The CPU utilization will be very low while a run is in progress, but interactivity in my winXP GUI is noticeably worse, which will be annoying. The binary can be run multithreaded but it does not make a difference with such large problems.

Please post here if you have problems or questions. I've only had time to test the GPU-specific engine for Fermi cards, custom engines exist for older cards but I have no idea if they work.

First, X and Y have to be in decimal, not in scientific notation like older versions used to allow.

Also, if building from SVN make sure to use the trunk and not the msieve_faststage1 branch; that branch is now old.

 pinhodecarlos 2012-11-17 17:35

So you don't recommend using only CPU.

 jasonp 2012-11-17 18:44

In my tests, the GPU code is perhaps 100x faster than the CPU code. The CPU code is not multithread-aware but could benefit from multithreading. Actually the CPU code needs to be able to use a thread pool and high-performance radix sorting just like the GPU code does now, after which the CPU and GPU code could probably be mostly unified together. But we're not there yet, and could probably buy a 10x speedup in the CPU code by doing so.

Given a choice between a CPU run and nothing, then obviously a CPU run would be better. But currently the situation with stage 1 polynomial selection is exactly analogous to Mersenne trial factoring, vis-a-vis the GPU speedup.

 pinhodecarlos 2012-11-17 18:49

[QUOTE=jasonp;318747]In my tests, the GPU code is perhaps 100x faster than the CPU code. [/QUOTE]

How many cores has your CPU?

 jasonp 2012-11-17 19:22

This is 1 core of a Core2 from 2007 vs a low-end Fermi card from 2010. The GPU finds about one hit every 5-10 seconds, and now I see that the CPU code found two hits after 24 minutes. Neither codebase is optimal, neither set of hardware is very fast in absolute terms and the search itself can find many times more hits if the cutoff for hit quality is loosened. The CPU code would certainly speed up linearly if you threw more cores at it. But currently GPUs have a huge performance advantage when running this algorithm.

 debrouxl 2012-11-17 20:35

BOINC wrapper integration would enable us to explode the CPU+GPU power working on polynomial selection for RSA-896. I should dig again into my mailbox for the discussions about that, though I don't have time to help with it right now...

 frmky 2012-11-19 20:08

[QUOTE=jasonp;318710]With a GTS450 card (which is pretty low-end) I get about 80000 hits per week, which is about 2MB zipped.[/QUOTE]
I started at 10^13. On the GTX 480, I got 229 hits in 8 minutes, which extrapolates to a bit under 300000 per week. How many are you wanting to collect?

 Dubslow 2012-11-19 20:54

[QUOTE=frmky;318965]I started at 10^13. On the GTX 480, I got 229 hits in 8 minutes, which extrapolates to a bit under 300000 per week. How many are you wanting to collect?[/QUOTE]

Sheesh, that's a lot. :smile: Perhaps we should lower the norm to (e.g.) 5e32.

How will -nps and -npr be run? Should we run the former our selves, as had previously been suggested for BOINCification?

frmky, how fast are you traversing leading-coeff ranges?

I plan to participate, once a) I'm back at school and b) I get my recently RMAd 460.

 jasonp 2012-11-20 01:25

I've computed about 65000 hits so far, and Paul reports computing 500k hits with CADO-NFS. I'm currently covering the 1e13 range as well, so we will probably have a little duplication. The code selects very few leading coefficients in this range, after two weeks I'm up to around (1e13+1e9).

I'm also being lazy, in that Paul's group can turn around stage 2 pretty quickly so I'm not bothering with it. A BOINC project would definitely need to do its own size optimization, since I'd expect thousands of good size-optimized polynomials out of tens of millions of stage 1 hits.

The upper limit is a good question; we would want millions of hits at least.

 frmky 2012-11-20 03:13

It's a big project, but if we only want millions of hits I'm not sure it's big or long term enough for BOINC. With two GPU's I will be getting over a half million per week.

Starting at 10^13, the leading coefficient has advanced by 21 million in 6.5 hours, with a bit over 10000 hits. Extrapolating, if we want to cover the entire range, this will take 35000 GTX480-years and yield a half-trillion hits totally 12 terabytes. To put this in perspective, NFS@Home uses about 1.5 CPU-years each day.

 Dubslow 2012-11-20 03:16

[QUOTE=jasonp;318995]I've computed about 65000 hits so far, and Paul reports computing 500k hits with CADO-NFS. I'm currently covering the 1e13 range as well, so we will probably have a little duplication. The code selects very few leading coefficients in this range, after two weeks I'm up to around (1e13+1e9).

I'm also being lazy, in that Paul's group can turn around stage 2 pretty quickly so I'm not bothering with it. A BOINC project would definitely need to do its own size optimization, since I'd expect thousands of good size-optimized polynomials out of tens of millions of stage 1 hits.

The upper limit is a good question; we would want millions of hits at least.[/QUOTE]

Does CADO support CUDA, or is yours the only such code?

How does lowering the norm affect how many hits you need? How likely would you be to miss an unusually-great poly with too large a norm?

It seems to me that running our own -nps would, at the very least, cut down on how much data we need to transfer. In either case, do we just attach results here and Paul will download them or something?

All times are UTC. The time now is 13:27.