20050630, 17:42  #1 
Feb 2005
4_{10} Posts 
Improvements for the CWI NFS software package
Hey all,
Let me first introduce myself, I student at the University of Amsterdam. Currently I am ending my first of a two year Master programma called Grid Computing in Computer Science. Besides computer science I have a great interest in theoretical mathematics, for my Bachelor project last year I implemented several factorization methods (Fermat, Pollard rho, Pollard p1, BQS and the MPQS) under supervision of H.J.J. te Riele at the CWI. Next year I have almost a whole year for my Masters graduation project and I will work on the NFS software package from the CWI (almost certainly again under the supervision of te Riele). My main goal is to speed up the implementation and implement some new ideas from a point of view which is more related to computer science. I have been reading this forum for almost a year now and most of the people here are experts in this field and have experience with a NFS software package. Some of you even worked with the software from the CWI, that's why I post this message here. I am interested in ideas or improvements you would like to see in the implementation of NFS, and to be more specific to the people who played with the software suit from the CWI what they (dis)liked about this piece of software. Of course I cannot promise I will implement all your suggestions but I will do my best I already have the code and am working on it beside my normal school work. I am a big fan of the GMP library and right know I am busy to replace the current big number library (FreeLIP) with GMP (this replacement should result in a speedup), this is a very time consuming task (and of course the new code must be tested extensively). Right now I am doing this only for the sieving stage since this is one of the more time consuming tasks. My second goal is to make the sieving program able to run on big clusters using MPI and improving the sieving code with the help of the bucket sorting techniques which I encountered here and are currently also being used in for instance msieve. I have some other smaller ideas which could improve the code but right know these are the main three things I would like to add to the software package. Of course I will (eventually) also be working on the other stages, but first things first. For now I will concentrate mainly on the sieving stage. I hope the people here have some great ideas I didn't think of or complaints about this piece of software. I have read a lot about the GNFS but only since a few months I have been able to play with this piece of software. So a lot of people here are certainly more experienced with this software package and could share this experience. I hope this discussion will results in even a faster NFS program, kind regards, Joppe Bos 
20050630, 19:14  #2  
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
5·13·163 Posts 
Quote:
I've been using the CWI package for many years now and have a source code license in myself. Herman will vouch for me if you ask him... Please contact me directly by email (paul AT leyland.vispa.com) and I'll discuss several possible enhancements. I may be able to help with some of the work, though I am extremely busy at the moment and can not make any firm promises. Strictly speaking, I think the CWI arithmetic package is not freeLIP but developed from a precursor to freeLIP itself. Independent of that, moving to GMP would almost certainly improve the performance significantly but PLEASE read the license terms of GMP very carefully. So far, CWI have not released their NFS suite under a license which is compatible with the LPGP that GMP uses. Just to get things started: I believe we need factobase primes to at least 2^32 and large primes to at least 2^40 in the near future (to support a kilobit SNFS for example. The software should still run well on 32bit machines. The postprocessing tools need rewriting to use much less memory, especially in the duplicate and singleton removal algorithms whe filtering (Peter Montgomery, Don Leclair and myself all have code which addresses parts of the problem). Likewise, the postprocessing tools need to be able to work with very large datasets on architectures which do not support very large files. I have code that addresses this problem. There are other things we can discuss, but that will do for a start. Paul 

20050630, 19:59  #3 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Excellent news indeed!
If you discuss possible enhancements by private mail, may I ask you to post summaries of your ideas occasionally? This would be of great interest for me (and undoubtedly others) to follow. A few things I noticed that should be easily fixed but can be quite annoying for the user are these (some are very similar to the ones Paul mentioned):  add large file support on 32 bit architectures. Right now filter aborts when files get >2GB (I think piping through gzip may avoid the problem, but not sure about it and it's pretty slow) (are there any systems left in popular use that do not support large files at all?)  allow doing duplicate removal and singleton pruning completely separately. Right now the maxrelsinp parameter to filter specifies up to how many hash table entries should be used for duplicate removal, so setting that parameter to a low value helps conserve memory. Unfortunately the parameter doubles as the initial value for the size of the table of ideals, so setting it too low will make filter spend a lot of time increasing that table (it dumps the old contents to disk and rereads). I think either mergelevel 1 should not do duplicate removal at all, or there should be a nodupe parameter.  Herman once mentioned to me that the binary file format would get retired. That would be a shame, imo, as it allows for much faster file I/O. The format will need to be adapted to allow for ideal norms >2^32, though.  Now that dualcore cpus appear to become commonplace, having gauss support dual (maybe quad) cpu machine without much hassle would be nice.  sqrt should learn to deal with ideal norms >1e9 on 32 bit machines. Even increasing the limit to 2^32 or 2^31 would make a huge difference for everyday NFS work, i.e. factorisations for the Cunningham project.  The siever could benefit from some modest architecture specific optimisation. Especially AMD64, which appears to be quickly becoming the workhorse of choice for computational number theory, should offer much room for improvement, for example by using the 64x64>128 bit product. This is all I can think of off the top of my head, when other ideas occur to me, I'll post more wish lists. Alex Last fiddled with by akruppa on 20050630 at 20:01 
20050630, 20:05  #4  
Bamboozled!
"πΊππ·π·π"
May 2003
Down not across
10100101100011_{2} Posts 
Quote:
Your point about Gauss (really Lanczos hiding behind a pseudonym) is a very good one. Personally, I would like to see the full MPI version released. Paul P.S. Apologies for typos, etc, in my posts today. I'm seriously short on sleep. 

20050630, 20:30  #5 
Jun 2003
The Texas Hill Country
3^{2}×11^{2} Posts 
Welcome to the group.
Having used the CWI suite for NFSNet, I welcome any improvements that you might make. Of course, everyone wants things that make it run faster. If gmp, or some other MP arithmetic package will help, by all means consider it. However, I would strongly recommend that you try to be implementation agnostic wherever possible and isolate the parts that are not. I would hope that, even if you don't actually implement it, you can start to accommodate larger primes in the factor base and factors. When we eventually do a kilobit factorization, I think that we are going to need to exceed the present upper limits. Raising the limits on 64 bit machines should not be difficult, at least conceptually. Although these 64 bit machines are becoming more readily available, there are still significantly more 32 bit machines around. I think that it will be a long time before we want to "write off" the ability of them to contribute usefully to the sieving stage. I also have a number of suggestions with respect to portability and interoperability, or rather the lack thereof. I will send you more information on our experiences via email. As for using MPI on big clusters, improvements in the sieving will be much less significant than they would be in other parts of the suite. Since the sieving is extremely loosely coupled, big longrunning batch jobs are generally quite adequate. One suggestion I would make is modularization. Separate the sieving engine from the control parts rather than having them woven together like spaghetti. I look forward to further discussions. Richard 
20050630, 20:49  #6  
Tribal Bullet
Oct 2004
2·3·19·31 Posts 
Quote:
jasonp 

20050630, 21:21  #7 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
In fact the siever already contains some code that aims at turning it into a lattice siever eventually. I.e. you can specify vectors for the transform of the sieving space, but you have to do it manually at present. I once played with them a little but ran into a bug (the siever exited with an error) which I reported to Peter Montgomery. He ackowledged the bug but I don't know what became of the issue.
So, some preparations for lattice sieving have been made already. This would definitely be worthwhile to pursue. Alex 
20050701, 13:16  #8  
Feb 2005
2^{2} Posts 
Wow! It is great to see so many replies in such a short time. I was posting my request at this point in time because the summer vacation is right around the corner and I am going on a vacation and was planning to investigate a lot of the ideas presented here (since I will bring my laptop with me).
I will make a large list of all the things that could be implemented or changed, later I will give some sort of priority number to every point on the list and hopefully have enough time to work down the whole list (and simultaneously do some research to speed up certain processes in the NFS). Currently I only took a close look to the polynomial selection and the sieving stage (and am still reading the code), so improvements to other stages are more than welcome but will be incorporated in a later stadium. I will contact the persons here with a source code license so we can discuss some things into detail, my mail address is jwbos <at> science <dot> uva <dot> nl. But first I will respond at some suggestions and ideas. Quote:
Quote:
Quote:
Quote:
Quote:
About the Lanczos algorithm, the current version already has multiprocessor support. I haven't looked into this part of the code yet but as far as I know this isn't done with MPI and converting the code to MPI would make it way more scalable to for instance big clusters. Quote:
Your second suggestion is also a very good one! The program would have a much better design if all the 'real' sieving code could be totally separated from the rest of the code in the sieving stage. As akruppa already mentioned there actually is lattice sieve support in the program. I have only experience with the normal line siever but will (of course) investigate the (dis)advantages of this sieving method. I am particularly interested in this bug, I also don't know if it has been fixed but if I know what caused it I could take a closer look at it. Again thanks for all the fast replies, I will take a closer look at all of the suggested ideas in the summer vacation since right now I am quite bussy with the last things I have to finish for my study. Sorry for this rather large reply. Kind regards, Joppe 

20050701, 14:06  #9  
Nov 2003
2^{2}×5×373 Posts 
Quote:
line siever to a lattice siever. When I reworked my line siever into a lattice siever I got (roughly) a factor of 5 speed improvement. I'd like to improve it further, but I have a real job that requires my time. Weekends are spent being a single parent..... If the OP likes, I can provide a source copy of my lattice siever. It is much more readable than the Franke siever. No disrespect intented. Franke et.al. did a good job on their siever, but the coding style and lack of comments makes in unreadable. 

20050701, 14:30  #10  
Nov 2003
2^{2}×5×373 Posts 
Quote:
is from 2,791+, which is in progress. I am using factor bases of 1.5 million primes for each polynomial, and the lattice being sieved is 6K x 12K. The sieving is being done on a 1.6GHz Pentium 4 Laptop. qindex is an index into the factor base for which special q is being used. lattice is the reduced lattice for this special q. I first sieve from (0 to 6K) (0 to 6K), then sieve from (0 to 6K) (0 to 6K) The "after scan: lines show how many *potential* successes there are that survived the algebraic sieve, first for the top half plane, then for the bottom half. num_total_success is the number of *potential* successes surviving both the algebraic and integer sieve. Many (most) are false hits with a too large prime cofactor. A lot are also successes where the (c,d) grid coordinates are not coprime. Siever built on Feb 11 2005 12:59:58 Clock rate = 1600000.000000 We try to factor: 538979174260030195774806654998460878107494491770558596271026789309120102762361871555156107665792312371183237710356585532466537782307442433670499269977421434908027478823657788916992881425413513 qindex, special_q, root, log, = 226440 3144173 896787 33 lattice = (101,2321), (1342,291) after scan, num_alg_success = 267026 num_total_success = 10115 after scan, num_alg_success = 257476 num_total_success = 9885 Int LP bad split: 2 Alg LP bad split: 89 Int cofactor prime: 724 Alg cofactor prime: 8136 Int big cofactor: 20 Alg big cofactor: 5531 Total algebraic hits: 257476 Total combined hits: 20000 Not coprime: 5069 Relations found: 97 ========================================== Total time to process 226440 : 30.290281 Total sieve time = 14.508987 Int line: 0.302529 Alg line: 0.297463 Int vector: 6.974603 Alg vector: 6.934392 Total resieve time = 9.550174 Int resieve: 4.791307 Alg resieve: 4.758867 Trial Int time: 0.097341 Trial Alg time: 1.854874 Find startpts time: 3.034763 Alg scan time: 0.161952 Lattice reduce time: 1.393615 QS/Squfof time: 1.068258 Prepare regions time: 2.366303 Inverse time = 1.141454 Prime Test time = 0.303485 I line sieve the smallest primes. Resieve time refers to the time to do the resieving during factoring the potential candidates. Find startpts shows the time to compute the sieve start points for all 3 million primes for this special q. As one can see, 80% of the run time is spent sieving. 4.6% is spent doing lattice reductions on the factor base primes, and 7.6% is spent preparing the sublattice sieve region boundaries. Less than 8% is spent doing 64 bit arithmetic. I could cut the time to prepare the sieve boundaries in half by storing, then reusing some computations, but it would *greatly* increase memory requirements. Even if we could cut the 64bit arithmetic time to ZERO, it would not speed the code by very much. Thus, I have doubts about whether 64 x 64 bit arithmetic on the (say) Opteron would help very much. Comments are solicited. I will provide the source to those who ask via email. 

20050701, 16:02  #11 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
But faster modular arithmetic will change the breaks of how much sieving vs. factoring of residuals should be done. To take the point to extremes, if we could do 64 bit arithmetic at ZERO time, we wouldn't sieve at all, because trial division would provide an arbitrary number of smooth relations in zero time.
How much sieving work can we save (by smaller factor base or smaller sieve region) if we can handle larger large primes inexpensively? Alex 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRPnet Script Improvements  brilong  Software  1  20140515 23:47 
Possible algorithm/software improvements?  diep  Miscellaneous Math  9  20090308 16:53 
Free C++ package and looking for a bit of help.  Uncwilly  Programming  9  20050304 13:37 
Any code improvements?  db597  Software  7  20040909 17:05 
Improvements possible for assignment procedure and double checking????  Erasmus  PrimeNet  10  20040219 12:09 