View Single Post
Old 2004-04-20, 09:41   #8
xilman's Avatar
May 2003
Down not across

2·72·109 Posts

Originally Posted by Bob Silverman
The result for 2,1642M should have been HARDER to achieve. It is 10 bits
larger than M811. I do not understand why their matrix was only 7+Million
rows, while that for M811 was 14+M.
The explanation for the huge matrix is both simple and depressing.

Converting raw relations into a matrix involves a process whereby useless relations are discarded and useful ones are merged into relation-sets. Precisely what "merging" means is not particularly relevant to this story so I won't go into details. Relation sets are, as the name suggests, collections of relations each member of which shares a property in common. The property is that at least two members of a set contain the same prime value but that, again, isn't relevant here. What is relevant is that a relation set behaves as if it were a single relation and that, in general, it is "heavier" than a raw relation. "Heavy" relations are bad news for the matrix step and so we make a point of collecting plenty of extra relations and so have an excess which allows us to throw away the very heaviest relation sets and yet still have enough data to enable us to solve the matrix.

The above paragraph is an extremely simplified overview of the filtering stage. Much more detail can be obtained relatively easily but I don't think it's necessary for present purposes.

Now to specific numbers. For the M811 project we collected something over 87 million different raw relations. Without any filtering at all, these would generate a matrix with over 87 million rows and columns. Throwing out the useless relations reduced the data to around 63 million relations. This is still a rather large data set.

A simple merging operation reduced the data to 30.8 million relation sets and indicated that we had an excess of 6.6 million, allowing us to throw away up to that number of the heaviest relation sets. Note that we've already reduced the matrix from 87M to 30.8M by this stage. This point had been reached by 4th March, the day after sieving finished. Progress was so quick because relatively little data was being manipulated and relatively little computation was being performed on it. Nonetheless, the data files were between 4Gb and 7Gb in size and the processing took several hours on a 2.5GHz machine fitted with a gigabyte of RAM.

The next stage was much more memory and cpu intensive, and required much more care in the selection of the parameters fed into the filtering program. I ran several copies, each with slightly different parameters, on machines which have 2Gb RAM each. One run crashed with an out-of-memory error after about a day or so, another did the same after running for almost three days. I restarted each of these with different parameters chosen to reduce memory usage. One of the original programs finished successfully on March 10th after five days running. I took this data and built a matrix out of it which had 10.19M rows and 11.61M columns. Unfortunately, it was useless because there were 6798 duplicate rows. This should not be possible and there is clearly a bug, either in the filter program or the bulidmatrix program or both. Note the date: 10th March. In the wee hours of 12th March I was due to head off to the airport to start a week's vacation.

The next day and a half was unbelievably hectic. Another attempt at filtering produced another useless matrix. I went back a stage and took the data which would generate the 30.8M matrix and ran a very gentle filter on it. This produced a usable but large matrix which has 14,572,496 rows and 15,607,468 columns; it was ready by around noon on the 11th. Other attempts to get a better matrix continued well into the evening without success. At around 9pm on the 11th, I gave up and set the 14.57M matrix going on the cluster. At 3am the next morning I got up to go on vacation.

First day back in the office was 22nd March and, from the progress made so far, I worked out that the matrix would finish in early May. I returned to the attempt to produce a much smaller matrix which could be processed in much less time and so finish before the end of April. No luck. Every attempt produced a useless matrix. I couldn't work out why and describing the symptoms of the failure to various colleagues didn't produce any good leads. On March 31, I put something over 7 gigabytes of data onto the MSR anon-ftp site, from where Don Leclair downloaded it and started his investigation. After more than a week, he found a problem in the MS C runtime I/O library. It turns out that if a program reads more than four gigabytes of data in text mode, something gets confused over how many CRLF pairs it has read and two or more lines can become concatenated. This behaviour does not apparently occur when files are read in binary mode. Needless to say, the relations and matrix files are all humungous text files. The matrix currently running is under 2 gigabytes and the relations fie from which it is built is 3.78Gb and so just squeaks under the threshhold.

Don's result came far too late to build a smaller matrix, which would still have been around 9M in size, so we decided to continue with the one we have running.

I hope this give some insight into the efforts that we have to go to behind the scenes. I should also make clear that although I've concentrated on my experiences, because those are the ones I know in greatest detail, the other members of the NFSNET team also put in heroic efforts. Richard Wackerbarth in partcular spent a great deal of time and resources and his effort has been largely unsung thus far.

xilman is offline