mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2007-09-14, 11:36   #1
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

242410 Posts
Default so what GIMPS work can single precision do?

As we know, single precision sucks, because Lucas-Lehmer tests require double precision. While it's possible to emulate double precision on single precision hardware, it is terribly inefficient, since only about 10% of the performance gets converted to single precision.

However, I've heard that trial factoring can be done in single precision. Is this true? If so, it might be a good idea to add GPU support for Prime95 if the user chooses to do trial factoring.
ixfd64 is offline   Reply With Quote
Old 2007-09-14, 16:40   #2
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7×167 Posts
Default

In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.

The performance would be abysmal.

The same is true for trial factoring, except that multi-millionfold precision isn't necessary. Double precision is enough.

I've no doubt that a GPU could be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on?
Mr. P-1 is offline   Reply With Quote
Old 2007-10-11, 17:03   #3
ShiningArcanine
 
ShiningArcanine's Avatar
 
Dec 2005

22·23 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.

The performance would be abysmal.

The same is true for trial factoring, except that multi-millionfold precision isn't necessary. Double precision is enough.

I've no doubt that a GPU could be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on?
Given the cost of a Lucas-Lehmer test and the fact that CPUs currently doing trial factoring will be freed to do Lucas-Lehmer tests, I think this would be a worthwhile project.

For example, say half of the CPUs in prime net are running Lucas-Lehmer tests and half are trial factoring. Trial factoring eliminates many composite mersenne numbers from consideration, which results in an immense reduction of the processing time needed to handle a given segment of the mersenne numer line versus the case where there is no trial factoring. If GPUs were recruited for trial factoring and those CPUs that were trial factoring were reassigned to running Lucas-Lehmer tests, then the amount of time needed to handle a given segment of the mersenne number line would be halved, as the trial factoring would be done for "free."

I make a few assumptions in that example, such as the CPUs in prime net being evenly divided between Lucas-Lehemer tests and trial factoring and either all of the CPUs being single core with the same specifications or the CPUs being evenly divided along lines of processing power. I also assume that the GPUs recruited for prime net will be able to trial factor numbers at a rate sufficient to produce more potential mersene prime numbers that have passed trial factoring than the CPUs recruited for prime net will be able to test, which is the most important assumption as it is my basis for claiming that trial factoring would be done for "free."

Last fiddled with by ShiningArcanine on 2007-10-11 at 17:04
ShiningArcanine is offline   Reply With Quote
Old 2007-10-11, 18:17   #4
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.

The performance would be abysmal.
I would like to present my "understanding" of this topic,
and welcome correction where appropriate.
In a typical LL test, each iteration needs to
square a 40M bit number.
In a 2048K FFT the number is first split into 2^21
"elements" each containing ~20 bits.
At the midway stage of the FFT these elements are squared.
These (by now) irrational quantities are repesented to sufficient
accuracy by the standard 53-bit "double-precision".
If we were to try this in single precision, most the bits would be
needed for the "precision" leaving few bits (if any) to be the
"significant" bits in each element.

If precision greater than "double" (=53 bit) were available, how
much help would this be? I can see that a Law of Diminishing
Returns applies here.

David

PS I have read that the mod(2^exponent-1) step comes "free".
At what point does it come in? Does it mean that we don't
need 40 significant bits at any stage?

Last fiddled with by davieddy on 2007-10-11 at 19:17
davieddy is offline   Reply With Quote
Old 2007-10-11, 22:57   #5
Mr. P-1
 
Mr. P-1's Avatar
 
Jun 2003

7·167 Posts
Default

Quote:
Originally Posted by davieddy View Post
I would like to present my "understanding" of this topic,
and welcome correction where appropriate.
In a typical LL test, each iteration needs to
square a 40M bit number.
Yes, for 40M exponents. The size of the number in bits is exactly the exponent.

Quote:
In a 2048K FFT the number is first split into 2^21
"elements" each containing ~20 bits.
I don't know how many bits there are per element.

Quote:
At the midway stage of the FFT these elements are squared.
It's not clear whether you are saying that the result of performing an FFT to "the midway stage" is that the elements are squared, or that squaring is an operation that is performed on elements "at the midway stage".

In fact, what happens is that at the midway stage of the entire process, i.e., when the FFT transform is complete, the transformed elements have to be explicitly squared. Then an inverse FFT is performed. At no point are the untransformed elements squared.

The entire process is FFT, element-by-element squaring, inverse-FFT.

This may be what you meant.

Quote:
These (by now) irrational quantities are repesented to sufficient
accuracy by the standard 53-bit "double-precision".
The transformed quantities are complex, i.e., have real and imaginary components, not merely irrational. I don't know whether the imaginary part automagically disappears in the inverse transform, or if it has to be folded into the real part somehow.

Quote:
If we were to try this in single precision, most the bits would be
needed for the "precision" leaving few bits (if any) to be the
"significant" bits in each element.
I'm not sure that "most" would be. I would expect that the number of bit available would about half that of a double, meaning that the total number of FLOPS would be multiplied by 4.

Quote:
If precision greater than "double" (=53 bit) were available, how
much help would this be? I can see that a Law of Diminishing
Returns applies here.
I don't think that law does, at least not at the level of the software implementation. (It probably does at the hardware/microcode level.) The limiting case is where the entire number can be represented within a single huge-precision operand. Clearly this would require only a few FLOPS to transform, square, inv-transform (and of course we could just square it directly).

Quote:
PS I have read that the mod(2^exponent-1) step comes "free".
At what point does it come in? Does it mean that we don't
need 40 significant bits at any stage?
My understanding is that the result at the end of the inverse transform is the square of the operand (mod Mp), rather than just the square of the operand. That's how I interpret the claim that this step comes "free" .
Mr. P-1 is offline   Reply With Quote
Old 2007-10-11, 23:03   #6
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by davieddy View Post
At the midway stage of the FFT these elements are squared.
These (by now) irrational quantities are repesented to sufficient
accuracy by the standard 53-bit "double-precision".
If we were to try this in single precision, most the bits would be
needed for the "precision" leaving few bits (if any) to be the
"significant" bits in each element.
This is correct, except that there is an inverse FFT performed to get the results of the squaring. If the numbers were split into 2^k chunks of B bits each, then each chunk of the product would need 2B+k bits when the squaring finishes. If 2B+k exceeds 53-bit double precision, then at least some of the results get truncated and the computation is spoiled.

It's actually more complicated than this, because we can use balanced representation, where chunks take values up to +-2^(B-1). In this case the squaring results need at most 2B+k-2 bits, but the average value is zero so this expression can exceed 53 bits in size and still recover the bits of the product with overwhelming probability.
Quote:
If precision greater than "double" (=53 bit) were available, how
much help would this be? I can see that a Law of Diminishing
Returns applies here.

David

PS I have read that the mod(2^exponent-1) step comes "free".
At what point does it come in? Does it mean that we don't
need 40 significant bits at any stage?
To a certain extent this is idle speculation, because no one except Intel/AMD can build an extended-precision FPU that's as fast as the ordinary FPU on an Intel/AMD processor, and there is no business case to add such an expensive feature to the processors. Nonetheless, I doubt that it would help all that much.

The mod is free in that you do not need 2^(k+1) chunks to hold the squaring result, but each chunk is still going to need extra bits to be represented exactly.
jasonp is offline   Reply With Quote
Old 2007-10-12, 00:30   #7
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

Quote:
Originally Posted by jasonp View Post
This is correct, except that there is an inverse FFT performed to get the results of the squaring. If the numbers were split into 2^k chunks of B bits each, then each chunk of the product would need 2B+k bits when the squaring finishes. If 2B+k exceeds 53-bit double precision, then at least some of the results get truncated and the computation is spoiled.
I can't see where "+k" comes into it, and since B~20 and k=21
2B+k = 61 so it is just as well.
All that is clear is that the exact square of a 40M bit number
contains 80M bits, suggesting that the representation of each element
must represent a 2B bit number + sufficient precision for its accurate
derivation to the nearest integer.
davieddy is offline   Reply With Quote
Old 2007-10-12, 00:40   #8
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

Quote:
Originally Posted by Mr. P-1 View Post
The entire process is FFT, element-by-element squaring, inverse-FFT.

This may be what you meant.
It was.
BTW 40M/2^21 ~ 19

Last fiddled with by davieddy on 2007-10-12 at 00:42
davieddy is offline   Reply With Quote
Old 2007-10-12, 03:24   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by davieddy View Post
I can't see where "+k" comes into it, and since B~20 and k=21
2B+k = 61 so it is just as well.
All that is clear is that the exact square of a 40M bit number
contains 80M bits, suggesting that the representation of each element
must represent a 2B bit number + sufficient precision for its accurate
derivation to the nearest integer.
Multiplying two numbers, each with 2^k elements each B bits in size, produces a 'multiplication pyramid' where each element of the product is the sum of at most 2^k numbers, each with 2*B bits (see here). That sum will have 2*B+k bits on average. For the Mersenne-mod case, the high half of the product wraps around to the low half, which means every element of the product is the sum of exactly 2^k 2B-bit numbers.

You are correct that if all the elements are positive, then double precision is not enough to represent the result without fatal errors. However, we use balanced representation; each input element is a positive or negative (B-1)-bit integer. The multiplication behaves the same way, you still have to add up 2^k of these 2*(B-1) bit numbers for each element in the product, but now approximately half the numbers are positive and half are negative, so their sum is going to be zero on average. Of course it won't be precisely zero, but the exact value of the sum is most likely to be a small number near zero, with values farther away from zero being exponentially less likely. Sums that are so large in absolute value that they exceed the size of a 53-bit integer are considered so unlikely that they are assumed never to happen, even after all the squarings that a single LL test performs.

To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen.
jasonp is offline   Reply With Quote
Old 2007-10-12, 09:11   #10
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

Quote:
Originally Posted by jasonp View Post
To a certain extent this is idle speculation, because no one except Intel/AMD can build an extended-precision FPU that's as fast as the ordinary FPU on an Intel/AMD processor, and there is no business case to add such an expensive feature to the processors. Nonetheless, I doubt that it would help all that much.
I thought this speculation was highly relevant to the original
question: if 53-bit precision is needed to handle elements whose
starting length is ~19 bits, one can envisage that single precision
can't do the task at all. It is natural to speculate that 106-bit
precision might enable elemnts with > 38 bits to be handled.
Maybe my suggested "Law of diminishing returns" has already
set in at 53-bit precision.
davieddy is offline   Reply With Quote
Old 2007-10-12, 17:50   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

22·3·7·139 Posts
Default

Quote:
Originally Posted by davieddy View Post
if 53-bit precision is needed to handle elements whose starting length is ~19 bits, one can envisage that single precision can't do the task at all. It is natural to speculate that 106-bit precision might enable elemnts with > 38 bits to be handled. Maybe my suggested "Law of diminishing returns" has already set in at 53-bit precision.
Actually, in terms of allowable-bits-per-input word as a function of FPU mantissa precision, you in fact get *accelerating* returns [though things asymptote quite quickly to linear once you get to DP and beyond] - this is because with increasing precision, the average fraction of bits of each float which is "noise" becomes relatively smaller with respect to the fraction that is "signal", for a suitably larger input word size. [And there is also a slight secondary error-reduction effect which comes from larger-input-word --> smaller-transform-length --> less-RO-error-accumulation.]

But as I and others have noted, the main gain in this respect is in going from SP to DP, and there is no market case for chip makers to add dedicated high-speed 128-bit-float support - the market for it is simply too small to justify the silicon needed. And the folks who really, really do need it can just use 128-bit-float emulation, and throw more CPUs at the problem if they need more throughput.

Last fiddled with by ewmayer on 2007-10-12 at 17:58
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
Dual Core to process single work unit? JimboPrimer Homework Help 18 2011-08-28 04:08
exclude single core from quad core cpu for gimps jippie Information & Answers 7 2009-12-14 22:04
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 17:09.


Fri Dec 3 17:09:44 UTC 2021 up 133 days, 11:38, 2 users, load averages: 1.29, 1.27, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.