mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2007-07-21, 18:56   #12
nibble4bits
 
nibble4bits's Avatar
 
Nov 2005

2·7·13 Posts
Default

Dividing by a power of a base, is the same as right-shifting the number, in that same base and has no real cost savings over a multiply. Of course the methods I'd prefer, us multiplication, but you could use a if-then statement to take the exception for values small enough to make left-shifting inefficient. :)

You would be counting the number of left-most zero bits in a row. For example, if the number is represented as an unsigned 64-bit binary value, then of course it makes sense that for finding base-2 log rounded down to the nearest integer, one could just use a recursive multiplication by 2.

Another method is a b-tree based on the powers of 2 from 0 to 63.
Compare X to 2^32, then 2^(32+/-16), then 2^(32+/-16+/-8), and so on.
This seems overly complicated compared to a tight Branch on Less Than/Branch on Carry loop combined with a repeated left-shift. However, for much larger powers of 2, this can start to add up. Do the cycle counts for a 64-bit word versus a 65536-bit word, ignoring overhead from arbitary precision. Even a rough estimate should give an obvious result. Of course, a whole book could be written on implementing efficient arbitary precision versions of algorythms and only cover a fraction.

Of course the CPUs often implement a fast method for at least one of their opcodes. It's still nice to know, even if they're doing this for you at that level.
nibble4bits is offline   Reply With Quote
Old 2007-07-21, 23:38   #13
fenderbender
 
fenderbender's Avatar
 
Jul 2007

17 Posts
Default Fast determination of squares

I spent several days experimenting with making an efficient square detection routine. There's never going to be "one best way" especially when you get down to actual CPU behavior, but I've pretty much optimized the function as much as possible.

The goal is to have a simple "isSquare(n)" routine which returns true or false. We are NOT making assumptions about n itself, there would be better routines if you knew some properties of n, or were computing squaritude of many sequential or spaced values.



Some discoveries and observations:
  • The computation of an integer square root (and then comparing its square to the original number) is the practical way to prove squaritude. This can be quite efficient, but is dominated by about log2(n) big-integer divisions. There's lots of variations and improvements to this algorithm, a one-bit divider is especially interesting.. but overall you can't get rid of that O(log(n)) so it becomes dominating quickly. Optimizing isSquare() is all about trying to quickly screen out as many non-squares as possible before you finally end at the stage which uses the big (and comparatively slow) intSqrt() gun.
  • As expected, computation of n mod m with BigInt numbers is also slow compared to native CPU 32-bit integer math. It depends on n's magnitude, but even for small n, there's a lot of overhead. You can do dozens of native 32 bit integer mods in the time it takes to do mod of a 50 digit number. The bigint mod speed is not hugely sensitive to the value of the mod itself though. We can use this and try to keep most of our math in the 32 bit realm. [using 64 bit native CPU math is also a good idea to try! But I only dealt with a 32 bit version to start with.]

So, the basic idea is similar to the one outlined in the original post.. square numbers always have a limited number of residues with different modulii. For example, a^2 mod 32 only has 17 possible residues. You can immediately reject the nonsquare residues and avoid having to use the Big and Slow intSqrt(). We use this idea for screening.

Two decisions are 1) what modulii are best to use and 2) how can you efficiently determine if a residue is acceptable or not.

One answer to #2 is a simple lookup table. You can trade off size of the table for rejection efficiency. I won't go into huge details, but tables of less than 64 bytes are REALLY fast and tables of less than about 1000 bytes are pretty darned fast. [This likely has to do with CPU L1 cache sizes and cache line sizes.) Lookups into tables of more than about 5K bytes degrade noticably, likely because of memory latency when fetching from the deeper L2 cache. So tables are good, but they can't be too big!

Storing a table as bits is space efficient, but it has real performance penalties because you need to do some math to figure out which bit to access and make masks to select just that bit. This is slow compared to the actual mod compute, so it's better to use one byte per residue.

So are tables the right answer? Jumping to my conclusion (after lots of experimentation and analysis), if you're doing extensive rejection, the sizes start to build quickly. It's better to do multiple sequential tests with smaller tables. But if the tables get small enough, you can even use a Bloom filter instead.. and if you can compress the Bloom filter table into about 64 bits.. it's scary-fast! I'll show the results.


Easiest initial 2^s residues

So if the basic plan is to test modulii sequentially, which modulii should we use?

One initial step is to start with a 2^s modulus simply because the internal storage of big integers let us read off the 2^s residues trivially.

If we used mod 32, we know a square will have a residue of 0, 1, 4, 9, 16, 17, or 25. If we have any of the other 25 residues, we can immediately return "nonsquare." This happens 25/32= 78% of the time.

If we used mod 64, we could reject 52 residues, a slightly higher proportion of 81%.

Mod 128 lets us reject 105 residues, which is 82% rejection. Mod 256 gives 82.4% rejection. Obviously higher powers give quickly diminishing returns.

Which power of 2 should we use? The answer is whatever is most efficient to reject. It turns out that mod 128 is the largest size that fits in a very small inline Bloom filter table, and it's not worth using a more complex filter to get the (small, fractional percentage) extra rejection of higher powers.


Bloom Filters

A Bloom filter is a kind of hash table used to determine if an element is in one or another predefined set. Depending on the size of the table, this discrimination can be perfect (every input is properly classified), or partial (all members of set A are properly identfied as being in set A, but there may be some false positives of set B members.). This works very well for our residue problem where we MUST identify all possible squares.. we can't reject them in error. But it's OK to accept a few nonsquares in a rejection step, we'll just try to keep the false acceptance as low as possible. Most residue tables will have perfect discrimination though..

We can encode the 105 residues for mod 128 in a perfect discriminator which looks like:

Code:
bool mod128Discriminator(unsigned long n) // 0<= n < 128
{
   return (n*0x8bc40d7d) & (n*a1e2f5d1) & 0x14020a;
}
It works by making a hashed mask which has about 1/4 of the bits set. Here the hash is just a super-simple multiply! The use of the first "&" combines two hashes to reduce the bit population to about 1/4. Different input n give different masks. All 23 acceptable residues have their masks "ored" together. We take the bitwise not of that mask, determining the bits that are never set for square residues. So if our result has a "bad" bit set (tested by the second &-ing) we reject the bad residue. Picking the right hash function makes this function accept all 23 good residues and reject all of the bad 105 residues.

The compactness of this filter makes it (surprisingly) faster than a table. The problem is that as the residue list gets bigger, the function has to grow, and therefore (as expected) there's a size/speed tradeoff.

We were lucky with mod 128, it encoded very efficiently. We can perfectly encode most modulii up to around mod 60 with a function in this format. Higher values will start leaking in bad residues (which isn't fatal, but it does reduce our rejection efficiency.) Tradeoffs, tradeoffs.

Creating these filters involves picking the hash function then just varying the settings until one passes with an acceptable number of false positives. The computer can churn through a billion test values (just use a random number generator) until it finds one that succeeds. Of course some modulii may need more complex hashes, but all of our modulii of interest fit very smoothly in this small compact style.


Picking Modulii

So we're going to check mod 128 first, what about other modulii? First, we'll remember that bigint modulii are slow so we want to take as few as possible. But we can minimize this expense by using only one bigint modulus that's < 2^32, then using CPU 32-bit math to find sub modulii. It's grabbing a big chunk of data with the slow algorithm, then using the fast math to peel off subchunks of that a bite at a time.

Examining prime modulii, we can see that each independent prime p has (p-1)/2 "nonsquare" residues. Larger primes are slightly better at rejection than small primes, but they'll have more residues and be harder to fit into a small discrimination function.

More importantly, we're limited by that 32 bit word size, our "big chunk" of data needs to fit into 2^32, so larger values means less modulli. That bigint modulus is expensive, so we want to squeeze as much as we can out of a single call.

What about composite modulii? Well, for powers of a prime, you get better and better rejection (as we saw with 2^n), but it also has diminishing returns. You get better rejection using both 3 and 5 than you do using 25 or 625 or even 5^50.

Composite modulii of two (or more) primes will reject the same residues that either prime would reject by itself, so it's kind of like testing both primes at once. [But large numbers of residues won't fit into that efficient Bloom filter, so we can't take advantage of this too much.]

So, we should use the most number of primes that will fit into 2^32.
That's 3*5*7*11*13*17*19*23*29 = 3234846615, and we should do 9 tests, right?

Almost.. there's a little more efficiency we can pull out. First, while the powers of primes have less "bang for the buck", the low primes of 3 and 5 are small and provide a lot more rejection-per-bit than higher primes.. enough that it's better to use 9 as a modulus and not 3, even though it means we can fit in less primes. (it takes a lot of combinitorial testing to figure out if 27 is worthwhile, or even 81, or 5 versus 25, or 49, etc). So the best rejection ends up being: 9*25*7*11*17*19*23*31=3989930175. Note that 13 is missing, you get slightly better rejection by skipping it and shifting the remaining primes higher. I'm skipping a lot of experimentation steps here to just present the results.

We can make an efficient (and perfect) Bloom filter for each component.. in fact we can also find a perfect filter for 63 = 7*9, combining two of the tests into one.

What order should we do the tests? They're independent so we order them from best rejection to least. 63 gives a rejection of about 75% by itself. 25 gives 56% rejection. The remaining primes each provide just under 50% each. So we test in the order: 63, 25, 31, 23, 19, 17, 11.



The Code

So, combining these, we get a routine which uses a mod 128 test to reject 82% of inputs without doing any big-integer math at all. We then take one big-integer modulus and use fast 32 bit math to test residues of 7 different independent bases. The net rejection rate is 0.999+, so we'll only need to compute the intSqrt of fewer than 1 of 1000 non-squares.

This is C++ code but it's very straightforward.

Code:
bool isSquare(const BigInt &n)
{
  unsigned long m;

  // start with mod 128 rejection. 82% rejection rate
  // VERY fast, can read bits directly
  m=n.lowWord() & 127; // n mod 128
  if ((m*0x8bc40d7d) & (m*0xa1e2f5d1) & 0x14020a) return false; 

  
  //Other modulii share one BigInt modulus.
  unsigned long largeMod=n%(63UL*25*11*17*19*23*31); // SLOW, bigint modulus

  // residues mod 63. 75% rejection
  m=largeMod%63; // fast, all 32-bit math
  if ((m*0x3d491df7) & (m*0xc824a9f9) & 0x10f14008) return false;

  // residues mod 25. 56% rejection
  m=largeMod%25; 
  if ((m*0x1929fc1b) & (m*0x4c9ea3b2) & 0x51001005) return false;

  // residues mod 31. 48.4% rejection
  //  Bloom filter has a little different form to keep it perfect
  m=0xd10d829a*(largeMod%31); 
  if (m & (m+0x672a5354) & 0x21025115) return false;


  // residues mod 23. 47.8% rejection
  m=largeMod%23; 
  if ((m*0x7bd28629) & (m*0xe7180889) & 0xf8300) return false;

  // residues mod 19. 47.3% rejection
  m=largeMod%19; 
  if ((m*0x1b8bead3) & (m*0x4d75a124) & 0x4280082b) return false;

  // residues mod 17. 47.1% rejection
  m=largeMod%17; 
  if ((m*0x6736f323) & (m*0x9b1d499) & 0xc0000300) return false;

  // residues mod 11. 45.5% rejection
  m=largeMod%11; 
  if ((m*0xabf1a3a7) & (m*0x2612bf93) & 0x45854000) return false;
  

// Net nonsquare rejection rate: 99.92%

// We COULD extend to another round, doing another BigInt modulus and
// then followup rejections here, using
// primes of  13 29 37 41 43 53.  That'd give 98% further rejection.
// Empirical timing shows this second round would be useful for n>10^100 or so.

   // VERY expensive final definitive test
   BigInt root(intSqrt(n));
   return root*root == n; // true if and only if n is a square
}

Results:


As expected, the new discrimination function is very fast. Below are some tests, which are clearly CPU, compiler, and especially BigInt-class dependent, but still representative. It times 2^25 evaluations of various n. Efficiencies change with the magnitude of n, due to the O(1) nature of the screening tests. Larger numbers show even more dramatic disparity.


(Hmm, no table support on this forum? I'll use code view so its ugly but readable.)


Code:
                                                   n~=10^40        n~=10^80
Using brute force intSqrt() tests:             49.95 secs          147.69 secs
Using mod 128 discrimination first:             9.65 secs            27.73 secs
Using full discrimination tests first:          0.65 secs             0.84 secs

Profiling shows that the n~= 10^40 tests use about 40% of its time to compute the "largeMod", about 35% of its time to do all the individual residue tests, and about 25% of its time for the integer square root. Larger n increases the largeMod fraction, but the intSqrt will begin to dominate overall.
fenderbender is offline   Reply With Quote
Old 2007-07-22, 02:51   #14
alpertron
 
alpertron's Avatar
 
Aug 2002
Buenos Aires, Argentina

138510 Posts
Default

fenderbender, could you please adapt the following UBASIC code I wrote for intsqrt and run it again? Notice that there are no divisions.

Code:
 10   input N:N=N^2
 20   X=1
 30   M=2
 40   for T=1 to 6
 50   M=M*M
 60   X=X*(3-N*X*X)\2@M
 70   next T
 80   X=X*(3-N*X*X)\2@M
 90   X=X*N@M
100   if X>=2^63 then print (-X)@2^64 else print X
110   goto 10
Notice that the input number must be odd, so you will need to divide the original number by a power of 4 so the result is odd. Of course you will need to erase lines 10 and 110. In this program the input number must have less than 64 bits. You can increment the number of bits of the input number by incrementing the stop value on line 40.
alpertron is offline   Reply With Quote
Old 2007-07-28, 23:24   #15
fenderbender
 
fenderbender's Avatar
 
Jul 2007

17 Posts
Default

Alpertron and I had some discussion offline. His algorithm is terrific because as he says, it uses only multiplications, and those will scale very nicely with large n. The algorithm isn't general purpose though, since it only successfully computes the square roots of perfect odd squares. If you feed in anything else, the output is garbage... so it's not useful for computing isqrt() in general. (a very interesting question would be how to modify it into isqrt().) But for the specific case of testing for squares, this limitation is no problem at all since we don't need to use the output except to verify that its input was a square.

Thanks much to alpertron for the boost! I had actually used the idea extensively in my optimized floating point library years ago, but I did not think it would be applicable for integer math.. I'm learning a lot!

I've implemented his algorithm, though using simple O(n^2) multiplication for now. I haven't optimized the implementation (ie, doing the first startup loops in 32 bit) so in practice it's actually slower than my tuned intSqrt() routine even for high N, like 2^1024. However, with tuning (and a especially a better multiplier implementation) his algorithm will clearly have better scaling.

For tests of N up to 2^512, the full rejection algorithm above is dominated by the single large modulus.. that's the slowest step. The intSqrt/alpertron compute is rare enough not to make a big contribution to speed. Past 2^512, the intSqrt/alpertron component starts to make a noticable contribution.. by 2^2048, it's dominating with 95% of the computation. (but the prescreening code in my previous post still boosts speed by 1000, and another round of screening would do more.)

So the best "isSquare()" code is slightly dependent on the magnitude of N. For N up to roughly 2^1024, the above code probably can't be beat, and the intsqrt() efficiency is not too important. For N up to roughly 2^(2^20), the above code should use the alpertron-style final test at the end and not intSqrt(). For greater N, it would be useful to extend the prescreening for another round of rejection.
fenderbender is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
determine hyderman Homework Help 7 2007-06-17 06:01
Methods to determine integer multiples dsouza123 Math 6 2006-11-18 16:10
Help: trying to determine latency on movaps instructions on AthlonXP LoKI.GuZ Hardware 1 2004-01-26 20:05
Early double-checking to determine error-prone machines? GP2 Data 13 2003-11-15 06:59
How to determine the P-1 boundaries? Boulder Software 2 2003-08-20 11:55

All times are UTC. The time now is 13:27.


Wed Dec 1 13:27:40 UTC 2021 up 131 days, 7:56, 3 users, load averages: 2.12, 1.87, 1.60

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.