I spent several days experimenting with making an efficient square detection routine. There's never going to be "one best way" especially when you get down to actual CPU behavior, but I've pretty much optimized the function as much as possible.

The goal is to have a simple "isSquare(n)" routine which returns true or false. We are NOT making assumptions about n itself, there would be better routines if you knew some properties of n, or were computing squaritude of many sequential or spaced values.

Some discoveries and observations:

- The computation of an integer square root (and then comparing its square to the original number) is the practical way to
**prove** squaritude. This can be quite efficient, but is dominated by about log2(n) big-integer divisions. There's lots of variations and improvements to this algorithm, a one-bit divider is especially interesting.. but overall you can't get rid of that O(log(n)) so it becomes dominating quickly. Optimizing isSquare() is all about trying to quickly screen out as many non-squares as possible before you finally end at the stage which uses the big (and comparatively slow) intSqrt() gun.

- As expected, computation of n mod m with BigInt numbers is also slow compared to native CPU 32-bit integer math. It depends on n's magnitude, but even for small n, there's a lot of overhead. You can do dozens of native 32 bit integer mods in the time it takes to do mod of a 50 digit number. The bigint mod speed is not hugely sensitive to the value of the mod itself though. We can use this and try to keep most of our math in the 32 bit realm. [using 64 bit native CPU math is also a good idea to try! But I only dealt with a 32 bit version to start with.]

So, the basic idea is similar to the one outlined in the original post.. square numbers always have a limited number of residues with different modulii. For example, a^2 mod 32 only has 17 possible residues. You can immediately reject the nonsquare residues and avoid having to use the Big and Slow intSqrt(). We use this idea for screening.

Two decisions are 1) what modulii are best to use and 2) how can you efficiently determine if a residue is acceptable or not.

One answer to #2 is a simple lookup table. You can trade off size of the table for rejection efficiency. I won't go into huge details, but tables of less than 64 bytes are REALLY fast and tables of less than about 1000 bytes are pretty darned fast. [This likely has to do with CPU L1 cache sizes and cache line sizes.) Lookups into tables of more than about 5K bytes degrade noticably, likely because of memory latency when fetching from the deeper L2 cache. So tables are good, but they can't be too big!

Storing a table as bits is space efficient, but it has real performance penalties because you need to do some math to figure out which bit to access and make masks to select just that bit. This is slow compared to the actual mod compute, so it's better to use one byte per residue.

So are tables the right answer? Jumping to my conclusion (after lots of experimentation and analysis), if you're doing extensive rejection, the sizes start to build quickly. It's better to do multiple sequential tests with smaller tables. But if the tables get small enough, you can even use a Bloom filter instead.. and if you can compress the Bloom filter table into about 64 bits.. it's scary-fast! I'll show the results.

**Easiest initial 2^s residues**
So if the basic plan is to test modulii sequentially, which modulii should we use?

One initial step is to start with a 2^s modulus simply because the internal storage of big integers let us read off the 2^s residues trivially.

If we used mod 32, we know a square will have a residue of 0, 1, 4, 9, 16, 17, or 25. If we have any of the other 25 residues, we can immediately return "nonsquare." This happens 25/32= 78% of the time.

If we used mod 64, we could reject 52 residues, a slightly higher proportion of 81%.

Mod 128 lets us reject 105 residues, which is 82% rejection. Mod 256 gives 82.4% rejection. Obviously higher powers give quickly diminishing returns.

Which power of 2 should we use? The answer is whatever is most efficient to reject. It turns out that mod 128 is the largest size that fits in a very small inline Bloom filter table, and it's not worth using a more complex filter to get the (small, fractional percentage) extra rejection of higher powers.

**Bloom Filters**
A Bloom filter is a kind of hash table used to determine if an element is in one or another predefined set. Depending on the size of the table, this discrimination can be perfect (every input is properly classified), or partial (all members of set A are properly identfied as being in set A, but there may be some false positives of set B members.). This works very well for our residue problem where we MUST identify all possible squares.. we can't reject them in error. But it's OK to accept a few nonsquares in a rejection step, we'll just try to keep the false acceptance as low as possible. Most residue tables will have perfect discrimination though..

We can encode the 105 residues for mod 128 in a perfect discriminator which looks like:

Code:

bool mod128Discriminator(unsigned long n) // 0<= n < 128
{
return (n*0x8bc40d7d) & (n*a1e2f5d1) & 0x14020a;
}

It works by making a hashed mask which has about 1/4 of the bits set. Here the hash is just a super-simple multiply! The use of the first "&" combines two hashes to reduce the bit population to about 1/4. Different input n give different masks. All 23 acceptable residues have their masks "ored" together. We take the bitwise not of that mask, determining the bits that are

**never** set for square residues. So if our result has a "bad" bit set (tested by the second &-ing) we reject the bad residue. Picking the right hash function makes this function accept all 23 good residues and reject all of the bad 105 residues.

The compactness of this filter makes it (surprisingly) faster than a table. The problem is that as the residue list gets bigger, the function has to grow, and therefore (as expected) there's a size/speed tradeoff.

We were lucky with mod 128, it encoded very efficiently. We can perfectly encode most modulii up to around mod 60 with a function in this format. Higher values will start leaking in bad residues (which isn't fatal, but it does reduce our rejection efficiency.) Tradeoffs, tradeoffs.

Creating these filters involves picking the hash function then just varying the settings until one passes with an acceptable number of false positives. The computer can churn through a billion test values (just use a random number generator) until it finds one that succeeds. Of course some modulii may need more complex hashes, but all of our modulii of interest fit very smoothly in this small compact style.

**Picking Modulii**
So we're going to check mod 128 first, what about other modulii? First, we'll remember that bigint modulii are

**slow** so we want to take as few as possible. But we can minimize this expense by using only

**one** bigint modulus that's < 2^32, then using CPU 32-bit math to find sub modulii. It's grabbing a big chunk of data with the slow algorithm, then using the fast math to peel off subchunks of that a bite at a time.

Examining prime modulii, we can see that each independent prime p has (p-1)/2 "nonsquare" residues. Larger primes are

**slightly** better at rejection than small primes, but they'll have more residues and be harder to fit into a small discrimination function.

More importantly, we're limited by that 32 bit word size, our "big chunk" of data needs to fit into 2^32, so larger values means less modulli. That bigint modulus is expensive, so we want to squeeze as much as we can out of a single call.

What about composite modulii? Well, for powers of a prime, you get better and better rejection (as we saw with 2^n), but it also has diminishing returns. You get better rejection using both 3 and 5 than you do using 25 or 625 or even 5^50.

Composite modulii of two (or more) primes will reject the same residues that either prime would reject by itself, so it's kind of like testing both primes at once. [But large numbers of residues won't fit into that efficient Bloom filter, so we can't take advantage of this too much.]

So, we should use the most number of primes that will fit into 2^32.

That's 3*5*7*11*13*17*19*23*29 = 3234846615, and we should do 9 tests, right?

Almost.. there's a little more efficiency we can pull out. First, while the powers of primes have less "bang for the buck", the low primes of 3 and 5 are small and provide a lot more rejection-per-bit than higher primes.. enough that it's better to use 9 as a modulus and not 3, even though it means we can fit in less primes. (it takes a lot of combinitorial testing to figure out if 27 is worthwhile, or even 81, or 5 versus 25, or 49, etc). So the

**best** rejection ends up being: 9*25*7*11*17*19*23*31=3989930175. Note that 13 is missing, you get slightly better rejection by skipping it and shifting the remaining primes higher. I'm skipping a lot of experimentation steps here to just present the results.

We can make an efficient (and perfect) Bloom filter for each component.. in fact we can also find a perfect filter for 63 = 7*9, combining two of the tests into one.

What order should we do the tests? They're independent so we order them from best rejection to least. 63 gives a rejection of about 75% by itself. 25 gives 56% rejection. The remaining primes each provide just under 50% each. So we test in the order: 63, 25, 31, 23, 19, 17, 11.

**The Code**
So, combining these, we get a routine which uses a mod 128 test to reject 82% of inputs without doing any big-integer math at all. We then take

**one** big-integer modulus and use fast 32 bit math to test residues of 7 different independent bases. The net rejection rate is 0.999+, so we'll only need to compute the intSqrt of fewer than 1 of 1000 non-squares.

This is C++ code but it's very straightforward.

Code:

bool isSquare(const BigInt &n)
{
unsigned long m;
// start with mod 128 rejection. 82% rejection rate
// VERY fast, can read bits directly
m=n.lowWord() & 127; // n mod 128
if ((m*0x8bc40d7d) & (m*0xa1e2f5d1) & 0x14020a) return false;
//Other modulii share one BigInt modulus.
unsigned long largeMod=n%(63UL*25*11*17*19*23*31); // SLOW, bigint modulus
// residues mod 63. 75% rejection
m=largeMod%63; // fast, all 32-bit math
if ((m*0x3d491df7) & (m*0xc824a9f9) & 0x10f14008) return false;
// residues mod 25. 56% rejection
m=largeMod%25;
if ((m*0x1929fc1b) & (m*0x4c9ea3b2) & 0x51001005) return false;
// residues mod 31. 48.4% rejection
// Bloom filter has a little different form to keep it perfect
m=0xd10d829a*(largeMod%31);
if (m & (m+0x672a5354) & 0x21025115) return false;
// residues mod 23. 47.8% rejection
m=largeMod%23;
if ((m*0x7bd28629) & (m*0xe7180889) & 0xf8300) return false;
// residues mod 19. 47.3% rejection
m=largeMod%19;
if ((m*0x1b8bead3) & (m*0x4d75a124) & 0x4280082b) return false;
// residues mod 17. 47.1% rejection
m=largeMod%17;
if ((m*0x6736f323) & (m*0x9b1d499) & 0xc0000300) return false;
// residues mod 11. 45.5% rejection
m=largeMod%11;
if ((m*0xabf1a3a7) & (m*0x2612bf93) & 0x45854000) return false;
// Net nonsquare rejection rate: 99.92%
// We COULD extend to another round, doing another BigInt modulus and
// then followup rejections here, using
// primes of 13 29 37 41 43 53. That'd give 98% further rejection.
// Empirical timing shows this second round would be useful for n>10^100 or so.
// VERY expensive final definitive test
BigInt root(intSqrt(n));
return root*root == n; // true if and only if n is a square
}

**Results:**
As expected, the new discrimination function is very fast. Below are some tests, which are clearly CPU, compiler, and especially BigInt-class dependent, but still representative. It times 2^25 evaluations of various n. Efficiencies change with the magnitude of n, due to the O(1) nature of the screening tests. Larger numbers show even more dramatic disparity.

(Hmm, no table support on this forum? I'll use code view so its ugly but readable.)

Code:

n~=10^40 n~=10^80
Using brute force intSqrt() tests: 49.95 secs 147.69 secs
Using mod 128 discrimination first: 9.65 secs 27.73 secs
Using full discrimination tests first: 0.65 secs 0.84 secs

Profiling shows that the n~= 10^40 tests use about 40% of its time to compute the "largeMod", about 35% of its time to do all the individual residue tests, and about 25% of its time for the integer square root. Larger n increases the largeMod fraction, but the intSqrt will begin to dominate overall.