I'm going to give away the backstory for my question about low-hamming-weight multiples.

"It's all for P-1"

You may know, GpuOwl can do both PRP and P-1. PRP has the advantage of being protected with the famous "G" error check (AKA "GEC"). P-1 (as implemented) OTOH is "naked", with no error check.

But there's a better way!

P-1 (first stage) can be run together with PRP, for better efficiency and a partial error check.

P-1 first stage works like this:

1. compute the exponent E=PowerSmooth(B1),

2. compute X=3^PowerSmooth(B1) (mod M)

3. do GCD(X-1, M)

(where PowerSmooth(B) computes the product of all primes raised to a maximal power such that they're under B, i.e. product(over p prime with p < B, of p^floor(log(B)/log(p)))).

in pari-gp:

Code:

powersmooth(b)= pr=1;forprime(p=2,b,pr*=p^floor(log(b)/log(p)));pr

Concerning the step 2,

Code:

X=3^PowerSmooth(B1)

This modular exponentiation can be done in two ways, funnily named "left-to-right" and "right-to-left" binary exponentiation.. (BTW I always mix them up)

L-to-R: MSB to LSB

R-to-L: LSB to MSB

The fastest is L-to-R binary exponentiation, as it requires only squaring and multiplication-by-3, which (being mul by a small constant) is free. So the cost for 3^PS(B1) is N squarings, where N is the number of bits of PS(B1).

The R-to-L binary exponentiation requires N squarings plus HammingWeight(PS(B1)) multiplications, so seems like a worse choice.

BUT the squarings needed in the R-to-L exponentiation are exactly the same squarings that take place in the initial part of the PRP test! If we assume one does P-1 followed by PRP, the squarings are "free" in the P-1 (i.e. counted towards the PRP), and in addition the squarings are protected by the GEC.

So the cost of the R-to-L exponentiation is HW(PS(B1)), i.e. the number of pops of the PowerSmooth(B1), which is about 1/2 of the bitlen of PS(B1), and this is significantly cheaper than the cost of L-to-R exponentiation. The drawback is that the P-1 must be followed by PRP in order to reap the cost benefit.

At this point it should be clear why the quest for a low-hamming-weight multiple. Instead of PS(B1) we can use just as well any multiple of it, without even needing to know which multiple it is. A reduction in the HW results in a proportional reduction in the cost of P-1 (first stage).

Unfortunately I could find no practical way at all for reducing the HW of the PS(B1). Let's enumerate the approaches tried:

1. low-HW multiple of a small number.

(let's say "small number" being up to 32bits or maybe even 64bits, and low-HW being <7 bits)

The key here is to think of the contribution of each bit that is set modulo the "base", and target a combination of bits that sum to 0 (mod "base"), i.e. represent a multiple of the base. And the problem has similarities to subset-sum.

This can be solved using either dynamic programming, or a "brute search" that takes advantage of the extremely-low HW that is targeted (HW < 7).

(not all the small numbers have a very-low HW multiple)

The problem with this approach is that it works for small numbers only. The solution can't be extended to the huge numbers needed for PS(B1): While PS(B1) can easily be split into a product of (many) 32-bit factors, the lowHW solutions for those factors can't be assembled back into a lowHW solution to PS(B1).

(below, PS is without the powers of "2"):

2. For B1=1M, bitlen(PS(1M))==1442080. So we have a number of about 1.4M bits. The starting HW is: HW(PS(1M))==720737.

I tried various approaches to search among the multiples of PS(1M), and the lowest HW I found was around 718K, let's say a HW reduction of about 0.4%! Impressivelly unremarkable. Feel free to try yourself, doing such experiments is not too difficult in pari-gp.

So for some reason, the search for low-HW multiples of PS(1M) is much harder than it might seem.

At this point I think an approch that might work a bit better would be to look at the base number as a pattern of bits, that can be shifted and added back (this is what multiplication does). This shift-and-add can both create and destroy 1-bits. Maybe an algorithm could be constructed there.

Anyway in the meantime I consider the problem "impossible" myself. But if somebody comes up with a multiple of PS(1M) that is.. let's say 1% improvement (HW<714K), I'd be impressed!

Code:

ps(b)=pr=1;forprime(p=3,b,pr*=p^floor(log(b)/log(p)));pr
hw(x)=hammingweight(x)
base=ps(1000000)
hw(base)