Assuming the ~p^{2.1} scaling also applies to GCD operations, and you're doing ~106M P1, there's a factor of ~4.2 unexplained difference in GCD speed in your favor. Maybe faster cores giving faster GCDs, and correspondingly faster stages too.
Timing I gave for large exponent was using ~10GB in stage 2, prime95 V30.6b4.
edit: chalsall's small exponent ~27.4M more than explains the rest of the speed ratio. 5.05sec x 2 /2hr29min = 0.11% potential speedup for him. Except, i39100 is 4core no hyperthreading. Gpouwl's parallelism came about because Mihai took pity on my multiRadeon VII/slowcpuforGCD P1 factory, which spent ~5 minutes of a 40 minute wavefront P1 factoring in singlecpucore GCD with the GPU idle and waiting. System didn't have enough max ram to support dualinstance P1 on its GPUs to mitigate it. 40/35 = 14.% P1 speedup via speculative parallelism. As always, George's call what is worth George's time, and not worthwhile.
