prime95 idles lots of cores on a worker during P1 GCD
See attachment. Same will apply to dualmanycoreXeons and some other configurations.
P1 during GCD idles all but 1 core of a worker. On a Xeon Phi this may be 64/n 1 or 68/n 1 cores, where n is number of workers. Depending on exponent the duration may be considerable.
Gpuowl handled this situation by running parallel threads, speculatively executing the next P1 stage or the following PRP while the GCD ran in a separate thread. The factor finding probabilities with normal bounds are such that ~98% of the time the same exponent is involved, it pays off. And if it is a succession of P1 on different exponents, or following is other type work on different exponents, whatever payoff there is occurs 100% of the time. Perhaps this approach would be productive on hyperthreaded CPUs in prime95 / mprime also. (And Mlucas?)
In the M880M case I was running, it took an hour for a GCD.
Similarly, during preallocation of disk space for the proof residues of an M500M PRP, it took a few minutes while nearly all cores of the worker were idle.
