Sparing me the build process to figure it out for myself, does GWNUM provide a speedup even for "small" inputs like 2^12771? How about 2^9411? Wondering where the cutoff is to normal mulmod methods.
(Small, of course, relative to where prime95 usually operates.)

Attempting to get something useful out of this thread, here are some timings on a 2Ghz Skylake VM. Without GWNUM:
$ echo "2^12771"  ./ecm c 5 11e6
GMPECM 7.0.4 [configured with GMP 6.2.0, enableasmredc] [ECM]
Input number is 2^12771 (385 digits)
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:15382913345576557885
Step 1 took 63126ms
Step 2 took 28310ms
Run 2 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:2472923556820086104
Step 1 took 63697ms
Step 2 took 28330ms
Run 3 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:13772778279637613489
Step 1 took 63532ms
Step 2 took 28339ms
Run 4 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:4149983287909296098
Step 1 took 65040ms
Step 2 took 28394ms
Run 5 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:3089621796835281869
Step 1 took 65407ms
Step 2 took 28430ms
With GWNUM linked:
$ echo "2^12771"  ./ecm c 5 11e6
GMPECM 7.0.4 [configured with GMP 6.2.0, GWNUM 29.8, enableasmredc] [ECM]
Due to incompatible licenses, this binary file must not be distributed.
Input number is 2^12771 (385 digits)
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:1133017334638497343
Step 1 took 52939ms
Step 2 took 28574ms
Run 2 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:16936162798099609109
Step 1 took 52935ms
Step 2 took 28488ms
Run 3 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:4745769159583183902
Step 1 took 52719ms
Step 2 took 28706ms
Run 4 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:14575819613407288447
Step 1 took 52688ms
Step 2 took 28512ms
Run 5 out of 5:
Using B1=11000000, B2=35133391030, polynomial Dickson(12), sigma=0:7117923393329265041
Step 1 took 52795ms
Step 2 took 28585ms
So, a decent speedup in stage 1.