Improved assembly code
As Mark suggested, I've upgraded the GMP4.1.4 addmul_1.c, mul_1.c, and submul_1.c files with improved assembly blocks. The full package of files and patches is attached.
Here are the new GMPECM timings for the test numbers given in the original post above:
GMPECM 6.0.1 [powered by GMP 4.1.4] [ECM]
Input number is 65030090232295...7611097719 (150 digits)
Using B1=3000000, B2=4016636513, polynomial Dickson(6), sigma=2091664324
Step 1 took 28605ms
Step 2 took 15620ms
Input number is 11846804646723...1005685163 (200 digits)
Using B1=3000000, B2=4016636513, polynomial Dickson(6), sigma=2433697906
Step 1 took 36737ms
Step 2 took 20987ms
Input number is 741533978684...40036509963 (250 digits)
Using B1=3000000, B2=4016636513, polynomial Dickson(6), sigma=232452003
Step 1 took 53508ms
Step 2 took 25151ms
Input number is 453061695333...528133509 (300 digits)
Using B1=3000000, B2=4016636513, polynomial Dickson(6), sigma=2709846699
Step 1 took 78669ms
Step 2 took 32485ms
Input number is 68760637088...09358157449 (348 digits)
Using B1=3000000, B2=4016636513, polynomial Dickson(6), sigma=2374092456
Step 1 took 104431ms
Step 2 took 39437ms
