Here are a couple other things to keep in mind:
1) for lowerend skylake processors the speedups are not as dramatic, with or without special Mersenne math.
I ran on a 7800x and only saw about 2.1 times speedup for 2^12771.
More testing/benchmarking is needed for a variety of avx512 capable processors.
The good news is that this situation will only improve for avxecm as time goes on. I plan to implement enhancements as soon I can get my hands on an ice lake cpu or whatever cpu supports AVX512IFMA.
2) avxecm uses fixedtime arithmetic in steps of 208 bits. So a curve at, say, 416 bits takes just as long as a curve at 623 bits.
gmpecm on the other hand will adjust the size of its arithmetic in 64bit steps. This could mean finetune adjustments for any crossover math that is performed for given input sizes.
