More updates:

1) Fixed some bugs in the fast stage 2, after finally bearing down and understanding all of the nuances of Montgomery's PAIR algorithm correctly. I had it mostly right before, but not 100%, which caused some factors to be missed. Thank <deity> for a gold reference implementation in gmp-ecm. Also, the fixed bugs actually improved the speed a bit.

2) Got a shiny new laptop with a tigerlake i7-1165G7 in order to get AVX512IFMA up and running. It helped more than I thought it would. Going from my floating point 52-bit double precision integer multipliers to IFMA speeds up all of the vector math by about 40%!

In this de-facto quick benchmark this cpu gets:
B1=7M on 2^1277-1:
avx-ecm stage 1: 47.3 sec for 8 curves, 5.9 sec/stg1-curve
avx-ecm stage 2: 18.3 sec for 8 curves, 2.3 sec/stg2-curve
The next goals are usability-focused. Get a proper set of command line flags in there, some options to control memory use and when to exit, improved logging, and reading in savefiles.
