"Carlos Pinho"
HT helps a lot on LA, at least for me.

"Curtis"
VBITS=128 on otherwise idle machine. ETA after 1% of job: 6threaded 14hr 34 min 12threads 8 hr 26 min 18threads 9 hr 15 min 24threads 8 hr 27 min These times look rather slow; I just installed the extra 32GB memory today, so perhaps filling all 8 slots slows memory access a bunch. Some time I'll remove the original 16GB and see if 4 sticks is faster than 8. 

"Mike"
Note that we only count the LA phase in our calculations. 

"Mike"
12 = 8h04m50s 20 = 8h32m31s 24 = 7h59m33s 

"Mike"
CPU = i78565U
RAM = 2×16GB DDR42400 CMD = ./msieve v nc t 8 LA = 47884s 
"Mike"
CPU = i78565U
RAM = 2×16GB DDR42400 CMD = ./msieve v nc t 4 LA = 51662s 
"Mike"
CPU = 3950X
RAM = 2×8GB DDR43666 CMD = ./msieve v nc t 16 LA = 27180s 
"Beschorner Kurt"
In my experience, the throughput depends on an additionally running program (e.g. gmpecm)
machine: i77820X  8 cores + HT matrix: 49M * 49M memory: 64 GB msieve .... t16 solo ~ 55% (power according task manager) msieve .... t16 and gmpecm (prior: low) ~ 78% " With msieve + mprime/Prime95 the effectiveness is a litle lower Kurt 
mpirun np 2 msieve nc2 1,2 v t 20 

Here's a bench using compute nodes with one Xeon E52650 v4 Broadwell cpu with 12cores, 24 threads.
1 node 7h 40m 2 nodes 2h 45m 4 nodes 1h 35m 8 nodes 1h 10m Not sure why the time for one node is so high compared to the others? Perhaps something fitting into the cache with the smaller matrices on each node? 
"Curtis"
