View Single Post
Old 2021-06-26, 19:46   #1
(loop (#_fork))
fivemack's Avatar
Feb 2006
Cambridge, England

22×32×179 Posts
Default Profile-driven optimisation for lanczos

Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big?

My profile at the moment looks like
  26.39%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_BxN_NxB                                                                                                                   
  26.09%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_trans_packed_core                                                                                                         
  21.01%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_packed_core                                                                                                               
  17.99%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] core_NxB_BxB_acc                                                                                                              
   5.58%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_packed_small_core

  37.72%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_trans_packed_core                                                                                                         
  30.40%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_packed_core                                                                                                               
  14.74%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_BxN_NxB                                                                                                                   
   7.40%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] core_NxB_BxB_acc                                                                                                              
   6.50%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_packed_small_core
At the moment we make VBITS/8 tables of 256*VBITS bits each, which barely fits in L2 for VBITS=256 and is a whole megabyte long for VBITS=512

The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size?

I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k.

Last fiddled with by fivemack on 2021-06-27 at 19:16
fivemack is offline   Reply With Quote