mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2021-06-26, 19:46   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3×19×113 Posts
Default Profile-driven optimisation for lanczos

Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big?

My profile at the moment looks like
Code:
  26.39%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_BxN_NxB                                                                                                                   
  26.09%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_trans_packed_core                                                                                                         
  21.01%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_packed_core                                                                                                               
  17.99%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] core_NxB_BxB_acc                                                                                                              
   5.58%  msieve-MP-V256-  msieve-MP-V256-BDW  [.] mul_packed_small_core
versus

Code:
  37.72%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_trans_packed_core                                                                                                         
  30.40%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_packed_core                                                                                                               
  14.74%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_BxN_NxB                                                                                                                   
   7.40%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] core_NxB_BxB_acc                                                                                                              
   6.50%  msieve-MP-V128-  msieve-MP-V128-HSW  [.] mul_packed_small_core
At the moment we make VBITS/8 tables of 256*VBITS bits each, which barely fits in L2 for VBITS=256 and is a whole megabyte long for VBITS=512

The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size?

I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k.

Last fiddled with by fivemack on 2021-06-27 at 19:16
fivemack is offline   Reply With Quote
Old 2021-06-26, 22:23   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3×19×113 Posts
Default

This is driving me slightly mad: I made the obvious tweaks to use a 2-bit rather than 8-bit table, it slowed down immensely, and this was because it wasn't using YMM operations at all, rather carrying around the vector in four 64-bit registers.

When I recoded to use absolutely explicit YMM operations:

Code:
	zero=_mm256_xor_si256(zero,zero);
        _mm256_zeroupper();
        for (i = 0; i < n; i++) {
          accum = zero;
          jig = 0;
          for (j=0; j<VWORDS; j++)
            {
              uint64 g = v[i].w[j];
              for (k=0; k<64; k+=2)
                {
                  accum = _mm256_xor_si256(accum,*(__m256i*)(&(c[jig+(g&3)])));
                  jig+=4; g>>=2;
                }
            }
          __m256i* yy = (__m256i*)(&(y[i]));
          *yy = _mm256_xor_si256(*yy,accum);
        }
gcc-9.3 is generating

VPXOR xmm0,xmm0,xmm0

(note the "x")

from the intrinsic

Code:
accum = _mm256_xor_si256(accum,accum);
and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code.

Last fiddled with by fivemack on 2021-06-26 at 22:23
fivemack is offline   Reply With Quote
Old 2021-06-27, 09:35   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

144518 Posts
Default

I am just downloading gcc-11.1 in the hope that it will be equipped with a sufficiently heavy-duty yak razor
fivemack is offline   Reply With Quote
Old 2021-06-27, 13:05   #4
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001001010012 Posts
Default

Quote:
Originally Posted by fivemack View Post
and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code.
Well, that's not actually the cause; the vpxor xmm0,xmm0,xmm0 instruction is in fact defined as clearing the top half of the register (whilst the superficially similar pxor xmm0,xmm0 is defined as leaving the top half intact and causing false-sharing issues).

Last fiddled with by fivemack on 2021-06-27 at 13:06
fivemack is offline   Reply With Quote
Old 2021-07-02, 17:05   #5
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

If the working set is too large when biting off 8-bit chunks then another option is to use a larger number of 6-bit chunks. With word size W bits and chunk size C bits the table will have (W/8) * (W/C) * 2^C bytes.

The msieve-lacuda branch uses C=6 and it does make GPU runs faster, but the stock version uses W=64 only. Greg is spending time generalizing to larger word size and folding in MPI so that multiple GPUs can combine together.

We are also going very carefully through the BxN * NxB vector-vector operation, which is a playground of fun to implement in CUDA.

Last fiddled with by jasonp on 2021-07-02 at 17:07
jasonp is offline   Reply With Quote
Old 2021-07-02, 19:29   #6
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

42168 Posts
Default

I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 3-5x speedup depending on the GPU.

https://github.com/gchilders/msieve_...czos_kernel.cu

Other than merging the non-lacuda branch changes back into the lacuda branch and testing the CUDA-aware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's.

In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't auto-vectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option -msve-vector-bits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512.
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
CRT optimisation? mickfrancis Math 9 2016-03-30 10:20
Fastest you've driven a car? Oddball Lounge 43 2011-03-14 00:26
Hardware Profile storm5510 Hardware 6 2009-08-19 13:05
Avatars and profile pictures... Xyzzy Lounge 3 2005-07-12 23:12
Incorrect CPU profile in 22.9 and 22.8 sdbardwick Software 5 2002-09-22 19:49

All times are UTC. The time now is 02:19.


Fri Oct 22 02:19:47 UTC 2021 up 90 days, 20:48, 1 user, load averages: 1.73, 2.10, 2.06

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.