mersenneforum.org Profile-driven optimisation for lanczos
 Register FAQ Search Today's Posts Mark Forums Read

 2021-06-26, 19:46 #1 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 22·1,613 Posts Profile-driven optimisation for lanczos Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big? My profile at the moment looks like Code:  26.39% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_BxN_NxB 26.09% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_trans_packed_core 21.01% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_core 17.99% msieve-MP-V256- msieve-MP-V256-BDW [.] core_NxB_BxB_acc 5.58% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_small_core versus Code:  37.72% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_trans_packed_core 30.40% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_core 14.74% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_BxN_NxB 7.40% msieve-MP-V128- msieve-MP-V128-HSW [.] core_NxB_BxB_acc 6.50% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_small_core At the moment we make VBITS/8 tables of 256*VBITS bits each, which barely fits in L2 for VBITS=256 and is a whole megabyte long for VBITS=512 The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size? I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k. Last fiddled with by fivemack on 2021-06-27 at 19:16
 2021-06-26, 22:23 #2 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 193416 Posts This is driving me slightly mad: I made the obvious tweaks to use a 2-bit rather than 8-bit table, it slowed down immensely, and this was because it wasn't using YMM operations at all, rather carrying around the vector in four 64-bit registers. When I recoded to use absolutely explicit YMM operations: Code:  zero=_mm256_xor_si256(zero,zero); _mm256_zeroupper(); for (i = 0; i < n; i++) { accum = zero; jig = 0; for (j=0; j>=2; } } __m256i* yy = (__m256i*)(&(y[i])); *yy = _mm256_xor_si256(*yy,accum); } gcc-9.3 is generating VPXOR xmm0,xmm0,xmm0 (note the "x") from the intrinsic Code: accum = _mm256_xor_si256(accum,accum); and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code. Last fiddled with by fivemack on 2021-06-26 at 22:23
 2021-06-27, 09:35 #3 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 11001001101002 Posts I am just downloading gcc-11.1 in the hope that it will be equipped with a sufficiently heavy-duty yak razor
2021-06-27, 13:05   #4
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

22·1,613 Posts

Quote:
 Originally Posted by fivemack and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code.
Well, that's not actually the cause; the vpxor xmm0,xmm0,xmm0 instruction is in fact defined as clearing the top half of the register (whilst the superficially similar pxor xmm0,xmm0 is defined as leaving the top half intact and causing false-sharing issues).

Last fiddled with by fivemack on 2021-06-27 at 13:06

 2021-07-02, 17:05 #5 jasonp Tribal Bullet     Oct 2004 3×1,181 Posts If the working set is too large when biting off 8-bit chunks then another option is to use a larger number of 6-bit chunks. With word size W bits and chunk size C bits the table will have (W/8) * (W/C) * 2^C bytes. The msieve-lacuda branch uses C=6 and it does make GPU runs faster, but the stock version uses W=64 only. Greg is spending time generalizing to larger word size and folding in MPI so that multiple GPUs can combine together. We are also going very carefully through the BxN * NxB vector-vector operation, which is a playground of fun to implement in CUDA. Last fiddled with by jasonp on 2021-07-02 at 17:07
 2021-07-02, 19:29 #6 frmky     Jul 2003 So Cal 225510 Posts I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 3-5x speedup depending on the GPU. https://github.com/gchilders/msieve_...czos_kernel.cu Other than merging the non-lacuda branch changes back into the lacuda branch and testing the CUDA-aware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's. In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't auto-vectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option -msve-vector-bits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512.

 Similar Threads Thread Thread Starter Forum Replies Last Post mickfrancis Math 9 2016-03-30 10:20 Oddball Lounge 43 2011-03-14 00:26 storm5510 Hardware 6 2009-08-19 13:05 Xyzzy Lounge 3 2005-07-12 23:12 sdbardwick Software 5 2002-09-22 19:49

All times are UTC. The time now is 14:08.

Fri Jan 21 14:08:13 UTC 2022 up 182 days, 8:37, 0 users, load averages: 1.25, 1.22, 1.19