mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Msieve (https://www.mersenneforum.org/forumdisplay.php?f=83)
-   -   Profile-driven optimisation for lanczos (https://www.mersenneforum.org/showthread.php?t=26954)

fivemack 2021-06-26 19:46

Profile-driven optimisation for lanczos
 
Has anyone got thoughts about the NxB times BxB multiply once VBITS grows big?

My profile at the moment looks like
[code]
26.39% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_BxN_NxB
26.09% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_trans_packed_core
21.01% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_core
17.99% msieve-MP-V256- msieve-MP-V256-BDW [.] core_NxB_BxB_acc
5.58% msieve-MP-V256- msieve-MP-V256-BDW [.] mul_packed_small_core
[/code]

versus

[code]
37.72% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_trans_packed_core
30.40% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_core
14.74% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_BxN_NxB
7.40% msieve-MP-V128- msieve-MP-V128-HSW [.] core_NxB_BxB_acc
6.50% msieve-MP-V128- msieve-MP-V128-HSW [.] mul_packed_small_core
[/code]

At the moment we make VBITS/8 tables of 256*VBITS bits each, which barely fits in L2 for VBITS=256 and is a whole megabyte long for VBITS=512

The matrix itself is VBITS^2 bits, which is 8kB and fits nicely in L1 for VBITS=256, but is already getting inconvenient for VBITS=512; I wonder how much slower it is to do eight times as many accesses, in a perfectly uniform pattern so no address computation, to a table of 1/32 the size?

I suppose I should also try VBITS/2 tables of 4*VBITS and VBITS/4 tables of 16*VBITS, which are 16k and 64k.

fivemack 2021-06-26 22:23

This is driving me slightly mad: I made the obvious tweaks to use a 2-bit rather than 8-bit table, it slowed down immensely, and this was because it wasn't using YMM operations at all, rather carrying around the vector in four 64-bit registers.

When I recoded to use absolutely explicit YMM operations:

[code]
zero=_mm256_xor_si256(zero,zero);
_mm256_zeroupper();
for (i = 0; i < n; i++) {
accum = zero;
jig = 0;
for (j=0; j<VWORDS; j++)
{
uint64 g = v[i].w[j];
for (k=0; k<64; k+=2)
{
accum = _mm256_xor_si256(accum,*(__m256i*)(&(c[jig+(g&3)])));
jig+=4; g>>=2;
}
}
__m256i* yy = (__m256i*)(&(y[i]));
*yy = _mm256_xor_si256(*yy,accum);
}
[/code]

gcc-9.3 is generating

VPXOR xmm0,xmm0,xmm0

(note the "x")

from the intrinsic

[code]
accum = _mm256_xor_si256(accum,accum);
[/code]

and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code.

fivemack 2021-06-27 09:35

I am just downloading gcc-11.1 in the hope that it will be equipped with a sufficiently heavy-duty yak razor

fivemack 2021-06-27 13:05

[QUOTE=fivemack;582011]
and I'm then unsurprisingly getting horrific slowdown from mixing XMM and YMM code.[/QUOTE]

Well, that's not actually the cause; the vpxor xmm0,xmm0,xmm0 instruction is in fact defined as clearing the top half of the register (whilst the superficially similar pxor xmm0,xmm0 is defined as leaving the top half intact and causing false-sharing issues).

jasonp 2021-07-02 17:05

If the working set is too large when biting off 8-bit chunks then another option is to use a larger number of 6-bit chunks. With word size W bits and chunk size C bits the table will have (W/8) * (W/C) * 2^C bytes.

The msieve-lacuda branch uses C=6 and it does make GPU runs faster, but the stock version uses W=64 only. Greg is spending time generalizing to larger word size and folding in MPI so that multiple GPUs can combine together.

We are also going very carefully through the BxN * NxB vector-vector operation, which is a playground of fun to implement in CUDA.

frmky 2021-07-02 19:29

I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 3-5x speedup depending on the GPU.

[URL="https://github.com/gchilders/msieve_nfsathome/blob/msieve-lacuda-nfsathome/common/lanczos/gpu/lanczos_kernel.cu"]https://github.com/gchilders/msieve_nfsathome/blob/msieve-lacuda-nfsathome/common/lanczos/gpu/lanczos_kernel.cu[/URL]

Other than merging the non-lacuda branch changes back into the lacuda branch and testing the CUDA-aware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's.

In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't auto-vectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option -msve-vector-bits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512.


All times are UTC. The time now is 15:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.