View Single Post
Old 2021-07-02, 19:29   #6
frmky's Avatar
Jul 2003
So Cal

1000101100002 Posts

I reimplemented the NxB_BxB CUDA kernel to bite off 2 bits at a time and make 4 (or actually 3 since the 00 table isn't needed) arrays in GPU shared memory rather than doing this on the cpu and uploading the result to the gpu. This gave a 3-5x speedup depending on the GPU.

Other than merging the non-lacuda branch changes back into the lacuda branch and testing the CUDA-aware MPI stuff once I have access to a cluster that supports it, I think I'm mostly done. CUDA and CUDA+MPI both work. As a test I used two (now ancient) Tesla K20 gpus to solve a 3.36M x 3.36M matrix in 2h22m. Once I have access to them again, probably by Tuesday, I'll test it out on a couple of V100's.

In the CPU code, the explicit unrolling generally needs to be removed. GCC 10+ don't auto-vectorize the unrolled loops but are good about detecting and vectorizing the rolled versions. This really helps for ARM SVE. Also, on ARM SVE, adding the option -msve-vector-bits=512 helps significantly. I'm not sure if there's an equivalent option for AVX2 or AVX 512.
frmky is offline   Reply With Quote