mersenneforum.org 64x64 integer multiplication GCC
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2020-07-22, 06:30 #1 diep     Sep 2006 The Netherlands 757 Posts 64x64 integer multiplication GCC Good Morning! Looking for way of multiplying fast in GCC 64x 64 bits unsigned 64 bits integers and obtaining both highres as well as lowres result. Was busy reviving CUDA sieving code i wrote in 2016 and to verify the results on the CPU my oldie code there for GCC is duck slow. Under windows it's fast enough there as one has intrinsics: reslo = _umul128(x,y,&reshi); However what i have for under linux where my GPU runs under i have only something duck slow for GCC. Right now under GCC i'm using the duckslow next thing (typing over by hand): uint64 mul_limbs(uint64 *reslo,uint64 x,uint64 y) { __uint128_t prod; prod = (__uint128_t) x * y; *reslo = (uint64)prod; return(prod>>64); } This duckslow - anyone know faster solution - and i just need multiply a single x * y for the BSGS algorithm verification. Kind Regards, Vincent
 2020-07-22, 13:25 #2 bsquared     "Ben" Feb 2007 2×1,789 Posts Try: Code: __inline uint64_t mul64(uint64_t x, uint64_t y, uint64_t* hi) { __asm__( "mulq %3 \n\t" : "=&a"(x), "=&d"(y) : "0"(x), "1"(y) : "cc" ); *hi = y; return x; } If you have a processor with mulx then this might be slightly faster: Code: __inline uint64_t mulx64(uint64_t x, uint64_t y, uint64_t* hi) { __asm__( "mulx %3, %0, %1 \n\t" : "=&d"(x), "=&a"(y) : "0"(x), "1"(y) ); *hi = y; return x; } Last fiddled with by bsquared on 2020-07-22 at 13:47 Reason: slight change to mul64
 2020-07-22, 14:14 #3 jasonp Tribal Bullet     Oct 2004 3·1,181 Posts As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory: Code:  #define umul_ppmm(w1, w0, u, v) \ __asm__ ("mulq %3" \ : "=a" (w0), "=d" (w1) \ : "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
2020-07-22, 14:56   #4
bsquared

"Ben"
Feb 2007

2·1,789 Posts

Quote:
 Originally Posted by jasonp As a slight modification, the following macro (from an old version of GMP's longlong.h) can avoid a potential trip through memory: Code:  #define umul_ppmm(w1, w0, u, v) \ __asm__ ("mulq %3" \ : "=a" (w0), "=d" (w1) \ : "%0" ((UDItype)(u)), "rm" ((UDItype)(v)))
Yeah, macros may integrate better into the rest of the code.

However, for this simple benchmark loop:

Code:
for (i = 0; i < 1000000000; i++)
{
c = mulx64(a, b, &d);
a = c;
b = d;
}
all of these are the same using gcc 7.3.0 (including diep's original code)

Code:
mulx took 0.2715 sec
mul took 0.2714 sec
mul_limbs took 0.2715 sec
umul_ppmm took 0.2714 sec

 2020-07-22, 15:08 #5 kruoli     "Oliver" Sep 2017 Porta Westfalica, DE 64510 Posts Is GCC optimizing it to the same target assembly code? Last fiddled with by kruoli on 2020-07-22 at 15:08 Reason: Spelling.
 2020-07-22, 15:22 #6 diep     Sep 2006 The Netherlands 75710 Posts i wrote that mul_limbs code like 15 years ago. maybe GCC got a lot more clever lately :) Back then you can bet i cried it out loud like a wounded Bear that GCC was duckslow optimizing it back then in my NTT code versus oldie visual c++ compiler making fast clean code with its intrinsics (back then not so old). I don't know which GCC version gets called by Nvidia's Cuda compiler driver. nvcc --version gives 'built on sun_sep__4_22:14:01_cdt_2016 gcc --version gives 4.9.2 Let's see whether NVCC knows the -S option to print assembler. If NVCC is not clever enough and this GCC version is - i can still consider making a library out of the math functions and just simply link it.
2020-07-22, 15:48   #7
bsquared

"Ben"
Feb 2007

357810 Posts

Quote:
 Originally Posted by kruoli Is GCC optimizing it to the same target assembly code?
As far as I can see from objdump, yes (except the one usage of mulx).

2020-07-22, 23:04   #8
ewmayer
2ω=0

Sep 2002
República de California

1165810 Posts

Quote:
 Originally Posted by bsquared Yeah, macros may integrate better into the rest of the code. However, for this simple benchmark loop: Code: for (i = 0; i < 1000000000; i++) { c = mulx64(a, b, &d); a = c; b = d; } all of these are the same using gcc 7.3.0 (including diep's original code) Code: mulx took 0.2715 sec mul took 0.2714 sec mul_limbs took 0.2715 sec umul_ppmm took 0.2714 sec
That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?

Last fiddled with by ewmayer on 2020-07-22 at 23:06

2020-07-23, 02:42   #9
bsquared

"Ben"
Feb 2007

2×1,789 Posts

Quote:
 Originally Posted by ewmayer That kind of single-MUL test loop is a poor way to gauge throughput - you want to have multiple MULs - at least 4, IMO - overlapping so as to better hide latency. OTOH if the compiler is unrolling the loop for you, it might be OK - is that the case?
Agreed, it's not a good benchmark. The OP asked for a way to multiply 64-bit unsigned ints and he now has several things to try. I assumed he will use them in his application to figure out which is best... the loop was just a first cut to see if there were glaring differences.

 2020-07-23, 08:24 #10 retina Undefined     "The unspeakable one" Jun 2006 My evil lair 3×7×13×23 Posts I am still amazed at how much C code it takes just to coax it into using a single native instruction. You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.
2020-07-23, 08:34   #11
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

10,949 Posts

Quote:
 Originally Posted by retina I am still amazed at how much C code it takes just to coax it into using a single native instruction. You have to throw out all the "portability" that C is supposed to provide. And you have to provide a lot of ugly boilerplate lines to coerce it into assembling just one line. And you have to write it the worst possible syntax: AT&T.

C is an old language designed to be efficient on a computer, the PDP-11, which was discontinued years before its manufacturer, DEC, went out of business --- itself many years ago.

AFAICT essentially all other languages (assembly excepted of course) make it difficult to exploit all the instructions provided by any CISC architecture.

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Computer Science & Computational Number Theory 17 2020-05-21 19:51 bhelmes Math 4 2016-10-06 13:33 jasong Miscellaneous Math 5 2016-04-24 03:40 vector Miscellaneous Math 10 2007-12-20 18:16 clowns789 Miscellaneous Math 5 2005-03-11 00:23

All times are UTC. The time now is 07:39.

Sun Oct 17 07:39:25 UTC 2021 up 86 days, 2:08, 0 users, load averages: 1.12, 1.28, 1.26