mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   NTT faster than FFT? (https://www.mersenneforum.org/showthread.php?t=27012)

moytrage 2021-07-17 16:22

NTT faster than FFT?
 
As far as I know Prime95 uses Fast Fourier Transform to multiply (square) large numbers, right?

Just for fun, recently I decided to implement from scratch both Fast Fourier Transform (FFT) and Number Theoretical Transform (NTT) for numbers multiplication. In C++ (and using SIMD instructions).

According to my time measurements NTT appeared 10-15 times faster than FFT for doing multiplications of numbers around 16 MegaBytes in size.

Both NTT and FFT code I tried to optimize as much as possible. Also tested that both algorithms gave correct results, by doing naive multiplication.

FFT was tweaked in such a way that error is around 0.07-0.1. NTT of cause gives no error as it is precise. Chinese Reminder Theorem was used at final step of NTT multiplication.

Does anyone know if there was any trial to use NTT in Prime95 engine instead of FFT? Was it successful and/or gave any speed or other benefits?

kriesel 2021-07-17 18:01

Welcome.
If you're feeling brave, show some source code. The accomplished code smiths may have some helpful comments.

How do your multiplication times compare to GWNUM /prime95 timings for same number size, processor, number of cores, etc.?

I assume George did some sort of analysis if not actual experimentation with alternate code for prime95. His code is much faster than GMP multiple precision routines that avoid use of floating point because of error concerns. Prime95 uses the IBDWT and multiple instruction-set-tailored code paths.
Preda experimented with DP, SP, M31, and M61 back in gpuowl V1.9. M61 was substantially slower than DP on AMD GPUs but gave slightly higher maximum exponent for the same fft length. SP and M31 were not useful; the usable bits/word were too small. See [URL]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/URL] including the first attachment.

There's a lot of GIMPS computing reference info at [URL]https://www.mersenneforum.org/showthread.php?t=24607[/URL].

Viliam Furik 2021-07-17 18:13

[QUOTE=moytrage;583352]As far as I know Prime95 uses Fast Fourier Transform to multiply (square) large numbers, right?[/QUOTE]

Well yes, but actually no. It uses something called IBDWT - Irrational Base Discrete Weighted Transform. But it is similar.

[QUOTE=moytrage;583352]
According to my time measurements NTT appeared 10-15 times faster than FFT for doing multiplications of numbers around 16 MegaBytes in size.
[/QUOTE]

16 MB in size, does that mean that a similar-sized Mersenne number is 2^134217728-1 (if Megabyte = 1024*1024 bytes, sometimes called Mebibyte) or 2^128000000-1 (if Megabyte is 1000*1000 bytes)?

What times did you get with your code, and what times do you get with Prime95 at similar exponents?

moytrage 2021-07-17 18:20

@kriesel Thanks for quick reply!

Right now my code is quite dirty draft. Is not very well suited for usage or reading.

Also didn't do comarison to current's Prime95 code. Sure Prime95's FFT is much faster than mine.

I just wanted to create a Proof of Concept about comparison of FFT and NTT for myself.

Actually I quite accidentally found out that NTT was much faster. Before implementing I didn't even expect this.

Just to make comparison of 2 my algos fair I made as much optimizations as possible and used SIMD instructions. Actually only FFT is currently using AVX1 SIMD in my code. So FFT is even in better position. Still NTT is 10-15 times faster than FFT.

Actually by posting this Thread first of all I wanted to find out if anyone knows if there were any efforts or comparisons to NTT implementations? Maybe there were already such exepriments and they didn't show such good improvement of NTT compared to FFT.

I can do speed comparisons if somebody can point out to me how to do that inside gwnum. If there is a ready made function that does multiplication of two large numbers. I.e. if there is some imaginary function `mult(char const * a, char const * b, int size, char * c)` that is when given two arrays of bytes computes resulting multiplication of those two. I can call such function to compare to my NTT.

Just of curiosity I compared my NTT implementation to some random optimized NTT inside GitHub like this one [url]https://github.com/Mysticial/ProtoNTT/tree/master/Binaries[/url] , this implementation is said to be a part of Y-Cruncher, a famous one. So my implementation is 5-7% slower than this, hence I can conclude that mine is quite fast.

Viliam Furik 2021-07-17 18:27

I think the easiest way to do a speed comparison is to find a similar size exponent of a Mersenne number, and run its PRP test in Prime95 for a few iterations. The reported time per iteration is your result for Prime95, and thus gwnum.

Make sure your code is using the same number of cores as Prime95.

moytrage 2021-07-17 18:32

[QUOTE=Viliam Furik;583359]
What times did you get with your code, and what times do you get with Prime95 at similar exponents?
[/QUOTE]

Right now I re-measured timings of my NTT multiplying two large integers of size 2^25 bytes each (2^22 uint-64 words each). For my measurement to be fair I also measured this NTT [url]https://github.com/Mysticial/ProtoNTT/tree/master/Binaries[/url] which is a part of Y-Cruncher.

So for this two numbers each 2^25 bytes I get 9.8 seconds on my NTT and 9.1 seconds on Y-Cruncher's NTT on my old 1Ghz laptop, these two a measured single-threaded.

If somebody can point me to how I can do multiply using `gwnum` lib or how to use Prime95 to test one multiplication of given size, then I can do measurements on my laptop.

But just to tell in advance - I actually wanted to tell not that my implementation is fast but wanted to stress out that NTT is maybe faster compared to FFT.

Viliam Furik 2021-07-17 18:52

[QUOTE=moytrage;583362]If somebody can point me to how I can do multiply using `gwnum` lib or how to use Prime95 to test one multiplication of given size, then I can do measurements on my laptop.
[/QUOTE]

If "two large integers of size 2^25 bytes each (2^22 uint-64 words each)" are each about the same size as 2^268435399-1, you can use Prime95 with this worktodo line:

[CODE]PRP=N/A,1,2,268435399,-1,85,0[/CODE]

Let it run until you get a few print-outs in the worker window, and tell us the average of the numbers it says after "ms/iter:". 1 iteration is a little bit more than 1 multiplication.

If you are testing your NTT and y-cruncher's NTT on one core, one thread, set it in the Prime95 in the upper menu Test -> Worker Windows (number of workers = 1, CPU cores to use = 1).

paulunderwood 2021-07-17 18:59

[QUOTE=Viliam Furik;583363] 1 iteration is 1 multiplication.

[/QUOTE]

Not quite true. 1 iteration is 1 multiplication plus 1 mod reduction. Of course mod reduction is fast for Mersenne numbers but not entirely insignificant. :smile:

Viliam Furik 2021-07-17 19:01

[QUOTE=paulunderwood;583365]Not quite true. 1 iteration is 1 multiplication plus 1 mod reduction. Of course mod reduction is fast for Mersenne numbers but not entirely insignificant. :smile:[/QUOTE]

Corrected, thanks!

moytrage 2021-07-17 19:14

[QUOTE=Viliam Furik;583363]Let it run until you get a few print-outs in the worker window.[/QUOTE]

Started your Prime95 speed test, all as you suggested. But already running for 15 minutes and still there is no print-out in worker's window.

Can you tell approximately how often print out to worker window is done for exponent of this size at 1 thread (1 core of 1Ghz)?

kriesel 2021-07-17 19:44

[QUOTE=moytrage;583362]Right now I re-measured timings of my NTT multiplying two large integers of size 2^25 bytes each (2^22 uint-64 words each). For my measurement to be fair I also measured this NTT [URL]https://github.com/Mysticial/ProtoNTT/tree/master/Binaries[/URL] which is a part of Y-Cruncher.

So for this two numbers each 2^25 bytes I get 9.8 seconds on my NTT and 9.1 seconds on Y-Cruncher's NTT on my old 1Ghz laptop, these two a measured single-threaded.

If somebody can point me to how I can do multiply using `gwnum` lib or how to use Prime95 to test one multiplication of given size, then I can do measurements on my laptop.

But just to tell in advance - I actually wanted to tell not that my implementation is fast but wanted to stress out that NTT is maybe faster compared to FFT.[/QUOTE]Based on the conclusions of several other researchers and excellent coders in the area, "NTT is maybe faster compared to FFT" seems an inexplicable result. For a summary of the current state of the art see [URL]https://www.mersenneforum.org/showpost.php?p=510721&postcount=7[/URL].

To run prime95 for comparison timings: In the reference info link I gave you earlier, scroll down to the thread that is specific to prime95. Read and follow the how to get started info in that thread's first post. Create worktodo entries manually. Set workers to one, cores to one. Set prime95 to print out interim residues and timing every 100 iterations. This timing will include the error detection's slight overhead. Prime95 can be configured for fraction of a second resolution on time stamp of screen output also. There are several good documentation files included in the package, recommended reading. Try to replicate in prime95, with timing, while nothing else runs, the 100-iteration residues for larger known Mersenne primes in the attachment of [URL]https://www.mersenneforum.org/showpost.php?p=506283&postcount=5[/URL].
Then attempt to compute the same [URL="https://www.mersenneforum.org/showpost.php?p=510732&postcount=8"]type 1 PRP sequence with a=3[/URL] using your FFT or NTT code, with output of timing and least significant 64 bits of the 100-iteration results, on the same exponents, iterations, and hardware.

It's not clear to me how you avoid issues with carry propagation or loss, apparently using no guard bits. Perhaps I misunderstood your reference to 2[SUP]25[/SUP] byte operands (32MB each) in 2[SUP]22[/SUP] 8-byte words each.

As a point of reference, from one of the benchmark attachments in that prime95 reference thread, on a dual-core E8200 Core2Duo, in older prime95 v29.4b8, 32M 64-bit words fft length, max ~596M exponent, I get ~2.53 iter/sec. If I pessimistically scale that down 2:1 for core count and 2.66:1 for clock rate, I extrapolate 1/2.53 * 2 *2.66 = 2.1 seconds*core-GHz/iter. Same hardware, scaled pessimistically to 4M DPwords fft length, 1 core, 1 GHz, max ~79M exponent, ~23.5 iter/sec; 1/23.5*2*2.66 ~0.226 sec*core-GHz/iter.

The E8200 is SSE2 at best as I recall; [URL]https://ark.intel.com/content/www/us/en/ark/products/33909/intel-core-2-duo-processor-e8200-6m-cache-2-66-ghz-1333-mhz-fsb.html[/URL] does not say. (Prime95 is usually more memory bandwidth limited than cpu limited.)


All times are UTC. The time now is 09:33.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.