mersenneforum.org GPU Prime Gap searching
 Register FAQ Search Today's Posts Mark Forums Read

2021-09-07, 09:06   #23
danaj

"Dana Jacobsen"
Feb 2011
Bangkok, TH

11100011012 Posts

Quote:
 Originally Posted by MJansen I have been looking at Dana's Perl code, but not sure this is the correct 64 bit code: https://metacpan.org/dist/Math-Prime...ime/Util/PP.pm
No, that is fall-back code if using big numbers without GMP. It works but it's both ugly and slow. I admit some of the C code mis ugly as well, and the whole thing kind of grew organically from a tiny project.

Quote:
 And it seems like he uses a small wheel (30) and a Miller_Rabin base 2 test followed by a full Lucas, so in effect a full BPSW test? Is this a correct assumption?
That's mostly true, but not for inputs > 120 bits. The part about BPSW is true. For 64-bit inputs, the results are deterministic (like most math libraries/packages). For larger results, they have passed extra-strong BPSW.

64-bit:

64-bit next_prime: https://github.com/danaj/Math-Prime-...b8/util.c#L127

64-bit prev_prime: https://github.com/danaj/Math-Prime-...b8/util.c#L155

The algorithm is:
1. If the value is tiny, use a fixed data set.
2. If the value is in our small saved sieve, use that.
3. Skip forward in a mod-30 wheel doing a primality test.

Steps 1 and 2 are useful for the many calls done with small values.

The primality test: https://github.com/danaj/Math-Prime-...mality.c#L1274

That's hard-coded tiny trial division followed by "AES" BPSW, which is deterministic for 64-bit inputs. The MR base 2 test followed by a fast Lucas test.

On an x86 for native inputs, there seems to be no benefit in running a Fermat test rather than a Miller-Rabin test. For that matter, even with GMP I wasn't able to measure any advantage.

For the full test, a hashed Miller-Rabin set (2 or 3 total tests needed) is slightly faster for base-2 pseudoprimes, but not obviously faster for primes. On platforms without fast Montgomery math, it may be different.

Larger than 64-bit:

next_prime: https://github.com/danaj/Math-Prime-...mp_main.c#L332

prev_prime: https://github.com/danaj/Math-Prime-...mp_main.c#L360

surround_primes: https://github.com/danaj/Math-Prime-...mp_main.c#L389

For next and prev prime, with values less than 121 bits we use a very similar simple algorithm. Skip forward using a mod-30 wheel, do trivial trial division, then call the primality test.

primality test: https://github.com/danaj/Math-Prime-...mality.c#L1471

which is pretests (trial division), then ES BPSW (base-2 Miller-Rabin then extra-strong Lucas test).

pretests: https://github.com/danaj/Math-Prime-...mp_main.c#L108

which is overly complicated amounts of trial division. This includes cached large GCDs (because GMP does those very quickly), and also a choice between simple trivial division and a treesieve routine (D Bernstein describes the concept at a high level, Jens K Andersen has a nice detailed description of the concept, my implementation may or may not be particularly efficient, but it's much faster than one-at-a-time trial division for large enough inputs).

This leaves numbers larger than 120 bits, as well as surround_primes, which is meant for prime gaps. Here we do a partial sieve with a width of about 30 merits and a convoluted "educated guess" as to a depth that will give us the best performance. After the sieve, the ES BPSW test is done on candidates in order. It repeats the process if no result is found of course, though at 30 merits this isn't common.

surround_primes sieves 20 merits on either side of the given input, then tests candidates. If both candidates are not found, it repeats with 40, 80 merits, etc. to find whichever one (or both) haven't been found.

I have had on my todo list for a long time, going over the method in https://arxiv.org/abs/2012.03771, which is claimed to be faster.

2021-09-07, 22:10   #24
SethTro

"Seth"
Apr 2019

19·23 Posts

Quote:
 Originally Posted by danaj I have had on my todo list for a long time, going over the method in https://arxiv.org/abs/2012.03771, which is claimed to be faster.
That's my paper :)

Nothing in it helps with very small primorials or single calls to surround_primes. But it has a number of useful speedups for running many sequential surround_primes.

I did recently add a GPU tester to my repository that uses CGBN which I've found to be very useful for doing math on 256-2048 bit numbers. the CGBN repository had a pre-written miller-rabin implementation I made use, on my 1080ti it's much faster using all the 12 cores of my Ryzen 3900x

2021-09-08, 06:35   #25
CraigLo

Mar 2021

59 Posts

Quote:
 Originally Posted by MJansen I bought a laptop with a GPU (Geforce 1650 mobile, 1024 cuda cores, 16 SM's of 64 cores each) for testing.
Nice. I started with an old laptop that had a 940mx gpu (384 cores I think).

Quote:
 Originally Posted by SethTro I did recently add a GPU tester to my repository that uses CGBN which I've found to be very useful for doing math on 256-2048 bit numbers.
I haven't had a chance to look at the CGBN library yet. It wasn't available when I started a few years ago. One of the big improvements I made to my 65-bit code was a 65-bit multiplier. Instead of representing the 65-bit number as 3 32-bit numbers which requires 9 multiplies

Code:
(x1 x2 x3) * (y1 y2 y3) = x1y1 x1y2+x2y1 x1y3+x2y2+x3y1 x2y3+x3y2 x3y3
I use

Code:
(1 x2 x3) * (1 y2 y3) = 1 x2+y2 x2y2+x3+y3 x2y3+x3y2 x3y3
which requires 4 multiplies and 4 additions. For squaring, 6 multiplies becomes 3 multiplies. This gives me about a factor of 2 improvement. Do you know if CGBN does this or if it would be possible to add it?

2021-09-08, 17:13   #26
SethTro

"Seth"
Apr 2019

19×23 Posts

Quote:
 Originally Posted by CraigLo which requires 4 multiplies and 4 additions. For squaring, 6 multiplies becomes 3 multiplies. This gives me about a factor of 2 improvement. Do you know if CGBN does this or if it would be possible to add it?
CGBN doesn't have BITS+1 optimizations or and I doubt the author will want to add them. It's much more oriented around large input numbers.

2021-09-09, 07:37   #27
MJansen

Jan 2018

2·5·11 Posts

I have not been looking at the forum too regularly lately, lots of posts:

Quote:
 Originally Posted by danaj No, that is fall-back code if using big numbers without GMP. It works but it's both ugly and slow. I admit some of the C code mis ugly as well, and the whole thing kind of grew organically from a tiny project. ....
Thanks Dana, appreciate the answer and the pointers, will look into them!

@Dana/@Craig I have a question regarding the bignum library in C++ under windows, let me elaborate a little first:
The fastest library for bignum calculations is GMP and GMP is native under Unix/Linux, but not under windows. Perl has a GMP library incorporated in the strawberry core files. So under Perl it is relatively easy to call on GMP functions.

C++ itself does not seem to have a bignum library incorporated (that I could find, maybe I did not search in the right place?). @Craig: how did you solve this? What bignum library are you using in cuda C++?

If one would like to use GMP under C++ windows, google will show some links about MinGw and using GMP from that instance. Does anybody have any experience with setting that up? Help would be appreciated!

Kind regards
Michiel

 2021-09-09, 17:44 #28 CraigLo   Mar 2021 59 Posts I use GMP for my CPU code but I've only used linux. I have my laptop set up so I can boot into either linux or Windows. I'm not sure if that will work for you. For the GPU stuff I have written all my own bignum routines. It looks like CGBN is a good option now but it didn't exist when I started. I haven't tried it yet. There might be some instances where a specialized routine will be faster (e.g. 65-bit numbers).
2021-09-09, 17:49   #29
robert44444uk

Jun 2003
Suva, Fiji

37678 Posts

Quote:
 Originally Posted by MJansen I have not been looking at the forum too regularly lately, lots of posts: Thanks Dana, appreciate the answer and the pointers, will look into them! @Dana/@Craig I have a question regarding the bignum library in C++ under windows, let me elaborate a little first: The fastest library for bignum calculations is GMP and GMP is native under Unix/Linux, but not under windows. Perl has a GMP library incorporated in the strawberry core files. So under Perl it is relatively easy to call on GMP functions. C++ itself does not seem to have a bignum library incorporated (that I could find, maybe I did not search in the right place?). @Craig: how did you solve this? What bignum library are you using in cuda C++? If one would like to use GMP under C++ windows, google will show some links about MinGw and using GMP from that instance. Does anybody have any experience with setting that up? Help would be appreciated! Kind regards Michiel
I made the migration to Linux under Windows 10 operating system using Ubuntu 20.04 LTS. I don't really understand what I'm doing, but with help from others I can now run Linux codes.

2021-09-09, 21:14   #30
MJansen

Jan 2018

2×5×11 Posts

Thnx Graig, I was wondering, after installing Strawberry Perl, a gmp.h file is installed at the laptop in the directory C:\Strawberry\c\include. The C++ code seems to accept a reference to gmp.h (include gmp.h). Could this work? I will try later with some test code, but what are your thoughts? Worth a try? Or does the GPU not work welll with GMP in your experience?

I spotted this thread online: https://stackoverflow.com/questions/...ng-gmp-library

If GMP and Cuda do not mix, I am interested in your homebrew code ;-)

I found some references to other libraries for GPU:
CUMP https://github.com/skystar0227/CUMP
XMP https://github.com/NVlabs/xmp
Campary https://homepages.laas.fr/mmjoldes/campary/

Any thoughts on those?

And an article regarding GMP implementation:
http://individual.utoronto.ca/haojun...724_Report.pdf

I get a feeling programming for a GPU will require a lot of programming and research ...

Kind regards
Michiel

Quote:
 Originally Posted by CraigLo I use GMP for my CPU code but I've only used linux. I have my laptop set up so I can boot into either linux or Windows. I'm not sure if that will work for you. For the GPU stuff I have written all my own bignum routines. It looks like CGBN is a good option now but it didn't exist when I started. I haven't tried it yet. There might be some instances where a specialized routine will be faster (e.g. 65-bit numbers).

 2021-09-10, 06:39 #31 CraigLo   Mar 2021 59 Posts I think you will be better off with a library that is designed to work well with the GPU architecture instead of trying to run GMP in the GPU. CUMP looks like it is for floating point numbers. XMP looks like an old version of CGBN (XMP 2.0) Campary doesn't have any documentation that I could find. I would start with CGBN. It is written by NVIDIA and well tested. It will handle larger numbers than my code at the moment and it is a more complete library. You will probably still want GMP for your CPU code. I haven't used MinGW or run GMP/CUDA on Windows so I can't help there.
2021-09-10, 06:47   #32
CraigLo

Mar 2021

59 Posts

Quote:
 Originally Posted by SethTro CGBN doesn't have BITS+1 optimizations or and I doubt the author will want to add them. It's much more oriented around large input numbers.
I was reading through some of the CGBN documentation.

It says CGBN currently requires 4, 8, 16 or 32 thread per CGBN group. What is a CGBN group? Is it a single bignum?

Also, the size must be evenly divisible by 32. Does this mean it won't work for a 65 bit number? Or can you treat it as a 96 bit number with leading 0s and everything will work correctly?

2021-09-10, 07:46   #33
SethTro

"Seth"
Apr 2019

6658 Posts

Quote:
 Originally Posted by CraigLo It says CGBN currently requires 4, 8, 16 or 32 thread per CGBN group. What is a CGBN group? Is it a single bignum?
It's a single bignum (made up of 32 bit limbs)

Quote:
 Originally Posted by CraigLo Also, the size must be evenly divisible by 32. Does this mean it won't work for a 65 bit number? Or can you treat it as a 96 bit number with leading 0s and everything will work correctly?
IMO CGBN isn't going to be of much use for 65 bit numbers, maybe 96 and defiantly 128 bit numbers but not 65 bits.

 Similar Threads Thread Thread Starter Forum Replies Last Post Dale Mahalko Software 7 2015-03-21 19:55 hhh Math 7 2014-03-18 14:37 Trilo Riesel Prime Search 33 2013-09-19 21:21 Mini-Geek Twin Prime Search 8 2011-01-30 00:18 ltd Prime Sierpinski Project 3 2006-03-23 19:27

All times are UTC. The time now is 04:19.

Wed Aug 10 04:19:41 UTC 2022 up 33 days, 23:07, 1 user, load averages: 1.60, 1.63, 1.45