mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2020-06-21, 03:40   #1
makis
 
makis's Avatar
 
Jun 2020

616 Posts
Default How is something like prime95 developed?

I have a question regarding the methodology behind developing such high performance code such as prime95 and mprime. As far as I understand, mprime uses the GMP or GNU Multiprecision library that is able to handle arbitrary precision arithmetic. Just our of curiosity, I've implemented a trivially simple implementation of the LL test in C++ using GMP as well since I am interested in the project and generally high performance code. Obviously my implementation was terrible compared to prime95 but this was to be expected. My question basically boils down to this: how does one start from a simple toy example like my implementation, and move on to develop the highly optimized prime95 or mprime code? What are the high level steps involved? I know that optimized x86 assembly code is used. But where? How does one start. Is there a resource that explains some good practices or generally the methodology that should be used to approach these problems? I am not asking as to which parts should be optimized since this is dependent on the specific problem at hand. What I am asking is assuming that the parts to be optimized are known (i.e a specific for loop or operation). How does one go about doing it? Were modifications to the GMP library necessary? Was FFT implemented as an ad-don to the GMP library? Was it implemented from scratch? If so how are arbitrarily large numbers handled? Was that implemented from scratch as well?
makis is offline   Reply With Quote
Old 2020-06-21, 07:52   #2
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×331 Posts
Default

AFAIK GMP is only used in non-core functions and checks (or is that Mlucas? P95 may not use it), the magic is done in hand-coded assembly. Learning ASM from scratch is a big undertaking, you will end up with less hair than you started with and probably a drinking problem. A more accessible approach is to use intrinsics, I'd at least start there to get your feet wet ( https://software.intel.com/sites/lan...rinsicsGuide/# ). Prime95 uses gwnum, a custom library built for purpose. Other software can use this library for related tasks. Check out the source: https://www.mersenne.org/download/
M344587487 is online now   Reply With Quote
Old 2020-06-21, 08:45   #3
makis
 
makis's Avatar
 
Jun 2020

2·3 Posts
Default

Thank you for the info. So as far as I understand the high level steps are:
  1. Write Bignum Library in C
  2. Implement FFT within this library
  3. Optimize mission critical parts of the code in assembly
  4. Use AVX (??)

Am I getting this right?
makis is offline   Reply With Quote
Old 2020-06-21, 08:52   #4
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

10110101110002 Posts
Default

Quote:
Originally Posted by makis View Post
Thank you for the info. So as far as I understand the high level steps are:
  1. Write Bignum Library in C
  2. Implement FFT within this library
  3. Optimize mission critical parts of the code in assembly
  4. Use AVX (??)

Am I getting this right?
No.

If you want fast code then forget about any high level languages.

Use assembly for all the heavy lifting. Don't believe all the scare tactics by the HLL people about how hard assembly is. It isn't hard at all, it just "helps" you less and makes you code all the steps. Make sure you avoid the deplorable AT&T syntax, and the ugly intrinsics, and everything will be quite clear.

Last fiddled with by retina on 2020-06-21 at 08:52
retina is offline   Reply With Quote
Old 2020-06-21, 08:54   #5
axn
 
axn's Avatar
 
Jun 2003

2·7·337 Posts
Default

Quote:
Originally Posted by makis View Post
Thank you for the info. So as far as I understand the high level steps are:
  1. Write Bignum Library in C
  2. Implement FFT within this library
  3. Optimize mission critical parts of the code in assembly
  4. Use AVX (??)

Am I getting this right?
Not quite.
For the purpose of primality testing, the only "bignum" operation you need is multiplication.
For the fastest multiplication, you need FFT using double precision number. Your bignum all-integer based multiplication isn't going to cut it.
For Mersenne numbers, on top of the FFT, you implement another algorithm called Irrational Base Discrete Weighted Transform (IBDWT).
And since FFT is the "mission critical" part, you do that in assembly for the highest performance.
And then you use specific instruction sets like SSE2, AVX, FMA, AVX-512, etc. on supported processors to gain even more performance.

But most of the gain is in implementing the IBDWT. You can do that in plain C as well. Or even Java.
axn is online now   Reply With Quote
Old 2020-06-21, 09:04   #6
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

23×727 Posts
Default

And if you want world class code, instead of just fast code, then you need to incorporate good management of the TLBs and caches and similar things to ensure good data flow between the CPU and RAM. Because CPUs can't process data while it is still in the RAM.
retina is offline   Reply With Quote
Old 2020-06-21, 09:06   #7
makis
 
makis's Avatar
 
Jun 2020

68 Posts
Default

Quote:
Originally Posted by retina View Post
No.

If you want fast code then forget about any high level languages.

Use assembly for all the heavy lifting. Don't believe all the scare tactics by the HLL people about how hard assembly is. It isn't hard at all, it just "helps" you less and makes you code all the steps. Make sure you avoid the deplorable AT&T syntax, and the ugly intrinsics, and everything will be quite clear.
I am slightly confused. I received two partially contradictory responses. Both you and axn agree that assembly should be used for at least the heavy lifting part of the code. But what about intristics. Why should I not use them? Thanks again for your responses!
makis is offline   Reply With Quote
Old 2020-06-21, 09:07   #8
makis
 
makis's Avatar
 
Jun 2020

2×3 Posts
Default

I also wanted to ask. Let's say that I won't use any assembly. How much am I leaving on the table in terms of performance? Is that even quantifiable?
makis is offline   Reply With Quote
Old 2020-06-21, 09:08   #9
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

23×727 Posts
Default

Quote:
Originally Posted by makis View Post
I am slightly confused. I received two partially contradictory responses. Both you and axn agree that assembly should be used for at least the heavy lifting part of the code. But what about intristics. Why should I not use them? Thanks again for your responses!
Intrinsics ARE assembly, but in a terribly ugly form designed to meld with HLL code. If you want your code to be readable just use pure assembly.

Last fiddled with by retina on 2020-06-21 at 09:08
retina is offline   Reply With Quote
Old 2020-06-21, 12:33   #10
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

100111100100012 Posts
Default

Responses so far have all dived in at the deep end, stepping over the first three stages of optimization.

0) First get it right, then get it fast. That is, develop a gold-standard implementation which guarantees to give correct answers. Subsequent code must then agree with those answers. It sounds as if you have already started on this.

1) Make sure you are using fast algorithms. This will take some literature search, an activity worthwhile in its own right. A standard example is to chose quick-sort instead of bubble-sort if the data to be sorted is large. Sorting may not be important for the case in hand but the principle surely is. You have made a start on this one too but there will surely be more to learn.

2) Instrument your code to find out where it is spending all of its time. There is absolutely no point in tripling the speed of a segment which is using 0.1% of the time. You can get a much bigger win increasing by 50% the speed of code which takes 70% of the time.

Good luck, and keep at it!

Last fiddled with by xilman on 2020-06-21 at 12:36
xilman is offline   Reply With Quote
Old 2020-06-21, 13:44   #11
a1call
 
a1call's Avatar
 
"Rashid Naimi"
Oct 2015
Remote to Here/There

3×643 Posts
Default

I find this a very informative thread and the posters responses extremely valuable. However, even the most efficient programming on a PC will have the overhead of a general purpose architecture which will be deficient compared to an architecture that uses a hardware module that can handle large-enough (rather than arbitrarily large) number modular-exponentiation, which is the bottleneck operation despite all the clever algorithms. Develop a module that can perform this operation in a greatly expedited manner (which I believe is possible) and then you don't need to worry about any of the rest of hardware/software used. You will outperform any general-purpose highly efficient supercomputer.
Bottomline is that addressing the single most time consuming operation is more effective than all the other software/hardware optimization (since it consumes weeks of delay at its best for current number ranges).
Just my 2 cents worth.

Last fiddled with by a1call on 2020-06-21 at 13:57
a1call is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GGNFS not being developed? jux Factoring 4 2015-12-27 03:05

All times are UTC. The time now is 02:12.

Thu Oct 29 02:12:41 UTC 2020 up 48 days, 23:23, 1 user, load averages: 1.85, 1.82, 1.83

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.