mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2016-10-12, 00:31   #1
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default Musings of someone learning

I am working through grasping the FFT based multiplication process in enough detail to hopefully contribute some optimizations to i.e. clLucas. I expect to say a lot of incorrect or silly things as I try to wrap my head around this in detail, help is appreciated. For some reason my brain is struggling more with this than many programming optimization problems I've worked on in the past.


LL test basics = start with x=4, do x=(x^2 - 2) mod M for p-2 iterations, if x=zero you win


As I understand it, there are five steps here for the inner loop

b=log2(M)

n = FFT size, which is roughly b/(useful bits of a double).

1. Convert our X to a polynomial P. Zero pad so it has 2n coefficients. Set complex parts to zero What is the base of the polynomial?

2. Run an FFT on P. Because we are squaring we only need P because P=Q

3. Square the 2n resulting values. These are complex, so we are squaring the complex number? (a + bi)^2 = a^2 + 2abi + bi^2 ? And we do that for all 2n?

4. Run IFFT on the 2n values and get back coefficients in base B

5. Some values might be bigger than B, do carry, round to integers, and subtract 2 while we are at it. Mod M is easy because of the base? I've read we get the mod for free but I haven't exactly pinpointed where. Is this the step?

Rinse wash and repeat. Am I on the right track?

Bonus Question: If I wanted to cube a huge number, could I just cube the values in step 3?

Last fiddled with by airsquirrels on 2016-10-12 at 00:32
airsquirrels is offline   Reply With Quote
Old 2016-10-12, 01:30   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·3·19·67 Posts
Default

You are on the right track. Very good.

As you might expect, there are optimizations available. The biggest is Richard Crandall's Irrational Base Discrete Weighted Transform. This saves us the padding with zeros (halving the size of an FFT) and gives us the modulo 2^P-1 for free.

The other biggie is Hermetian symmetry. When doing a forward FFT of n values, you can discard half of the complex values. That is, the end result of the FFT is (mostly) complex values of the form a + bi and a - bi -- no need to remember both complex values.

On the bonus question, yes you can cube the transformed values. Alas, the more arithmetic you do on transformed values, the fewer bits you can stuff in each FFT word at the beginning -- no free lunch here.
Prime95 is offline   Reply With Quote
Old 2016-10-13, 03:09   #3
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

So I've been reading Crandall's paper.... http://www.ams.org/journals/mcom/199...-1185244-1.pdf

Working through the example of section six, a simple squaring of 78314567209 mod (2^37-1) using an FFT of size 4....

1. Choose a digit representation... The example computes bit sizes {10, 9, 9, 9} which leads to digits {553, 93, 381, 291} and further weights {1, 2.75, 2.5, 2.25}. This part is all laid out in the paper and I follow.

2. Compute X = DWT(N, a)x and Y is the same as X because we are squaring. Based on the initial section, DWT(4, a)x is equal to a DFT(4)ax where ax is component wise multiplication with the signal weights a. This leads to {553 + 0i, 2730.75 + 0i, 952.5 + 0i, 654.75 + 0i} as the complex input values to the DFT, which I compute with the FFT.

I tried doing this with wolfram alpha, for reasons unknown. The output there is {2445.5+0. i, -199.75+1038. i, -940.+0. i, -199.75-1038. i}. The output from this FFT calculator (http://scistatcalc.blogspot.com/2013...alculator.html) is {4891 + 0i, -399.5 - 2076i, -1880 + 0i, -399.5 + 2076i}. I have no idea why we have the discrepancy.

Never the less, I continue...

3. Compute Z=XY (Z=XX in this case). I do this by squaring the complex signal values point-wise, using the second FFT result that gives me: Z={23921881 + 0i, -4150175.75 + 1658724i, 3534400 + 0i, -4150175.75 - 1658724i}

4. I compute the inverse DWT(N, a) of Z, which is a-1DFT-1(4)[Z].
DFT-1[Z] Yields {4788982.375,4267508.25,8939158.125,5926232.25} which I component wise multiply with my inverted weight signal a-1={1/1, 1/2.75, 1/2.5, 1/2.25} yielding z={4788982.375,1551821.1818...,3575663.25,5926232.25,2633881}. At this point my values do not match those shown in 6.14 (z={704383,324600,523365,463577} and so I have failed.

5. Rounding is trivial...

6. Perform carry and make it an integer again. Again trivial...

Where did I go wrong?
airsquirrels is offline   Reply With Quote
Old 2016-10-13, 03:58   #4
axn
 
axn's Avatar
 
Jun 2003

3×17×101 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
and further weights {1, 2.75, 2.5, 2.25}.
The weights are {1, 23/4, 21/2, 21/4} = {1, 1.682, 1.414, 1.189}
axn is online now   Reply With Quote
Old 2016-10-13, 08:53   #5
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

31·191 Posts
Default

If you haven't already read this thread read it http://mersenneforum.org/showthread.php?t=120
henryzz is offline   Reply With Quote
Old 2016-10-13, 12:43   #6
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by axn View Post
The weights are {1, 23/4, 21/2, 21/4} = {1, 1.682, 1.414, 1.189}
Thanks! I missed the superscript...

So that leave me {553., 156.407, 538.815, 346.059}

Forward DWT:
1594.281000
14.185000 + 189.652000i
589.349000
14.185000 - 189.652000i

Squaring:
2541731.906961
-35766.667 + 5380.4272 i
347332.243801
-35766.667 - 5380.4272 i

IFFT:
704382.704191
545909.702190
740149.371190
551290.129390

Inverse weights:
704382.704191
324599.851
523364.639459
463577.9

Rounding, digit recovery, carry propagation LSB first
704383 +
324600 * 2^10 +
523365 * 2^19 +
463578 * 2^28

That looks better! Thanks for the note on the weights.

Now to write some code....
airsquirrels is offline   Reply With Quote
Old 2016-10-13, 23:29   #7
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

So, now I have code that does work, however I'm confused looking at clFFT / cuFFT code. In all cases they actually initialize FFT sizes that are / 2.

Launching clLucas with a 4096K FFT actually uses a 2048K complex FFT internally. The same is true of CUDALucas

I found one forum post that alluded to mprime etc. actually counting the number of total double precision values vs. the number of complex elements? Is that accurate information?

Timings using a Fury X GPU at stock clocks.

Existing clLucas: 3.8-3.9 ms/Iteration using the 2048K Complex Double Precision SixStep FFT (4096K setting)

My code currently runs about 3.16 ms/Iteration using the same Complex Interleaved format and baking the complex squaring, rounding, and carry-save into the FFT kernels using clFFTs post and pre callbacks, although I may need to add some more overhead for the carry/base normalizing I don't expect that to add much more than 10% overhead. I used Andrew Thall's strategy regarding keeping the Carry/Save separate - http://www.ece.neu.edu/groups/nucar/...iles/thall.pdf.

clFFT also supports using REAL->HERMITIAN_INTERLEAVED transforms and the inverse, which would provide the afore-mentioned benefits of Hermitian symmetry and I believe I can perform the squaring directly in the HERMITIAN_INTERLEAVED space. Doing the transforms in this fashion at the "4096K" FFT size yields a timing of 2.52ms/iteration at 4096k on a Fury X, or about the same performance as a Titan Black using CUDALucas.
airsquirrels is offline   Reply With Quote
Old 2016-10-16, 00:24   #8
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

I am having some trouble understanding how to properly restore the digit representation in step 6 of Crandall's algorithm.


(Reminder that the problem is to compute 78314567209^2 modulo 2^37-1 )

For the example problem, I have Real result values (Call them vector X) of:

704383, 324600, 523365, and 463578

These can be computed into a sum based on each digit value

704383 * 2^0 + 324600 * 2^10 + 523365 * 2^19 + 463578 * 2^28 = 124715498905471

78314567209^2 = 6133171437132978049681

78314567209^2 mod 2^37-1 = 58368107274

124715498905471 mod 2^37-1 also = 58368107274

I was led to believe that restoring the digit values would give us the "mod 2^37 -1" for free, however when I compute the digit values of 58368107274 by splitting the bits =
778, 168, 224, 217 I do not see how those values are obtained from X.

I attempted to add and carry the MSB from X and computed instead 895, 167, 194, and 217. Since there are no operations performed on the least significant 10 bits I fail to see how I get 778 instead of 895. Taking 124715498905471 & 1023 = 895. Crandall's section 6.14 just gives a vague "Adjustment for proper constraints on digit sizes are to follow..." however I do not find those adjustments in the paper.

Any direction is appreciated.

Edit: Is this where I've gone wrong? http://mersenneforum.org/showpost.ph...9&postcount=59

So I should take 704383 / (2^10) -> 687 Carry and 704383 % (2^10) = my 895. Add 687 to 324600 and continue with 2^9th. Answering my own questions...

Last fiddled with by airsquirrels on 2016-10-16 at 00:40
airsquirrels is offline   Reply With Quote
Old 2016-10-16, 01:26   #9
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Edit: Is this where I've gone wrong? http://mersenneforum.org/showpost.ph...9&postcount=59

So I should take 704383 / (2^10) -> 687 Carry and 704383 % (2^10) = my 895. Add 687 to 324600 and continue with 2^9th. Answering my own questions...
Sorry for so many replies to myself. So my hand-carry method is the same as what ewmayer posted in that previous thread, however I am still getting 895 for the first 10 bit word, instead of the 778 that I am looking for. What am I missing now?
airsquirrels is offline   Reply With Quote
Old 2016-10-16, 16:30   #10
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

I will keep journalling my progress to hopefully help someone else trying to learn and understand this later.

For anyone trying to understand LL testing, I highly recommend taking a look at the gpuLucas code. It is very clean and well commented and lacks much of the deep complexity/programmed by a mathematition effects of CUDALucas, and since the FFT is handled by CUDA much of the intense complexity of prime95 is not present.

https://github.com/Almajester/gpuLucas

What I critically missed in the papers was that the carry needs to be treated cyclically in order to preserve the mod 2^37-1.

In this case my values of 895, 167, 194, and 217 from the example problem are correct, but there is a 5th carry value of (905 in the results + 2 from previous carries.)

This leads to 895 + 907 = 1802. 1802 mod 1024 = 778 with a carry of 1, bringing my 167 to 168.
airsquirrels is offline   Reply With Quote
Old 2016-10-18, 13:03   #11
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Can anyone help me understand or point me to a paper on how you would perform IBDWT stuffing your words into both the real and imaginary components of a complex FFT? I've tried searching for this but perhaps I am not using the right vocabulary to describe it.

Both CUDALucas and clLucas seem to employee this technique, stemming from their origins as MacLucasFFTW. This is allowing them to use 1/2 the FFT size that is actually specified.

Crandall's paper makes a mention of this option without details but recommends the Real->Hemitian solution instead

Last fiddled with by airsquirrels on 2016-10-18 at 13:04
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Learning languages is the new fashion... LaurV LaurV 3 2019-11-05 10:25
Online language-learning course kladner Lounge 8 2013-04-18 03:08
Learning Python - Course from Google Jeff Gilchrist Programming 3 2012-01-15 00:29
flowcharts, self-learning jasong jasong 6 2007-12-07 14:06
Learning About RAM the Hard Way Longshot Hardware 5 2005-05-21 16:40

All times are UTC. The time now is 02:32.


Mon Oct 25 02:32:26 UTC 2021 up 93 days, 21:01, 0 users, load averages: 1.29, 1.33, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.