mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2004-04-21, 20:20   #1
wfgarnett3
 
wfgarnett3's Avatar
 
"William Garnett III"
Oct 2002
Bensalem, PA

10101102 Posts
Default improvement needed in SSE2 for Pentium 4 for non-IBDWT case for LLR/PRP

Hello,

This is a pretty long post, but I wanted to bring this up to try to
get some SSE2 programs to fix LLR/PRP, if possible. This "problem" has persisted for years and
if it can be fixed I wanted to get some people working on it. To help explain what I mean,
let me first show some benchmarks.

The following were run on a computer in a computer lab, specifically, a Dell Pentium 4 2.8Ghz, 800Mhz system bus,
dual channel DDR 333 memory (I think) with Hyperthreading disabled.

For the Mersenne number, 2^7137769-1, the latest version of Prime95
in natural SSE2 mode yields a per iteration time of:
14.085ms
For Prime95 with SSE2 disabled, it yields
42.129ms

For the Riesel number 3*2^7137768-1, using the latest version of LLR which takes
advantage of IBDWT for small k, first for SSE2 enabled:
[Mon Apr 19 02:33:05 2004]
Using IBDWT : Mersenne fftlen = 393216, Proth fftlen = 786432, Used fftlen = 458752
19.087ms
For SSE2 disabled:
[Mon Apr 19 02:49:03 2004]
Using IBDWT : Mersenne fftlen = 393216, Proth fftlen = 786432, Used fftlen = 458752
50.364ms

For the Riesel number 513*2^7137768-1 in regular mode (no IBDWT),
For SSE2 enabled:
67.562ms
For SSE2 disabled:
105.331

For the Proth number 513*2^7137768+1, using LLR (it uses PRP mode) (no IBDWT):
with SSE2
67.482ms
without SSE2
105.351ms

Finally, on my personal computer, a Pentium III 600 Mhz (133 mhz system bus, PC-133) (which of course
doesn't have SSE2):
For the Riesel number 513*2^7137768-1 (no IBDWT).
284.437ms
For Mersenne number 2^7137769-1 using Prime95:
109.612ms
For 3*2^7137768-1 using LLR (IBDWT):
128.921ms

Few aside notes:
I am assuming Jean is going to implement the IBDWT for the Proth case too??? and not limit it to Riesel case?
This is all based on Colin's algorithm:
http://www.ams.org/mcom/2003-72-241/...19-9/home.html
Thank you Jean for implementing this algorithm!! Excellent work so far!! No SSE2 problems here for IBWDT (since modulo step
done for free :) )

OK, back to the main focus. Let's look at the benchmarks. It is known that Mersenne's are roughly
3 times faster than Riesels/Proths(non-IBDWT). Let's have a look at Pentium 3:
Mersenne: 109.612ms
Riesel: 284.437ms
Ratio: 2.59

No problems here. Let's look at Pentium 4 SSE2 disabed mode(non-IBDWT):
Mersenne:42.129ms
Riesel:105.331
Ratio: 2.50

No problems. So we see with non SSE2 processor like Pentium 3, and SSE2 disabled mode for P4,
everything happens as planned.

Now let's look at SSE2. Remember when Prime95 got SSE2 optimized George said it was 3 times faster:
Mersenne(SSE2):14.082ms
Mersenne(non-SSE2):42.129ms
ratio:2.99

Yeap that's what happened. Let's look even at this new IBDWT mode Jean included in LLR for small k:
SSE2 Riesel IBDWT: 19.087ms
non-IBDWT Riesel: 50.364ms
ratio: 2.64

That's happening to plan too. Makes sense, since this "Colin algorithm" uses IBDWT like Prime95 does
for Mersenne's so iteration times should be close. Nice job for a first release. I imagine they will be
improving even more as time goes along. These numbers get the modulo step done for free.

OK, now for the bad news. Using LLR, and k > 512, Proth's and Riesel's are roughly same time.
Riesel's do a deterministic primality test for k*2^n-1, while PRP does a probable prime test for
+1. I ran both benchmarks in LLR, since LLR automatically defaults to PRP mode for +1. Here are the times:

SSE2 (no IBDWT): 67.482ms
SSE2 disabled (no IBDWT):105.351ms
ratio:1.56 OUCH!!!

This has been like this for years. This is the one problem that exists. It is that only for the SSE2 processor Pentium 4, LLR/PRP
is poorly optimized for the non IBDWT case.

So my request to any SSE2 experts out there, can this be fixed? I know very little about SSE2 so don't
know about feasibility of fixing this problem. Many of us are running projects
that test Proth and Riesel numbers with big k, and we are not taking full advantage of the SSE2 instructions in the Pentium 4 like Prime95 is.

Here are some old emails I have that may shed light on helping you guys (I hope these guys don't
get upset for me cutting and pasting conversations):

George Woltman:
Date: Mon, 11 Mar 2002 17:42:47 -0500
To: "William F. Garnett III" <vmrf@grove.iup.edu>
"The P4 speedup in prp is not as dramatic as the P4 speedup in prime95.
This is because the mod k*2^n-1 step is not SSE2 optimized. Thus expect only
a 2x speedup instead of a 3x speedup."

Date: Tue, 12 Mar 2002 10:05:45 -0500
To: "William F. Garnett III" <vmrf@grove.iup.edu>
"Thanks.


Just one more thing. For Proth primes (k*2^n+1), is PRP2 partially SSE2
optimized, or not at all?


The multiplication is SSE2 optimized.


I know below you said the mod step is not SSE2
optimized. Is this as far as PRP2 will probably be SSE2 optimized, or will
you maybe be able to optimize it further?


Its not high on my priority list. I'm really hoping the PFGW project puts
PRP out of business. Maybe someone in that group can an write SSE2
optimized modulo routine.

Regards,
George
"
"
From: George Woltman <woltman@alum.mit.edu>
Subject: Re: SSE2
Date: Sat, 16 Mar 2002 22:24:51 -0500
To: "William F. Garnett III" <vmrf@grove.iup.edu>



Hi,

At 09:43 PM 3/16/2002 -0500, you wrote:

Just some questions about PRP2 and SSE2. I know you said multiplication is
SSE2 optimized but not modulo routine. You mentioned you hopefully want PFGW
to put PRP2 out of business. Anyway, I just like to know, would it be hard to
optimized the modulo routine for SSE2?


It would be somewhat difficult. The multiplication routines use a funny
memory layout for good cache utilitization. This layout makes a good
modulo implementation hard. The good news is the routine is quite localized.


Are the prime95 and PRP2 modulo
routines similar, or is somehow the modulo routine harder to do for PRP2 and
thus we get only the 2x improvement instead of 3x?


Prime95 does not do a modulo. It happens for free thanks to Richard Crandall
irrational discrete weighted transform.

Regards,
George

"

"
From: "Brian J. Beesley" <xx@xx.net>
Subject: Re: benchmarks for Proth primes
Date: Tue, 09 Apr 2002 19:30:33 +0000
To: "William F. Garnett III" <vmrf@grove.iup.edu>, woltman@alum.mit.edu
Cc: vmrf@iup.edu

Why is there only a minor speedup, unlike Prime
> 95, which gets a HUGE boost with regards to SSE2.

Without bothering to check the code, I suspect that the modulo operation
isn't taking advantage of SSE2 (maybe there's no easy way to do this) and
ends up dominating the execution time on a P4 system.

There may be a workaround. If the convolution is broken down into a^2 (mod p)
and a^2 (mod q), where p and q are different prime-exponent Mersenne numbers
(therefore coprime) and p.q > k.2^n+1, then you can use "free modulo" code
and recover the final result using the Chinese Remainder Theorem. If this
enables full use to be made of SSE2, this method might well be superior on a
P4, even if it's not efficient on a processor which doesn't have SSE2 logic.

BTW George could you please put up a copy of "PRP v2.1" (the one that reports
final residuals) for linux? The version on ftp://www.mersenne.org/gimps still
doesn't log residuals.

Regards
Brian Beesley
"


From: George Woltman <woltman@alum.mit.edu>
Subject: Re: benchmarks for Proth primes
Date: Tue, 09 Apr 2002 17:22:08 -0400
My P4 shows about 3.2 ms per iteration vs. 5.8 ms per iteration if I make
PRP not use the SSE2 code. That is a big gain although not 2x.

I created a special prp that did not time the proth mod step. This ran at
1.7 ms per iteration using SSE2 and 4.8 ms for non-SSE2 code.
This means the multiply step did get nearly 3 times faster (4.8 vs. 1.7)
but the proth mod step got slower (1.5 vs. 1.0). The SSE2 code uses
a different memory layout that is apparently less cache friendly in the
proth mod step.

Hope that helps,
George
"

Mon, 11 Nov 2002 18:25:17 -0500
No.


Reason I'm asking is for searches like mine (searching for the 5th largest prime), every bit of SSE2 optimization can help, and we can really use it. PFGW is the same speed as your PRP, so they don't have any extra SSE2 optimization I suppose. Let me know if you are optimizing the mod step.


The mod step actually uses SSE2 instructions. Its problem is that it accesses
cache lines poorly. It uses only 16 bytes of each 128-byte cache line resulting
in main memory being read 8 times more often than the ideal. This all stems
from the memory layout that was chosen to make the Mersenne multiplication
as fast as possible. This same layout is poor for the mod k*2^n-1 step.

Is there a memory layout that could satisfy both goals? Maybe, but even if it
exists it means a major rewrite of all the multiplication code. Is there a different
way to do the mod step that is more memory efficient? It is possible. For
example it might be faster to transpose memory, do the mod, transpose back
faster than the current scheme.

Unfortunately, it is a low priority item right now.

Regards,
George
"


That's basically it. Is there anyway guys to improve the mod step for proths and riesels using sSE2 (non-IBDWT)? Any speed improvement would
help us guys running projects that are not using small ks. I am not knowledgeable in SSE2 so don't know if it's feasible. If it is, I would like some
people to work on fixing it if possible and doesn't interfere with any other obligations you have.

Thanks for all the help everyone and for all the contributions!!

regards,
william garnett
wfgarnett3 is offline   Reply With Quote
Old 2004-04-22, 03:40   #2
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

167268 Posts
Default

Your best hope is Jean Penne. He is the first person to successfully wade in and improve the assembly code. Very impressive.

I now know how to fix the problem without changing the memory model. It can be done in the same routines that Jean has already changed.
Prime95 is offline   Reply With Quote
Old 2004-06-29, 00:07   #3
Citrix
 
Citrix's Avatar
 
Jun 2003

1,583 Posts
Default

Any chance for improvement on P4 HT's. Under win XP, one processor is completely doing nothing, just sitting idle.

can't we make it do anything?

Citrix
Citrix is offline   Reply With Quote
Old 2004-06-29, 03:03   #4
ColdFury
 
ColdFury's Avatar
 
Aug 2002

26×5 Posts
Default

Quote:
Originally Posted by Citrix
Any chance for improvement on P4 HT's. Under win XP, one processor is completely doing nothing, just sitting idle.

can't we make it do anything?

Citrix
It's still just one processor. Prime95 is maximizing the execution units of the processor, so adding a second thread wouldn't add anything.
ColdFury is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
performance improvement with assembly bsquared Software 15 2010-09-28 19:00
Possible improvement of quadratic sieve Random Poster Factoring 4 2010-02-12 03:09
Pentium 90 // Pentium ][ 400 years ValerieVonck Programming 4 2006-12-12 17:06
Pentium 4 not using SSE2 with Windows NT4 Unregistered Software 10 2004-05-20 12:14
Possible idea for improvement Kevin Software 1 2002-08-26 07:57

All times are UTC. The time now is 12:44.


Tue Oct 26 12:44:16 UTC 2021 up 95 days, 7:13, 0 users, load averages: 2.86, 2.96, 2.82

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.