mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2014-09-20, 00:10   #23
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
It looks like roundpd is an SSE4.1 instruction which your Opteron 6124 doesn't seem to support (it's not part of SSE4a; see Wipedia). I guess Ernst will have to explain why he pretends that Mlucas is an SSE2 program
Been too long since I visited here, but note that the use of roundpd has been purged from the Mersenne-mod carry macros (Fermat-mod still have them, but I'm the only one using those) in "SSE2"-build-mode in all recent releases, and this will remain so. AVX mode of course makes free use of vroundpd, since there is no "which flavor of AVX do you have?" issue w.r.to that instruction.
ewmayer is offline   Reply With Quote
Old 2014-12-13, 05:05   #24
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

V14.1 is available - details via the readme-file link in the opening post.
ewmayer is offline   Reply With Quote
Old 2014-12-13, 06:02   #25
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

222016 Posts
Default

How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...

(hey, I am the guy who DC-ed Mike's work, remember? )
LaurV is offline   Reply With Quote
Old 2014-12-13, 07:39   #26
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by LaurV View Post
How does the newer version compares with P95? I mean, I have read your "less than two times slower" stuff there, but I assume that is a figure of speech...
Here are 4-thread per-iteration timings on my Haswell 4670K/ddr3-2400, all running at stock. These are all ~2% pessimistic due to startup/shutdown time (e.g. I get 82 msec/iter running @ 3072K in production mode):
Code:
FFT(K)	msec/iter (4-threaded)
----	---------
1024	 2.65
1152	 3.15
1280	 3.43
1408	 4.01
1536	 4.19
1664	 4.61
1792	 4.81
1920	 5.29
2048	 5.35
2304	 6.07
2560	 6.51
2816	 7.54
3072	 8.40
3328	 8.74
3584	 9.13
3840	10.16
4096	10.54
4608	11.98
5120	13.80
5632	15.92
6144	17.54
6656	18.62
7168	19.69
7680	22.00
ewmayer is offline   Reply With Quote
Old 2014-12-13, 12:32   #27
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24×3×11 Posts
Default

For comparison, http://mersenneforum.org/showpost.ph...&postcount=633
i5-4670K @ 3.8 GHz, Dual DDR3 1600
Code:
Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms. 
Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms. 
Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms. 
Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms. 
Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms. 
Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms. 
Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms. 
Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms. 
Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms. 
Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms. 
Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms. 
Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms. 
Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.
Nice result for Mlucas, congratulations :)
ldesnogu is offline   Reply With Quote
Old 2014-12-13, 22:12   #28
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
For comparison, http://mersenneforum.org/showpost.ph...&postcount=633
i5-4670K @ 3.8 GHz, Dual DDR3 1600
Code:
Best time for 1024K FFT length: 1.336 ms., avg: 1.374 ms. 
Best time for 1280K FFT length: 1.839 ms., avg: 1.865 ms. 
Best time for 1536K FFT length: 2.333 ms., avg: 2.370 ms. 
Best time for 1792K FFT length: 2.833 ms., avg: 3.277 ms. 
Best time for 2048K FFT length: 3.350 ms., avg: 3.374 ms. 
Best time for 2560K FFT length: 4.239 ms., avg: 4.276 ms. 
Best time for 3072K FFT length: 5.124 ms., avg: 5.155 ms. 
Best time for 3584K FFT length: 6.006 ms., avg: 6.042 ms. 
Best time for 4096K FFT length: 6.970 ms., avg: 7.000 ms. 
Best time for 5120K FFT length: 8.705 ms., avg: 8.745 ms. 
Best time for 6144K FFT length: 10.496 ms., avg: 10.543 ms. 
Best time for 7168K FFT length: 12.371 ms., avg: 12.451 ms. 
Best time for 8192K FFT length: 14.673 ms., avg: 14.735 ms.
Nice result for Mlucas, congratulations :)
Thanks - a lot of work went into that "within a factor of 2x". My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out. I've been using George's early Haswell results as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA.

BTW, if anyone has access to a Broadwell system running Linux (or MingGW64 under Windoze), I'd very much appreciate tmings on such, and have some special preprocessor-flags-to-try-for-Broadwell, as well.
ewmayer is offline   Reply With Quote
Old 2014-12-14, 11:03   #29
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

52810 Posts
Default

Quote:
Originally Posted by ewmayer View Post
My system runs @3.3GHz (slower than above) but with ddr3-2400 (faster), so not sure how those 2 differences net out.
Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...).

Quote:
I've been using George's early Haswell results as my guide, since we bought identical hardware (Mobo, CPU, RAM) and those timings were before George OCed his system. I apply a 15% reduction to his timings, since he says that's roughly what he gained from use of FMA.
Silly question: why don't you run the latest Prime95 benchmark on your system?
ldesnogu is offline   Reply With Quote
Old 2014-12-14, 12:05   #30
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24×3×11 Posts
Default

I gave Mlucas a try on my i7-4770K.
Code:
gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c
rm -f rng*.o util.o qfloat.o
gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c
gcc -o Mlucas *.o -lm -lpthread -lrt
./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2
...
100 iterations of M3888517 with FFT length 196608 = 192 K
Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36     =          67881076658
Res mod 2^35 - 1 =          21674900403
Res mod 2^36 - 1 =          42893438228
The README page says this should be output:
Code:
This particular testcase should produce the following 100-iteration  residues,
with some platform-dependent variability in the roundoff  errors :

100 iterations of M3888509 with FFT length 196608 = 192 K
Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x
Res mod 2^36     =          12028950892
Res mod 2^35 - 1 =          29259839105
Res mod 2^36 - 1 =          50741070790
I guess the README should be updated.

How do you get an output similar to Prime95 benchmark?
ldesnogu is offline   Reply With Quote
Old 2014-12-17, 02:30   #31
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I gave Mlucas a try on my i7-4770K.
Code:
gcc -c -Os -m64 -DUSE_AVX2 -DUSE_THREADS *.c
rm -f rng*.o util.o qfloat.o
gcc -c -O1 -m64 -DUSE_AVX2 -DUSE_THREADS rng*.c util.c qfloat.c
gcc -o Mlucas *.o -lm -lpthread -lrt
./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2
...
100 iterations of M3888517 with FFT length 196608 = 192 K
Res64: 579D593FCE0707B2. AvgMaxErr = 0.274916295. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36     =          67881076658
Res mod 2^35 - 1 =          21674900403
Res mod 2^36 - 1 =          42893438228
The README page says this should be output:
Code:
This particular testcase should produce the following 100-iteration  residues,
with some platform-dependent variability in the roundoff  errors :

100 iterations of M3888509 with FFT length 196608 = 192 K
Res64: 71E61322CCFB396C. AvgMaxErr = 0.226967076. MaxErr = 0.281250000. Program: E3.0x
Res mod 2^36     =          12028950892
Res mod 2^35 - 1 =          29259839105
Res mod 2^36 - 1 =          50741070790
I guess the README should be updated.
Ah, good catch - if you look closely you see the 2 exponents are slightly different (3888517 is the next-larger prime above 3888509). I believe I must have changed the self-test exponent computation formula sometime in the last year or so to take p as the smallest prime >= the number given by my continuous-function max_p(FFT length) formula, rather than the largest prime <= same. If you force a non-default self-test p via

./Mlucas -m 3888509 -fftlen 192 -iters 100 -radset 0 -nthread 2

you will see the result indicated on the webpage (which I have since corrected). Thanks for the catch.

Quote:
How do you get an output similar to Prime95 benchmark?
George and I do our self-tests differently ... If you want a best-FFT-params (as determined by the these self-tests) timings for range of FFT lengths relevant to current GIMPS 'wavefront' and DC work, pause any other CPU-heavy tasks on our system and run the 'medium' self-test range:

./Mlucas -s m -iters 1000

1000 iters gives cleaner timings (and better roundoff testing) than the "quick look" 100-iter tests. With no #threads specified the code will use all the physical cores on your system. The README page discusses all this stuff.
ewmayer is offline   Reply With Quote
Old 2014-12-21, 07:34   #32
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Do you mean your system is underclocked? Because 4670K are supposed to run at base 3.4 GHz with turbo at 3.8 GHz (and I supposed the benchmark poster just stated turbo speed, might be a wrong assumption...).
Ah, I mis-wrote - clock is indeed 3.40 GHz. Perusing the BIOS boot-menu info, I have Turbo Boost enabled (and Enhanced Turbo = [Auto], whatever that means). As I had not recently tried toggling Turbo Boost I tried disabling it - the current Mlucas build runs at the same speed (within the usual noise-based error bars) that way, so it seems to make no difference for my code. at least on my setup.

Quote:
Silly question: why don't you run the latest Prime95 benchmark on your system?
Like you say, 'tis a silly question. :)

Here 4-threaded results for my Haswell system:

[Worker #1 Dec 19 16:21] Timing FFTs using 4 threads.
[Worker #1 Dec 19 16:21] Timing 39 iterations of 1024K FFT length. Best time: 1.293 ms., avg time: 1.344 ms.
[Worker #1 Dec 19 16:21] Timing 31 iterations of 1280K FFT length. Best time: 1.825 ms., avg time: 1.850 ms.
[Worker #1 Dec 19 16:21] Timing 26 iterations of 1536K FFT length. Best time: 1.993 ms., avg time: 2.305 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 1792K FFT length. Best time: 2.317 ms., avg time: 2.356 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 2048K FFT length. Best time: 2.766 ms., avg time: 2.785 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 2560K FFT length. Best time: 3.462 ms., avg time: 3.500 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 3072K FFT length. Best time: 4.141 ms., avg time: 4.190 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 3584K FFT length. Best time: 4.957 ms., avg time: 5.009 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 4096K FFT length. Best time: 5.639 ms., avg time: 5.722 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 5120K FFT length. Best time: 7.151 ms., avg time: 7.202 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 6144K FFT length. Best time: 8.471 ms., avg time: 8.639 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 7168K FFT length. Best time: 10.197 ms., avg time: 10.272 ms.
[Worker #1 Dec 19 16:21] Timing 25 iterations of 8192K FFT length. Best time: 11.917 ms., avg time: 11.952 ms.

Now assembling the average times for 4-threaded Prime95 and Mlucas (update of previous table, now using 10000-iter timings run after a reboot, right after which I ran the above Prime95 timing test) at the above FFT lengths (plus the intermediate radix-9/11/13/15-based ones supported by Mlucas) and supplementing with the resulting [Mlucas/Prime95] timing ratio (for cases where the FFT length in question is not supported by Prime95, use its timing at the next-higher length as the denominator):
Code:
FFTlen     Prime95      Mlucas     Timing Ratio
(Kdbl)    msec/iter    msec/iter   [Mlucas/P95]
------    ---------    ---------   ------------
  1024       1.344       2.60          1.93
  1152                   3.13          1.69
  1280       1.850       3.56          1.92
  1408                   3.98          1.73
  1536       2.305       4.02          1.74
  1664                   4.63          1.97
  1792       2.356       4.70          1.99
  1920                   5.29          1.90
  2048       2.785       5.29          1.90
  2304                   6.00          1.71
  2560       3.500       6.44          1.84
  2816                   7.47          1.78
  3072       4.190       8.25          1.97
  3328                   8.84          1.76
  3584       5.009       9.02          1.80
  3840                  10.06          1.76
  4096       5.722      10.46          1.83
  4608                  11.78          1.64
  5120       7.202      13.47          1.87
  5632                  15.52          1.80
  6144       8.639      17.40          2.01
  6656                  18.48          1.80
  7168      10.272      19.02          1.85
  7680                  21.49          1.80
  8192      11.952      22.33          1.87
So George still kicks my butt, but now maybe with just one leg, rather than both. :)
ewmayer is offline   Reply With Quote
Old 2015-05-22, 06:40   #33
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Here is the head-to-head comparison on my new Xyzzy-built Broadwell (i3) NUC, both programs run 4-threaded on the 2 physical cores of the system (that setup gives best per-iteration timing for both on this system) - these timings and ratios can be compared to the Haswell ones in the above post:
Code:
FFTlen     Prime95      Mlucas     Timing Ratio
(Kdbl)    msec/iter    msec/iter   [Mlucas/P95]    Comments
------    ---------    ---------   ------------    ------------
  1024       3.894       6.869        1.76        
  1152       4.634       8.294        1.79        
  1280       4.990       8.702        1.74        
  1408       5.502      10.118        1.84        [Prime95 1440K]
  1536       6.203      10.298        1.66        
  1664       6.506      11.562        1.78        [Prime95: average of the 1600K and 1728K timings]
  1792       7.473      11.904        1.59        
  1920       7.843      13.186        1.68        
  2048       7.898      13.946        1.77        
  2304       8.889      15.846        1.78        
  2560       9.930      17.281        1.74        
  2816      11.369      19.931        1.75        [Prime95 2880K]
  3072      12.465      22.373        1.79        
  3328      13.688      23.541        1.72        [Prime95 3360K]
  3584      14.567      25.318        1.74        
  3840      16.079      27.987        1.74        
  4096      16.917      29.488        1.74        
  4608      19.762      34.077        1.72        
  5120      21.736      37.573        1.73        
  5632      25.657      43.197        1.68        [Prime95 5760K]
  6144      26.867      50.179        1.87        
  6656      30.958      51.091        1.65        [Prime95 6720K]
  7168      32.399      54.929        1.70        
  7680      34.025      60.411        1.78        
  8192      34.791      65.911        1.89        
                                 Avg: 1.75

Last fiddled with by ewmayer on 2015-05-22 at 06:41
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas v18 available ewmayer Mlucas 48 2019-11-28 02:53
Mlucas on ubuntu Damian Mlucas 17 2017-11-13 18:12
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:46.

Tue Sep 22 01:46:42 UTC 2020 up 11 days, 22:57, 0 users, load averages: 1.14, 1.54, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.