mersenneforum.org Skylake AVX-512: Google Cloud has announced general availability
 Register FAQ Search Today's Posts Mark Forums Read

 2017-06-03, 21:56 #1 GP2     Sep 2003 50338 Posts Skylake AVX-512: Google Cloud has announced general availability https://cloudplatform.googleblog.com...exibility.html Skylake is available in Western US, Western Europe and Eastern Asia Pacific regions (i.e., that is where the servers themselves are located; customer can live anywhere ) Looks like Google beat Amazon to the punch here, but presumably the c5 instances will be available on AWS fairly soon.
2017-06-03, 22:51   #2
ewmayer
2ω=0

Sep 2002
República de California

2×32×653 Posts

Quote:
 Originally Posted by GP2 https://cloudplatform.googleblog.com...exibility.html Skylake is available in Western US, Western Europe and Eastern Asia Pacific regions (i.e., that is where the servers themselves are located; customer can live anywhere ) Looks like Google beat Amazon to the punch here, but presumably the c5 instances will be available on AWS fairly soon.
Anyone who wants to try out a beta of Mlucas v17 on Skylake Xeon, xzipped tarball attached (md5 = 9a97be71d623a7315ef360bf1ba2674b). After validating the checksum, xz -d *xz to uncompress, and tar xvf *tar to unpack the tarchive.

AFAICT the code is quite solid, but I'm still working on updating the linux auto-installer, so you'll want to instead follow the simple manual build procedure - create an obj_avx512 (or whatever) dir inside the src-dir of the unzipped tarball, then

gcc -c -O3 -march=knl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
grep -i error build.log
[if the above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt

Before building, you might also check whether your version of gcc supports -mavx512 -- I use -mavx2 for AVX2 builds, but I didn't see -mavx512 supported in the versions of gcc I use, only the above Knights-Landing-specific flag.

Note in v17 the -nthread flag is deprecated in favor of the new -cpu flag, and absent either flag the default is to run 1-thread on logical core 0. Best throughput for a multicore will likely be 1-thread-pre core, i.e. from within the various run-dirs do

./Mlucas -cpu 0
./Mlucas -cpu 1

etc, using Intel core-numbering scheme. But you're welcome to also try multiple threads, as described in the Performance-Tuning section of the Mlucas README.

v17 also has a simple Python script primenet.py for automated Primenet assignments management in the src-dir, but not sure whether that will work from a cloud setup. Easy to try, though, after you create however many run-subdirs you want to run jobs from and copy the relevant mlucas.cfg file to each, just cd into one such rundir and run

python primenet.py -d -t 0 -T 100 -u [primenet uid] -p [pwd]

The -t 0 means run a single-shot get-work-to-do, -d enables debug diagnostics. -T 100 means 'get smallest available first-time LL', just grep the .py for 'worktype' to see the other options.

If that works for you, you'll want to do similar from each rundir before launching Mlucas from within it and note that sans the '-t 0' the default is to check for 'results to submit/work to get?' every 6 hours.
Attached Files
 mlucas_v17.tar.xz (1.42 MB, 324 views)

Last fiddled with by ewmayer on 2017-07-03 at 00:34 Reason: New attachment with correct version of primenet.py script

 2017-06-03, 22:59 #3 Batalov     "Serge" Mar 2008 Phi(4,2^7658614+1)/2 22×11×227 Posts Well, if you signed up for a "month of free trial" some time ago, ...there's good news: Attached Thumbnails
2017-06-04, 02:05   #4
CRGreathouse

Aug 2006

5,987 Posts

Quote:
 Originally Posted by Batalov Well, if you signed up for a "month of free trial" some time ago, ...there's good news:
Apparently they extended all the old free trial accounts by 305 days.

2017-06-04, 05:07   #5
GP2

Sep 2003

13·199 Posts

Quote:
 Originally Posted by ewmayer Anyone who wants to try out a beta of Mlucas v17 on Skylake Xeon, xzipped tarball attached (md5 = ac27027656e8fdfe7e7dc52177faffc8). After validating the checksum, xz -d *xz to uncompress, and tar xvf *tar to unpack the tarchive. AFAICT the code is quite solid, but I'm still working on updating the linux auto-installer, so you'll want to instead follow the simple manual build procedure - create an obj_avx512 (or whatever) dir inside the src-dir of the unzipped tarball, then gcc -c -O3 -march=knl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log grep -i error build.log [if the above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt Before building, you might also check whether your version of gcc supports -mavx512 -- I use -mavx2 for AVX2 builds, but I didn't see -mavx512 supported in the versions of gcc I use, only the above Knights-Landing-specific flag.
As a shortcut, you can say tar xvJf *.tar.xz to extract in a single step.

I selected Ubuntu 17.04 as the OS and gcc on it does indeed support AVX-512

However instead of -mavx512 there are various separate flags, see below.

You need to select regions us-west1-a or us-west1-b (not us-west1-c, see the Regions and Zones page), and you do need to specify that you want Skylake when starting your virtual instances, otherwise you might get Broadwell instead.

After you start the virtual instance, run less /proc/cpuinfo to make sure various avx512 flags are listed, so you know it really is Skylake. The flags listed there are avx512f avx512dq avx512cd avx512bw avx512vl

So I ran the command

Quote:
 gcc -c -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
Grepping for "error" as suggested gave:

Quote:
 ../factor.c:658:4: error: #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time! #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
I assume that doesn't really matter since factor.c is presumably factoring code?

I am using n1-highcpu-2, which has 2 vCPUs (i.e., one actual core).

Anyways, the program is running and here is some output:

Code:
INFO: no restart file found...starting run from scratch.
M45676327: using FFT length 2560K = 2621440 8-byte floats.
this gives an average   17.424135971069337 bits per digit
Using complex FFT radices       160        16        16        32
[Jun 04 04:42:41] M45676327 Iter# = 10000 [ 0.02% complete] clocks = 00:04:53.152 [  0.0293 sec/iter] Res64: B092FAA91F9
0CCC1. AvgMaxErr = 0.048224542. MaxErr = 0.070312500.
[Jun 04 04:47:36] M45676327 Iter# = 20000 [ 0.04% complete] clocks = 00:04:54.746 [  0.0295 sec/iter] Res64: 9B83E49764C
D807E. AvgMaxErr = 0.048445035. MaxErr = 0.070312500.
[Jun 04 04:52:31] M45676327 Iter# = 30000 [ 0.07% complete] clocks = 00:04:55.124 [  0.0295 sec/iter] Res64: BF44E167F14
222DA. AvgMaxErr = 0.048418895. MaxErr = 0.070312500.
I assume it's using the AVX-512 instructions, any way to check? Should I have attempted to use flags corresponding to AVX512PF, AVX512ER even though /proc/cpuinfo didn't list these?

By comparison, mprime benchmark has:

Code:
Best time for 2560K FFT length: 17.573 ms., avg: 17.625 ms.

PS,
Skylake on Google Cloud is only 2.0 GHz.

In general, Google Cloud has lower clock speeds than AWS; for instance, it has 2.3 GHz Haswell instances versus 2.9 GHz Haswell for AWS. Also Google's preemptible instances cost 1.5 cents per hour (fixed price) versus AWS's spot market (fluctuating price) which varies greatly from region to region but is currently very steady at around 1.3 cents per hour in us-east-2 (Ohio). So Google Cloud is not competitive with AWS for running mprime at the present time. Since mprime doesn't yet make use of AVX-512 for LL testing, it will just run relatively slowly on the Skylake box.

Last fiddled with by GP2 on 2017-06-04 at 05:13

2017-06-04, 06:05   #6
ewmayer
2ω=0

Sep 2002
República de California

2·32·653 Posts

Quote:
 Originally Posted by GP2 As a shortcut, you can say tar xvJf *.tar.xz to extract in a single step.
My version of tar (under MacOS) is older and only supports lowercase 'j' (bzip2) - good to know that 'J' works the same way for xzip under newer version of tar.

Quote:
After you start the virtual instance, run less /proc/cpuinfo to make sure various avx512 flags are listed, so you know it really is Skylake. The flags listed there are avx512f avx512dq avx512cd avx512bw avx512vl

So I ran the command
Quote:
 gcc -c -O3 -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vl -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
AVX-512F (512-bit vector foundation instructions) is the only instruction-subset Mlucas uses currently (and all that is on offer on the KNL, where I did my code-dev). OTOH, I wonder whether adding suitable arguments to -march and -mtune might help?

Quote:
Grepping for "error" as suggested gave:
Quote:
 ../factor.c:658:4: error: #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time! #error USE_AVX512 only meaningful if 64-bit GCC (or GCC-compatible) build and USE_FLOAT also defined at compile time!
I assume that doesn't really matter since factor.c is presumably factoring code?
You assume correctly.

Quote:
 Anyways, the program is running and here is some output: Code: INFO: no restart file found...starting run from scratch. M45676327: using FFT length 2560K = 2621440 8-byte floats. this gives an average 17.424135971069337 bits per digit Using complex FFT radices 160 16 16 32 [Jun 04 04:42:41] M45676327 Iter# = 10000 [ 0.02% complete] clocks = 00:04:53.152 [ 0.0293 sec/iter] Res64: B092FAA91F9 0CCC1. AvgMaxErr = 0.048224542. MaxErr = 0.070312500. [Jun 04 04:47:36] M45676327 Iter# = 20000 [ 0.04% complete] clocks = 00:04:54.746 [ 0.0295 sec/iter] Res64: 9B83E49764C D807E. AvgMaxErr = 0.048445035. MaxErr = 0.070312500. [Jun 04 04:52:31] M45676327 Iter# = 30000 [ 0.07% complete] clocks = 00:04:55.124 [ 0.0295 sec/iter] Res64: BF44E167F14 222DA. AvgMaxErr = 0.048418895. MaxErr = 0.070312500. I assume it's using the AVX-512 instructions, any way to check? Should I have attempted to use flags corresponding to AVX512PF, AVX512ER even though /proc/cpuinfo didn't list these?
By way of comparison, I get 54 msec/iter @2560K on one core of the 1.3GHz KNL. Re. AVX-512 usage, I doubt those extra flags would make a difference - but see my above note re. -march,-mtune - and on program launch you should look for the bolded line below:

Mlucas 17.0

INFO: testing qfloat routines...
CPU Family = x86_64, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0.
INFO: Build uses AVX512 instruction set.

Quote:
 By comparison, mprime benchmark has: Code: Best time for 2560K FFT length: 17.573 ms., avg: 17.625 ms.
If that's on the same Skylake Xeon hardware, it beats me, obviously.

Quote:
 PS, Skylake on Google Cloud is only 2.0 GHz.
Factoring in the clock speed difference between that and the 1.3 GHz of the KNL we have

Skylake Xeon: (29.5 msec/iter * 2.0 mcycles/msec) = 59 mcycles/iter
Knights Landing: (54.0 msec/iter * 1.3 mcycles/msec) = 70 mcycles/iter ,

hence ~1.2x greater per-cycle throughput for Skylake Xeon vs KNL, running the same code. That seems low, though perhaps I was expecting too much?

One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks!

2017-06-04, 06:42   #7
ewmayer
2ω=0

Sep 2002
República de California

2·32·653 Posts

Quote:
 Originally Posted by ewmayer One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks!
One more (i.e. last++) thing to try - for both avx2 and avx512 builds, check the effect of running on one versus both logical cores attached to your single-physical-core instance, using just a single FFT length and radix set for now. Say the same params being used in the production run you excerpted above:

./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0
and
./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0:1

(Yes, I know I'm very demanding. :)

2017-06-04, 10:30   #8
GP2

Sep 2003

A1B16 Posts

Quote:
 Originally Posted by ewmayer AVX-512F (512-bit vector foundation instructions) is the only instruction-subset Mlucas uses currently (and all that is on offer on the KNL, where I did my code-dev). OTOH, I wonder whether adding suitable arguments to -march and -mtune might help?
I recompiled with

Code:
gcc -c -O3 -march=skylake-avx512 -mtune=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log
but after re-running ./Mlucas -s m the mlucas.cfg file looked very similar to the earlier one (there's always a bit of noise with a virtual machine sharing the same physical hardware with other users). Running Mlucas on an exponent confirmed the timings are identical, at 29.3 to 29.4 ms/iter for 2560K FFT.

Quote:
 on program launch you should look for the bolded line below: INFO: Build uses AVX512 instruction set.
Yes it's there, I should have noticed it earlier.

Quote:
 If that's on the same Skylake Xeon hardware, it beats me, obviously.
Yes, it was. And on an AWS c4.large instance (2.9 GHz), 2560K FFT on mprime benchmarks at about 12 ms/iter.

Google doesn't provide any information about the specific Skylake model that they use, they only specify the frequency of 2.0 GHz. However, when mprime 29.1 runs it reports an L2 cache of 256 KB (this is on one core). I'm not sure what method it uses to detect that, but it must be detecting it dynamically because it reports the architecture as "Unknown Intel". I think I read that the higher-end Skylakes (the 7xxx series) are supposed to have 1 MB / core of L2 cache, so this must be in the 6xxx series.

gcc actually provides a flag to specify the L2 cache size. I wonder if it would be worthwhile to try -l2-cache-size=256

I'll try the AVX2 build and the other stuff a little later today.

Last fiddled with by GP2 on 2017-06-04 at 13:18 Reason: only one leading - (not two) in the -l2-cache-size option

2017-06-04, 11:59   #9
GP2

Sep 2003

1010000110112 Posts

Quote:
 Originally Posted by ewmayer One last thing - would appreciate if you would be so kind as to also do an AVX2-build (just use -mavx2 for now, same as I use for AVX2 builds on both Haswell and KNL) and, with any running jobs paused, rerun the self-tests. I want to compare the AVX2/AVX512 timing ratios you get to the ~1.6x I see on the KNL. Thanks!
I compiled with

Code:
gcc -c -O3 -mavx2 -DUSE_THREADS ../*.c >& build.log
removing the -DUSE_AVX512 flag since it's presumably no longer applicable.

The self-test actually crashed in the middle of the FFT = 5120K section, with the error:

Code:
N = 5242880, radix_set = 6 : product of complex radices 0 != (FFT length/2)
ERROR: at line 2818 of file ../get_fft_radices.c
Assertion failed: 0
The timings were about four times slower:

Code:
17.0
1024  msec/iter =   43.18  ROE[avg,max] = [0.255357143, 0.343750000]  radices =   8 16 16 16 16  0  0  0  0  0
1152  msec/iter =   49.72  ROE[avg,max] = [0.227158901, 0.256347656]  radices =  18  8 16 16 16  0  0  0  0  0
1280  msec/iter =   55.10  ROE[avg,max] = [0.249909319, 0.281250000]  radices =  10 16 16 16 16  0  0  0  0  0
1408  msec/iter =   63.56  ROE[avg,max] = [0.231169782, 0.265625000]  radices =  22  8 16 16 16  0  0  0  0  0
1536  msec/iter =   66.41  ROE[avg,max] = [0.226067243, 0.253906250]  radices =  12 16 16 16 16  0  0  0  0  0
1664  msec/iter =   74.32  ROE[avg,max] = [0.255719866, 0.281250000]  radices =  26  8 16 16 16  0  0  0  0  0
1792  msec/iter =   79.85  ROE[avg,max] = [0.232812500, 0.312500000]  radices =  14 16 16 16 16  0  0  0  0  0
1920  msec/iter =  106.87  ROE[avg,max] = [0.244818987, 0.281250000]  radices =  60 32 32 16  0  0  0  0  0  0
2048  msec/iter =   91.61  ROE[avg,max] = [0.255859375, 0.312500000]  radices =   8 16 16 16 32  0  0  0  0  0
2304  msec/iter =  103.68  ROE[avg,max] = [0.231054687, 0.281250000]  radices =  18 16 16 16 16  0  0  0  0  0
2560  msec/iter =  115.50  ROE[avg,max] = [0.256919643, 0.312500000]  radices =  10 16 16 32 16  0  0  0  0  0
2816  msec/iter =  134.34  ROE[avg,max] = [0.226226153, 0.253906250]  radices =  22 16 16 16 16  0  0  0  0  0
3072  msec/iter =  139.39  ROE[avg,max] = [0.229003906, 0.281250000]  radices =  12 16 16 32 16  0  0  0  0  0
3328  msec/iter =  156.65  ROE[avg,max] = [0.255078125, 0.281250000]  radices =  26 16 16 16 16  0  0  0  0  0
3584  msec/iter =  167.19  ROE[avg,max] = [0.234919085, 0.281250000]  radices =  14 16 16 32 16  0  0  0  0  0
3840  msec/iter =  232.40  ROE[avg,max] = [0.260686384, 0.312500000]  radices = 240  8  8  8 16  0  0  0  0  0
4096  msec/iter =  191.62  ROE[avg,max] = [0.242801339, 0.312500000]  radices =   8 16 16 32 32  0  0  0  0  0
4608  msec/iter =  217.14  ROE[avg,max] = [0.226663644, 0.265625000]  radices =  18 16 16 32 16  0  0  0  0  0
compared to the AVX-512 version:

Code:
17.0
1024  msec/iter =   10.27  ROE[avg,max] = [0.234430804, 0.281250000]  radices =  32 16 32 32  0  0  0  0  0  0
1152  msec/iter =   13.07  ROE[avg,max] = [0.274553571, 0.343750000]  radices =  36 16 32 32  0  0  0  0  0  0
1280  msec/iter =   14.90  ROE[avg,max] = [0.290569196, 0.343750000]  radices = 160 16 16 16  0  0  0  0  0  0
1408  msec/iter =   17.20  ROE[avg,max] = [0.262848772, 0.281250000]  radices = 176 16 16 16  0  0  0  0  0  0
1536  msec/iter =   19.33  ROE[avg,max] = [0.250020926, 0.281250000]  radices = 192 16 16 16  0  0  0  0  0  0
1664  msec/iter =   20.86  ROE[avg,max] = [0.264160156, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
1792  msec/iter =   19.47  ROE[avg,max] = [0.282254464, 0.312500000]  radices =  56 16 32 32  0  0  0  0  0  0
1920  msec/iter =   23.27  ROE[avg,max] = [0.256640625, 0.312500000]  radices = 240 16 16 16  0  0  0  0  0  0
2048  msec/iter =   21.54  ROE[avg,max] = [0.238113839, 0.281250000]  radices =  32 32 32 32  0  0  0  0  0  0
2304  msec/iter =   26.63  ROE[avg,max] = [0.266880580, 0.312500000]  radices = 144 16 16 32  0  0  0  0  0  0
2560  msec/iter =   29.84  ROE[avg,max] = [0.257589286, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
2816  msec/iter =   34.61  ROE[avg,max] = [0.245047433, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
3072  msec/iter =   38.85  ROE[avg,max] = [0.275613839, 0.375000000]  radices = 192 16 16 32  0  0  0  0  0  0
3328  msec/iter =   42.24  ROE[avg,max] = [0.270535714, 0.312500000]  radices = 208 16 16 32  0  0  0  0  0  0
3584  msec/iter =   43.80  ROE[avg,max] = [0.269921875, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
3840  msec/iter =   48.09  ROE[avg,max] = [0.252887835, 0.312500000]  radices = 240 16 16 32  0  0  0  0  0  0
4096  msec/iter =   49.58  ROE[avg,max] = [0.245026507, 0.281250000]  radices =  32 16 16 16 16  0  0  0  0  0
4608  msec/iter =   59.24  ROE[avg,max] = [0.236941964, 0.281250000]  radices = 144 16 32 32  0  0  0  0  0  0
5120  msec/iter =   65.80  ROE[avg,max] = [0.297656250, 0.375000000]  radices = 160 16 32 32  0  0  0  0  0  0
5632  msec/iter =   75.59  ROE[avg,max] = [0.234268624, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
6144  msec/iter =   85.82  ROE[avg,max] = [0.258161272, 0.281250000]  radices = 192 16 32 32  0  0  0  0  0  0
6656  msec/iter =   92.11  ROE[avg,max] = [0.250704738, 0.312500000]  radices = 208 16 32 32  0  0  0  0  0  0
7168  msec/iter =   94.87  ROE[avg,max] = [0.264208984, 0.312500000]  radices = 224 16 32 32  0  0  0  0  0  0
7680  msec/iter =  103.35  ROE[avg,max] = [0.266294643, 0.312500000]  radices = 240 16 32 32  0  0  0  0  0  0
I double-checked to make sure there were no running jobs in the background.

2017-06-04, 12:22   #10
GP2

Sep 2003

13·199 Posts

Quote:
 Originally Posted by ewmayer ./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0
Code:
NTHREADS = 1
M49005071: using FFT length 2560K = 2621440 8-byte floats.
this gives an average   18.693951034545897 bits per digit
Using complex FFT radices       160        16        16        32
mers_mod_square: Complex-roots arrays have 1024, 1280 elements.
Using 1 threads in carry step
100 iterations of M49005071 with FFT length 2621440 = 2560 K
Res64: 07EFE3EF1F78E763. AvgMaxErr = 0.257589286. MaxErr = 0.312500000. Program: E17.0
Res mod 2^36     =          64952526691
Res mod 2^35 - 1 =          22407816581
Res mod 2^36 - 1 =          54111649274
Clocks = 00:00:02.357
Done ...
Quote:
 ./Mlucas -fftlen 2560 -iters 100 -radset 0 -cpu 0:1
Code:
NTHREADS = 2
M49005071: using FFT length 2560K = 2621440 8-byte floats.
this gives an average   18.693951034545897 bits per digit
Using complex FFT radices       160        16        16        32
mers_mod_square: Complex-roots arrays have 1024, 1280 elements.
Using 2 threads in carry step
100 iterations of M49005071 with FFT length 2621440 = 2560 K
Res64: 07EFE3EF1F78E763. AvgMaxErr = 0.257589286. MaxErr = 0.312500000. Program: E17.0
Res mod 2^36     =          64952526691
Res mod 2^35 - 1 =          22407816581
Res mod 2^36 - 1 =          54111649274
Clocks = 00:00:01.216
Done ...

Hmmmmmm....

That was for the AVX-512, the version with -march and -mtune.

I retried running worktodo.ini with ./Mlucas -cpu 0:1 versus the version with no option flag, and instead of 29.3–29.4 ms/iter it's down to 18.8–18.9 ms/iter. Wow.

I'm a little confused here because I'm pretty sure that on Google Cloud, 2 vCPUs = 1 actual core. Looking at /proc/cpuinfo, I think these lines confirm it:

Code:
processor       : 0
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1

processor       : 1
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
I don't think it's worth repeating the exercise with AVX2 since it seems to perform so much worse than AVX-512, but let me know if it still matters.

 2017-06-04, 13:11 #11 GP2     Sep 2003 13×199 Posts Meanwhile, mprime 29.1 on the same box: EDIT: oops, this is a misleading comparison, the below figures are for a 2400K FFT, not 2560K. As mentioned earlier, 2560K FFT in the benchmarks takes about 17.6 ms/iter with CoresPerTest=1 Code: [Work thread Jun 4 12:32] Iteration: 230000 / 45106307 [0.50%], ms/iter: 16.682, ETA: 8d 15:56 [Work thread Jun 4 12:35] Iteration: 240000 / 45106307 [0.53%], ms/iter: 16.697, ETA: 8d 16:05 [Work thread Jun 4 12:37] Iteration: 250000 / 45106307 [0.55%], ms/iter: 16.676, ETA: 8d 15:47 [Work thread Jun 4 12:40] Iteration: 260000 / 45106307 [0.57%], ms/iter: 16.684, ETA: 8d 15:49 and with CoresPerTest=2 Code: [Work thread Jun 4 12:46] Iteration: 280000 / 45106307 [0.62%], ms/iter: 16.678, ETA: 8d 15:39 [Work thread Jun 4 12:49] Iteration: 290000 / 45106307 [0.64%], ms/iter: 16.667, ETA: 8d 15:29 [Work thread Jun 4 12:52] Iteration: 300000 / 45106307 [0.66%], ms/iter: 16.697, ETA: 8d 15:48 [Work thread Jun 4 12:54] Iteration: 310000 / 45106307 [0.68%], ms/iter: 16.660, ETA: 8d 15:18 There is very little if any difference between the two. I didn't try HyperthreadLL=1 since this seems to be more or less deprecated in version 29. The slight variability from one set of 10000 to the next is explained by the fact that on a virtual machine, other users are sharing the same physical server, and among other things, they're competing for the L3 cache. In any case, this confirms that the n1-highcpu-2 virtual machine type with "2 vCPUs" really is only one core. In other words, the same obfuscation as on AWS. But the main revelation here is that Mlucas is competitive with mprime on this platform. The difference between 18.8 ms/iter and 16.8 ms/iter is only about 11%. And playing with compiler flags or tinkering further with the code might yield more improvements, perhaps more readily than George can tinker with assembler to implement AVX-512 for mprime LL testing (he's already said that it will take some time). So it might be worthwhile to try out Mlucas on Google Cloud on Skylake. It's only 1.5 cents per hour for a preemptible instance. If there's interest, I might try to write a how-to guide. And hopefully real soon Amazon AWS will be ready with their c5 instances (also Skylake). Now to wait for the exponents to run to completion, hopefully they will yield good verified results. Last fiddled with by GP2 on 2017-06-04 at 18:46 Reason: exponent 45106307 uses a 2400K FFT with mprime, not 2560K. Bad comparison.

 Similar Threads Thread Thread Starter Forum Replies Last Post GP2 Cloud Computing 4 2020-08-03 11:21 EdH Aliquot Sequences 11 2015-11-27 03:53 VictordeHolland Hardware 7 2015-03-11 23:26 fivemack Hardware 0 2009-03-25 12:09 Prime Monster Lounge 8 2003-11-26 07:15

All times are UTC. The time now is 09:40.

Sun Dec 4 09:40:54 UTC 2022 up 108 days, 7:09, 0 users, load averages: 1.17, 1.04, 1.00