mersenneforum.org ARM builds and SIMD-assembler prospects
 Register FAQ Search Today's Posts Mark Forums Read

2017-03-13, 20:08   #34
ldesnogu

Jan 2008
France

10208 Posts

Quote:
 Originally Posted by ewmayer I'm not familiar enough with ARM to understand why -m64 is unsupported in GCC, but correctly handling aarch64 in platform.h will cause the build to be in 64-bit mode. (I had assumed -m64 was needed to trigger the aarch64-related predefs, but your output from [1] will settle that.)
gcc ARM comes in 2 flavors: one that targets 64-bit code while the other targets 32-bit code, so there's no need for -m64 or -m32.

2017-03-13, 20:22   #35
ldesnogu

Jan 2008
France

24·3·11 Posts

Quote:
 Originally Posted by ewmayer gcc -c -Os -m64 -DUSE_THREADS ../Mlucas.c
For that to succeed, you need this:
Code:
$diff platform.h~ platform.h 714a715,728 > #elif defined(__AARCH64EL__) > #ifndef OS_BITS > #define OS_BITS 32 > #endif > #define CPU_TYPE > #define CPU_IS_ARM_EABI > #if(defined(__GNUC__) || defined(__GNUG__)) > #define COMPILER_TYPE > #define COMPILER_TYPE_GCC > #else > #define COMPILER_TYPE > #define COMPILER_TYPE_UNKNOWN > #endif > And it compiles: Code: $ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS -c *.c
$aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS *.o -o mlucas64 -lm -lpthread$ file mlucas64
mlucas64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, not stripped
Tested with QEMU, it starts but I have no clue how I should launch the binary to do something sensible that doesn't take forever :)

Last fiddled with by ldesnogu on 2017-03-13 at 20:38

2017-03-13, 21:41   #36
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Quote:
 Originally Posted by Lorenzo Ok! I have done! [CODE]ubuntu@pine64:~/Solaris2/mlucas-14.1$gcc -dM -E - < /dev/null [snip] Thanks! The key predefine there is __aarch64__, which is also the trigger in the .h file I posted ... so the latter should allow you to build. So I don't understand the raft of 'stray character' errors you get with that one - here are line 88-90 of that header: Code: #elif(defined(_AIX)) #define OS_TYPE #define OS_TYPE_AIX Can you open both the original and new .h in an editor, and compare the file encodings? If those are the same, can you diff your local copies of those 2 file versions? Maybe that will reveal something relevant to the stary-octals errors you are getting. Quote:  Originally Posted by ldesnogu For that to succeed, you need this: Code: $ diff platform.h~ platform.h 714a715,728 > #elif defined(__AARCH64EL__) > #ifndef OS_BITS > #define OS_BITS 32 > #endif > #define CPU_TYPE > #define CPU_IS_ARM_EABI > #if(defined(__GNUC__) || defined(__GNUG__)) > #define COMPILER_TYPE > #define COMPILER_TYPE_GCC > #else > #define COMPILER_TYPE > #define COMPILER_TYPE_UNKNOWN > #endif > And it compiles: Code: $aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS -c *.c$ aarch64-none-linux-gnu-gcc -Os -DUSE_THREADS *.o -o mlucas64 -lm -lpthread $file mlucas64 mlucas64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-aarch64.so.1, for GNU/Linux 3.7.0, not stripped Tested with QEMU, it starts but I have no clue how I should launch the binary to do something sensible that doesn't take forever :) That sets the wrong value of OS_BITS - for basic C-code Mlucas builds that won't matter much except for various utility functions which make heavy use of 64-bit-int math (e.g. the quad-float library used for high-precision inits of double constants), but for future asm-code builds we need the right bitness to be set. The predef section beginning at line 792 in the .h I posted should work just fine for Lorenzo, and you as well - did you try building with that, or did you just make your mod above and use it? Please try the unmodified .h file - the one with the __aarch64__ predef stuff at line 792 and let me know if you get the same unrecognized-char errors as Lorenzo. You can quick-test the binary by trying some timing runs at a specific FFT length, say ./Mlucas -fftlen 1024 -nthread 1 will try all radix combos available @1024K and write the best-timing one to the mlucas.cfg file. You can also play with the threadcount - note the default there is to try to use all available cores. 2017-03-13, 22:35 #37 ldesnogu Jan 2008 France 24×3×11 Posts Quote:  Originally Posted by ewmayer That sets the wrong value of OS_BITS - for basic C-code Mlucas builds that won't matter much except for various utility functions which make heavy use of 64-bit-int math (e.g. the quad-float library used for high-precision inits of double constants), but for future asm-code builds we need the right bitness to be set. The predef section beginning at line 792 in the .h I posted should work just fine for Lorenzo, and you as well - did you try building with that, or did you just make your mod above and use it? Please try the unmodified .h file - the one with the __aarch64__ predef stuff at line 792 and let me know if you get the same unrecognized-char errors as Lorenzo. Silly me, I had missed your attachment. It compiles fine with it. So Lorenzo's error comes from somewhere else. Quote:  You can quick-test the binary by trying some timing runs at a specific FFT length, say ./Mlucas -fftlen 1024 -nthread 1 will try all radix combos available @1024K and write the best-timing one to the mlucas.cfg file. You can also play with the threadcount - note the default there is to try to use all available cores. Code: /work/qemu/qemu/aarch64-linux-user/qemu-aarch64 -L /work/Cross/fsf-6.169/aarch64-none-linux-gnu/libc ./mlucas64 -fftlen 1024 -nthread 1 -iters 1 Mlucas 14.1 http://hogranch.com/mayer/README.html INFO: testing qfloat routines... CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 6.3.1 20170118. INFO: Using inline-macro form of MUL_LOHI64. INFO: MLUCAS_PATH is set to "" INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. INFO: testing IMUL routines... INFO: System has 4 available processor cores. INFO: testing FFT radix tables... All MaxErr are at 0.  2017-03-13, 22:46 #38 ewmayer ∂2ω=0 Sep 2002 República de California 2×13×443 Posts Thanks, Laurent - so I suspect a file-encoding issue, with Lorenzo's .h file downloaded from my post, or perhaps his unzip utility inserted a bunch of garbage chars. 2017-03-14, 06:56 #39 Lorenzo Aug 2010 Republic of Belarus AA16 Posts Quote:  Originally Posted by ewmayer Thanks, Laurent - so I suspect a file-encoding issue, with Lorenzo's .h file downloaded from my post, or perhaps his unzip utility inserted a bunch of garbage chars. RIght! Sorry, found issue. It's working nice!) So withoit SIMD optimization it looks like: Code: ubuntu@pine64:~/Solaris2/mlucas-14.1$ cat mlucas.cfg
14.1
1024  msec/iter =  114.57  ROE[avg,max] = [0.250000000, 0.250000000]  radices =  32 32 16 32  0  0  0  0  0  0
1152  msec/iter =  109.04  ROE[avg,max] = [0.206808036, 0.250000000]  radices = 288  8 16 16  0  0  0  0  0  0
1280  msec/iter =  133.03  ROE[avg,max] = [0.236600167, 0.281250000]  radices = 160 16 16 16  0  0  0  0  0  0
1408  msec/iter =  140.47  ROE[avg,max] = [0.273688616, 0.343750000]  radices = 176 16 16 16  0  0  0  0  0  0
1536  msec/iter =  161.30  ROE[avg,max] = [0.223493304, 0.281250000]  radices = 192 16 16 16  0  0  0  0  0  0
1664  msec/iter =  166.09  ROE[avg,max] = [0.246149554, 0.312500000]  radices = 208 16 16 16  0  0  0  0  0  0
1792  msec/iter =  180.60  ROE[avg,max] = [0.220703125, 0.281250000]  radices = 224 16 16 16  0  0  0  0  0  0
1920  msec/iter =  198.81  ROE[avg,max] = [0.222460938, 0.250000000]  radices = 240 16 16 16  0  0  0  0  0  0
2048  msec/iter =  206.38  ROE[avg,max] = [0.278125000, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
2304  msec/iter =  242.52  ROE[avg,max] = [0.208269392, 0.250000000]  radices = 288 16 16 16  0  0  0  0  0  0
2560  msec/iter =  308.94  ROE[avg,max] = [0.243164062, 0.281250000]  radices = 160 16 16 32  0  0  0  0  0  0
2816  msec/iter =  329.54  ROE[avg,max] = [0.272896903, 0.343750000]  radices = 176 16 16 32  0  0  0  0  0  0
3072  msec/iter =  371.71  ROE[avg,max] = [0.225892857, 0.281250000]  radices = 192 16 16 32  0  0  0  0  0  0
3328  msec/iter =  388.66  ROE[avg,max] = [0.241322545, 0.281250000]  radices = 208 16 16 32  0  0  0  0  0  0
3584  msec/iter =  414.33  ROE[avg,max] = [0.220870536, 0.250000000]  radices = 224 16 16 32  0  0  0  0  0  0
3840  msec/iter =  453.97  ROE[avg,max] = [0.213636998, 0.265625000]  radices = 240 16 16 32  0  0  0  0  0  0
4096  msec/iter =  472.52  ROE[avg,max] = [0.247321429, 0.250000000]  radices = 256 16 16 32  0  0  0  0  0  0
4608  msec/iter =  544.08  ROE[avg,max] = [0.201870292, 0.222656250]  radices = 288 16 16 32  0  0  0  0  0  0
5120  msec/iter =  673.79  ROE[avg,max] = [0.239508929, 0.312500000]  radices = 160 16 32 32  0  0  0  0  0  0
5632  msec/iter =  693.38  ROE[avg,max] = [0.278264509, 0.343750000]  radices = 176 16 32 32  0  0  0  0  0  0
6144  msec/iter =  776.30  ROE[avg,max] = [0.213504464, 0.250000000]  radices = 192 16 32 32  0  0  0  0  0  0
6656  msec/iter =  814.97  ROE[avg,max] = [0.242299107, 0.281250000]  radices = 208 16 32 32  0  0  0  0  0  0
7168  msec/iter =  870.94  ROE[avg,max] = [0.219768415, 0.312500000]  radices = 224 16 32 32  0  0  0  0  0  0
7680  msec/iter =  955.79  ROE[avg,max] = [0.222209821, 0.250000000]  radices = 240 16 32 32  0  0  0  0  0  0

2017-03-14, 07:22   #40
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Quote:
 Originally Posted by Lorenzo RIght! Sorry, found issue. It's working nice!) So withoit SIMD optimization it looks like: Code: ubuntu@pine64:~/Solaris2/mlucas-14.1$cat mlucas.cfg 14.1 1024 msec/iter = 114.57 ROE[avg,max] = [0.250000000, 0.250000000] radices = 32 32 16 32 0 0 0 0 0 0 1152 msec/iter = 109.04 ROE[avg,max] = [0.206808036, 0.250000000] radices = 288 8 16 16 0 0 0 0 0 0 1280 msec/iter = 133.03 ROE[avg,max] = [0.236600167, 0.281250000] radices = 160 16 16 16 0 0 0 0 0 0 [snip] Glad to hear it - what was the issue with the updated .h file? I'd like to know in case another user hits similar in future. The only timing that really pops out is the anomalously low one @1152K ... but SIMD timings will be the ones of real interest. How many threads did you run your self-test with? (Your screen output will indicate that, e.g. NTHREADS = {some value >= 1}.  2017-03-14, 07:46 #41 Lorenzo Aug 2010 Republic of Belarus 2·5·17 Posts Issue was in that file was unzipped not correctly by me. So in generally it's ok. I ran ./mlucas -s m. So looks like Mlucas used 4 cores (threads ) correctly. I didn't play with threads yet. So in generally very slow Last fiddled with by Lorenzo on 2017-03-14 at 08:19 2017-03-14, 09:12 #42 ewmayer 2ω=0 Sep 2002 República de California 2×13×443 Posts Quote:  Originally Posted by Lorenzo I ran ./mlucas -s m. So looks like Mlucas used 4 cores (threads ) correctly. I didn't play with threads yet. So in generally very slow Yes - even with a 2-3x speedup from use of SIMD, the ARM will be more about performance per watt (and per hardware$) than speed-per-core.

2017-03-14, 10:19   #43
ET_
Banned

"Luigi"
Aug 2002
Team Italia

2·2,383 Posts

Quote:
 Originally Posted by ewmayer Yes - even with a 2-3x speedup from use of SIMD, the ARM will be more about performance per watt (and per hardware \$) than speed-per-core.
The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit linux.
Code:
        2048  sec/iter =    0.134  ROE[min,max] = [0.281250000, 0.343750000]  radices =  32 32 32 32  0  0  0  0  0  0  [Any text offset from the list-ending 0 by whitespace is ignored]
2304  sec/iter =    0.148  ROE[min,max] = [0.242187500, 0.281250000]  radices =  36  8 16 16 16  0  0  0  0  0
2560  sec/iter =    0.166  ROE[min,max] = [0.281250000, 0.312500000]  radices =  40  8 16 16 16  0  0  0  0  0
2816  sec/iter =    0.188  ROE[min,max] = [0.328125000, 0.343750000]  radices =  44  8 16 16 16  0  0  0  0  0
3072  sec/iter =    0.222  ROE[min,max] = [0.250000000, 0.250000000]  radices =  24 16 16 16 16  0  0  0  0  0
3584  sec/iter =    0.264  ROE[min,max] = [0.281250000, 0.281250000]  radices =  28 16 16 16 16  0  0  0  0  0
4096  sec/iter =    0.300  ROE[min,max] = [0.250000000, 0.312500000]  radices =  16 16 16 16 32  0  0  0  0  0
The following mlucas.cfg file was generated on a 1.4 GHz ARM running 64-bit linux.
Code:
      2048  msec/iter =  206.38  ROE[avg,max] = [0.278125000, 0.281250000]  radices = 256 16 16 16  0  0  0  0  0  0
2304  msec/iter =  242.52  ROE[avg,max] = [0.208269392, 0.250000000]  radices = 288 16 16 16  0  0  0  0  0  0
2560  msec/iter =  308.94  ROE[avg,max] = [0.243164062, 0.281250000]  radices = 160 16 16 32  0  0  0  0  0  0
2816  msec/iter =  329.54  ROE[avg,max] = [0.272896903, 0.343750000]  radices = 176 16 16 32  0  0  0  0  0  0
3072  msec/iter =  371.71  ROE[avg,max] = [0.225892857, 0.281250000]  radices = 192 16 16 32  0  0  0  0  0  0
3328  msec/iter =  388.66  ROE[avg,max] = [0.241322545, 0.281250000]  radices = 208 16 16 32  0  0  0  0  0  0
3584  msec/iter =  414.33  ROE[avg,max] = [0.220870536, 0.250000000]  radices = 224 16 16 32  0  0  0  0  0  0
3840  msec/iter =  453.97  ROE[avg,max] = [0.213636998, 0.265625000]  radices = 240 16 16 32  0  0  0  0  0  0
4096  msec/iter =  472.52  ROE[avg,max] = [0.247321429, 0.250000000]  radices = 256 16 16 32  0  0  0  0  0  0
In other words, a 4 threaded ARM is about 1.5x slower than one core of a 2.8 GHz Opteron.
With a 3x SIMD speedup its efficiency would be 0.5x on a per-core comparison, and 1:1 on a per-core-and-GHz comparison with the Opteron.

That's to say, a 20 ARM cores minicluster would be 20x faster on a per GHz measurement and 10x faster on a per-core measurement. And also as cheap as the single Opteron system. Not to speak about the energy saving...

Last fiddled with by ET_ on 2017-03-14 at 10:20

 2017-03-14, 18:42 #44 VictordeHolland     "Victor de Hollander" Aug 2011 the Netherlands 23·3·72 Posts You got it working, nice! That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right? I'm a little bit surprised it is about as fast as my Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz) which is only 32bit and an much older architecture. http://mersenneforum.org/showpost.ph...5&postcount=94 Code:  1024 msec/iter = 121.70 ROE[avg,max] = [0.298214286, 0.312500000] radices = 128 16 16 16 0 0 0 0 0 0 1152 msec/iter = 142.69 ROE[avg,max] = [0.225310407, 0.250000000] radices = 144 16 16 16 0 0 0 0 0 0 1280 msec/iter = 161.44 ROE[avg,max] = [0.251618304, 0.312500000] radices = 160 16 16 16 0 0 0 0 0 0 1408 msec/iter = 185.52 ROE[avg,max] = [0.297056362, 0.375000000] radices = 176 16 16 16 0 0 0 0 0 0 1536 msec/iter = 195.56 ROE[avg,max] = [0.234742955, 0.312500000] radices = 192 16 16 16 0 0 0 0 0 0 1664 msec/iter = 208.36 ROE[avg,max] = [0.254631696, 0.312500000] radices = 208 16 16 16 0 0 0 0 0 0 1792 msec/iter = 222.32 ROE[avg,max] = [0.234012277, 0.250000000] radices = 224 16 16 16 0 0 0 0 0 0 1920 msec/iter = 243.65 ROE[avg,max] = [0.235016741, 0.281250000] radices = 240 16 16 16 0 0 0 0 0 0 2048 msec/iter = 255.25 ROE[avg,max] = [0.310714286, 0.312500000] radices = 256 16 16 16 0 0 0 0 0 0 2304 msec/iter = 297.26 ROE[avg,max] = [0.228341239, 0.281250000] radices = 288 16 16 16 0 0 0 0 0 0 2560 msec/iter = 339.70 ROE[avg,max] = [0.256682478, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0 2816 msec/iter = 384.56 ROE[avg,max] = [0.296219308, 0.375000000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 413.85 ROE[avg,max] = [0.239704241, 0.281250000] radices = 192 16 16 32 0 0 0 0 0 0 3584 msec/iter = 370.28 ROE[avg,max] = [0.231487165, 0.281250000] radices = 224 16 16 32 0 0 0 0 0 0 4096 msec/iter = 455.10 ROE[avg,max] = [0.282142857, 0.312500000] radices = 128 16 32 32 0 0 0 0 0 0 In that post I also made the comparison with a Intel Core2Duo E7400 @2.8GHz, running Mprime28.7 . Looking back at it, that comparison might not have been entirely fair (Mlucas vs. Mprime) . So I dusted off the machine and also ran Mlucas: Intel Core2Duo E7400 @2.8GHz NTHREADS = 1 Code: 14.1 1024 msec/iter = 33.76 ROE[avg,max] = [0.264564732, 0.265625000] radices = 32 32 16 32 0 0 0 0 0 0 1152 msec/iter = 40.30 ROE[avg,max] = [0.237220982, 0.273437500] radices = 36 16 32 32 0 0 0 0 0 0 1280 msec/iter = 45.42 ROE[avg,max] = [0.251841518, 0.296875000] radices = 40 16 32 32 0 0 0 0 0 0 1408 msec/iter = 52.31 ROE[avg,max] = [0.285110910, 0.375000000] radices = 44 16 32 32 0 0 0 0 0 0 1536 msec/iter = 53.31 ROE[avg,max] = [0.239299665, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0 1664 msec/iter = 61.81 ROE[avg,max] = [0.261802455, 0.312500000] radices = 52 16 32 32 0 0 0 0 0 0 1792 msec/iter = 65.81 ROE[avg,max] = [0.267229353, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0 1920 msec/iter = 70.98 ROE[avg,max] = [0.243638393, 0.281250000] radices = 60 16 32 32 0 0 0 0 0 0 2048 msec/iter = 71.88 ROE[avg,max] = [0.257366071, 0.257812500] radices = 32 32 32 32 0 0 0 0 0 0 2304 msec/iter = 81.60 ROE[avg,max] = [0.236948940, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0 2560 msec/iter = 90.96 ROE[avg,max] = [0.255691964, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0 2816 msec/iter = 102.69 ROE[avg,max] = [0.283956473, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0 3072 msec/iter = 112.85 ROE[avg,max] = [0.233879743, 0.265625000] radices = 48 32 32 32 0 0 0 0 0 0 3328 msec/iter = 123.71 ROE[avg,max] = [0.267947824, 0.312500000] radices = 52 32 32 32 0 0 0 0 0 0 3584 msec/iter = 135.08 ROE[avg,max] = [0.267689732, 0.301757812] radices = 56 32 32 32 0 0 0 0 0 0 3840 msec/iter = 144.52 ROE[avg,max] = [0.242107282, 0.281250000] radices = 60 32 32 32 0 0 0 0 0 0 4096 msec/iter = 154.69 ROE[avg,max] = [0.263169643, 0.281250000] radices = 64 32 32 32 0 0 0 0 0 0 4608 msec/iter = 177.26 ROE[avg,max] = [0.236798968, 0.281250000] radices = 36 16 16 16 16 0 0 0 0 0 5120 msec/iter = 201.17 ROE[avg,max] = [0.257240513, 0.312500000] radices = 40 16 16 16 16 0 0 0 0 0 5632 msec/iter = 224.76 ROE[avg,max] = [0.291057478, 0.375000000] radices = 44 16 16 16 16 0 0 0 0 0 6144 msec/iter = 244.47 ROE[avg,max] = [0.233741978, 0.265625000] radices = 48 16 16 16 16 0 0 0 0 0 6656 msec/iter = 271.08 ROE[avg,max] = [0.264965820, 0.312500000] radices = 52 16 16 16 16 0 0 0 0 0 7168 msec/iter = 292.72 ROE[avg,max] = [0.274094936, 0.312500000] radices = 56 16 16 16 16 0 0 0 0 0 7680 msec/iter = 312.74 ROE[avg,max] = [0.249065290, 0.290039062] radices = 60 16 16 16 16 0 0 0 0 0 NTHREADS = 2 Code: 14.1 1024 msec/iter = 21.01 ROE[avg,max] = [0.273214286, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0 1152 msec/iter = 25.43 ROE[avg,max] = [0.237220982, 0.273437500] radices = 36 16 32 32 0 0 0 0 0 0 1280 msec/iter = 28.85 ROE[avg,max] = [0.259319196, 0.312500000] radices = 20 32 32 32 0 0 0 0 0 0 1408 msec/iter = 35.14 ROE[avg,max] = [0.280566406, 0.343750000] radices = 176 16 16 16 0 0 0 0 0 0 1536 msec/iter = 33.98 ROE[avg,max] = [0.239299665, 0.281250000] radices = 24 32 32 32 0 0 0 0 0 0 1664 msec/iter = 38.98 ROE[avg,max] = [0.261802455, 0.312500000] radices = 52 16 32 32 0 0 0 0 0 0 1792 msec/iter = 40.84 ROE[avg,max] = [0.267229353, 0.312500000] radices = 28 32 32 32 0 0 0 0 0 0 1920 msec/iter = 45.63 ROE[avg,max] = [0.243638393, 0.281250000] radices = 60 16 32 32 0 0 0 0 0 0 2048 msec/iter = 45.92 ROE[avg,max] = [0.257366071, 0.257812500] radices = 32 32 32 32 0 0 0 0 0 0 2304 msec/iter = 54.36 ROE[avg,max] = [0.236948940, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0 2560 msec/iter = 54.64 ROE[avg,max] = [0.255691964, 0.312500000] radices = 40 32 32 32 0 0 0 0 0 0 2816 msec/iter = 63.06 ROE[avg,max] = [0.283956473, 0.343750000] radices = 44 32 32 32 0 0 0 0 0 0 3072 msec/iter = 67.77 ROE[avg,max] = [0.233879743, 0.265625000] radices = 48 32 32 32 0 0 0 0 0 0 3328 msec/iter = 74.36 ROE[avg,max] = [0.267947824, 0.312500000] radices = 52 32 32 32 0 0 0 0 0 0 3584 msec/iter = 79.71 ROE[avg,max] = [0.267689732, 0.301757812] radices = 56 32 32 32 0 0 0 0 0 0 3840 msec/iter = 87.04 ROE[avg,max] = [0.242107282, 0.281250000] radices = 60 32 32 32 0 0 0 0 0 0 4096 msec/iter = 92.87 ROE[avg,max] = [0.263169643, 0.281250000] radices = 64 32 32 32 0 0 0 0 0 0 4608 msec/iter = 106.31 ROE[avg,max] = [0.238187081, 0.281250000] radices = 288 16 16 32 0 0 0 0 0 0 5120 msec/iter = 116.95 ROE[avg,max] = [0.241458566, 0.312500000] radices = 160 16 32 32 0 0 0 0 0 0 5632 msec/iter = 147.80 ROE[avg,max] = [0.278641183, 0.312500000] radices = 176 16 32 32 0 0 0 0 0 0 6144 msec/iter = 150.32 ROE[avg,max] = [0.247349330, 0.281250000] radices = 192 16 32 32 0 0 0 0 0 0 6656 msec/iter = 164.51 ROE[avg,max] = [0.250781250, 0.289062500] radices = 208 16 32 32 0 0 0 0 0 0 7168 msec/iter = 172.77 ROE[avg,max] = [0.277169364, 0.343750000] radices = 224 16 32 32 0 0 0 0 0 0 7680 msec/iter = 191.50 ROE[avg,max] = [0.253627232, 0.281250000] radices = 240 16 32 32 0 0 0 0 0 0 I also reran the Mprime 28.7 benchmark: Code: [Tue Mar 14 19:28:48 2017] Compare your results to other computers at http://www.mersenne.org/report_benchmarks Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz CPU speed: 2800.02 MHz, 2 cores CPU features: Prefetch, SSE, SSE2, SSE4 L1 cache size: 32 KB L2 cache size: 3 MB L1 cache line size: 64 bytes L2 cache line size: 64 bytes TLBS: 256 Prime95 64-bit version 28.7, RdtscTiming=1 Best time for 1024K FFT length: 16.199 ms., avg: 16.704 ms. Best time for 1280K FFT length: 20.961 ms., avg: 21.575 ms. Best time for 1536K FFT length: 26.163 ms., avg: 27.718 ms. Best time for 1792K FFT length: 30.755 ms., avg: 32.141 ms. Best time for 2048K FFT length: 34.946 ms., avg: 38.731 ms. Best time for 2560K FFT length: 43.191 ms., avg: 46.909 ms. Best time for 3072K FFT length: 53.965 ms., avg: 59.120 ms. Best time for 3584K FFT length: 69.864 ms., avg: 83.959 ms. Best time for 4096K FFT length: 71.973 ms., avg: 72.495 ms. Best time for 5120K FFT length: 87.800 ms., avg: 88.870 ms. Best time for 6144K FFT length: 110.473 ms., avg: 111.362 ms. Best time for 7168K FFT length: 131.831 ms., avg: 132.743 ms. Best time for 8192K FFT length: 146.812 ms., avg: 147.631 ms. Timing FFTs using 2 threads. Best time for 1024K FFT length: 15.401 ms., avg: 15.644 ms. Best time for 1280K FFT length: 18.143 ms., avg: 19.026 ms. Best time for 1536K FFT length: 21.927 ms., avg: 22.995 ms. Best time for 1792K FFT length: 26.605 ms., avg: 27.481 ms. Best time for 2048K FFT length: 30.460 ms., avg: 31.351 ms. Best time for 2560K FFT length: 38.699 ms., avg: 39.689 ms. Best time for 3072K FFT length: 47.988 ms., avg: 49.353 ms. Best time for 3584K FFT length: 85.181 ms., avg: 85.865 ms. Best time for 4096K FFT length: 62.209 ms., avg: 66.705 ms. Best time for 5120K FFT length: 79.554 ms., avg: 80.260 ms. Best time for 6144K FFT length: 92.489 ms., avg: 94.000 ms. Best time for 7168K FFT length: 116.309 ms., avg: 119.709 ms. Best time for 8192K FFT length: 125.236 ms., avg: 128.261 ms. Timings for 1024K FFT length (1 cpu, 1 worker): 16.37 ms. Throughput: 61.08 iter/sec. Timings for 1024K FFT length (2 cpus, 2 workers): 30.59, 31.69 ms. Throughput: 64.25 iter/sec. Timings for 1280K FFT length (1 cpu, 1 worker): 21.24 ms. Throughput: 47.07 iter/sec. Timings for 1280K FFT length (2 cpus, 2 workers): 37.86, 39.14 ms. Throughput: 51.96 iter/sec. Timings for 1536K FFT length (1 cpu, 1 worker): 26.08 ms. Throughput: 38.34 iter/sec. Timings for 1536K FFT length (2 cpus, 2 workers): 45.43, 47.68 ms. Throughput: 42.99 iter/sec. Timings for 1792K FFT length (1 cpu, 1 worker): 31.05 ms. Throughput: 32.21 iter/sec. Timings for 1792K FFT length (2 cpus, 2 workers): 52.50, 53.32 ms. Throughput: 37.81 iter/sec. Timings for 2048K FFT length (1 cpu, 1 worker): 35.05 ms. Throughput: 28.53 iter/sec. Timings for 2048K FFT length (2 cpus, 2 workers): 61.40, 63.17 ms. Throughput: 32.12 iter/sec. Timings for 2560K FFT length (1 cpu, 1 worker): 43.36 ms. Throughput: 23.06 iter/sec. Timings for 2560K FFT length (2 cpus, 2 workers): 77.50, 79.16 ms. Throughput: 25.54 iter/sec. Timings for 3072K FFT length (1 cpu, 1 worker): 53.71 ms. Throughput: 18.62 iter/sec. Timings for 3072K FFT length (2 cpus, 2 workers): 96.11, 97.25 ms. Throughput: 20.69 iter/sec. Timings for 3584K FFT length (1 cpu, 1 worker): 67.86 ms. Throughput: 14.74 iter/sec. Timings for 3584K FFT length (2 cpus, 2 workers): 164.50, 169.02 ms. Throughput: 12.00 iter/sec. Timings for 4096K FFT length (1 cpu, 1 worker): 71.87 ms. Throughput: 13.91 iter/sec. [Tue Mar 14 19:33:59 2017] Timings for 4096K FFT length (2 cpus, 2 workers): 127.57, 128.14 ms. Throughput: 15.64 iter/sec. Timings for 5120K FFT length (1 cpu, 1 worker): 87.87 ms. Throughput: 11.38 iter/sec. Timings for 5120K FFT length (2 cpus, 2 workers): 153.62, 158.10 ms. Throughput: 12.83 iter/sec. Timings for 6144K FFT length (1 cpu, 1 worker): 110.52 ms. Throughput: 9.05 iter/sec. Timings for 6144K FFT length (2 cpus, 2 workers): 187.40, 186.73 ms. Throughput: 10.69 iter/sec. Timings for 7168K FFT length (1 cpu, 1 worker): 132.18 ms. Throughput: 7.57 iter/sec. Timings for 7168K FFT length (2 cpus, 2 workers): 236.89, 243.20 ms. Throughput: 8.33 iter/sec. Timings for 8192K FFT length (1 cpu, 1 worker): 151.83 ms. Throughput: 6.59 iter/sec. Timings for 8192K FFT length (2 cpus, 2 workers): 263.17, 260.16 ms. Throughput: 7.64 iter/sec. BTW: Is it possible to compile run Mlucas on Windows 7/10? If so, I could try to run benchmarks on my i5 2500k and/or i7 3770k Last fiddled with by VictordeHolland on 2017-03-14 at 18:50 Reason: Mlucas on Windows????

 Similar Threads Thread Thread Starter Forum Replies Last Post cheesehead Science & Technology 137 2018-06-26 15:46 BrainStone Mlucas 14 2017-11-19 00:59 ixfd64 Software 7 2011-02-25 20:05 ewmayer Programming 34 2010-10-18 22:36 fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 15:02.

Wed Sep 30 15:02:27 UTC 2020 up 20 days, 12:13, 0 users, load averages: 2.05, 1.80, 1.68