mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2019-03-29, 10:07   #23
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

17010 Posts
Default

Hello! Benchmark for v18 on Ampere eMAG 32-Core @ 3.3GHz using pre-built Mlucas_v18_c2simd.

Code:
root@lorenzoArm:~/mersenne/arm8# lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    1
Core(s) per socket:    32
Socket(s):             1
NUMA node(s):          1
CPU max MHz:           3300.0000
CPU min MHz:           363.9700
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
NUMA node0 CPU(s):     0-31
Code:
root@lorenzoArm:~/mersenne/arm8# cat  /proc/cpuinfo
processor       : 0
BogoMIPS        : 90.00
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x50
CPU architecture: 8
CPU variant     : 0x3
CPU part        : 0x000
CPU revision    : 2
root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -s m -cpu 0:31:
Code:
root@lorenzoArm:~/mersenne/arm8# cat mlucas.cfg 
18.0
      2048  msec/iter =   19.63  ROE[avg,max] = [0.000307249, 0.375000000]  radices = 128 32 16 16  0  0  0  0  0  0
      2304  msec/iter =   19.88  ROE[avg,max] = [0.000272423, 0.375000000]  radices = 144 32 16 16  0  0  0  0  0  0
      2560  msec/iter =   22.07  ROE[avg,max] = [0.000281943, 0.375000000]  radices = 160  8  8  8 16  0  0  0  0  0
      2816  msec/iter =   22.07  ROE[avg,max] = [0.000260572, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =   22.24  ROE[avg,max] = [0.000265834, 0.375000000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =   23.63  ROE[avg,max] = [0.000281118, 0.375000000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =   25.02  ROE[avg,max] = [0.000250660, 0.343750000]  radices = 224 32 16 16  0  0  0  0  0  0
      3840  msec/iter =   26.60  ROE[avg,max] = [0.000222911, 0.312500000]  radices =  60 32 32 32  0  0  0  0  0  0
      4096  msec/iter =   25.42  ROE[avg,max] = [0.000244299, 0.312500000]  radices =  64 32 32 32  0  0  0  0  0  0
      4608  msec/iter =   30.73  ROE[avg,max] = [0.000298148, 0.375000000]  radices = 144  8  8 16 16  0  0  0  0  0
      5120  msec/iter =   31.50  ROE[avg,max] = [0.000235369, 0.312500000]  radices = 160 32 32 16  0  0  0  0  0  0
      5632  msec/iter =   33.74  ROE[avg,max] = [0.000257523, 0.343750000]  radices = 176 32 32 16  0  0  0  0  0  0
      6144  msec/iter =   36.94  ROE[avg,max] = [0.000247058, 0.312500000]  radices = 192 32 32 16  0  0  0  0  0  0
      6656  msec/iter =   36.74  ROE[avg,max] = [0.000313628, 0.406250000]  radices = 208  8  8 16 16  0  0  0  0  0
      7168  msec/iter =   36.94  ROE[avg,max] = [0.000233152, 0.312500000]  radices = 224  8  8 16 16  0  0  0  0  0
      7680  msec/iter =   36.94  ROE[avg,max] = [0.000246354, 0.312500000]  radices = 240  8  8 16 16  0  0  0  0  0
Lorenzo is offline   Reply With Quote
Old 2019-03-29, 10:10   #24
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

17010 Posts
Default

Just FYI
Code:
root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7

    Mlucas 18.0

    http://www.mersenneforum.org/mayer/README.html

INFO: testing qfloat routines...
CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0 20160609.
INFO: Build uses ARMv8 advanced-SIMD instruction set.
INFO: Using inline-macro form of MUL_LOHI64.
INFO: MLUCAS_PATH is set to ""
INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. 
Setting DAT_BITS = 10, PAD_BITS = 2
INFO: testing IMUL routines...
INFO: System has 32 available processor cores.
INFO: testing FFT radix tables...
Set affinity for the following 8 cores: 0.1.2.3.4.5.6.7.

           Mlucas selftest running.....

/****************************************************************************/

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 49407158
 this gives an average   17.887500180138481 bits per digit
Using complex FFT radices       288        32        32        32
mers_mod_square: Init threadpool of 8 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982
Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0
Res mod 2^36     =          19729049858
Res mod 2^35 - 1 =          20161851329
Res mod 2^36 - 1 =           1044285462
Clocks = 00:00:21.067

NTHREADS = 8
M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 321038982
 this gives an average   17.887500180138481 bits per digit
Using complex FFT radices       144        16        16        16        16
mers_mod_square: Init threadpool of 8 threads
Using 8 threads in carry step
100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556
Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0
Res mod 2^36     =          17874138706
Res mod 2^35 - 1 =          28069471175
Res mod 2^36 - 1 =          53816329185
Clocks = 00:00:21.009
NTHREADS = 8
ERROR: at line 1540 of file ../src/Mlucas.c
Assertion failed: Return value of shift_word(): unpadded-array-index out of range!
Lorenzo is offline   Reply With Quote
Old 2019-03-29, 19:27   #25
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1151810 Posts
Default

Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

Edit: I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.

Last fiddled with by ewmayer on 2019-03-29 at 19:29
ewmayer is offline   Reply With Quote
Old 2019-03-30, 07:51   #26
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2×5×17 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system.

I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K.

Edit: I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.
Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.
Lorenzo is offline   Reply With Quote
Old 2019-03-30, 19:18   #27
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×631 Posts
Default

Quote:
Originally Posted by Lorenzo View Post
Hello! Benchmark for v18 on Ampere eMAG 32-Core @ 3.3GHz using pre-built Mlucas_v18_c2simd.
Code:
       4608  msec/iter =   30.73  ROE[avg,max] = [0.000298148, 0.375000000]  radices =
Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at https://www.packet.com/cloud/servers/ It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)

Last fiddled with by kriesel on 2019-03-30 at 19:22
kriesel is online now   Reply With Quote
Old 2019-03-30, 19:29   #28
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by Lorenzo View Post
Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.
OK - for future reference, on a 'typical' multisocket system with 1 or more sockets and each socket holding a 4-core CPU, I like to see the following timing tests:

1. All 4 cores on 1 socket: '-s m -iters 100 -cpu 0:3'

2. If there are differences between the CPUs on various sockets (use /proc/cpuinfo as your guide here), run the same self-tests on each distinct-CPU-type socket. If it's e.g. a BIG socket with a high-perf CPU having just 2 cores, fiddle the -cpu args to use just those 2 cores;

3. All cores across all sockets: Like the 32-core test you did above;

4. One program instance per socket: This can get tricky in self-test mode if the runspeed varies appreciably between sockets. Better is to create a rundir for each socket, e.g. run0-run7 on an 8-socket 32-core system, copy the mlucas.cfg files from your -cpu 0:3 self-test to each rundir, create a worktodo.ini file containing one exponent of the size range of interest (you can use a single-shot invocation of the primenet.py script to grab such an assignment), then copy that to each rundir. Then cd to run0 and fire up a production run using -cpu 0:3, let that get to the first 10000-iter checkpoint (you will see a pair of p|q-named binary savefiles get created, and the p*.stat file updated with a checkpoint entry), that gives you a production-run timing for 1 socket. At that point cd to each of the other rundirs in turn and start up an instance in each. Let those runs get through a couple checkpoints and average the last-line-of-statfile timings, compare that average to the 1-socket-used timing.

I have found and fixed the bug your 18432K self-test exposed, will post update on that once I finish creating new ARM binaries from the updated source tarball and uploading to the server.
ewmayer is offline   Reply With Quote
Old 2019-03-30, 19:39   #29
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100111111102 Posts
Default

Quote:
Originally Posted by kriesel View Post
Yikes, 717 hours, so at nominal $1/hour, that works out to over $700/84M primality test at https://www.packet.com/cloud/servers/ It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.)
I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.
ewmayer is offline   Reply With Quote
Old 2019-03-31, 00:21   #30
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·631 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig.

But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.
Thought experiment: suppose one instance per 4-core socket was the same speed as his 32-core test, so 8 instances, 8-fold more throughput. It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
Another way to go at it would be to make 1-core, 4-core, and 8-core benchmark runs and compare to the 32.

Last fiddled with by kriesel on 2019-03-31 at 00:24
kriesel is online now   Reply With Quote
Old 2019-03-31, 17:39   #31
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24·3·11 Posts
Default

Quote:
Originally Posted by kriesel View Post
It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
How much power does your system consume? How much will that cost you?

For the record, the CPU from Ampere is not that great from a performance point of view, in particular its FP performance is less than Amazon Cortex-A72 chip despite running at 3.3 GHz vs 2.3 GHz: http://browser.geekbench.com/v4/cpu/...eline=11678329

It's not even that much faster than an S7: http://browser.geekbench.com/v4/cpu/...eline=12621230

BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.
ldesnogu is offline   Reply With Quote
Old 2019-03-31, 19:21   #32
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·631 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
How much power does your system consume? How much will that cost you?
~US$3 / 85M exponent total cost, equipment amortization and utilities and taxes. Details at https://www.mersenneforum.org/showpo...8&postcount=20
I have no reason to expect that figure to be optimal among cpu choices. It's just one of the better among my little fleet. (Then there's curtisc's and others' $0/exponent, when the participant is using someone else's hardware and electricity.)
kriesel is online now   Reply With Quote
Old 2019-03-31, 19:34   #33
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.
Ah, I didn't look into the details of that kind of system, assumed it was a single-mobo cluster of 2 or 4-core cortex CPUs.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Mlucas version 17.1 ewmayer Mlucas 96 2019-10-16 12:55
Mlucas on ubuntu Damian Mlucas 17 2017-11-13 18:12
Mlucas version 17 ewmayer Mlucas 3 2017-06-17 11:18
MLucas on IBM Mainframe Lorenzo Mlucas 52 2016-03-13 08:45
mlucas on sun delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:50.

Tue Sep 22 01:50:19 UTC 2020 up 11 days, 23:01, 0 users, load averages: 1.70, 1.50, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.