mersenneforum.org Mlucas v18 available
 Register FAQ Search Today's Posts Mark Forums Read

 2019-03-29, 10:07 #23 Lorenzo     Aug 2010 Republic of Belarus 17010 Posts Hello! Benchmark for v18 on Ampere eMAG 32-Core @ 3.3GHz using pre-built Mlucas_v18_c2simd. Code: root@lorenzoArm:~/mersenne/arm8# lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 1 NUMA node(s): 1 CPU max MHz: 3300.0000 CPU min MHz: 363.9700 L1d cache: 32K L1i cache: 32K L2 cache: 256K NUMA node0 CPU(s): 0-31 Code: root@lorenzoArm:~/mersenne/arm8# cat /proc/cpuinfo processor : 0 BogoMIPS : 90.00 Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid CPU implementer : 0x50 CPU architecture: 8 CPU variant : 0x3 CPU part : 0x000 CPU revision : 2 root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -s m -cpu 0:31: Code: root@lorenzoArm:~/mersenne/arm8# cat mlucas.cfg 18.0 2048 msec/iter = 19.63 ROE[avg,max] = [0.000307249, 0.375000000] radices = 128 32 16 16 0 0 0 0 0 0 2304 msec/iter = 19.88 ROE[avg,max] = [0.000272423, 0.375000000] radices = 144 32 16 16 0 0 0 0 0 0 2560 msec/iter = 22.07 ROE[avg,max] = [0.000281943, 0.375000000] radices = 160 8 8 8 16 0 0 0 0 0 2816 msec/iter = 22.07 ROE[avg,max] = [0.000260572, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 22.24 ROE[avg,max] = [0.000265834, 0.375000000] radices = 192 16 16 32 0 0 0 0 0 0 3328 msec/iter = 23.63 ROE[avg,max] = [0.000281118, 0.375000000] radices = 208 16 16 32 0 0 0 0 0 0 3584 msec/iter = 25.02 ROE[avg,max] = [0.000250660, 0.343750000] radices = 224 32 16 16 0 0 0 0 0 0 3840 msec/iter = 26.60 ROE[avg,max] = [0.000222911, 0.312500000] radices = 60 32 32 32 0 0 0 0 0 0 4096 msec/iter = 25.42 ROE[avg,max] = [0.000244299, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0 4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices = 144 8 8 16 16 0 0 0 0 0 5120 msec/iter = 31.50 ROE[avg,max] = [0.000235369, 0.312500000] radices = 160 32 32 16 0 0 0 0 0 0 5632 msec/iter = 33.74 ROE[avg,max] = [0.000257523, 0.343750000] radices = 176 32 32 16 0 0 0 0 0 0 6144 msec/iter = 36.94 ROE[avg,max] = [0.000247058, 0.312500000] radices = 192 32 32 16 0 0 0 0 0 0 6656 msec/iter = 36.74 ROE[avg,max] = [0.000313628, 0.406250000] radices = 208 8 8 16 16 0 0 0 0 0 7168 msec/iter = 36.94 ROE[avg,max] = [0.000233152, 0.312500000] radices = 224 8 8 16 16 0 0 0 0 0 7680 msec/iter = 36.94 ROE[avg,max] = [0.000246354, 0.312500000] radices = 240 8 8 16 16 0 0 0 0 0
 2019-03-29, 10:10 #24 Lorenzo     Aug 2010 Republic of Belarus 17010 Posts Just FYI Code: root@lorenzoArm:~/mersenne/arm8# ./Mlucas_v18_c2simd -fftlen 18432 -iters 100 -cpu 0:7 Mlucas 18.0 http://www.mersenneforum.org/mayer/README.html INFO: testing qfloat routines... CPU Family = ARM Embedded ABI, OS = Linux, 64-bit Version, compiled with Gnu C [or other compatible], Version 5.4.0 20160609. INFO: Build uses ARMv8 advanced-SIMD instruction set. INFO: Using inline-macro form of MUL_LOHI64. INFO: MLUCAS_PATH is set to "" INFO: using 53-bit-significand form of floating-double rounding constant for scalar-mode DNINT emulation. Setting DAT_BITS = 10, PAD_BITS = 2 INFO: testing IMUL routines... INFO: System has 32 available processor cores. INFO: testing FFT radix tables... Set affinity for the following 8 cores: 0.1.2.3.4.5.6.7. Mlucas selftest running..... /****************************************************************************/ NTHREADS = 8 M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 49407158 this gives an average 17.887500180138481 bits per digit Using complex FFT radices 288 32 32 32 mers_mod_square: Init threadpool of 8 threads radix16_dif_dit_pass pfetch_dist = 32 radix16_wrapper_square: pfetch_dist = 1024 Using 8 threads in carry step 100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 321038982 Res64: 69FF742497F16902. AvgMaxErr = 0.003191964. MaxErr = 0.375000000. Program: E18.0 Res mod 2^36 = 19729049858 Res mod 2^35 - 1 = 20161851329 Res mod 2^36 - 1 = 1044285462 Clocks = 00:00:21.067 NTHREADS = 8 M337615261: using FFT length 18432K = 18874368 8-byte floats, initial residue shift count = 321038982 this gives an average 17.887500180138481 bits per digit Using complex FFT radices 144 16 16 16 16 mers_mod_square: Init threadpool of 8 threads Using 8 threads in carry step 100 iterations of M337615261 with FFT length 18874368 = 18432 K, final residue shift count = 171176556 Res64: 2258A7342961B652. AvgMaxErr = 0.002428013. MaxErr = 0.281250000. Program: E18.0 Res mod 2^36 = 17874138706 Res mod 2^35 - 1 = 28069471175 Res mod 2^36 - 1 = 53816329185 Clocks = 00:00:21.009 NTHREADS = 8 ERROR: at line 1540 of file ../src/Mlucas.c Assertion failed: Return value of shift_word(): unpadded-array-index out of range!
 2019-03-29, 19:27 #25 ewmayer ∂2ω=0     Sep 2002 República de California 1151810 Posts Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system. I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K. Edit: I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific. Last fiddled with by ewmayer on 2019-03-29 at 19:29
2019-03-30, 07:51   #26
Lorenzo

Aug 2010
Republic of Belarus

2×5×17 Posts

Quote:
 Originally Posted by ewmayer Thanks, Lorenzo: Could you also try the '-s m' tests using just -cpu 0:3 on that 32-core system and post the resulting cfg-file here? I'd like to see what kind of degradation of parallelism results from using more than one socket on that system. I will look into the residue-shift assertion issue you hit in your 18432K FFT length test - first I need to see if I can reproduce it on any of the hardware I have. By way of a workaround, '-shift 0' should bypass all the new residue-shift code and allow you get a best-radix-set timing at 18432K. Edit: I was able to reproduce the assertion on my 2-core Macbook using an x86 SSE2 build, so the bug appears to be code-logic-related rather than anything platform specific.
Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more.

2019-03-30, 19:18   #27
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×631 Posts

Quote:
 Originally Posted by Lorenzo Hello! Benchmark for v18 on Ampere eMAG 32-Core @ 3.3GHz using pre-built Mlucas_v18_c2simd. Code:  4608 msec/iter = 30.73 ROE[avg,max] = [0.000298148, 0.375000000] radices =
Yikes, 717 hours, so at nominal $1/hour, that works out to over$700/84M primality test at https://www.packet.com/cloud/servers/ It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate $0.25/hr helps but not nearly enough.) Last fiddled with by kriesel on 2019-03-30 at 19:22 2019-03-30, 19:29 #28 ewmayer 2ω=0 Sep 2002 República de California 2×13×443 Posts Quote:  Originally Posted by Lorenzo Hello, ewmayer! Sorry but unfortunately I haven't access to this machine any more. OK - for future reference, on a 'typical' multisocket system with 1 or more sockets and each socket holding a 4-core CPU, I like to see the following timing tests: 1. All 4 cores on 1 socket: '-s m -iters 100 -cpu 0:3' 2. If there are differences between the CPUs on various sockets (use /proc/cpuinfo as your guide here), run the same self-tests on each distinct-CPU-type socket. If it's e.g. a BIG socket with a high-perf CPU having just 2 cores, fiddle the -cpu args to use just those 2 cores; 3. All cores across all sockets: Like the 32-core test you did above; 4. One program instance per socket: This can get tricky in self-test mode if the runspeed varies appreciably between sockets. Better is to create a rundir for each socket, e.g. run0-run7 on an 8-socket 32-core system, copy the mlucas.cfg files from your -cpu 0:3 self-test to each rundir, create a worktodo.ini file containing one exponent of the size range of interest (you can use a single-shot invocation of the primenet.py script to grab such an assignment), then copy that to each rundir. Then cd to run0 and fire up a production run using -cpu 0:3, let that get to the first 10000-iter checkpoint (you will see a pair of p|q-named binary savefiles get created, and the p*.stat file updated with a checkpoint entry), that gives you a production-run timing for 1 socket. At that point cd to each of the other rundirs in turn and start up an instance in each. Let those runs get through a couple checkpoints and average the last-line-of-statfile timings, compare that average to the 1-socket-used timing. I have found and fixed the bug your 18432K self-test exposed, will post update on that once I finish creating new ARM binaries from the updated source tarball and uploading to the server. 2019-03-30, 19:39 #29 ewmayer 2ω=0 Sep 2002 República de California 101100111111102 Posts Quote:  Originally Posted by kriesel Yikes, 717 hours, so at nominal$1/hour, that works out to over $700/84M primality test at https://www.packet.com/cloud/servers/ It's triple the speed of Ernst's Samsung S7 phone, at far higher cost (~83x) there. I've bought whole used workstations capable of 10+ times the 30.73ms/it speed, for the price of one exponent at packet.com at that rate. (Spot rate$0.25/hr helps but not nearly enough.)
I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a $/FLOP-optimized low-end retail Intel rig. But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger. 2019-03-31, 00:21 #30 kriesel "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 7·631 Posts Quote:  Originally Posted by ewmayer I suspect the total throughput for that system would be several times greater using one instance per 4-core socket, that's why I asked Lorenzo if he could provide that timing. You get a hint of throughput loss due to too-many-threads-for-one-job from the cfg-file timings he posted: The 32-thread timing @7680K is less than 2x greater than that for 2048K. Larger FFT lengths tend to be more parallelizable than smaller ones because at the same threadcount the work units done by each thread are proportionally larger, resulting in that sort of timing pattern. But I'm sure even in optimum-usage mode such a system would be a lot more expensive than a cellphone compute node - reminiscent of the difference between the big-iron AWS-instance runs we use to verify new prime discoveries, as compared to a$/FLOP-optimized low-end retail Intel rig. But manycore tests are always interesting because we hope to see signs of one manufacturer or another achieving a breakthrough in parallelism. Though in that regard, nearly all the action the past 5 years has been on the GPU side of the ledger.
Thought experiment: suppose one instance per 4-core socket was the same speed as his 32-core test, so 8 instances, 8-fold more throughput. It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
Another way to go at it would be to make 1-core, 4-core, and 8-core benchmark runs and compare to the 32.

Last fiddled with by kriesel on 2019-03-31 at 00:24

2019-03-31, 17:39   #31
ldesnogu

Jan 2008
France

24·3·11 Posts

Quote:
 Originally Posted by kriesel It still loses to the dual-e5-2670 that I bought for a month's rent of the 32-arm-core system.
How much power does your system consume? How much will that cost you?

For the record, the CPU from Ampere is not that great from a performance point of view, in particular its FP performance is less than Amazon Cortex-A72 chip despite running at 3.3 GHz vs 2.3 GHz: http://browser.geekbench.com/v4/cpu/...eline=11678329

It's not even that much faster than an S7: http://browser.geekbench.com/v4/cpu/...eline=12621230

BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.

2019-03-31, 19:21   #32
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7·631 Posts

Quote:
 Originally Posted by ldesnogu How much power does your system consume? How much will that cost you?
~US$3 / 85M exponent total cost, equipment amortization and utilities and taxes. Details at https://www.mersenneforum.org/showpo...8&postcount=20 I have no reason to expect that figure to be optimal among cpu choices. It's just one of the better among my little fleet. (Then there's curtisc's and others'$0/exponent, when the participant is using someone else's hardware and electricity.)

2019-03-31, 19:34   #33
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Quote:
 Originally Posted by ldesnogu BTW Ernst, I'm afraid I don't get why you're talking about multiple sockets. That system has a single socket.
Ah, I didn't look into the details of that kind of system, assumed it was a single-mobo cluster of 2 or 4-core cortex CPUs.

 Similar Threads Thread Thread Starter Forum Replies Last Post ewmayer Mlucas 96 2019-10-16 12:55 Damian Mlucas 17 2017-11-13 18:12 ewmayer Mlucas 3 2017-06-17 11:18 Lorenzo Mlucas 52 2016-03-13 08:45 delta_t Mlucas 14 2007-10-04 05:45

All times are UTC. The time now is 01:50.

Tue Sep 22 01:50:19 UTC 2020 up 11 days, 23:01, 0 users, load averages: 1.70, 1.50, 1.53