mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2019-01-15, 14:44   #177
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·79 Posts
Default NanoPi NEO4 RK3399

Paper specs:
Quote:
Model: Rockchip RK3399; Number of Cores: big.LITTLE, 64-bit Dual Core Cortex-A72 + Quad Core Cortex-A53; Frequency: Cortex-A72(up to 2.0GHz), Cortex-A53(up to 1.5GHz)
1GB DDR3-1866
Using rk3399-sd-friendlycore-bionic-4.4-arm64-20181219.img (headless Ubuntu 18.04). It compiled Mlucas but seg faulted at runtime, so I used the c2 build you gave the N1 tester.

Under load core frequencies:
Code:
pi@NanoPi-NEO4:/sys/devices/system/cpu$ sudo cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
1416000
1416000
1416000
1416000
1800000
1800000
Solo benchmarks not very accurate or interesting so omitted. Simultaneous benchmarks at key FFT lengths:

1024K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p20000047.stat
INFO: no restart file found...starting run from scratch.
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        16         8        16        16        16
[Jan 14 15:07:24] M20000047 Iter# = 10000 [ 0.05% complete] clocks = 00:09:50.612 [  0.0591 sec/iter] Res64: 9A2AF744DE060296. AvgMaxErr = 0.220881351. MaxErr = 0.312500000.
[Jan 14 15:17:14] M20000047 Iter# = 20000 [ 0.10% complete] clocks = 00:09:49.950 [  0.0590 sec/iter] Res64: D99B4D255F5C0C74. AvgMaxErr = 0.221486410. MaxErr = 0.312500000.

pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p20000047.stat
INFO: no restart file found...starting run from scratch.
M20000047: using FFT length 1024K = 1048576 8-byte floats.
 this gives an average   19.073531150817871 bits per digit
Using complex FFT radices        64        32        16        16
[Jan 14 15:09:47] M20000047 Iter# = 10000 [ 0.05% complete] clocks = 00:12:14.866 [  0.0735 sec/iter] Res64: 9A2AF744DE060296. AvgMaxErr = 0.230923241. MaxErr = 0.312500000.
[Jan 14 15:22:02] M20000047 Iter# = 20000 [ 0.10% complete] clocks = 00:12:14.665 [  0.0735 sec/iter] Res64: D99B4D255F5C0C74. AvgMaxErr = 0.231118560. MaxErr = 0.343750000.
2560K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p49005071.stat
INFO: no restart file found...starting run from scratch.
M49005071: using FFT length 2560K = 2621440 8-byte floats.
 this gives an average   18.693951034545897 bits per digit
Using complex FFT radices        40        32        32        32
[Jan 14 15:52:39] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:23:46.485 [  0.1426 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.245389789. MaxErr = 0.375000000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p49005071.stat
INFO: no restart file found...starting run from scratch.
M49005071: using FFT length 2560K = 2621440 8-byte floats.
 this gives an average   18.693951034545897 bits per digit
Using complex FFT radices       160         8         8         8        16
M49005071 Roundoff warning on iteration      181, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration     4240, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration     8110, maxerr =   0.437500000000
[Jan 14 15:58:12] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:29:18.200 [  0.1758 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.282188452. MaxErr = 0.437500000.
4608K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p87068977.stat
INFO: no restart file found...starting run from scratch.
M87068977: using FFT length 4608K = 4718592 8-byte floats.
 this gives an average   18.452321582370335 bits per digit
Using complex FFT radices       144        32        32        16
[Jan 14 16:48:14] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:36:37.950 [  0.2198 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.256021777. MaxErr = 0.375000000.
[Jan 14 17:24:52] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:36:37.133 [  0.2197 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.256765161. MaxErr = 0.343750000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p87068977.stat
INFO: no restart file found...starting run from scratch.
M87068977: using FFT length 4608K = 4718592 8-byte floats.
 this gives an average   18.452321582370335 bits per digit
Using complex FFT radices       288        32        16        16
[Jan 14 17:04:38] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:53:00.131 [  0.3180 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.249227525. MaxErr = 0.375000000.
[Jan 14 17:57:39] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:52:59.939 [  0.3180 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.249523122. MaxErr = 0.343750000.
7680K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p143472073.stat
INFO: no restart file found...starting run from scratch.
M143472073: using FFT length 7680K = 7864320 8-byte floats.
 this gives an average   18.243417485555014 bits per digit
Using complex FFT radices       240        16        32        32
[Jan 14 19:06:56] M143472073 Iter# = 10000 [ 0.01% complete] clocks = 01:02:23.065 [  0.3743 sec/iter] Res64: C7B182C990710B46. AvgMaxErr = 0.241344566. MaxErr = 0.343750000.
[Jan 14 20:09:15] M143472073 Iter# = 20000 [ 0.01% complete] clocks = 01:02:18.024 [  0.3738 sec/iter] Res64: 181335759D5BB711. AvgMaxErr = 0.241826627. MaxErr = 0.375000000.
[Jan 14 21:11:38] M143472073 Iter# = 30000 [ 0.02% complete] clocks = 01:02:21.381 [  0.3741 sec/iter] Res64: 126EDB1E9B6580C4. AvgMaxErr = 0.241919198. MaxErr = 0.343750000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p143472073.stat
INFO: no restart file found...starting run from scratch.
M143472073: using FFT length 7680K = 7864320 8-byte floats.
 this gives an average   18.243417485555014 bits per digit
Using complex FFT radices       240        32        32        16
[Jan 14 19:44:16] M143472073 Iter# = 10000 [ 0.01% complete] clocks = 01:39:44.262 [  0.5984 sec/iter] Res64: C7B182C990710B46. AvgMaxErr = 0.235340244. MaxErr = 0.343750000.
[Jan 14 21:23:58] M143472073 Iter# = 20000 [ 0.01% complete] clocks = 01:39:39.004 [  0.5979 sec/iter] Res64: 181335759D5BB711. AvgMaxErr = 0.236132050. MaxErr = 0.375000000.
18432K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p332220523.stat
INFO: no restart file found...starting run from scratch.
M332220523: using FFT length 18432K = 18874368 8-byte floats.
 this gives an average   17.601676676008438 bits per digit
Using complex FFT radices       288        32        32        32
[Jan 15 00:29:12] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 02:57:36.013 [  1.0656 sec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.186972266. MaxErr = 0.250000000.
M332220523 Roundoff warning on iteration    11467, maxerr =   0.500000000000
 Retrying iteration interval to see if roundoff error is reproducible.
Restarting M332220523 at iteration = 10000. Res64: 1A313D709BFA6663
M332220523: using FFT length 18432K = 18874368 8-byte floats.
 this gives an average   17.601676676008438 bits per digit
Retry of iteration interval with fatal roundoff error was successful.
[Jan 15 03:52:50] M332220523 Iter# = 20000 [ 0.01% complete] clocks = 02:57:28.763 [  1.0649 sec/iter] Res64: 73DC7A5C8B839081. AvgMaxErr = 0.187356934. MaxErr = 0.250000000.
[Jan 15 06:50:22] M332220523 Iter# = 30000 [ 0.01% complete] clocks = 02:57:28.523 [  1.0649 sec/iter] Res64: B928CD22434EEC7C. AvgMaxErr = 0.187289062. MaxErr = 0.281250000.
[Jan 15 09:47:49] M332220523 Iter# = 40000 [ 0.01% complete] clocks = 02:57:24.003 [  1.0644 sec/iter] Res64: 307ECB47139AEB31. AvgMaxErr = 0.187450000. MaxErr = 0.250000000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p332220523.stat
INFO: no restart file found...starting run from scratch.
M332220523: using FFT length 18432K = 18874368 8-byte floats.
 this gives an average   17.601676676008438 bits per digit
Using complex FFT radices       288        32        32        32
[Jan 15 03:04:59] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 05:33:22.437 [  2.0002 sec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.186969141. MaxErr = 0.250000000.
[Jan 15 08:38:04] M332220523 Iter# = 20000 [ 0.01% complete] clocks = 05:32:58.179 [  1.9978 sec/iter] Res64: 73DC7A5C8B839081. AvgMaxErr = 0.187339746. MaxErr = 0.250000000.
Giving combined synthetic timings of:
Code:
 1024K    32.73 ms/it
 2560K    78.73 ms/it
 4608K   129.93 ms/it
 7680K   230.12 ms/it
18432K   694.90 ms/it
As we increase FFT length the A53 performs relatively worse than the A72. Using the A53 at lower FFT and A72 at higher FFT could be more optimal so lets test:

Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p87068977.stat
INFO: no restart file found...starting run from scratch.
M87068977: using FFT length 4608K = 4718592 8-byte floats.
 this gives an average   18.452321582370335 bits per digit
Using complex FFT radices       144        32        32        16
[Jan 15 11:18:17] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:37:42.276 [  0.2262 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.256021777. MaxErr = 0.375000000.
[Jan 15 11:55:59] M87068977 Iter# = 20000 [ 0.02% complete] clocks = 00:37:41.853 [  0.2262 sec/iter] Res64: C43069A17478EF46. AvgMaxErr = 0.256765161. MaxErr = 0.343750000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p49005071.stat
INFO: no restart file found...starting run from scratch.
M49005071: using FFT length 2560K = 2621440 8-byte floats.
 this gives an average   18.693951034545897 bits per digit
Using complex FFT radices       160         8         8         8        16
M49005071 Roundoff warning on iteration      181, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration     4240, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration     8110, maxerr =   0.437500000000
[Jan 15 11:07:39] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:27:03.713 [  0.1624 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.282188452. MaxErr = 0.437500000.
[Jan 15 11:34:45] M49005071 Iter# = 20000 [ 0.04% complete] clocks = 00:27:05.668 [  0.1626 sec/iter] Res64: 6CD0428337CA1430. AvgMaxErr = 0.282933594. MaxErr = 0.406250000.
M49005071 Roundoff warning on iteration    20522, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration    24876, maxerr =   0.437500000000
M49005071 Roundoff warning on iteration    25658, maxerr =   0.437500000000
[Jan 15 12:01:52] M49005071 Iter# = 30000 [ 0.06% complete] clocks = 00:27:05.741 [  0.1626 sec/iter] Res64: 106C93EFA0800D81. AvgMaxErr = 0.282969043. MaxErr = 0.437500000.
It's interesting that the A53 sped up and the A72 slowed down relative to their performance when the other cluster was at the same FFT. I expected the A72 to be slightly faster than it's previous result because the A53 should be working in its own cache more and using system memory less, but maybe because the A53 is less starved and doing more work it's using more power or something else shared. Reversing the allocation to see what would happen, A53 on 4608K A72 on 2560K:
Code:
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat big/p49005071.stat
INFO: no restart file found...starting run from scratch.
M49005071: using FFT length 2560K = 2621440 8-byte floats.
 this gives an average   18.693951034545897 bits per digit
Using complex FFT radices        40        32        32        32
[Jan 15 12:54:52] M49005071 Iter# = 10000 [ 0.02% complete] clocks = 00:23:16.719 [  0.1397 sec/iter] Res64: 8E7E56F23C735CF2. AvgMaxErr = 0.245389789. MaxErr = 0.375000000.
[Jan 15 13:18:09] M49005071 Iter# = 20000 [ 0.04% complete] clocks = 00:23:16.597 [  0.1397 sec/iter] Res64: 6CD0428337CA1430. AvgMaxErr = 0.246327235. MaxErr = 0.375000000.
[Jan 15 13:41:25] M49005071 Iter# = 30000 [ 0.06% complete] clocks = 00:23:16.016 [  0.1396 sec/iter] Res64: 106C93EFA0800D81. AvgMaxErr = 0.246085634. MaxErr = 0.375000000.
pi@NanoPi-NEO4:~/mlucas_v17.1$ cat little/p87068977.stat
INFO: no restart file found...starting run from scratch.
M87068977: using FFT length 4608K = 4718592 8-byte floats.
 this gives an average   18.452321582370335 bits per digit
Using complex FFT radices       288        32        16        16
[Jan 15 13:28:13] M87068977 Iter# = 10000 [ 0.01% complete] clocks = 00:56:36.492 [  0.3396 sec/iter] Res64: 13BB5C9DDF0CD3D6. AvgMaxErr = 0.249227525. MaxErr = 0.375000000.
Works the same when roles are reversed. I don't know enough to analyse further but in any case the difference is pretty small, for an easy life it's probably best to stick both clusters on DC work and maybe switch the A72 to first time PRP when it's implemented just for fun.

This board is tiny (60x45mm), the SoC is on the underside and the heatsink covers the entire underside. It might not make sense from a power or hardware cost perspective to use these boards for GIMPS, but creating a DIY radiator to heat your house with a cluster of these is tempting.

Using the 2200G and 8100 numbers from above, we need ~15.5 NEO4 to match a 2200G, ~21.5 to match an 8100. I don't have a wattmeter handy but online benchmarks indicate power usage is probably give or take 11W per NEO4, a win for x86 by some margin I think.
M344587487 is offline   Reply With Quote
Old 2019-01-15, 19:28   #178
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

Thanks for the data, M344587487! That is indeed a dramatic falling-off of the A53 throughput once you get above 4M FFT length - on my A53-quad-based C2 I see falloff from the strictly-arithmetic-opcount-based O(n log n) scaling, but nowhere near what you see in your big+little combined tests.

How much did the full kit cost you? And do you have anything in mind to reduce per-node cost for the possible homebuilt cluster you describe?
ewmayer is offline   Reply With Quote
Old 2019-01-15, 19:48   #179
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

2×1,433 Posts
Default

I'd be interested in building a small 4 or 7 node cluster of devices like this, if it's more cost effective for DC than running mprime on x86-64.
Mark Rose is offline   Reply With Quote
Old 2019-01-15, 21:12   #180
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·79 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for the data, M344587487! That is indeed a dramatic falling-off of the A53 throughput once you get above 4M FFT length - on my A53-quad-based C2 I see falloff from the strictly-arithmetic-opcount-based O(n log n) scaling, but nowhere near what you see in your big+little combined tests.

How much did the full kit cost you? And do you have anything in mind to reduce per-node cost for the possible homebuilt cluster you describe?
You buy these directly and they ship worldwide: https://www.friendlyarm.com/index.ph...product_id=241

$45 for the board, $6 for the heatsink, $5 postage, £3 for a USB-C cable, £3 for an SD card, I already had a USB-C power source. You can probably get a 10 port USB power hub on ebay for £10-£20 but they're likely terrible efficiency. I'm not certain but have a feeling that the best solution for efficiency would be to mod an ATX PSU or maybe a laptop transformer, massive headache though.

You absolutely need the board, heatsink and USB cable per node. You might be able to DIY a shared heatsink on the cheap if you're building a wall of these, you'd just need some thermal pads on the SoCs and the backside of the heatsink can be flat. No need for a switch as there's wifi. I don't think you can network boot these, but even if you could you'd need a switch and would gain nothing (except the reliability of not using SD cards). It may be possible to eliminate the SD card if the OS can be made small enough to run fully in RAM, but it's a bit of an admin nightmare on intital boot and if there's a crash or god forbid power cut.

Quote:
Originally Posted by Mark Rose View Post
I'd be interested in building a small 4 or 7 node cluster of devices like this, if it's more cost effective for DC than running mprime on x86-64.
Unfortunately it's probably not, but when I find the wattmeter I'll disable all non-critical hardware and see how low it can go. I have hope that newer chips will make ARM more power competitive, but by the time they get cheap enough to make sense x86 will also have made progress. AMD 7nm is out this year with AVX2 parity to intel, that promises to be a massive leap forwards in power efficiency for x86.

I didn't buy a NEO4 heatsink and instead used one salvaged from an old computer. It's pretty ridiculous as it dwarfs the board in all three dimensions.
Attached Thumbnails
Click image for larger version

Name:	neo.jpg
Views:	92
Size:	124.0 KB
ID:	19670  
M344587487 is offline   Reply With Quote
Old 2019-01-15, 21:37   #181
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Hmm, so let's think along extreme cost-cutting lines, how low can we go?

o Was hoping that (like Odroid) one might be able to get a bulk discount (say 10%) on these - might be worth e-mailing the mfr to ask. (I just did so.)

o Heat sink: One should able to get a properly-sized set of these in either a cheap bulk-pack (I see a 10-pack of 25x25X5mm ones on ebay for ~$10) or in form of some cut-to-desired-length extruded Alu. finned stock. A total per-unit cost < $50 is getting pretty close to "worth a try as a feasibility study" range. I'll be very interested to seeing accurate TPD numbers for these. When one factors in the wattage of the entire package (CPU + rest-of-mobo + PSU + SSD + case fans, what is the typical TPD at the wall outlet for a typical high-bang-per-watt Intel or AMD multicore system?

Last fiddled with by ewmayer on 2019-01-15 at 21:43
ewmayer is offline   Reply With Quote
Old 2019-01-16, 00:13   #182
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

54628 Posts
Default

Quote:
Originally Posted by ewmayer View Post
When one factors in the wattage of the entire package (CPU + rest-of-mobo + PSU + SSD + case fans, what is the typical TPD at the wall outlet for a typical high-bang-per-watt Intel or AMD multicore system?
With a single gold power supply delivering to 4, 4 core boards underclocked and undervolted to match memory, about 67.5 watts at the wall each to give 186 iter/sec for a 4k fft (143 iter/sec for 5k, 295 iter/sec for a 2.5k).
Mark Rose is offline   Reply With Quote
Old 2019-01-17, 20:36   #183
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

I started a thread about the RockPi4 on the Odroid forums - Local-expert tkaiser has some useful insights there re. suitable OS images.
ewmayer is offline   Reply With Quote
Old 2019-02-25, 08:17   #184
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

14816 Posts
Default Raspberry Pi 3A+ power consumption

Just a quick note as a reference.

I've seen many tables and measurements for Raspberry Pi power consumption on the interwebs, but they are somewhat misleading because apparently their "full load" isn't anywhere near what Mlucas and ASIMD is capable of achieving while running.

So, Mlucas (still v17.1, sorry) - on Raspberry Pi 3A+ at stock 1.4 GHz. 64-bit Gentoo "sakaki" build, fresher image from December 2018 so that it can run on the 3A as well. There is some slight difference in the firmware, and the old image from June 2018 wouldn't start on 3A+. X disabled, of course. On 1 GB it makes only a small difference in the running time, but 512MB seems to be too small for a graphical environment, even doing nothing.

Idle 220 mA (from 5V)
Proper full load 840 mA with 880mA spikes, perhaps more, my current meter isn't that fast.
What I've seen on the net has generally been in the 400-500 mA range "full load" so take those figures with a grain of salt...

As a side note, the 3A+ "should" be as fast as the 3B+, but for whatever reason, is actually a few percent slower. Maybe the smaller memory chip makes that difference? The Elpida chip is marked -1D-F on the end of the device code which means 533 MHz, and the default speed should be 500 MHz, so no difference there. (The memory on my 3B+ cards is -8D-F, by the way, which is 400 MHz so the default setting is already overclocking it a bit!)
nomead is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 16:55.

Wed Sep 30 16:55:51 UTC 2020 up 20 days, 14:06, 0 users, load averages: 1.95, 1.69, 1.71

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.