mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2018-02-07, 01:56   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default Next-gen Odroid announcement

So I finally got round to registering for the Odroid forum and posting re. the Mlucas-for-ARMv8 SIMD-code port yesterday, in my post I also loudly yearned for an Odroid update based on the newer/faster Cortex A57, today got a "Wish granted" reply from user 'rooted' pointing to this thread posted - not sure if coincidentally - just a few hours after I started mine:

https://forum.odroid.com/viewtopic.php?t=29932
ewmayer is offline   Reply With Quote
Old 2018-02-07, 11:52   #2
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23·3·72 Posts
Default

Mooaarrr power, can't go wrong with that .
VictordeHolland is offline   Reply With Quote
Old 2018-02-07, 14:37   #3
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22×1,481 Posts
Default

Where's the PS2 port?
rogue is offline   Reply With Quote
Old 2018-02-07, 17:50   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23×79 Posts
Default

That's interesting, I have some open questions
  • what can we expect out of the dual core A72 in terms of performance/power draw/efficiency?
  • Can the A72 and A53 run full bore together?
  • I assume the best setup is 2 workers, 1 per cortex

I like that it's 12v instead of 5V, it works better to use a PSU as the power source as it's the main rail. I'm augmenting an x86 system with some pi/pi clones which use 5V, it would be pretty cool to be able to power some 12V boards from the same molex connectors.

I tried to find some benchmarks for comparison.
Mediatek-MT8176 (2.1Ghz 2 core A72, 4 core A53): https://www.notebookcheck.net/Mediat....187985.0.html
Geekbench 4.1/4.2 64 bit single-core score: 1541
Geekbench 4.1/4.2 64 bit multi-core score: 2489

Mediatek-MT6735 (4 core 1.5Ghz A53): https://www.notebookcheck.net/Mediat....147799.0.html
Geekbench 4.1/4.2 64 bit single-core score: 519
Geekbench 4.1/4.2 64 bit multi-core score: 1430

I know it's very rule-of-thumb as it is, but an A72 core having triple the bench score of an A53 core may mean the board can do 2.5x the throughput of a 4 core A53 SoC. The multi-core score backs this up if the A53 cores were idle during it's run, or maybe the benchmarks can't be compared in this way and this is all fluff.
M344587487 is offline   Reply With Quote
Old 2018-02-07, 18:17   #5
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

52810 Posts
Default

Quote:
Originally Posted by M344587487 View Post
I like that it's 12v instead of 5V, it works better to use a PSU as the power source as it's the main rail. I'm augmenting an x86 system with some pi/pi clones which use 5V, it would be pretty cool to be able to power some 12V boards from the same molex connectors.
The molex connector on that board can only be used to power an external drive, not vice versa
ldesnogu is offline   Reply With Quote
Old 2018-02-07, 18:24   #6
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

23·79 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
The molex connector on that board can only be used to power an external drive, not vice versa
I didn't mean to use the molex connector on the board, I meant to do the 12V equivalent of this: https://www.ebay.co.uk/itm/USB-To-4-...UAAOSw0kNXg2vy
M344587487 is offline   Reply With Quote
Old 2018-02-07, 18:45   #7
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

According to ARM the A72 core should perform about 26% better than the A57 in FloatPoint (same frequency, process and memory subsystem)

However, looking at the bottom of this:
https://www.anandtech.com/show/11088...ce-and-power/2
It looks like the A57 and A72 are much closer in performance/MHz (A72 maybe 10% better than A57). A53 core slightly less than half of a A57 core.
Attached Thumbnails
Click image for larger version

Name:	ARM_A72-A57_IPC.PNG
Views:	109
Size:	134.6 KB
ID:	17659  
VictordeHolland is offline   Reply With Quote
Old 2018-02-08, 00:46   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·443 Posts
Default

Quote:
Originally Posted by M344587487 View Post
That's interesting, I have some open questions
  • what can we expect out of the dual core A72 in terms of performance/power draw/efficiency?
  • Can the A72 and A53 run full bore together?
  • I assume the best setup is 2 workers, 1 per cortex
Good questions - you're probably right re. one worker on the a72 and another on the a53, but it will be interesting to see if and to what extent the two different CPUs can work together on tasks. The multithreading in my code breaks stuff into lots of separate work chunks which can be done independently by the respective threads in a master pool, and which only need be synchronized in the "all work chunks done, let's move on to the next phase" sense, so e.g. having the a53 completing their work units at 1/3rd (or whatever) the rate of the a72 cores should be no problem. I would expect the memory/cache-locality bandwidth between the 2 CPUs to be the bigger gating factor.

Anyhow, the quickest way to find out is to get hold of one of the dev-boards they're gifting a small subset of Odroiders, I asked to be included in the list of potential grantees, so we can hope. In the meantime, though, if someone has access to one of currently-available (and pricier) big-LITTLE dual-cortex-CPUs-of-different-flavors dev-boards, they could play with this aspect.
ewmayer is offline   Reply With Quote
Old 2018-02-08, 01:38   #9
GP2
 
GP2's Avatar
 
Sep 2003

22×3×5×43 Posts
Default

Looks like the compiler flags to use with gcc for this announced product would be -march=armv8-a -mtune=cortex-a72.cortex-a53
GP2 is offline   Reply With Quote
Old 2018-02-08, 02:59   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2CFE16 Posts
Default

Quote:
Originally Posted by GP2 View Post
Looks like the compiler flags to use with gcc for this announced product would be -march=armv8-a -mtune=cortex-a72.cortex-a53
Note that in my C2 SIMD builds I found a slight negative timing impact from using the a53 arch-flags, so I eschew them. YMMV, but AFAICT the only good reason to invoke such flags is if your platform requires them, which is sometimes not easy to tell - e.g. I had one builder whose build runtime-segfaulted sans the arch-flags for his CPU, said issue was cured by invoking them on rebuild.
ewmayer is offline   Reply With Quote
Old 2018-03-03, 02:10   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2CFE16 Posts
Default

Thanks to a well-placed Odroider who was one of the selected recipients of a pre-release N1 system and was kind enough to try out my code on some, we have N1 timings to mull over. Couple of notes:

1. His Debian build bonked with segfaults, likely similar miscompilation issue as TomW hit (but haven't bothered to do the deeper digging needed to precisely localize the cause). But my C2-build (under the standard Ubuntu distro Hardkernel ships with that unit) worked for him. Since that same build appears to run in drop-in mode on a surprising variety of ARMv8 platforms (including Raspberry Pi3), I've posted it to the Mlucas ftp site and added corresponding link/verbiage to the Mlucas readme page.

2. I'm still waiting for more data re. running code on both sockets (i.e. 6 total cores/threads), but preliminarily it looks like running separate jobs on the A72 and the A53 is best, as I surmised would be the case.

A72, 2-cores - I've snipped the ROE stats column from both mlucas.cfg files' data for the sake of readability:
Code:
1024  msec/iter =   43.98  radices = 256  8 16 16  0
1152  msec/iter =   47.97  radices = 144 16 16 16  0
1280  msec/iter =   51.81  radices = 160 16 16 16  0
1408  msec/iter =   60.31  radices = 176 16 16 16  0
1536  msec/iter =   65.26  radices = 192 16 16 16  0
1664  msec/iter =   71.93  radices = 208 16 16 16  0
1792  msec/iter =   78.33  radices = 224 16 16 16  0
1920  msec/iter =   85.62  radices = 240 16 16 16  0
2048  msec/iter =   91.51  radices = 256 16 16 16  0
2304  msec/iter =  108.23  radices = 288 16 16 16  0
2560  msec/iter =  121.80  radices = 160 32 16 16  0
2816  msec/iter =  140.07  radices = 176 32 16 16  0
3072  msec/iter =  149.53  radices = 192 32 16 16  0
3328  msec/iter =  165.62  radices = 208 32 16 16  0
3584  msec/iter =  180.50  radices = 224 32 16 16  0
3840  msec/iter =  195.86  radices = 240 32 16 16  0
4096  msec/iter =  212.20  radices = 256 32 16 16  0
4608  msec/iter =  249.22  radices = 288 32 16 16  0
5120  msec/iter =  278.39  radices = 160 32 32 16  0
5632  msec/iter =  316.39  radices = 176 32 32 16  0
6144  msec/iter =  339.48  radices = 192 32 32 16  0
6656  msec/iter =  376.19  radices = 208 32 32 16  0
7168  msec/iter =  407.32  radices = 224 32 32 16  0
7680  msec/iter =  446.24  radices = 240 32 32 16  0
A72, 2-cores:
Code:
							Speedup vs A53x4:
1024  msec/iter =   31.18  radices =  64  8  8  8 16	1.41
1152  msec/iter =   35.09  radices = 288  8 16 16  0	1.37
1280  msec/iter =   41.47  radices = 160 16 16 16  0	1.25
1408  msec/iter =   48.19  radices = 176 16 16 16  0	1.25
1536  msec/iter =   51.76  radices =  48 32 32 16  0	1.26
1664  msec/iter =   57.13  radices = 208 16 16 16  0	1.26
1792  msec/iter =   60.23  radices = 224 16 16 16  0	1.30
1920  msec/iter =   65.86  radices = 240 16 16 16  0	1.30
2048  msec/iter =   66.59  radices = 128 16 16 32  0	1.37
2304  msec/iter =   75.49  radices = 144 16 16 32  0	1.43
2560  msec/iter =   80.48  radices = 160  8  8  8 16	1.51
2816  msec/iter =   94.42  radices = 176  8  8  8 16	1.48
3072  msec/iter =  102.73  radices = 192  8  8  8 16	1.46
3328  msec/iter =  110.71  radices = 208  8  8  8 16	1.50
3584  msec/iter =  115.94  radices = 224  8  8  8 16	1.56
3840  msec/iter =  125.06  radices = 240  8  8  8 16	1.57
4096  msec/iter =  134.47  radices = 256  8  8  8 16	1.58
4608  msec/iter =  150.90  radices = 288  8  8  8 16	1.66
5120  msec/iter =  181.31  radices = 160  8  8 16 16	1.54
5632  msec/iter =  210.01  radices = 176  8  8 16 16	1.51
6144  msec/iter =  227.63  radices = 192  8  8 16 16	1.49
6656  msec/iter =  248.11  radices = 208  8  8 16 16	1.52
7168  msec/iter =  261.58  radices = 224  8  8 16 16	1.56
7680  msec/iter =  284.01  radices = 240  8  8 16 16	1.57
Total system throughput is thus ~2.5x that of my A53x4 Odroid C2 and as such, appreciably exceeds that of an Mlucas SSE2 build running 2-threaded on my 2GHz Core2Duo macbook. Likely still lags Prime95 running on the latter hardware, but I expect not by all that much.

Last fiddled with by ewmayer on 2018-03-03 at 02:13
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mprime on Odroid 64bit ET_ Software 2 2017-02-24 15:42
GPU72 plans post-announcement garo GPU to 72 25 2013-03-04 10:11
The Prime Announcement Thread axn Sierpinski/Riesel Base 5 61 2008-12-08 16:28
Subscribing to announcement thread fetofs GMP-ECM 1 2006-05-30 04:32
Fourth known factor of M(M31) (preliminary announcement) ewmayer Operazione Doppi Mersennes 22 2005-07-06 00:33

All times are UTC. The time now is 16:18.

Wed Sep 30 16:18:26 UTC 2020 up 20 days, 13:29, 0 users, load averages: 1.28, 1.76, 1.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.