mersenneforum.org Next-gen Odroid announcement
 Register FAQ Search Today's Posts Mark Forums Read

 2018-02-07, 01:56 #1 ewmayer ∂2ω=0     Sep 2002 República de California 1157010 Posts Next-gen Odroid announcement So I finally got round to registering for the Odroid forum and posting re. the Mlucas-for-ARMv8 SIMD-code port yesterday, in my post I also loudly yearned for an Odroid update based on the newer/faster Cortex A57, today got a "Wish granted" reply from user 'rooted' pointing to this thread posted - not sure if coincidentally - just a few hours after I started mine: https://forum.odroid.com/viewtopic.php?t=29932
 2018-02-07, 11:52 #2 VictordeHolland     "Victor de Hollander" Aug 2011 the Netherlands 23×3×72 Posts Mooaarrr power, can't go wrong with that .
 2018-02-07, 14:37 #3 rogue     "Mark" Apr 2003 Between here and the 10111100101002 Posts Where's the PS2 port?
 2018-02-07, 17:50 #4 M344587487     "Composite as Heck" Oct 2017 3×233 Posts That's interesting, I have some open questionswhat can we expect out of the dual core A72 in terms of performance/power draw/efficiency? Can the A72 and A53 run full bore together? I assume the best setup is 2 workers, 1 per cortex I like that it's 12v instead of 5V, it works better to use a PSU as the power source as it's the main rail. I'm augmenting an x86 system with some pi/pi clones which use 5V, it would be pretty cool to be able to power some 12V boards from the same molex connectors. I tried to find some benchmarks for comparison. Mediatek-MT8176 (2.1Ghz 2 core A72, 4 core A53): https://www.notebookcheck.net/Mediat....187985.0.html Geekbench 4.1/4.2 64 bit single-core score: 1541 Geekbench 4.1/4.2 64 bit multi-core score: 2489 Mediatek-MT6735 (4 core 1.5Ghz A53): https://www.notebookcheck.net/Mediat....147799.0.html Geekbench 4.1/4.2 64 bit single-core score: 519 Geekbench 4.1/4.2 64 bit multi-core score: 1430 I know it's very rule-of-thumb as it is, but an A72 core having triple the bench score of an A53 core may mean the board can do 2.5x the throughput of a 4 core A53 SoC. The multi-core score backs this up if the A53 cores were idle during it's run, or maybe the benchmarks can't be compared in this way and this is all fluff.
2018-02-07, 18:17   #5
ldesnogu

Jan 2008
France

3×179 Posts

Quote:
 Originally Posted by M344587487 I like that it's 12v instead of 5V, it works better to use a PSU as the power source as it's the main rail. I'm augmenting an x86 system with some pi/pi clones which use 5V, it would be pretty cool to be able to power some 12V boards from the same molex connectors.
The molex connector on that board can only be used to power an external drive, not vice versa

2018-02-07, 18:24   #6
M344587487

"Composite as Heck"
Oct 2017

3·233 Posts

Quote:
 Originally Posted by ldesnogu The molex connector on that board can only be used to power an external drive, not vice versa
I didn't mean to use the molex connector on the board, I meant to do the 12V equivalent of this: https://www.ebay.co.uk/itm/USB-To-4-...UAAOSw0kNXg2vy

 2018-02-07, 18:45 #7 VictordeHolland     "Victor de Hollander" Aug 2011 the Netherlands 23·3·72 Posts According to ARM the A72 core should perform about 26% better than the A57 in FloatPoint (same frequency, process and memory subsystem) However, looking at the bottom of this: https://www.anandtech.com/show/11088...ce-and-power/2 It looks like the A57 and A72 are much closer in performance/MHz (A72 maybe 10% better than A57). A53 core slightly less than half of a A57 core. Attached Thumbnails
2018-02-08, 00:46   #8
ewmayer
2ω=0

Sep 2002
República de California

101101001100102 Posts

Quote:
 Originally Posted by M344587487 That's interesting, I have some open questionswhat can we expect out of the dual core A72 in terms of performance/power draw/efficiency? Can the A72 and A53 run full bore together? I assume the best setup is 2 workers, 1 per cortex
Good questions - you're probably right re. one worker on the a72 and another on the a53, but it will be interesting to see if and to what extent the two different CPUs can work together on tasks. The multithreading in my code breaks stuff into lots of separate work chunks which can be done independently by the respective threads in a master pool, and which only need be synchronized in the "all work chunks done, let's move on to the next phase" sense, so e.g. having the a53 completing their work units at 1/3rd (or whatever) the rate of the a72 cores should be no problem. I would expect the memory/cache-locality bandwidth between the 2 CPUs to be the bigger gating factor.

Anyhow, the quickest way to find out is to get hold of one of the dev-boards they're gifting a small subset of Odroiders, I asked to be included in the list of potential grantees, so we can hope. In the meantime, though, if someone has access to one of currently-available (and pricier) big-LITTLE dual-cortex-CPUs-of-different-flavors dev-boards, they could play with this aspect.

 2018-02-08, 01:38 #9 GP2     Sep 2003 2·1,291 Posts Looks like the compiler flags to use with gcc for this announced product would be -march=armv8-a -mtune=cortex-a72.cortex-a53
2018-02-08, 02:59   #10
ewmayer
2ω=0

Sep 2002
República de California

1157010 Posts

Quote:
 Originally Posted by GP2 Looks like the compiler flags to use with gcc for this announced product would be -march=armv8-a -mtune=cortex-a72.cortex-a53
Note that in my C2 SIMD builds I found a slight negative timing impact from using the a53 arch-flags, so I eschew them. YMMV, but AFAICT the only good reason to invoke such flags is if your platform requires them, which is sometimes not easy to tell - e.g. I had one builder whose build runtime-segfaulted sans the arch-flags for his CPU, said issue was cured by invoking them on rebuild.

 2018-03-03, 02:10 #11 ewmayer ∂2ω=0     Sep 2002 República de California 1157010 Posts Thanks to a well-placed Odroider who was one of the selected recipients of a pre-release N1 system and was kind enough to try out my code on some, we have N1 timings to mull over. Couple of notes: 1. His Debian build bonked with segfaults, likely similar miscompilation issue as TomW hit (but haven't bothered to do the deeper digging needed to precisely localize the cause). But my C2-build (under the standard Ubuntu distro Hardkernel ships with that unit) worked for him. Since that same build appears to run in drop-in mode on a surprising variety of ARMv8 platforms (including Raspberry Pi3), I've posted it to the Mlucas ftp site and added corresponding link/verbiage to the Mlucas readme page. 2. I'm still waiting for more data re. running code on both sockets (i.e. 6 total cores/threads), but preliminarily it looks like running separate jobs on the A72 and the A53 is best, as I surmised would be the case. A72, 2-cores - I've snipped the ROE stats column from both mlucas.cfg files' data for the sake of readability: Code: 1024 msec/iter = 43.98 radices = 256 8 16 16 0 1152 msec/iter = 47.97 radices = 144 16 16 16 0 1280 msec/iter = 51.81 radices = 160 16 16 16 0 1408 msec/iter = 60.31 radices = 176 16 16 16 0 1536 msec/iter = 65.26 radices = 192 16 16 16 0 1664 msec/iter = 71.93 radices = 208 16 16 16 0 1792 msec/iter = 78.33 radices = 224 16 16 16 0 1920 msec/iter = 85.62 radices = 240 16 16 16 0 2048 msec/iter = 91.51 radices = 256 16 16 16 0 2304 msec/iter = 108.23 radices = 288 16 16 16 0 2560 msec/iter = 121.80 radices = 160 32 16 16 0 2816 msec/iter = 140.07 radices = 176 32 16 16 0 3072 msec/iter = 149.53 radices = 192 32 16 16 0 3328 msec/iter = 165.62 radices = 208 32 16 16 0 3584 msec/iter = 180.50 radices = 224 32 16 16 0 3840 msec/iter = 195.86 radices = 240 32 16 16 0 4096 msec/iter = 212.20 radices = 256 32 16 16 0 4608 msec/iter = 249.22 radices = 288 32 16 16 0 5120 msec/iter = 278.39 radices = 160 32 32 16 0 5632 msec/iter = 316.39 radices = 176 32 32 16 0 6144 msec/iter = 339.48 radices = 192 32 32 16 0 6656 msec/iter = 376.19 radices = 208 32 32 16 0 7168 msec/iter = 407.32 radices = 224 32 32 16 0 7680 msec/iter = 446.24 radices = 240 32 32 16 0 A72, 2-cores: Code:  Speedup vs A53x4: 1024 msec/iter = 31.18 radices = 64 8 8 8 16 1.41 1152 msec/iter = 35.09 radices = 288 8 16 16 0 1.37 1280 msec/iter = 41.47 radices = 160 16 16 16 0 1.25 1408 msec/iter = 48.19 radices = 176 16 16 16 0 1.25 1536 msec/iter = 51.76 radices = 48 32 32 16 0 1.26 1664 msec/iter = 57.13 radices = 208 16 16 16 0 1.26 1792 msec/iter = 60.23 radices = 224 16 16 16 0 1.30 1920 msec/iter = 65.86 radices = 240 16 16 16 0 1.30 2048 msec/iter = 66.59 radices = 128 16 16 32 0 1.37 2304 msec/iter = 75.49 radices = 144 16 16 32 0 1.43 2560 msec/iter = 80.48 radices = 160 8 8 8 16 1.51 2816 msec/iter = 94.42 radices = 176 8 8 8 16 1.48 3072 msec/iter = 102.73 radices = 192 8 8 8 16 1.46 3328 msec/iter = 110.71 radices = 208 8 8 8 16 1.50 3584 msec/iter = 115.94 radices = 224 8 8 8 16 1.56 3840 msec/iter = 125.06 radices = 240 8 8 8 16 1.57 4096 msec/iter = 134.47 radices = 256 8 8 8 16 1.58 4608 msec/iter = 150.90 radices = 288 8 8 8 16 1.66 5120 msec/iter = 181.31 radices = 160 8 8 16 16 1.54 5632 msec/iter = 210.01 radices = 176 8 8 16 16 1.51 6144 msec/iter = 227.63 radices = 192 8 8 16 16 1.49 6656 msec/iter = 248.11 radices = 208 8 8 16 16 1.52 7168 msec/iter = 261.58 radices = 224 8 8 16 16 1.56 7680 msec/iter = 284.01 radices = 240 8 8 16 16 1.57 Total system throughput is thus ~2.5x that of my A53x4 Odroid C2 and as such, appreciably exceeds that of an Mlucas SSE2 build running 2-threaded on my 2GHz Core2Duo macbook. Likely still lags Prime95 running on the latter hardware, but I expect not by all that much. Last fiddled with by ewmayer on 2018-03-03 at 02:13

 Similar Threads Thread Thread Starter Forum Replies Last Post ET_ Software 2 2017-02-24 15:42 garo GPU to 72 25 2013-03-04 10:11 axn Sierpinski/Riesel Base 5 61 2008-12-08 16:28 fetofs GMP-ECM 1 2006-05-30 04:32 ewmayer Operazione Doppi Mersennes 22 2005-07-06 00:33

All times are UTC. The time now is 03:59.

Wed Dec 2 03:59:37 UTC 2020 up 83 days, 1:10, 1 user, load averages: 2.03, 2.16, 1.91