mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-03-14, 20:02   #45
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2×5×17 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
You got it working, nice!
That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right?

I'm a little bit surprised it is about as fast as my
Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz)
which is only 32bit and an much older architecture.
http://mersenneforum.org/showpost.ph...5&postcount=94
Right. This is PINE64 board.
Anyway results for PINE64 is bit better and your device has bigger freq (+300MHz for each core). So PINE64 will be much better on the same freq as your device :)

Also i did benchmark for 1-3 threads for moreless actual FFT size 2048K:
./mlucas -fftlen 2048 -nthread N -iters 10
Code:
      2048  msec/iter =  707.58  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 256 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  371.82  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 256 16 16 16  0  0  0  0  0  0
      2048  msec/iter =  241.66  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 256 16 16 16  0  0  0  0  0  0
Lorenzo is offline   Reply With Quote
Old 2017-03-14, 20:21   #46
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Quote:
Originally Posted by VictordeHolland View Post
You got it working, nice!
That is a Pine64 with 4x ARM Cortex A53 cores (@1.4GHz) right?

I'm a little bit surprised it is about as fast as my
Odroid-U2 (4x ARM Cortex A9 cores @1.7Ghz)
which is only 32bit and an much older architecture.
...
BTW: Is it possible to compile run Mlucas on Windows 7/10? If so, I could try to run benchmarks on my i5 2500k and/or i7 3770k
Thanks for the timings! 32 vs 64-bit speed for LL testing is overwhelmingly a matter of the float-double capability - how do those 2 version of the ARM compare in that regard?

I used to have Win-buildability in the 32-bit days for the x86, but MSFT delayed supporting 64-bit inline asm by at least 4-5 years (w.r.to when x86_64 started shipping), so I dropped Win support years ago. To build/run under Win you'll need a Linux emulator.
ewmayer is offline   Reply With Quote
Old 2017-03-14, 20:37   #47
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

6C916 Posts
Default

Under Windows in the Ubuntu shell with an i7-6700 @ 3.4Ghz, using:
Code:
./mlucas -fftlen 2048 -nthread N -iters 10
with N=1 to 8 (4 core machine with hyperthreading)

Code:
      2048  msec/iter =   21.03  ROE[avg,max] = [0.000000000, 0.000091553]  radices =  32 32 32 32  0  0  0  0  0  0
      2048  msec/iter =   13.90  ROE[avg,max] = [0.000000000, 0.000091553]  radices =  32  8 16 16 16  0  0  0  0  0
      2048  msec/iter =   11.43  ROE[avg,max] = [0.000000000, 0.000091553]  radices =  64 16 32 32  0  0  0  0  0  0
      2048  msec/iter =   10.52  ROE[avg,max] = [0.000000000, 0.000091553]  radices =  32  8 16 16 16  0  0  0  0  0
      2048  msec/iter =   10.79  ROE[avg,max] = [0.000000000, 0.000091553]  radices =  32  8 16 16 16  0  0  0  0  0
      2048  msec/iter =   11.39  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 128 16 16 32  0  0  0  0  0  0
      2048  msec/iter =   11.06  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 256 16 16 16  0  0  0  0  0  0
      2048  msec/iter =   11.43  ROE[avg,max] = [0.000000000, 0.000091553]  radices = 256 16 16 16  0  0  0  0  0  0
With 100 iterations:

Code:
2048  msec/iter =   18.29  ROE[avg,max] = [0.247767857, 0.250000000]  radices =  32 32 32 32  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =   11.17  ROE[avg,max] = [0.341964286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    8.36  ROE[avg,max] = [0.312165179, 0.375000000]  radices = 128 16 16 32  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    7.84  ROE[avg,max] = [0.341964286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    7.67  ROE[avg,max] = [0.312165179, 0.375000000]  radices = 128 16 16 32  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    7.64  ROE[avg,max] = [0.341964286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    7.70  ROE[avg,max] = [0.341964286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388
      2048  msec/iter =    7.69  ROE[avg,max] = [0.341964286, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 6179CD26EC3B3274,  8060072069, 29249383388

Last fiddled with by wombatman on 2017-03-14 at 20:42
wombatman is offline   Reply With Quote
Old 2017-03-14, 20:48   #48
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

@wombatman: Suggest you use 1000-iter for your multithread-scaling tests, to minimize init-overhead effects. (More precisely, one would do 1000*(t_1000-t_100)/900.)
ewmayer is offline   Reply With Quote
Old 2017-03-14, 23:53   #49
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

32·193 Posts
Default

No problem. I can do that tomorrow when I'm back at work
wombatman is offline   Reply With Quote
Old 2017-03-15, 13:29   #50
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks for the timings! 32 vs 64-bit speed for LL testing is overwhelmingly a matter of the float-double capability - how do those 2 version of the ARM compare in that regard?
ARM Cortex A9 was announced in October 2007, the Cortex A53 in October 2012. But 5 years newer doesn't tell the whole story. The design choices were different.

A9
The ARM Cortex A9 was designed as a 'performance' core (with a power budget) and is dual-issue, Out-of-Order. In other words, it can decode/send two instructions per clock to the execution units and reorder them if necessary to extract extra performance. But a Vector Floating Point (VFP) execution unit is not mandatory in the A9. Most A9s have the (optional) Vector Float Point v3 (VFPv3) for handling FP though. It has 32 registers of 64-bits with NEON capability. NEON is ARMs name for a SIMD. If I understand it all correctly the A9 is limited to 1 DP Float per clock.

A53
ARM designed A53 with (high) power efficiency in mind as it is supposed to fill the roll of 'little' cores in their little.BIG philosophy. So in many high-end devices (mostly phones) they are coupled with more powerful A57 or A72 cores. When maximum responsiveness is needed (loading websites/apps, games, etc) the A57/A72 cores are used. While the A53s handles background tasks with their greater efficiency in order to extend battery life. Anandtech tested them in the Samsung Exynos 7420 and 5433, taking into account overhead and different frequencies and concluded a Cortex A53 core consumes ~200mW/core @1.4GHz (see attached graph). A53 is a dual-issue in-order design with a VFPv4 + (advanced) NEON mandatory. The VFPv4 has 32 registers of 128-bits, theoretically allowing it to process 2 DP Floats per clock.

Other differences which could impact performance of the ODROID-U2 vs. PINE64
ODROID-U2 has 1MB L2 cache (shared amongst the cores), PINE64 512KB L2 (shared amongst the cores).
Fab: 32nm (U2) vs. 40nm (PINE64) which might explain/allow the U2 to clock slightly higher (1.7GHz vs 1.4GHz).

If anybody has a board with A57s or A72s, please share your benchmarks, we're curious how they perform :).

I've also attached a graph with a comparison of Drystone benchmark performance (DMIPS/MHz) of different ARM architectures. Keep in mind Drystone is an old Integer benchmark, but it gives a rough idea.

PINE64 Allwinner A64:
http://linux-sunxi.org/A64
ODROID-U2:
http://www.hardkernel.com/main/produ...0451&tab_idx=2

Useful pages for comparison between cores:
https://en.wikipedia.org/wiki/Compar..._ARMv7-A_cores
https://en.wikipedia.org/wiki/Compar..._ARMv8-A_cores
Attached Thumbnails
Click image for larger version

Name:	A53-power-curve-frequency.png
Views:	117
Size:	20.4 KB
ID:	15765   Click image for larger version

Name:	ARMv7_vs_ARMv8_DIPS_Performance.png
Views:	126
Size:	9.4 KB
ID:	15766  
VictordeHolland is offline   Reply With Quote
Old 2017-03-15, 14:57   #51
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

33118 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@wombatman: Suggest you use 1000-iter for your multithread-scaling tests, to minimize init-overhead effects. (More precisely, one would do 1000*(t_1000-t_100)/900.)
As requested, the 1000 iteration tests for nthread = 1-4:
Code:
      2048  msec/iter =   18.94  ROE[avg,max] = [0.370465528, 0.375000000]  radices = 128 16 16 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
      2048  msec/iter =   11.20  ROE[avg,max] = [0.370465528, 0.375000000]  radices = 128 16 16 32  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
      2048  msec/iter =    8.48  ROE[avg,max] = [0.372615979, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
      2048  msec/iter =    8.01  ROE[avg,max] = [0.372615979, 0.375000000]  radices = 256 16 16 16  0  0  0  0  0  0    1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 81AEAC0C7E6089BB, 25132671466, 41950605021
For anyone else wanting to run mlucas on Windows 10, the Ubuntu shell works well, and mlucas compiles straight away.
wombatman is offline   Reply With Quote
Old 2017-03-15, 15:17   #52
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

21016 Posts
Default

IIRC Cortex-A9 can only issue one DP mul every other cycle. But that was so long ago, that I might be wrong...
ldesnogu is offline   Reply With Quote
Old 2017-03-15, 22:02   #53
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×443 Posts
Default

Many thanks for the details, VdH - one key point, though, needing clarification - In accordance with my earlier post re. the number of 128-bit registers, I believe your "VFPv4 has 32 such" is 2x too large. From Wikipedia (underlines mine):

VFPv4 or VFPv4-D32
Implemented on the Cortex-A12 and A15 ARMv7 processors, Cortex-A7 optionally has VFPv4-D32 in the case of an FPU with NEON.[81] VFPv4 has 32 64-bit FPU registers as standard, adds both half-precision support as a storage format and fused multiply-accumulate instructions to the features of VFPv3.

The same wikipage says Aarch64 has 31 64-bit GPRs - just confirming, those are distinct from the FPRs, yes?

Wombatman, thanks for the timings - so no appreciable difference vs your simple 100-iter ones here. (This varies a lot by CPU< thus always better safe than sorry.)

Last fiddled with by ewmayer on 2017-03-15 at 22:03
ewmayer is offline   Reply With Quote
Old 2017-03-15, 22:49   #54
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

24×3×11 Posts
Default

AArch64 has 32 128-bit SIMD/FP regs on top of 3x 64-bit int regs.
ldesnogu is offline   Reply With Quote
Old 2017-03-15, 22:51   #55
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

636110 Posts
Default

AArch64 has 32 integer registers (but X31 reads as zero and throws away anything written to it, so basically that's 31 registers), and also 32 128-bit-wide "SIMD and floating-point" registers.

Code looks like

FADD V3.2D, V5.2D, V7.2D (which adds the doubles in V5[127:64] and V7[127:64] and puts the result in V3[127:64], and also adds the doubles in V5[63:0] and V7[63:0] and puts the result in V3[63:0])

or FADD S3, S7, S2 (which adds the bottom floats of V7 and V2, puts the result in the bottom float of S3, and sets the other three floats of V3 to zero)

It has fused FMA support, but in the form Vd = Vd + Vm*Vn because there isn't space to pass four five-bit register names in a 32-bit opcode (there is also an FMLS instruction that does Vd = Vd - Vm*Vn form).
fivemack is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 16:12.

Wed Sep 30 16:12:01 UTC 2020 up 20 days, 13:22, 0 users, load averages: 2.01, 2.09, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.