20170314, 20:02  #45  
Aug 2010
Republic of Belarus
2×5×17 Posts 
Quote:
Anyway results for PINE64 is bit better and your device has bigger freq (+300MHz for each core). So PINE64 will be much better on the same freq as your device :) Also i did benchmark for 13 threads for moreless actual FFT size 2048K: ./mlucas fftlen 2048 nthread N iters 10 Code:
2048 msec/iter = 707.58 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0 2048 msec/iter = 371.82 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0 2048 msec/iter = 241.66 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0 

20170314, 20:21  #46  
∂^{2}ω=0
Sep 2002
República de California
2×13×443 Posts 
Quote:
I used to have Winbuildability in the 32bit days for the x86, but MSFT delayed supporting 64bit inline asm by at least 45 years (w.r.to when x86_64 started shipping), so I dropped Win support years ago. To build/run under Win you'll need a Linux emulator. 

20170314, 20:37  #47 
I moo ablest echo power!
May 2013
6C9_{16} Posts 
Under Windows in the Ubuntu shell with an i76700 @ 3.4Ghz, using:
Code:
./mlucas fftlen 2048 nthread N iters 10 Code:
2048 msec/iter = 21.03 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 32 32 32 0 0 0 0 0 0 2048 msec/iter = 13.90 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0 2048 msec/iter = 11.43 ROE[avg,max] = [0.000000000, 0.000091553] radices = 64 16 32 32 0 0 0 0 0 0 2048 msec/iter = 10.52 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0 2048 msec/iter = 10.79 ROE[avg,max] = [0.000000000, 0.000091553] radices = 32 8 16 16 16 0 0 0 0 0 2048 msec/iter = 11.39 ROE[avg,max] = [0.000000000, 0.000091553] radices = 128 16 16 32 0 0 0 0 0 0 2048 msec/iter = 11.06 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0 2048 msec/iter = 11.43 ROE[avg,max] = [0.000000000, 0.000091553] radices = 256 16 16 16 0 0 0 0 0 0 Code:
2048 msec/iter = 18.29 ROE[avg,max] = [0.247767857, 0.250000000] radices = 32 32 32 32 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 11.17 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 8.36 ROE[avg,max] = [0.312165179, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 7.84 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 7.67 ROE[avg,max] = [0.312165179, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 7.64 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 7.70 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 2048 msec/iter = 7.69 ROE[avg,max] = [0.341964286, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 100iteration Res mod 2^64, 2^351, 2^361 = 6179CD26EC3B3274, 8060072069, 29249383388 Last fiddled with by wombatman on 20170314 at 20:42 
20170314, 20:48  #48 
∂^{2}ω=0
Sep 2002
República de California
2×13×443 Posts 
@wombatman: Suggest you use 1000iter for your multithreadscaling tests, to minimize initoverhead effects. (More precisely, one would do 1000*(t_1000t_100)/900.)

20170314, 23:53  #49 
I moo ablest echo power!
May 2013
3^{2}·193 Posts 
No problem. I can do that tomorrow when I'm back at work

20170315, 13:29  #50  
"Victor de Hollander"
Aug 2011
the Netherlands
2^{3}×3×7^{2} Posts 
Quote:
A9 The ARM Cortex A9 was designed as a 'performance' core (with a power budget) and is dualissue, OutofOrder. In other words, it can decode/send two instructions per clock to the execution units and reorder them if necessary to extract extra performance. But a Vector Floating Point (VFP) execution unit is not mandatory in the A9. Most A9s have the (optional) Vector Float Point v3 (VFPv3) for handling FP though. It has 32 registers of 64bits with NEON capability. NEON is ARMs name for a SIMD. If I understand it all correctly the A9 is limited to 1 DP Float per clock. A53 ARM designed A53 with (high) power efficiency in mind as it is supposed to fill the roll of 'little' cores in their little.BIG philosophy. So in many highend devices (mostly phones) they are coupled with more powerful A57 or A72 cores. When maximum responsiveness is needed (loading websites/apps, games, etc) the A57/A72 cores are used. While the A53s handles background tasks with their greater efficiency in order to extend battery life. Anandtech tested them in the Samsung Exynos 7420 and 5433, taking into account overhead and different frequencies and concluded a Cortex A53 core consumes ~200mW/core @1.4GHz (see attached graph). A53 is a dualissue inorder design with a VFPv4 + (advanced) NEON mandatory. The VFPv4 has 32 registers of 128bits, theoretically allowing it to process 2 DP Floats per clock. Other differences which could impact performance of the ODROIDU2 vs. PINE64 ODROIDU2 has 1MB L2 cache (shared amongst the cores), PINE64 512KB L2 (shared amongst the cores). Fab: 32nm (U2) vs. 40nm (PINE64) which might explain/allow the U2 to clock slightly higher (1.7GHz vs 1.4GHz). If anybody has a board with A57s or A72s, please share your benchmarks, we're curious how they perform :). I've also attached a graph with a comparison of Drystone benchmark performance (DMIPS/MHz) of different ARM architectures. Keep in mind Drystone is an old Integer benchmark, but it gives a rough idea. PINE64 Allwinner A64: http://linuxsunxi.org/A64 ODROIDU2: http://www.hardkernel.com/main/produ...0451&tab_idx=2 Useful pages for comparison between cores: https://en.wikipedia.org/wiki/Compar..._ARMv7A_cores https://en.wikipedia.org/wiki/Compar..._ARMv8A_cores 

20170315, 14:57  #51  
I moo ablest echo power!
May 2013
3311_{8} Posts 
Quote:
Code:
2048 msec/iter = 18.94 ROE[avg,max] = [0.370465528, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 1000iteration Res mod 2^64, 2^351, 2^361 = 81AEAC0C7E6089BB, 25132671466, 41950605021 2048 msec/iter = 11.20 ROE[avg,max] = [0.370465528, 0.375000000] radices = 128 16 16 32 0 0 0 0 0 0 1000iteration Res mod 2^64, 2^351, 2^361 = 81AEAC0C7E6089BB, 25132671466, 41950605021 2048 msec/iter = 8.48 ROE[avg,max] = [0.372615979, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 1000iteration Res mod 2^64, 2^351, 2^361 = 81AEAC0C7E6089BB, 25132671466, 41950605021 2048 msec/iter = 8.01 ROE[avg,max] = [0.372615979, 0.375000000] radices = 256 16 16 16 0 0 0 0 0 0 1000iteration Res mod 2^64, 2^351, 2^361 = 81AEAC0C7E6089BB, 25132671466, 41950605021 

20170315, 15:17  #52 
Jan 2008
France
210_{16} Posts 
IIRC CortexA9 can only issue one DP mul every other cycle. But that was so long ago, that I might be wrong...

20170315, 22:02  #53 
∂^{2}ω=0
Sep 2002
República de California
2×13×443 Posts 
Many thanks for the details, VdH  one key point, though, needing clarification  In accordance with my earlier post re. the number of 128bit registers, I believe your "VFPv4 has 32 such" is 2x too large. From Wikipedia (underlines mine):
VFPv4 or VFPv4D32 Implemented on the CortexA12 and A15 ARMv7 processors, CortexA7 optionally has VFPv4D32 in the case of an FPU with NEON.[81] VFPv4 has 32 64bit FPU registers as standard, adds both halfprecision support as a storage format and fused multiplyaccumulate instructions to the features of VFPv3. The same wikipage says Aarch64 has 31 64bit GPRs  just confirming, those are distinct from the FPRs, yes? Wombatman, thanks for the timings  so no appreciable difference vs your simple 100iter ones here. (This varies a lot by CPU< thus always better safe than sorry.) Last fiddled with by ewmayer on 20170315 at 22:03 
20170315, 22:49  #54 
Jan 2008
France
2^{4}×3×11 Posts 
AArch64 has 32 128bit SIMD/FP regs on top of 3x 64bit int regs.

20170315, 22:51  #55 
(loop (#_fork))
Feb 2006
Cambridge, England
6361_{10} Posts 
AArch64 has 32 integer registers (but X31 reads as zero and throws away anything written to it, so basically that's 31 registers), and also 32 128bitwide "SIMD and floatingpoint" registers.
Code looks like FADD V3.2D, V5.2D, V7.2D (which adds the doubles in V5[127:64] and V7[127:64] and puts the result in V3[127:64], and also adds the doubles in V5[63:0] and V7[63:0] and puts the result in V3[63:0]) or FADD S3, S7, S2 (which adds the bottom floats of V7 and V2, puts the result in the bottom float of S3, and sets the other three floats of V3 to zero) It has fused FMA support, but in the form Vd = Vd + Vm*Vn because there isn't space to pass four fivebit register names in a 32bit opcode (there is also an FMLS instruction that does Vd = Vd  Vm*Vn form). 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Economic prospects for solar photovoltaic power  cheesehead  Science & Technology  137  20180626 15:46 
Which SIMD flag to use for Raspberry Pi  BrainStone  Mlucas  14  20171119 00:59 
compiler/assembler optimizations possible?  ixfd64  Software  7  20110225 20:05 
Running 32bit builds on a Win7 system  ewmayer  Programming  34  20101018 22:36 
SIMD string>int  fivemack  Software  7  20090323 18:15 