mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software > Mlucas

Reply
 
Thread Tools
Old 2017-02-01, 02:24   #1
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×33×7×13 Posts
Default ARM builds and SIMD-assembler prospects

Forumite and ARM Odroid user David Willmore and I hacked together a small amount of predefine-code in the Mlucas platform.h file to enable him to get a build of the code on that platform (he is using this Odroid hardware implementation). As the late baseball great Yogi Berra famously quipped, "Predictions are hard, especially about the future", but over the weekend I had a gander about the ARM architecture, especially SIMD support. Here some e-mail ruminations that spawned:

Me:
Quote:
Had a look at the Wikipedia article on ARM - looks like the Neon has pretty nice SIMD support, though the page is a tad confusing re. the single-vs-double-precision-float aspect. First we see this - underlines mine:

Advanced SIMD (NEON)

The Advanced SIMD extension (aka NEON or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications. NEON is included in all Cortex-A8 devices but is optional in Cortex-A9 devices.[83] NEON can execute MP3 audio decoding on CPUs running at 10 MHz and can run the GSM adaptive multi-rate (AMR) speech codec at no more than 13 MHz. It features a comprehensive instruction set, separate register files and independent execution hardware.[84] NEON supports 8-, 16-, 32- and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP. Devices such as the ARM Cortex-A8 and Cortex-A9 support 128-bit vectors but will execute with 64 bits at a time,[80] whereas newer Cortex-A15 devices can execute 128 bits at a time.

ProjectNe10 is ARM's first open source project (from its inception). The Ne10 library is a set of common, useful functions written in both NEON and C (for compatibility). The library was created to allow developers to use NEON optimisations without learning NEON but it also serves as a set of highly optimised NEON intrinsic and assembly code examples for common DSP, arithmetic and image processing routines. The code is available on GitHub.

Later on we have this:

AArch64 features

New instruction set, A64
Has 31 general-purpose 64-bit registers.
Has dedicated SP or zero register.
The program counter (PC) is no longer directly accessible as a register.
Instructions are still 32 bits long and mostly the same as A32 (with LDM/STM instructions and most conditional execution dropped).
Has paired loads/stores (in place of LDM/STM).
No predication for most instructions (except branches).
Most instructions can take 32-bit or 64-bit arguments.
Addresses assumed to be 64-bit.
Advanced SIMD (NEON) enhanced
Has 32× 128-bit registers (up from 16), also accessible via VFPv4.
Supports double-precision floating point.

Intel only added 32-SIMD-regieters with their AVX-512, i.e. that won't come to the PC space until late this year. I'm doing a lot of 32-register-using code streamlining in my KNL dev work ... using the eventual inline-asm macros which will result from that as the basis for an ARM asm-translation seems quite doable, especially since access to the Intel KNL has allowed me to start the AVX-512 code upgrades a full year in advance of that arch hitting consumer PCs. That means that once I have a decent first-cut at AVX-512 - looking like summer based on work so far - I would have time for other things.

If we had good DP support, we could probably expect a 3-5x gain over generic-C build from a SIMD-using enhancement. (ARM SIMD is 128-bit-wide, i.e. pairs-of-doubles, so one expects 2x at the very least, plus more on top of that from hand-tuned register and FMA usage.) If that opened up a realistic prospect of gaining some reasonable fraction of current GIMPS throughput - say 10% - from thousands of ARM users, it would worth doing. David, any sense of what kinds of ARM-using devices would be available for the kind of 24/7 crunching needed here?

Just got [Mlucas self-test-produced] cfg-file timings from David's 1-core self-tests ... those indicate ~1/20th the per-cycle throughput of a single Intel Haswell core running 256-bit vector-inline-assembly ... that's not bad at all for a generic-C build.
David:
Quote:
Yes, the Cortex-A53 does one 128 bit NEON instruction/clock--which
could be two doubles. The higher end models can dual issue NEON, but
I don't know the details. I know a few people who might.

There are hundreds of millions of these devices, but they tend to be
in power and memory limited applications. The one use of them that
would have no power limit and little memory limit would be in set top
boxes as they are mains powered. Other than that, they tend to be in
phones and other battery powered devices--or in things so small that
thermal issues would become signifigant. Even those devices that
could run this code effectively tend to be 'black boxes' that come as
set from the manufacturer and leave the user very little ability to
add programs.

I think any effort you put into this should bear in mind that very
little direct benefit will acrue to GIMPS. I would suggest that
second order benefits would be higher--people learn about GIMPS
because of the usefullness of this code to them in a different
context. Because of that knowledge, they chose to support GIMPS with
other hardware of theirs.

Because of that, I would suggest that any work you do to optimize for
AARCH64 be as generic as you can make it--use the Ne10 project code
if/when possible.

I don't want to sidetrack you from your valuable work on GIMPS with my
silly little side project.

Last fiddled with by ewmayer on 2017-02-04 at 23:39 Reason: Added Odroid link
ewmayer is offline   Reply With Quote
Old 2017-02-02, 12:28   #2
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

13×367 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Forumite and ARM Odroid user David Willmore and I hacked together a small amount of predefine-code in the Mlucas platform.h file to enable him to get a build of the code on that platform.
David:
Following the thread,,,
ET_ is offline   Reply With Quote
Old 2017-02-03, 11:47   #3
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2·5·17 Posts
Default

Yesterday i have ordered this device https://www.pine64.org/?product=pine-a64-board-1gb for only 19$ + cost of shipping!

It based on 1.2 GHz Quad-Core ARM Cortex A53 64-Bit Processor.

Would be very interesting to test Mlucas on it when i get these devise in my hand. Unfortunately shipment to my country might takes up to two month.
Lorenzo is online now   Reply With Quote
Old 2017-02-03, 14:05   #4
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

12A316 Posts
Default

Quote:
Originally Posted by Lorenzo View Post
Yesterday i have ordered this device https://www.pine64.org/?product=pine-a64-board-1gb for only 19$ + cost of shipping!

It based on 1.2 GHz Quad-Core ARM Cortex A53 64-Bit Processor.

Would be very interesting to test Mlucas on it when i get these devise in my hand. Unfortunately shipment to my country might takes up to two month.
[commercial mode on]
Hi Lorenzo. Once at that, you can easily get a Odroid-C2 clocked at 1.5GHz, 64bit ready with Ubuntu, with its own heatsink and a lot of room for overclocking, 2GB and nearly no memory bandwidth limitations... for $40
[commercial mode off]

If I get it right, you are still in the process of building a microfarm...
ET_ is offline   Reply With Quote
Old 2017-02-03, 14:33   #5
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2·5·17 Posts
Default

Quote:
Originally Posted by ET_ View Post
[commercial mode on]
Hi Lorenzo. Once at that, you can easily get a Odroid-C2 clocked at 1.5GHz, 64bit ready with Ubuntu, with its own heatsink and a lot of room for overclocking, 2GB and nearly no memory bandwidth limitations... for $40
[commercial mode off]

If I get it right, you are still in the process of building a microfarm...
I booked it because in my country i have custom taxes limitation. I can receive goods with cost up to 22 USD per month without any custom taxes and absolutely free. In another way i have a very complicated way to get it and i must pay at least custom taxes (30%) and some extra taxes. So that is why i bought the PINE device and only with 1 Gb memory
Lorenzo is online now   Reply With Quote
Old 2017-02-03, 22:30   #6
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

26·157 Posts
Default

This thread has gotten me much more intrigued with these devices.
ET: I see why you recommend Odroid. That's an amazing package. I would need a PSU, the storage module, etc. As I remember, Android has a larger memory footprint, so Ubuntu would seem to be the obvious choice.
Thinking more about it, anyway.
EDIT: Lorenzo: That is really restrictive! But who knows? The US might be going that way, too.
Tariffs on Mexican goods? Picking fights with AUSTRALIA?!? I note that He Who Shall Not Be Named does not use the word "tariff", but something like "border tax".

Trade Wars, Anyone?

Last fiddled with by kladner on 2017-02-03 at 22:36
kladner is offline   Reply With Quote
Old 2017-03-10, 08:54   #7
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

477110 Posts
Default

Update.
I bought one of these PicoCubes with five Odroid-C2 (20 nodes) and am ready to test/benchmark Mlucas on it as soon as I receive the package and get it ready to work.
Please refer to this thread whenever you have news or hints.

Luigi
Attached Thumbnails
Click image for larger version

Name:	pico.png
Views:	201
Size:	227.0 KB
ID:	15735  

Last fiddled with by ET_ on 2017-03-10 at 08:55
ET_ is offline   Reply With Quote
Old 2017-03-10, 09:26   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×33×7×13 Posts
Default

Quote:
Originally Posted by ET_ View Post
Update.
I bought one of these PicoCubes with five Odroid-C2 (20 nodes) and am ready to test/benchmark Mlucas on it as soon as I receive the package and get it ready to work.
Please refer to this thread whenever you have news or hints.

Luigi
You're at least a couple months ahead of me - that's how long it'll take me to finish a first-cut AVX-512 upgrade to all the Mlucas code, at which point I plan to get a low-cost Neon dev-board to play with. What software (I really just care about gcc/gdb and the associated libraries) did you install, or came preinstalled on your system? And is that a Cortex-A15, i.e. a true 128-bit NEON?
ewmayer is offline   Reply With Quote
Old 2017-03-10, 10:16   #9
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

13·367 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You're at least a couple months ahead of me - that's how long it'll take me to finish a first-cut AVX-512 upgrade to all the Mlucas code, at which point I plan to get a low-cost Neon dev-board to play with. What software (I really just care about gcc/gdb and the associated libraries) did you install, or came preinstalled on your system? And is that a Cortex-A15, i.e. a true 128-bit NEON?
Don't feel pressed, I will need some time as well for the delivery and to get acquainted with the management software (and my own projects).

You are doing a wonderful work with AVX-512 and everybody here would not like a slowdown on that front

Unfortunately (?) the processor is a Cortex-A58 (ARMv8), a 64-bit processor like the one used on the Raspberry PI, but fully supported by a 64 bit OS with 2GB of memory and clocked at 1.5GHz. It is the same you said David Willmore is using (http://www.hardkernel.com/main/products/prdt_info.php ).

gcc is the version that runs on Ubuntu Mate (I will have more infos as soon as I get the package delivered). The specifications say:
Quote:
You can use this cluster to run almost any kind of distributed or parallel software. Run your own LAMP cluster, Docker, Kubernetes, Hadoop, ElasticSearch, Cassandra and many others. Also learn languages like Javascript, Java, Python, R, and so on. Use for Development, QA, DevOps, or Education.

The PicoCluster Application Image Set is a basic cluster setup designed to get the PicoCluster user up and running quickly. It is the Image Set that will be pre-configured with any PicoCluster Cube or Kit that is ordered with memory cards. All other Application or Cluster Images Sets are based upon this one.

You can either use PicoCluster as a desktop cluster by plugging in a mouse, keyboard, and monitor, or use it as a network cluster by connecting via SSH.
Just let me know if I can be of any help.

Luigi
ET_ is offline   Reply With Quote
Old 2017-03-10, 23:15   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·33·7·13 Posts
Default

Hi, Luigi:

Cortex-A58 ... so 128-bit vector instructions OK, but they actually get executed 64-bits at a time?

Thanks for the kind offer of help - a remote-access account would be great, but no biggie since the fewer-core dev-boards are cheap. If you could LMK which precise dev-board I should get to get true 128-bit exec capability, that would be helpful.

Post a pic of your rig once it's set up!
ewmayer is offline   Reply With Quote
Old 2017-03-11, 00:48   #11
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

2·587 Posts
Default

I think you both mean ARM Cortex A53 (or A57). A58 doesnt exist (yet) ;).
List of ARM Cortex A:
http://www.arm.com/products/processors/cortex-a

Last fiddled with by VictordeHolland on 2017-03-11 at 00:51 Reason: Source
VictordeHolland is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Economic prospects for solar photovoltaic power cheesehead Science & Technology 137 2018-06-26 15:46
Which SIMD flag to use for Raspberry Pi BrainStone Mlucas 14 2017-11-19 00:59
compiler/assembler optimizations possible? ixfd64 Software 7 2011-02-25 20:05
Running 32-bit builds on a Win7 system ewmayer Programming 34 2010-10-18 22:36
SIMD string->int fivemack Software 7 2009-03-23 18:15

All times are UTC. The time now is 15:54.

Thu Nov 26 15:54:39 UTC 2020 up 77 days, 13:05, 4 users, load averages: 1.65, 1.42, 1.42

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.