Go Back > Extra Stuff > Blogorrhea > LaurV

Thread Tools
Old 2017-06-14, 04:37   #1
Romulan Interpreter
LaurV's Avatar
Jun 2011

100010100101002 Posts
Default Welcome to Blogorrhea!

We are the proud owner of a Blogorrhea subforum (or is it subfora? anyhow, both are underlined red, as their blogblood went out through their soles, so we decided that both of them are wrong...

So, let's give it a test. We are waiting now for comments from you, that we will remove and destroy, only because we can, to show you that we have all the rights here.

And we use the royal we to show you that we are the god... (the lowercase one, we do not want to anger the uppercase Gods of the forum...)

We are starting this blog with a (test) post about ARM computing, but we hope that the blog will not be about this. In fact, we have no idea yet what the blog will be about...

Someone in a parallel thread said that Raspberry Pi should be at most the best we can expect from porting some LL/TF code to nowadays ARMs.

We do not really agree. Nowadays ARMs have strong and flexible GPU units inside, like the Neon, which can be used for calculus. Well, it will be not-so-hard, and not-so-slow, to use the Neon for LL testing, due to the fact that the beast can execute two double precision float (FP64) multiplications, or two 64-bit integer multiplications simultaneously, in a very short time (even compared with a "standard" x64 CPU), and all libraries (FFT included) are available and well maintained.

But our concern and competency is not about beasts like the NEON, but about small silicon grains, like Cortex Mx, as our daily job includes a lot of STM32xxxx controllers. We write the software for them, and program them, sometimes with a programmer, sometimes with a hammer, depending of our mood and the mood of our bosses...

Well, we think we can get much better than the Pi, just putting a gang of these grains together.

We are still thinking to that old idea of ours to put 960 pieces of Cortex Mx on a board and ask them politely to do TF for us. We invested a lot of time in thinking the interface, and writing 96-bit integer code that plays with Montgomery and Barrett reductions. Everything is very slow, running at only 48MHz, and only a "single core" on the devboard. And you need to do billions of operations (see below) just to get a single modular multiplication done... No talk about exponentiation yet... But the consumption is only ~250uA (micro-amperes) per megahertz, at 3.3 volts, and it scales very linear.

The "decision problem" we face is that we either use Cortex M0, either M3. The advantage of M3 is the internal multiplier. For M0, there is none, you only rely on standard stuff, multiplying 32 bits registers, with result on 32 bits, which means that you can only use 16-bits operands ("digits") to avoid overflows. If you hunt for factors below 80 bits, you must do "5 digits" multiplication, and if you go for higher bitlevels, below-96-bits factors is the next, which means 6 "digits" of 16 bits each. In school-grade multiplication, this makes 25, respectively 36 multiplications, but it can be reduced a lot with some clever Karatsuba or Toom3-Toom-2.5 split. However, here the 16-bit multiplication is not much slower than the 32-bit addition, they all go "risc-like", one operation per clock, and STM32 series all have implemented a single-cycle 32-bit hardware multiplier, but there is no carry and no way to get the most significant 32 bits of the result, so you have to live with 16-bits operands, but in this case, Karatsuba and Toom-Cook doesn't make much sense either. Or... We do not know a better way to do this, and we need a lot of 16-bit operations (with some 32 bit additions) to do our 96-bit modular multiplications. That is kinda hell. With M3, everything is much faster, almost double speed, due to the availability of the 32-bit multipliers with result on 64 bits. But the disadvantage is the price. A suitable M0 costs like 25-30 cents per piece, with all the memory inside, but a same suitable M3 costs more than double, 60-70 cents per piece. They may also consume more, like 360uA/MHz, but they also run faster, like 72MHz and higher.

A 960-pieces board with M3 will marginally beat a GTX 1080 Ti at TF performance, and it will totally beat it at performance per watt (about 65-70 GHzCores, consuming about 85 Wats, without power supply/conversion circuits), but it will cost close to 1000 bucks.

A 960-pieces board with M0 will consume half and cost half, but would have a quarter performance (about 40-45 GHzCores taking like 38 watts, for about $450).

They won't be good for LL at all (no fast connection between "cores", and the memory of each "core" would be too less to allow running LL test alone) but they can make some waves in the TF market.

Just grains of sand, when you compare them with intel's i7, but lots of them, you know...

Last fiddled with by LaurV on 2017-06-14 at 05:07
LaurV is online now   Reply With Quote
Old 2017-06-14, 05:03   #2
retina's Avatar
"The unspeakable one"
Jun 2006
My evil lair

10110100111102 Posts

I think you neglected to include the development costs. You actually have to design a 960 CPU board, and the software to run on it. That takes time and effort. Perhaps your time is worth nothing in which case you are definitely winning in terms of TCO. But if, in the more likely case, your time does cost something then you'll be in a hole for along time until those electricity savings start to kick in.

And when the newer gen embedded chips come out with the 64-bit instruction set do you do it all again?
retina is online now   Reply With Quote
Old 2017-06-14, 07:24   #3
Nick's Avatar
Dec 2012
The Netherlands

1,451 Posts

Originally Posted by retina View Post
You actually have to design a 960 CPU board, and the software to run on it. That takes time and effort.
It sounds to me like that's the part LaurV enjoys most!
Nick is online now   Reply With Quote
Old 2017-06-14, 09:03   #4
ET_'s Avatar
Aug 2002
Team Italia

25×149 Posts

You could crowdfund your job, asking people interested to the project to "anticipate" part of the money for their board.
ET_ is offline   Reply With Quote
Old 2017-06-14, 10:36   #5
Romulan Interpreter
LaurV's Avatar
Jun 2011

22×2,213 Posts

Originally Posted by ET_ View Post
You could crowdfund your job, asking people interested to the project to "anticipate" part of the money for their board.
I was thinking to that, but you see, I actually have nothing, beside of some crazy ideas, a bunch of paper&pencil (well... excel, but ignore it) calculus, and some trial pieces of code squeezed between daily job. Developing the board would not be difficult, I am sure my company will support me with production resources, and buying a reel of 2000 pieces (say) STM32F030F4 (or even F3) for few hundred dollars is something I can afford from my own pocket. If the try fails completely, I have a use for them anyhow, in other products we do here around. And of course, I would love to play on the software side, as Nick said. The problem is not here, but to the fact that I don't know if and how this will work, if it will worth all the effort and time. As I said, beside of some wild ass guesses, based on some trials and calculus extrapolations, I have nothing else. For crowdfunding, or whatever, I will have to come at least with a proof of concept...

I hope I can put myself together sometime in the future and come with something more palpable..

But well, this was just a trial post

Last fiddled with by LaurV on 2017-06-14 at 10:37
LaurV is online now   Reply With Quote
Old 2017-06-14, 10:43   #6
firejuggler's Avatar
Apr 2010
Over the rainbow

2·1,217 Posts

start small, scale up? start with 4 then and if it work , scale up?
firejuggler is offline   Reply With Quote
Old 2017-06-14, 11:37   #7
Tribal Bullet
jasonp's Avatar
Oct 2004

3×1,163 Posts

You can definitely start small; this sounds like one of those projects where you won't know what part needs the most effort until you try to build something end-to-end. E.g. routing 960 cores on a bus sounds pretty painful. How do you control such a board? Serial port to a master M3 in the corner? What about delivering data to each core, and collecting results?

You have more choices than 16-bit words and 32-bit words; what about 28-bit words? GMP used to have a feature called 'nails' where every word had fewer bits in it than the maximum a machine word allowed, precisely because having extra precision in every word made a lot of multiple-precision arithmetic faster on machines that didn't have explicit carries. 28-bit words is only a marginal loss of efficiency if you are implementing multiplication via sequences of adds, and carries are still accessible after a long sequence of 28-bit multiply-adds. Definitely use Montgomery multiplication, it will turn a modular multiply into about 2.5 regular multiplies plus a little overhead.

There have been many efforts to develop specialized coprocessors. See for example COPACOBANA, or the little boards xilman is playing with from Parallela. An older effort (before 2000) put 8 StrongARM processors on a PCI board. Deep learning is hot right now, and low-precision integer only architectures like yours have a lot of promise for training neural networks.

The money to get started on this is pretty trivial, especially if you are used to having a feral teenager. You judgement call is how far to go when it starts to get tough and you have already sunk a lot of your time into it.
jasonp is offline   Reply With Quote
Old 2017-06-15, 09:07   #8
Romulan Interpreter
LaurV's Avatar
Jun 2011

100010100101002 Posts

Thanks Jason, input from "msieve's daddy" is very much appreciated! You are one of the guys here who really smoked all these things (at least at the software level) and our starting blog feels quite honored by the attention you pay to it

However, I have an issue here: I don't have any idea how 28 bits can help me on M0. Can you elaborate?

Because, when I multiply two 28 bits numbers, I may get a 56 bits number result, from which the most significant 24 bits will be lost (there is no way to get them, there is no instruction to get the result on 64 bits, and no "split" instruction like other MCUs have, one instruction to get "the most significant 32" and another instruction to get "the least significant 32". There are only 32-bit registers, which you multiply with the MULS instruction, in a single clock cycle (which is good!) and you get a 32 bit result containing the least significant 32 bits of the product. The most significant are lost. See for example page 55 on the programming manual.

With M3, the things are different, as there is a long-multiplier available, which multiplies two registers (32*32) and give the result in another two registers (32+32) bits. You also have the possibility to ask for the highest or lowest 32 bits of the result. There is also a hardware divider, etc. (manual here, pages 83-85).

About the interface, well, kinda serial. Nothing fancy, that is why I say the "contraption" can not be used for LL. There can be 80 "biscuits" in two rows of 40, or 4 rows of 20, on the "mobo". Each "biscuit" has 12 MCUs ("cores") on it, connected somehow together (this is already well defined, and it has practical reasons), and it connects to the external world with 8 wires, from which two are power, and the other 6 are "serial interface" (paired somehow, with data in/out and clock in/out, they mainly work like shift registers you have in light bands). Each "core" has a number (that can be set independently) and the number will decide what "class" will try to TF. In this way, you only need to communicate to the "mobo" the exponent, and start/stop k for TF. Each "core" will pick its own class from the 960 available (in a 4620 scheme) or 96 available (in a 420 scheme) and crunch it independently - sieving and all.

The advantage of the approach is that one doesn't need to have all 80 biscuits on the mobo, in fact you don't need a "mobo" at all, you can start with a single "core" (any of the nucleo or any of the discovery boards mostly provided for free by ST) and do your software tweaks, then make a biscuit, then make two biscuits... hehe... you got the point, when you have 96 "cores" (8 biscuits) you can already "crunch"... And if you have a "mobo", this is mainly some rows of 8-pin connectors and a couple of FTDI interface chips.

Right now, however, this is only dreaming... we will see what the time brings...

Last fiddled with by LaurV on 2017-06-15 at 09:21
LaurV is online now   Reply With Quote
Old 2017-06-15, 09:30   #9
xilman's Avatar
May 2003
Down not across

22×32×281 Posts

LaurV: I've long thought about and have discussed the design and implementation of a system such as you describe, though my ideas have been more along the lines of 128 or 256 moderately chunky machines on a single board. Think "Raspberry Pi" for an idea of what I call moderately chunk. Each machine would have adequate memory --- 512M and up --- and each board would have a FEP responsible for talking to the outside world, which would see the system as being a /24 or /25 IP network. Note that the internal comms need not be IP, only that it looks that way to the rest of the world.

I am willing to pontificate at greater length here or we could switch to email if you are interested and no-one else it.
xilman is offline   Reply With Quote
Old 2017-06-15, 10:14   #10
Romulan Interpreter
LaurV's Avatar
Jun 2011

22×2,213 Posts

Thanks. That would be something much too complicate to take on by my "one-man-army" , but thanks anyhow. In that case, indeed, we may need crowd-funding and all the stuff vehiculated already in this thread. What I am talking about is something which costs peanuts and consumes peanuts, and can be done by a determined guy with low budget in his free/fun time. We fenced for a while with things like bitcoin mining rigs you describe, and we decided that they are too complicate for our modest knowledge...

Last fiddled with by LaurV on 2017-06-15 at 10:17
LaurV is online now   Reply With Quote
Old 2017-06-15, 13:33   #11
kladner's Avatar
Jul 2011
In My Own Galaxy!

100111000000112 Posts

Originally Posted by xilman View Post
I am willing to pontificate at greater length here or we could switch to email if you are interested and no-one else it.
I value such discussions, even if I understand next-to-nothing of what is being said.
kladner is offline   Reply With Quote

Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Blogorrhea? lukerichards Information & Answers 1 2018-01-20 13:41

All times are UTC. The time now is 08:54.

Fri Oct 23 08:54:37 UTC 2020 up 43 days, 6:05, 0 users, load averages: 1.41, 1.53, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.