In a moment of lunacy, I ran linear algebra on the Jetson board.

The right options seem to be la_block=8192 la_superblock=98304 (the board has 32k-per-core L1 caches and a shared 2M L2 cache over four cores; la_superblock=196608 is about the same speed, 393216 is a good deal slower, la_block=16384 is a lot slower).

It takes just under ten hours on four threads for a 1.98M matrix, compared to just over two hours for four threads on i7/4770. Not bad for a machine that fits under an iPad Mini.
