mersenneforum.org mpi Two Machines Take longer than One for LA
 Register FAQ Search Today's Posts Mark Forums Read

2016-12-09, 11:53   #2
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

24·3·7·19 Posts

Quote:
 Originally Posted by EdH But, for a recent c129, the single quad core machine did the LA in about 1hr 45m. When I task the two machines via mpi to run the same LA, I can't get them to do it in under 4 hours!
My suspicion is that they're spending all the time waiting for data to move over the proportionally very slow gigabit ethernet. If you have £120 to spend at ebay on second-hand Infiniband QDR cards and second-hand Infiniband cable, you can make the network forty times faster and see if that helps.

I've had situations where adding more processors via MPI *on the same motherboard* slows the task down.

Probably worth checking with iperf that you are actually getting 900Mbps or so over the gigabit.

Last fiddled with by fivemack on 2016-12-09 at 11:54

2016-12-09, 22:27   #3
EdH

"Ed Hall"
Dec 2009

612 Posts

Quote:
 Originally Posted by fivemack My suspicion is that they're spending all the time waiting for data to move over the proportionally very slow gigabit ethernet. If you have £120 to spend at ebay on second-hand Infiniband QDR cards and second-hand Infiniband cable, you can make the network forty times faster and see if that helps. I've had situations where adding more processors via MPI *on the same motherboard* slows the task down. Probably worth checking with iperf that you are actually getting 900Mbps or so over the gigabit.
Well, I went to install iperf and have discovered that even though the second machine appears to be communicating fine within the LAN, it can't see any further and the software packages it does have for network tools are written in a manner to make them as useless as possible. They don't do anything. Debian is looking more and more like a useless POS OS. Too bad I have so many machines running it. And, right there, thinking about it - the working cluster is using Ubuntu and the one that doesn't work is trying to use two Debian systems.

Sorry for the rant!

On the brighter side, pings between the machines over the gigabit connection are much quicker than via the 10/100 switch/router connection. I also found one possible setting that may have been in error. After the current c127 finishes NFS, I'll test again.

Meanwhile, I think I can use a stand-a-lone version of iperf and see if that works. Otherwise, I'll beat on it a bit here and there to see if I can find out why it won't talk to the web.

I might end up just moving to Ubuntu and see if that takes care of all the troubles. The only reason I'm using Debian on so many machines is because at the time I was installing the earlier systems, I couldn't get Ubuntu to run headless. Now I can.

Sorry I got long winded (fingered). Thanks for all the help...

 2016-12-09, 23:00 #4 EdH     "Ed Hall" Dec 2009 Adirondack Mtns 372110 Posts Just a short follow on: iperf was a no-go so far, but there is a definite transfer difference for normal file movement. Between the two Debian machines across 10/100, I get 11.2 MB/s. Via the gigabit, I get 33.7MB/s. But, between the Ubuntu machines, via a switch, I get 45.2MB/s. And, there I have it...
 2016-12-09, 23:34 #5 frmky     Jul 2003 So Cal 2×1,049 Posts For msieve LA, gigabit ethernet is slow and very high latency. This will bottleneck the calculation. As Tom suggests, QDR Infiniband will work much better but it's relatively expensive.
2016-12-10, 02:14   #6
retina
Undefined

"The unspeakable one"
Jun 2006
My evil lair

6,143 Posts

Quote:
 Originally Posted by frmky ... high latency.
I suspect this is the real killer.

To the OP: You could try removing the switch and directly connect the two machines. You might need a crossed cable, but most Ethernet chips nowadays can auto switch so a normal straight cable might also work. Also you might need to assign static IPs, or run a DHCP server on one of the boxes.

Last fiddled with by retina on 2016-12-10 at 02:14

2016-12-10, 15:05   #7
EdH

"Ed Hall"
Dec 2009

612 Posts

Quote:
 Originally Posted by retina I suspect this is the real killer. To the OP: You could try removing the switch and directly connect the two machines. You might need a crossed cable, but most Ethernet chips nowadays can auto switch so a normal straight cable might also work. Also you might need to assign static IPs, or run a DHCP server on one of the boxes.
Thanks, Retina, That was all done for the dual machine setup, as explained earlier, including direct connect. But the three machine setup is using a switch. All my machines are static, so I can easily run them headless. The three machines through a switch have a considerably faster connection than the two without. Unless there's a speed loss using a standard cable over a crossover, I don't understand that one (except as ranted above - Debian vs. Ubuntu?).

I'm stumped about the one machine not talking to "mommy" anymore, so I can't get repository packages. Because of dependencies, I can't easily install iperf and haven't found a stand-a-lone version, but normal file transfer is slower across these two, so I don't know that I need to prove that with iperf. I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.

Thanks for all suggestions.

2016-12-10, 18:38   #8
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

24×3×7×19 Posts

Quote:
 Originally Posted by EdH I'm stumped about the one machine not talking to "mommy" anymore, so I can't get repository packages. Because of dependencies, I can't easily install iperf and haven't found a stand-a-lone version, but normal file transfer is slower across these two, so I don't know that I need to prove that with iperf. I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.
I find file transfer often quite surprisingly slow, even between SSDs

Code:
% scp wheat@wheat:pingle .
pingle                                        100% 1024MB  85.3MB/s   00:12
[  4]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
so I'm getting something like 73% of the measured transfer rate as a measured file transfer rate.

I too have a problem with machines not talking to the net-at-large, I ended up using 'apt-get download iperf' on one that worked and 'dpkg -i iperf_2.0.5-3_amd64.deb' after scping the file. Wondering whether a domestic edge-router is unhappy to have twenty distinct machines behind it.

Last fiddled with by fivemack on 2016-12-10 at 18:39

2016-12-10, 18:41   #9
xilman
Bamboozled!

"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

26×167 Posts

Quote:
 Originally Posted by EdH I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.

There are communication overheads such that you will never reach 1000 megabits per second for application data. Two of my systems connected through a gigabit switch usually transfer data at around 75MB/s over sftp. That's 60% of the notional 1Gbps and is a fair indication of the sort of overheads that you can expect. More efficient protocols than sftp will do better but I doubt you'll ever get much more than 80% (100MB/s) in practice.

2016-12-10, 20:53   #10
EdH

"Ed Hall"
Dec 2009

612 Posts

Quote:
 Originally Posted by fivemack ... I too have a problem with machines not talking to the net-at-large, I ended up using 'apt-get download iperf' on one that worked and 'dpkg -i iperf_2.0.5-3_amd64.deb' after scping the file. Wondering whether a domestic edge-router is unhappy to have twenty distinct machines behind it.
Thank you for the new info (to me) about downloading from the repo and copying it to the distant end. I had gotten an iperf...deb from iperf.fr, but I couldn't handle the dependencies.

iperf appears to show all is well with the connection:
via the 10/100 path:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   116 MBytes  96.0 Mbits/sec
via the 1Gb path:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec
These are two machines on the Gb switch (two of the ones that work so much better):
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec
I'm leaning even more toward blaming Debian. And, yes I, too, have over twenty distinct physical machines, several with extra accounts, like the ones for mpi.

I do have one limiting issue I'm aware of, in that I haven't been able to figure out network sharing for the mpi drive without using sshfs, which I'm sure is digging into the bandwidth. But that's in use with the other, three machine, setup, too. Is there something about network drive sharing that I could maybe get working that would do better than my sshfs?

Thanks to all for the help. I'm learning more stuff...

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong jasong 10 2012-12-04 05:00 Prime95 Software 11 2012-01-13 15:06 sdbardwick Lounge 11 2009-10-27 09:19 fivemack NFSNET Discussion 3 2008-02-21 19:26 ixfd64 Lounge 2 2007-06-09 08:55

All times are UTC. The time now is 23:39.

Tue May 18 23:39:23 UTC 2021 up 40 days, 18:20, 0 users, load averages: 1.17, 1.46, 1.56