mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2016-12-09, 04:19   #1
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

612 Posts
Default mpi Two Machines Take longer than One for LA

OK, I've been playing with mpi processing and msieve LA for a while now with some success, but this latest setup I'm trying to get running is just taxing my patience and level of knowledge.

On the success side, I have a setup with three dual core machines connected via a gigabit switch that gives me the slightest advantage (time-wise) over a quad core machine running alone.

I am trying to get the quad core to link up with a second quad core via mpi. I have everything working with the two machines connected direct via their gigabit networks and they are claiming 1 Gigabit connectivity.

Running msieve LA on the two machines via mpi works with no errors.

But, for a recent c129, the single quad core machine did the LA in about 1hr 45m. When I task the two machines via mpi to run the same LA, I can't get them to do it in under 4 hours! I've tried all the various options of threads, grids, mpi processes, etc. Nothing will come back at less than 4 hours. I've also never gotten near, let alone, over 200% on either cpu as displayed by top. If I use 2 threads and 2 processes per machine, I can get around 150% on each of 2 threads per machine. If I go to 1 process and 4 threads per machine, I still only get around 150%. If I go full 4 processes on both machines with 1 thread, I get 4 processes at ~100%. But none of these calls get me under 4 hours and some are nearer to 5.

All thoughts welcome.
EdH is offline   Reply With Quote
Old 2016-12-09, 11:53   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

24·3·7·19 Posts
Default

Quote:
Originally Posted by EdH View Post
But, for a recent c129, the single quad core machine did the LA in about 1hr 45m. When I task the two machines via mpi to run the same LA, I can't get them to do it in under 4 hours!
My suspicion is that they're spending all the time waiting for data to move over the proportionally very slow gigabit ethernet. If you have Β£120 to spend at ebay on second-hand Infiniband QDR cards and second-hand Infiniband cable, you can make the network forty times faster and see if that helps.

I've had situations where adding more processors via MPI *on the same motherboard* slows the task down.

Probably worth checking with iperf that you are actually getting 900Mbps or so over the gigabit.

Last fiddled with by fivemack on 2016-12-09 at 11:54
fivemack is offline   Reply With Quote
Old 2016-12-09, 22:27   #3
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

612 Posts
Default

Quote:
Originally Posted by fivemack View Post
My suspicion is that they're spending all the time waiting for data to move over the proportionally very slow gigabit ethernet. If you have Β£120 to spend at ebay on second-hand Infiniband QDR cards and second-hand Infiniband cable, you can make the network forty times faster and see if that helps.

I've had situations where adding more processors via MPI *on the same motherboard* slows the task down.

Probably worth checking with iperf that you are actually getting 900Mbps or so over the gigabit.
Well, I went to install iperf and have discovered that even though the second machine appears to be communicating fine within the LAN, it can't see any further and the software packages it does have for network tools are written in a manner to make them as useless as possible. They don't do anything. Debian is looking more and more like a useless POS OS. Too bad I have so many machines running it. And, right there, thinking about it - the working cluster is using Ubuntu and the one that doesn't work is trying to use two Debian systems.

Sorry for the rant!

On the brighter side, pings between the machines over the gigabit connection are much quicker than via the 10/100 switch/router connection. I also found one possible setting that may have been in error. After the current c127 finishes NFS, I'll test again.

Meanwhile, I think I can use a stand-a-lone version of iperf and see if that works. Otherwise, I'll beat on it a bit here and there to see if I can find out why it won't talk to the web.

I might end up just moving to Ubuntu and see if that takes care of all the troubles. The only reason I'm using Debian on so many machines is because at the time I was installing the earlier systems, I couldn't get Ubuntu to run headless. Now I can.

Sorry I got long winded (fingered). Thanks for all the help...
EdH is offline   Reply With Quote
Old 2016-12-09, 23:00   #4
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

372110 Posts
Default

Just a short follow on:

iperf was a no-go so far, but there is a definite transfer difference for normal file movement. Between the two Debian machines across 10/100, I get 11.2 MB/s. Via the gigabit, I get 33.7MB/s. But, between the Ubuntu machines, via a switch, I get 45.2MB/s.

And, there I have it...
EdH is offline   Reply With Quote
Old 2016-12-09, 23:34   #5
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×1,049 Posts
Default

For msieve LA, gigabit ethernet is slow and very high latency. This will bottleneck the calculation. As Tom suggests, QDR Infiniband will work much better but it's relatively expensive.
frmky is offline   Reply With Quote
Old 2016-12-10, 02:14   #6
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

6,143 Posts
Default

Quote:
Originally Posted by frmky View Post
... high latency.
I suspect this is the real killer.

To the OP: You could try removing the switch and directly connect the two machines. You might need a crossed cable, but most Ethernet chips nowadays can auto switch so a normal straight cable might also work. Also you might need to assign static IPs, or run a DHCP server on one of the boxes.

Last fiddled with by retina on 2016-12-10 at 02:14
retina is online now   Reply With Quote
Old 2016-12-10, 15:05   #7
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

612 Posts
Default

Quote:
Originally Posted by retina View Post
I suspect this is the real killer.

To the OP: You could try removing the switch and directly connect the two machines. You might need a crossed cable, but most Ethernet chips nowadays can auto switch so a normal straight cable might also work. Also you might need to assign static IPs, or run a DHCP server on one of the boxes.
Thanks, Retina, That was all done for the dual machine setup, as explained earlier, including direct connect. But the three machine setup is using a switch. All my machines are static, so I can easily run them headless. The three machines through a switch have a considerably faster connection than the two without. Unless there's a speed loss using a standard cable over a crossover, I don't understand that one (except as ranted above - Debian vs. Ubuntu?).

I'm stumped about the one machine not talking to "mommy" anymore, so I can't get repository packages. Because of dependencies, I can't easily install iperf and haven't found a stand-a-lone version, but normal file transfer is slower across these two, so I don't know that I need to prove that with iperf. I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.

Thanks for all suggestions.
EdH is offline   Reply With Quote
Old 2016-12-10, 18:38   #8
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

24×3×7×19 Posts
Default

Quote:
Originally Posted by EdH View Post
I'm stumped about the one machine not talking to "mommy" anymore, so I can't get repository packages. Because of dependencies, I can't easily install iperf and haven't found a stand-a-lone version, but normal file transfer is slower across these two, so I don't know that I need to prove that with iperf. I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.
I find file transfer often quite surprisingly slow, even between SSDs

Code:
% scp wheat@wheat:pingle .
pingle                                        100% 1024MB  85.3MB/s   00:12    
[  4]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
so I'm getting something like 73% of the measured transfer rate as a measured file transfer rate.

I too have a problem with machines not talking to the net-at-large, I ended up using 'apt-get download iperf' on one that worked and 'dpkg -i iperf_2.0.5-3_amd64.deb' after scping the file. Wondering whether a domestic edge-router is unhappy to have twenty distinct machines behind it.

Last fiddled with by fivemack on 2016-12-10 at 18:39
fivemack is offline   Reply With Quote
Old 2016-12-10, 18:41   #9
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

26×167 Posts
Default

Quote:
Originally Posted by EdH View Post
I think the above transfer at 33.7MB/s is less than 1/3 of the gigabit rating, if my math is correct.
Your math is correct.

There are communication overheads such that you will never reach 1000 megabits per second for application data. Two of my systems connected through a gigabit switch usually transfer data at around 75MB/s over sftp. That's 60% of the notional 1Gbps and is a fair indication of the sort of overheads that you can expect. More efficient protocols than sftp will do better but I doubt you'll ever get much more than 80% (100MB/s) in practice.
xilman is offline   Reply With Quote
Old 2016-12-10, 20:53   #10
EdH
 
EdH's Avatar
 
"Ed Hall"
Dec 2009
Adirondack Mtns

612 Posts
Default

Quote:
Originally Posted by fivemack View Post
...
I too have a problem with machines not talking to the net-at-large, I ended up using 'apt-get download iperf' on one that worked and 'dpkg -i iperf_2.0.5-3_amd64.deb' after scping the file. Wondering whether a domestic edge-router is unhappy to have twenty distinct machines behind it.
Thank you for the new info (to me) about downloading from the repo and copying it to the distant end. I had gotten an iperf...deb from iperf.fr, but I couldn't handle the dependencies.

iperf appears to show all is well with the connection:
via the 10/100 path:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.1 sec   116 MBytes  96.0 Mbits/sec
via the 1Gb path:
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec
These are two machines on the Gb switch (two of the ones that work so much better):
Code:
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec
I'm leaning even more toward blaming Debian. And, yes I, too, have over twenty distinct physical machines, several with extra accounts, like the ones for mpi.

I do have one limiting issue I'm aware of, in that I haven't been able to figure out network sharing for the mpi drive without using sshfs, which I'm sure is digging into the bandwidth. But that's in use with the other, three machine, setup, too. Is there something about network drive sharing that I could maybe get working that would do better than my sshfs?

Thanks to all for the help. I'm learning more stuff...
EdH is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
When are we no longer humane? jasong jasong 10 2012-12-04 05:00
AMD benchmarks NO LONGER needed for v26 Prime95 Software 11 2012-01-13 15:06
"...[take] longer than the age of the known universe to sdbardwick Lounge 11 2009-10-27 09:19
Longer-term plans? fivemack NFSNET Discussion 3 2008-02-21 19:26
RSA contests are no longer active :( ixfd64 Lounge 2 2007-06-09 08:55

All times are UTC. The time now is 23:39.

Tue May 18 23:39:23 UTC 2021 up 40 days, 18:20, 0 users, load averages: 1.17, 1.46, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.