mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Msieve

Reply
 
Thread Tools
Old 2018-04-21, 13:23   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11000111011112 Posts
Default Much MPI confusion

So: I have created a new key pair, I have added the public half of it to .ssh/authorized_keys on the compute nodes, I have done 'ssh-add {private key}' on the head node.

Code:
oak@oak:/scratch/stoat$ ssh birch@birch1 hostname
birch1
oak@oak:/scratch/stoat$ mpirun -H birch@birch1,birch@birch2 hostname
birch2
birch1
oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3 hostname
birch4
birch3
oak@oak:/scratch/stoat$ mpirun -H birch@birch4,birch@birch3,birch@birch1,birch@birch2 hostname
Host key verification failed.
I can launch jobs on any pair but not on any set of more than two, and the error message is not exactly helpful.

If I strace the mpirun job, it only even tries connecting to the first n-1 of the hosts, but it uses an ssh command which works fine when I reconstruct it and try it from the command line.
fivemack is offline   Reply With Quote
Old 2018-04-21, 15:59   #2
chris2be8
 
chris2be8's Avatar
 
Sep 2009

2·1,021 Posts
Default

Looking in syslog on the destination systems should show some sshd messages if oak got as far as connecting to them. Their absence would suggest it's trying to connect to another system (or other systems).

Can you tell mpirun to use ssh -v to connect? That should get some more info out of ssh. Or update .ssh/config to request more diagnostics?

Chris
chris2be8 is offline   Reply With Quote
Old 2018-04-21, 16:16   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

Looking in /var/log/auth.log on the four machines, I get

Code:
Apr 21 17:10:42 birch1 sshd[25767]: Connection closed by 172.26.200.103 port 45598 [preauth]
when I try submitting three jobs; and when I try submitting four I get
Code:
Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2
Apr 21 17:12:08 birch4 sshd[7969]: Failed password for birch from 172.26.200.103 port 39336 ssh2
Apr 21 17:12:08 birch4 sshd[7969]: Connection closed by 172.26.200.103 port 39336 [preauth]
even though I'm attempting to use public-key authentication, and even though I get a 'Accepted publickey' message when I submit one job to birch@birch4
fivemack is offline   Reply With Quote
Old 2018-04-21, 16:23   #4
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·4,787 Posts
Default

Quote:
Originally Posted by fivemack View Post
...even though I'm attempting to use public-key authentication...
Check to ensure the ~/.ssh/authorized_keys permissions are "-rw-------"; "chmod go-rwx ~/.ssh/authorized_keys" on the problematic nodes.

Last fiddled with by chalsall on 2018-04-21 at 16:24
chalsall is offline   Reply With Quote
Old 2018-04-21, 16:23   #5
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

13·491 Posts
Default

Oh! It's doing something complicated and hierarchical, where some of the jobs are started from other slaves rather than from the master - I'm getting connection-refused from IP addresses which aren't the IP address of oak.

When I put the id_ed25519 file in the .ssh directory on all the nodes as well as on the master it starts dispatching properly.
fivemack is offline   Reply With Quote
Old 2018-04-21, 16:28   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

101001101110102 Posts
Default

Quote:
Originally Posted by fivemack View Post
Oh! It's doing something complicated and hierarchical, where some of the jobs are started from other slaves rather than from the master - I'm getting connection-refused from IP addresses which aren't the IP address of oak.

When I put the id_ed25519 file in the .ssh directory on all the nodes as well as on the master it starts dispatching properly.
I'd wondered if that might be the case but your first report suggested otherwise.

When setting up BackupPC I had a lot of hassle until I ran "ssh host ls" from all the various machines in an interactive session and accepted that all the targets were legit. You might try the same. It's an n^2 process but at least it only needs doing once.
xilman is offline   Reply With Quote
Old 2018-05-12, 15:45   #7
pinhodecarlos
 
pinhodecarlos's Avatar
 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK

2×23×107 Posts
Default

Not sure where to post this about this training from TACC.

Introduction To MPI Using The Interactive Parallelization Tool (IPT)


MPI (Message Passing Interface) is the principal way data is communicated between the nodes of a compute cluster. Sign up to learn about MPI and the TACC-developed Interactive Parallelization Tool (IPT), designed to parallelize serial C/C++ programs semi-automatically.

https://learn.tacc.utexas.edu/enrol/index.php?id=31
pinhodecarlos is offline   Reply With Quote
Old 2018-05-31, 13:46   #8
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

How can I tell Msieve to only use only one NUMA node on my Ubuntu box with 2S Xeon E5-2650? I've compiled it without MPI (at least I think).

I've used 'taskset' untill now, but that doesn't seem to work so well.

Any suggestions?
VictordeHolland is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Assignment confusion Chuck PrimeNet 7 2014-02-11 13:42
Question about work units and confusion about mailing lists jasong NFSNET Discussion 5 2006-05-17 01:42

All times are UTC. The time now is 11:36.

Wed May 12 11:36:21 UTC 2021 up 34 days, 6:17, 0 users, load averages: 1.01, 1.41, 1.43

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.