Re: Slurm set up issues on CentOS


Adrian Reber
 

What is the value of ControlMachine= in your slurm.conf?

For the OSS EU workshop I was setting it to the value of what 'hostname'
returned.

Adrian

On Thu, Nov 07, 2019 at 03:10:36PM +0100, David Brayford wrote:
[2019-11-07T12:13:02.344] error: chdir(/var/log): Permission denied
[2019-11-07T12:13:02.344] error: Configured MailProg is invalid
[2019-11-07T12:13:02.345] Job accounting information stored, but details not
gathered
[2019-11-07T12:13:02.347] slurmctld version 18.08.8 started on cluster linux
[2019-11-07T12:13:02.350] error: This host
(ip-10-0-0-37/ip-10-0-0-37.us-west-2.compute.internal) not a valid
controller
[2019-11-07T12:38:06.760] error: Possible corrupt pidfile
`/var/run/slurmctld.pid'
[2019-11-07T12:38:06.761] error: chdir(/var/log): Permission denied
[2019-11-07T12:38:06.761] error: Configured MailProg is invalid
[2019-11-07T12:38:06.761] Job accounting information stored, but details not
gathered
[2019-11-07T12:38:06.761] slurmctld version 18.08.8 started on cluster linux
[2019-11-07T12:38:06.761] error: This host
(ip-10-0-0-37/ip-10-0-0-37.us-west-2.compute.internal) not a valid
controller
[2019-11-07T13:10:31.735] error: chdir(/var/log): Permission denied
[2019-11-07T13:10:31.735] error: Configured MailProg is invalid
[2019-11-07T13:10:31.735] Job accounting information stored, but details not
gathered
[2019-11-07T13:10:31.735] slurmctld version 18.08.8 started on cluster linux
[2019-11-07T13:10:31.735] error: This host
(ip-10-0-0-37/ip-10-0-0-37.us-west-2.compute.internal) not a valid
controller
[2019-11-07T13:17:07.065] error: chdir(/var/log): Permission denied
[2019-11-07T13:17:07.065] error: Configured MailProg is invalid
[2019-11-07T13:17:07.065] Job accounting information stored, but details not
gathered
[2019-11-07T13:17:07.065] slurmctld version 18.08.8 started on cluster linux
[2019-11-07T13:17:07.065] error: This host
(ip-10-0-0-37/ip-10-0-0-37.us-west-2.compute.internal) not a valid
controller
[2019-11-07T13:34:47.603] error: Possible corrupt pidfile
`/var/run/slurmctld.pid'
[2019-11-07T13:34:47.604] error: chdir(/var/log): Permission denied
[2019-11-07T13:34:47.604] error: Configured MailProg is invalid
[2019-11-07T13:34:47.604] Job accounting information stored, but details not
gathered
[2019-11-07T13:34:47.604] slurmctld version 18.08.8 started on cluster linux
[2019-11-07T13:34:47.604] error: This host
(ip-10-0-0-37/ip-10-0-0-37.us-west-2.compute.internal) not a valid
controller


On 11/7/19 3:07 PM, jose_d wrote:

hi, perhaps try to check and/or paste here the content of the slurmctld
log file which is configured in your slurm.conf:

# cat /etc/slurm/slurm.conf | grep SlurmctldLog
SlurmctldLogFile=/var/log/slurmctld.log
#

lot of errors have quite descriptive symptoms in this file.

cheers

josef


On 07. 11. 19 14:47, David Brayford wrote:
I am experiencing a problem when trying to set up slurm on the
head/master node on CentOS

I execute the commands:
systemctl enable munge
systemctl enable slurmctld

systemctl start munge
systemctl start slurmctld

systemctl status munge
systemctl status slurmctld

but get the error message:

● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service;
enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Thu 2019-11-07 13:34:47
UTC; 2s ago
  Process: 10532 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
(code=exited, status=0/SUCCESS)
 Main PID: 10534 (code=exited, status=1/FAILURE)

Nov 07 13:34:47 ip-10-0-0-37.us-west-2.compute.internal systemd[1]:
Starting Slurm controller daemon...
Nov 07 13:34:47 ip-10-0-0-37.us-west-2.compute.internal systemd[1]:
Started Slurm controller daemon.
Nov 07 13:34:47 ip-10-0-0-37.us-west-2.compute.internal systemd[1]:
slurmctld.service: main process exited, code=exited,
status=1/FAILURE
Nov 07 13:34:47 ip-10-0-0-37.us-west-2.compute.internal systemd[1]:
Unit slurmctld.service entered failed state.
Nov 07 13:34:47 ip-10-0-0-37.us-west-2.compute.internal systemd[1]:
slurmctld.service failed.


Any suggestions on how I resolve this issue.

David
--
Josef Dvoracek
Institute of Physics | Czech Academy of Sciences
cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669


Join OpenHPC-users@groups.io to automatically receive all group messages.