Date   

Re: CentOS aarch64 on Raspberry PI 3

DARDO ARIEL VIÑAS VISCARDI
 

Adrian!! I'm trying to build your guide: https://opensource.com/article/18/1/how-build-hpc-system-raspberry-pi-and-openhpc#comment-152896

How did you do this???


Re: CentOS aarch64 on Raspberry PI 3

Adrian Reber
 

On Wed, Mar 28, 2018 at 03:04:36PM +0200, Alexandre Strube wrote:
I’m not sure you can use the aarch64 image for it, but you can you the
minimal image made specifically for the raspberry pi 3:

http://mirror.centos.org/altarch/7/isos/armhfp/
That is 32bit only. What I did for a demo was to use CentOS 64 bit
aarch64 user-space and a Fedora 64 bit kernel which works with the
raspberry pi 3.

Adrian

2018-03-28 14:58 GMT+02:00 DARDO ARIEL VIÑAS VISCARDI <
dardo.vinas@...>:

Has anyone find out a way to run CentOS 7 on RP3?

I want to build a small cluster based on them and can't find a way to load
Centos on aarch64


Re: CentOS aarch64 on Raspberry PI 3

Alexandre Strube
 

I’m not sure you can use the aarch64 image for it, but you can you the minimal image made specifically for the raspberry pi 3:


2018-03-28 14:58 GMT+02:00 DARDO ARIEL VIÑAS VISCARDI <dardo.vinas@...>:

Has anyone find out a way to run CentOS 7 on RP3? 

I want to build a small cluster based on them and can't find a way to load Centos on aarch64




--
[]
Alexandre Strube
surak@...


CentOS aarch64 on Raspberry PI 3

DARDO ARIEL VIÑAS VISCARDI
 

Has anyone find out a way to run CentOS 7 on RP3? 

I want to build a small cluster based on them and can't find a way to load Centos on aarch64


Spack not generating modules

Irek Porebski
 

Hi All,

I have installed SPACK version 0.10.0 from OpenHPC. After that I have installed boost "spack install boost" which install boost@1.63.0 and add to lmod

$ ml spider boost/1.63.0
 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  boost: boost/1.63.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      Boost free peer-reviewed portable C++ source libraries
 
 
    You will need to load all module(s) on any one of the lines below before the "boost/1.63.0" module is available to load.
 
      gnu/5.4.0  mpich/3.2
      gnu/5.4.0  mvapich2/2.2
      gnu/5.4.0  openmpi/1.10.6
      gnu7/7.2.0  openmpi/1.10.7

However if I installed different version of the boost via spack the software will be installed but only in admin (/opt/ohpc/admin/spack) folder structure but not in pub (/opt/ohpc/pub). The lmod files are not created as well. What I am missing? I was expecting that spack will create modules and copy the software to pub folder similar like for other version so the users can access it. 

Could you help me troubleshoot this issue? 

Thanks,
Irek


Re: Cant run ICC on my compute nodes

Patrick Goetz
 

I'm pretty sure you're not supposed to have 2 exports with the same fsid. Check to see if /opt/intel is even being mounted.

On 03/22/2018 02:44 PM, DARDO ARIEL VIÑAS VISCARDI wrote:
You know I realized that, so I added the folder on the master in /etc/exports with the others:
/home *(rw,no_subtree_check,fsid=10,no_root_squash)
/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
/opt/intel *(ro,no_subtree_check,fsid=11)
And on the provosioning image, on the $CHROOT/etc/fstab
[root@n2 ~]# cat /etc/fstab
tmpfs / tmpfs rw,relatime,mode=555 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
10.0.1.1:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0
10.0.1.1:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0
10.0.1.1:/opt/intel /opt/intel nfs nfsvers=3,nodev,noatime 0 0
I ran the command "exportfs -a"
But steel, after rebuilding everthing, rebooting nodes, everything..... the folder /opt/intel is showing the content of /opt/ohpc/pub...
Any ideas why this could be happening?


Re: wwgetfiles SLOW and flaky

 

Hi Mark,

On a node can you verify in /warewulf/bin/wwgetfiles that the: rm -f
${LOCKF} -- is in 2 places. Should be above the 'exit 2' for a
failed download, and also above the 'exit 0'.

If that's the case, then you're removing the lock file for the
running wwgetfiles ... which ya probably don't want to do. ;)

-J

On Fri, Mar 23, 2018 at 2:47 PM, Mark Moorcroft <plaktau@...> wrote:
[Edited Message Follows]


I find I have to first delete all the timestamp files before I can run
wwgetfiles on the nodes. And It takes several minutes or more to run. Is
there a particular thing I should look at first to explain this?

Sorry, I actually meant that the lock files have to be deleted. But I just
say pdsh -w c[1-88] rm /tmp/wwgetfile*


Re: wwgetfiles SLOW and flaky

Mark Moorcroft
 

Well, what I find is with my 87 nodes any time I try to get them all to wwgetfiles the majority say it's already running. If it runs every 5 minutes and the delay interval is up to 3 minutes I guess I can see why this happens. I can understand why they do it this way, but it's nearly useless to attempt to run it manually.

Oh yeah, and it's Warewulf ;-)


Re: wwgetfiles SLOW and flaky

 

You can run into an issue with the timestamp file when you update a
file, but the timestamp entry on a node just happens to be after that
file gets updated, but it didn't get pulled down because the file
import wasn't complete. The timestamp should be the last time the node
pulled a file, so it's looking to see if there has been any of it's
files that have been updated since then. Removing that, just causes it
to pull down everything again.

If you want to not have the delay, you can do something like:

pdsh -w n0[00-99] WWGETFILES_INTERVAL=0 /warewulf/bin/wwgetfiles

The delay is there to stagger the nodes when running from a cronjob so
every node in the cluster isn't hitting the http server at the same
time.

-J

On Fri, Mar 23, 2018 at 2:53 PM, Ryan Novosielski <novosirj@...> wrote:
I believe there’s a random delay before the download is attempted. I’ve seen rare cases where I had to delete the timestamp files, but it’s not the norm. It generally takes a few minutes. You can set that to something definite and small to run it manually — take a look at /warewulf/bin/wwgetfiles:

WWGETFILES_INTERVAL=${WWGETFILES_INTERVAL:-180}

if [ -n "$WWGETFILES_INTERVAL" -a $WWGETFILES_INTERVAL -gt 0 ];then
if [ -n "$RANDOM" -a ! -f "/init" ]; then
SLEEPTIME=`expr $RANDOM % $WWGETFILES_INTERVAL`
sleep $SLEEPTIME
fi
fi

There is probably a way that you can set it for all nodes too, I just don’t know what it is off the top of my head.

PS: you should probably specify that you’re talking about Warewulf — OpenHPC can be provisioned with other provisioners as well.

On Mar 23, 2018, at 3:47 PM, Mark Moorcroft <plaktau@...> wrote:


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'




Re: wwgetfiles SLOW and flaky

Ryan Novosielski
 

I believe there’s a random delay before the download is attempted. I’ve seen rare cases where I had to delete the timestamp files, but it’s not the norm. It generally takes a few minutes. You can set that to something definite and small to run it manually — take a look at /warewulf/bin/wwgetfiles:

WWGETFILES_INTERVAL=${WWGETFILES_INTERVAL:-180}

if [ -n "$WWGETFILES_INTERVAL" -a $WWGETFILES_INTERVAL -gt 0 ];then
if [ -n "$RANDOM" -a ! -f "/init" ]; then
SLEEPTIME=`expr $RANDOM % $WWGETFILES_INTERVAL`
sleep $SLEEPTIME
fi
fi

There is probably a way that you can set it for all nodes too, I just don’t know what it is off the top of my head.

PS: you should probably specify that you’re talking about Warewulf — OpenHPC can be provisioned with other provisioners as well.

On Mar 23, 2018, at 3:47 PM, Mark Moorcroft <plaktau@...> wrote:


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


wwgetfiles SLOW and flaky

Mark Moorcroft
 
Edited


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?

Sorry, I actually meant that the lock files have to be deleted. But I just say pdsh -w c[1-88] rm /tmp/wwgetfile*


Re: Cant run ICC on my compute nodes

DARDO ARIEL VIÑAS VISCARDI
 

Yup! You were right! Thank you very much for all your help Karl

2018-03-23 14:01 GMT-03:00 Karl W. Schulz <karl@...>:



> On Mar 22, 2018, at 2:44 PM, DARDO ARIEL VIÑAS VISCARDI <dardo.vinas@....ar> wrote:
>
> You know I realized that, so I added the folder on the master in /etc/exports with the others:
>
> /home *(rw,no_subtree_check,fsid=10,no_root_squash)
> /opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
> /opt/intel *(ro,no_subtree_check,fsid=11)
>
> And on the provosioning image, on the $CHROOT/etc/fstab
>
> [root@n2 ~]# cat /etc/fstab
> tmpfs / tmpfs rw,relatime,mode=555 0 0
> tmpfs /dev/shm tmpfs defaults 0 0
> devpts /dev/pts devpts gid=5,mode=620 0 0
> sysfs /sys sysfs defaults 0 0
> proc /proc proc defaults 0 0
> 10.0.1.1:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0
> 10.0.1.1:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0
> 10.0.1.1:/opt/intel /opt/intel nfs nfsvers=3,nodev,noatime 0 0
>
> I ran the command "exportfs -a"
>
> But steel, after rebuilding everthing, rebooting nodes, everything..... the folder /opt/intel is showing the content of /opt/ohpc/pub...
>
> Any ideas why this could be happening?

It might be due to the fact that you are using the same fsid in the /etc/exports file.  Can you try making them unique (e.g. change the last line to have fsid=12) and see if that helps?

-k







Re: Cant run ICC on my compute nodes

Karl W. Schulz
 

On Mar 22, 2018, at 2:44 PM, DARDO ARIEL VIÑAS VISCARDI <dardo.vinas@...> wrote:

You know I realized that, so I added the folder on the master in /etc/exports with the others:

/home *(rw,no_subtree_check,fsid=10,no_root_squash)
/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
/opt/intel *(ro,no_subtree_check,fsid=11)

And on the provosioning image, on the $CHROOT/etc/fstab

[root@n2 ~]# cat /etc/fstab
tmpfs / tmpfs rw,relatime,mode=555 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
10.0.1.1:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0
10.0.1.1:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0
10.0.1.1:/opt/intel /opt/intel nfs nfsvers=3,nodev,noatime 0 0

I ran the command "exportfs -a"

But steel, after rebuilding everthing, rebooting nodes, everything..... the folder /opt/intel is showing the content of /opt/ohpc/pub...

Any ideas why this could be happening?
It might be due to the fact that you are using the same fsid in the /etc/exports file. Can you try making them unique (e.g. change the last line to have fsid=12) and see if that helps?

-k


Re: Cant run ICC on my compute nodes

DARDO ARIEL VIÑAS VISCARDI
 

You know I realized that, so I added the folder on the master in /etc/exports with the others:

/home *(rw,no_subtree_check,fsid=10,no_root_squash)
/opt/ohpc/pub *(ro,no_subtree_check,fsid=11)
/opt/intel *(ro,no_subtree_check,fsid=11)

And on the provosioning image, on the $CHROOT/etc/fstab

[root@n2 ~]# cat /etc/fstab  
tmpfs / tmpfs rw,relatime,mode=555 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
10.0.1.1:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0
10.0.1.1:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0
10.0.1.1:/opt/intel /opt/intel nfs nfsvers=3,nodev,noatime 0 0

I ran the command "exportfs -a"

But steel, after rebuilding everthing, rebooting nodes, everything..... the folder /opt/intel is showing the content of /opt/ohpc/pub... 

Any ideas why this could be happening?
 


Re: shine-ohpc doesn't work

Reese Baird
 

Hi Götz -
The shine issue has languished on the backlog for too long, and for that I apologize. Our biggest challenge is in testing it. It is currently difficult for our CI jobs to have escalated privileges on our lustre system. If we could come up with a meaningful integration test (your input would be greatly appreciated) that didn't require root on our lustre, I think we could quickly resolve this. Otherwise, perhaps deprecating the component should be considered.

Cheers,
Reese

On 3/21/18, 5:51 AM, "OpenHPC-users@groups.io on behalf of Götz Waschk" <OpenHPC-users@groups.io on behalf of goetz.waschk@...> wrote:

Hi everyone,

I have reported the issue of shine not working to github six months
ago and it wasn't even acknowledged:
https://github.com/openhpc/ohpc/issues/541

Was this module abandoned? If so, shouldn't it be removed then?

Regards, Götz


Re: Cant run ICC on my compute nodes

Karl W. Schulz
 

On Mar 22, 2018, at 7:58 AM, DARDO ARIEL VIÑAS VISCARDI <dardo.vinas@...> wrote:

I had a problem when I tried to run a test domain in WRF on my cluster.

[prun] Error: Expected Job launcher mpiexec.hydra not found for impi

So I shh to my node, and try to run the command myself (after loading the intel and impi module)

icc
-bash: icc: command not found
mpiexec.hydra
-bash: mpiexec.hydra: command not found
mpirun
-bash: mpirun: command not found

Any idea why this happends? On my master I cant find the commands after loading the module (my master isn't acting as a compute node).
Did you install the parallel studio package on the head node in the default path, or put it in a path that is already visible to the compute nodes (like /opt/ohpc/pub/intel)? If you chose the default (which is likely /opt/intel), you will want to make sure to export that path to your compute nodes (so, update /etc/exports on head node and /etc/fstab on computes) if you haven’t already.

-k


Re: Cant run ICC on my compute nodes

Simba Nyamudzanga
 

The error you are getting might be because the path to the executables is not set, to check if the path is set use the command:

which mpirun
which icc
which mpiexec.hydra

If this does not show the path to the respctive executable try to configure the path to the executables by using:

export PATH=$PATH:/path/to/icc/executable/directory

Do the same for mpirun and mpiexec.hydra

On Thu, Mar 22, 2018 at 2:58 PM, DARDO ARIEL VIÑAS VISCARDI <dardo.vinas@...> wrote:
I had a problem when I tried to run a test domain in WRF on my cluster. 

 [prun] Error: Expected Job launcher mpiexec.hydra not found for impi

So I shh to my node, and try to run the command myself (after loading the intel and impi module)
 
icc
-bash: icc: command not found
mpiexec.hydra
-bash: mpiexec.hydra: command not found
mpirun
-bash: mpirun: command not found

Any idea why this happends? On my master I cant find the commands after loading the module (my master isn't acting as a compute node).

This is my slurm.conf config for the nodes:

# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
AccountingStorageType=accounting_storage/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
NodeName=yaku04 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku03 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku02 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku01 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku Weight=10 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
PartitionName=normal Nodes=yaku0[1-4] Default=YES MaxTime=24:00:00 State=UP PriorityTier=1
PartitionName=mono Nodes=yaku01 Default=NO MaxTime=4:00:00 State=UP PriorityTier=1
PartitionName=intensiva Nodes=yaku0[1-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=1 PreemptMode=requeue
PartitionName=hipri Nodes=yaku0[1-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=2 PreemptMode=off
PartitionName=Infiniband Nodes=yaku0[2-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=2 PreemptMode=off
ReturnToService=1


 
 



Cant run ICC on my compute nodes

DARDO ARIEL VIÑAS VISCARDI
 

I had a problem when I tried to run a test domain in WRF on my cluster. 

 [prun] Error: Expected Job launcher mpiexec.hydra not found for impi

So I shh to my node, and try to run the command myself (after loading the intel and impi module)
 
icc
-bash: icc: command not found
mpiexec.hydra
-bash: mpiexec.hydra: command not found
mpirun
-bash: mpirun: command not found

Any idea why this happends? On my master I cant find the commands after loading the module (my master isn't acting as a compute node).

This is my slurm.conf config for the nodes:

# COMPUTE NODES
# OpenHPC default configuration
PropagateResourceLimitsExcept=MEMLOCK
AccountingStorageType=accounting_storage/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
NodeName=yaku04 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku03 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku02 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku01 Weight=100 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
NodeName=yaku Weight=10 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
PartitionName=normal Nodes=yaku0[1-4] Default=YES MaxTime=24:00:00 State=UP PriorityTier=1
PartitionName=mono Nodes=yaku01 Default=NO MaxTime=4:00:00 State=UP PriorityTier=1
PartitionName=intensiva Nodes=yaku0[1-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=1 PreemptMode=requeue
PartitionName=hipri Nodes=yaku0[1-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=2 PreemptMode=off
PartitionName=Infiniband Nodes=yaku0[2-4] Default=NO MaxTime=UNLIMITED State=UP PriorityTier=2 PreemptMode=off
ReturnToService=1


 
 


Re: Login nodes?

Mark Moorcroft
 
Edited

Disabling NM solved the DNS issue. Adding DEFROUTE to eth1 did nothing to solve the hanging salloc command, and seemed to add no value. The route command looks the same. Enabling NM again screws everything up all over again.

Edit: The salloc issue was the firewall on the login nodes. The routing and DNS issues were resolved with NM_CONTROLED=no. Now I can get back to locking things down again and breaking things. :-\

It also appears I can't even have SELinux in Permissive on the login nodes without it killing slurm.


Re: Login nodes?

Meij, Henk
 

GATEWAY=xxx.xxx.xxx.0
DEFROUTE=yes



add those lines to eth1 then restart the network


-Henk


From: OpenHPC-users@groups.io <OpenHPC-users@groups.io> on behalf of Mark Moorcroft <plaktau@...>
Sent: Wednesday, March 21, 2018 2:02:33 AM
To: OpenHPC-users@groups.io
Subject: Re: [openhpc-users] Login nodes?
 

[Edited Message Follows]
[Reason: Killing NetWorkManager seems to help some of the issues.]

As it turns out despite the files getting written correctly from all appearances the routing is getting all messed up at boot time.

To begin with I can't ping the DNS. If I restart network making no changes I can ping the DNS but "route" still looks all wrong.

[root@l1 ~]# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         gateway         0.0.0.0         UG    100    0        0 eth1
10.1.0.0        0.0.0.0         255.255.0.0     U     100    0        0 eth0
xxx.xxx.xxx.0   0.0.0.0         255.255.254.0   U     100    0        0 eth1

The default line should show the host name of the DNS server.

Also, the salloc command works on the head but not on login nodes. The job is allocated but you never see the prompt on the compute node, and the terminal is hung completely until slurm gives up on the job completely. This has been network routing every time I have seen this before.

edit: I am currently experimenting with disabling NetWorkManager in ifcfg-ethx. So far that allows me to ping the DNS but salloc is still broken. 

2881 - 2900 of 4663