wwgetfiles SLOW and flaky


Mark Moorcroft
 
Edited


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?

Sorry, I actually meant that the lock files have to be deleted. But I just say pdsh -w c[1-88] rm /tmp/wwgetfile*


Ryan Novosielski
 

I believe there’s a random delay before the download is attempted. I’ve seen rare cases where I had to delete the timestamp files, but it’s not the norm. It generally takes a few minutes. You can set that to something definite and small to run it manually — take a look at /warewulf/bin/wwgetfiles:

WWGETFILES_INTERVAL=${WWGETFILES_INTERVAL:-180}

if [ -n "$WWGETFILES_INTERVAL" -a $WWGETFILES_INTERVAL -gt 0 ];then
if [ -n "$RANDOM" -a ! -f "/init" ]; then
SLEEPTIME=`expr $RANDOM % $WWGETFILES_INTERVAL`
sleep $SLEEPTIME
fi
fi

There is probably a way that you can set it for all nodes too, I just don’t know what it is off the top of my head.

PS: you should probably specify that you’re talking about Warewulf — OpenHPC can be provisioned with other provisioners as well.

On Mar 23, 2018, at 3:47 PM, Mark Moorcroft <plaktau@...> wrote:


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'


 

You can run into an issue with the timestamp file when you update a
file, but the timestamp entry on a node just happens to be after that
file gets updated, but it didn't get pulled down because the file
import wasn't complete. The timestamp should be the last time the node
pulled a file, so it's looking to see if there has been any of it's
files that have been updated since then. Removing that, just causes it
to pull down everything again.

If you want to not have the delay, you can do something like:

pdsh -w n0[00-99] WWGETFILES_INTERVAL=0 /warewulf/bin/wwgetfiles

The delay is there to stagger the nodes when running from a cronjob so
every node in the cluster isn't hitting the http server at the same
time.

-J

On Fri, Mar 23, 2018 at 2:53 PM, Ryan Novosielski <novosirj@...> wrote:
I believe there’s a random delay before the download is attempted. I’ve seen rare cases where I had to delete the timestamp files, but it’s not the norm. It generally takes a few minutes. You can set that to something definite and small to run it manually — take a look at /warewulf/bin/wwgetfiles:

WWGETFILES_INTERVAL=${WWGETFILES_INTERVAL:-180}

if [ -n "$WWGETFILES_INTERVAL" -a $WWGETFILES_INTERVAL -gt 0 ];then
if [ -n "$RANDOM" -a ! -f "/init" ]; then
SLEEPTIME=`expr $RANDOM % $WWGETFILES_INTERVAL`
sleep $SLEEPTIME
fi
fi

There is probably a way that you can set it for all nodes too, I just don’t know what it is off the top of my head.

PS: you should probably specify that you’re talking about Warewulf — OpenHPC can be provisioned with other provisioners as well.

On Mar 23, 2018, at 3:47 PM, Mark Moorcroft <plaktau@...> wrote:


I find I have to first delete all the timestamp files before I can run wwgetfiles on the nodes. And It takes several minutes or more to run. Is there a particular thing I should look at first to explain this?
--
____
|| \\UTGERS, |---------------------------*O*---------------------------
||_// the State | Ryan Novosielski - novosirj@...
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'




Mark Moorcroft
 

Well, what I find is with my 87 nodes any time I try to get them all to wwgetfiles the majority say it's already running. If it runs every 5 minutes and the delay interval is up to 3 minutes I guess I can see why this happens. I can understand why they do it this way, but it's nearly useless to attempt to run it manually.

Oh yeah, and it's Warewulf ;-)


 

Hi Mark,

On a node can you verify in /warewulf/bin/wwgetfiles that the: rm -f
${LOCKF} -- is in 2 places. Should be above the 'exit 2' for a
failed download, and also above the 'exit 0'.

If that's the case, then you're removing the lock file for the
running wwgetfiles ... which ya probably don't want to do. ;)

-J

On Fri, Mar 23, 2018 at 2:47 PM, Mark Moorcroft <plaktau@...> wrote:
[Edited Message Follows]


I find I have to first delete all the timestamp files before I can run
wwgetfiles on the nodes. And It takes several minutes or more to run. Is
there a particular thing I should look at first to explain this?

Sorry, I actually meant that the lock files have to be deleted. But I just
say pdsh -w c[1-88] rm /tmp/wwgetfile*