Possible work around for WD 100% CPU bug


Rob Robinett
 

In yesterday's SD call Bret suggested that for those experiencing this WD problem it may be related to what seems to be a related problem experienced by many LInux programs since a Linux OS upgrade last June:  Programs which use ssh or some library used by ssh will die if there is not at least  one active ssh session.  For WD the bug is encountered inside the kiwirecorder Python program which terminates silently after a fraction of a second when there are no ssh sessions. During our call we verified that having the WD server execute a persistent ssh session to itself seems to suppress that WD 100% CPU  bug.  

To create such a session, I have just tested running these two lines which run a detached sh session in the background.  Run the first 'cat ...'line only once on a WD machine, after which the second 'ssh...' line needs to be run after every reboot of the WD machine.   

wd_client@KFS-WD3:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
wd_client@KFS-WD3:~$ ssh -o StrictHostKeyChecking=no -f localhost  "while true; do date ; sleep 3600; done >> ssh_date.log"

You can verify that a such a background session is active by running " ps auxf | grep 'sleep 3600' | grep -v grep":

wd_client@KFS-WD3:~$ ps auxf | grep 'sleep 3600' | grep -v grep
wd_clie+ 2716048  0.0  0.0   9500  3292 ?        Ss   14:50   0:00          \_ bash -c while true; do date ; sleep 3600; done >> ssh_date.log
wd_clie+ 2716052  0.0  0.0   8084   576 ?        S    14:50   0:00              \_ sleep 3600
wd_clie+ 2715775  0.0  0.0  14620  2508 ?        Ss   14:50   0:00 ssh -o StrictHostKeyChecking=no -f localhost while true; do date ; sleep 3600; done >> ssh_date.log
wd_client@KFS-WD3:~$


And terminate such a remote session  by executing a 'kill' on those PIDs:

wd_client@KFS-WD3:~$ ps auxf | grep 'sleep 3600' | grep -v grep
wd_clie+
2716048  0.0  0.0   9500  3292 ?        Ss   14:50   0:00          \_ bash -c while true; do date ; sleep 3600; done >> ssh_date.log
wd_clie+
2716052  0.0  0.0   8084   576 ?        S    14:50   0:00              \_ sleep 3600
wd_clie+
2715775  0.0  0.0  14620  2508 ?        Ss   14:50   0:00 ssh -o StrictHostKeyChecking=no -f localhost while true; do date ; sleep 3600; done >> ssh_date.log
wd_client@KFS-WD3:~$ ^C
wd_client@KFS-WD3:~$
kill 2716048 2716052 2715775
wd_client@KFS-WD3:~$ ps auxf | grep 'sleep 3600' | grep -v grep
wd_client@KFS-WD3:~$


 Incorporating that code into WD so that it could be cleanly started at powerup will take several hours to implement and probably a lot more time to debug on the few sites which are experiencing this problem.  I am also loath to invest development and debug time  in creating a bandaid for a Linux OS problem which it seems likely will be fixed in the near future.

All of this should be a warning to WD users to be very cautious about upgrading the OS on your Linux server.

--
Rob Robinett
AI6VN
mobile: +1 650 218 8896