Looking for WD 3.0 sites reporting crash and 100% CPU when there are no ssh terminal sessions


Rob Robinett
 

I have just received that two sites running Ubuntu 20.04 on intel CPUs find that unless they maintain at least one ssh session, their system will fall into a state where the CPU is 100% busy and connections to the Kiwi(s) restart every several seconds.  
That problem has not appeared on the WD sites I manage, so I am looking for reports from other WD users who have systems which exhibit that problem so I might find what is common about such failing sites.


Jim Lill
 

FWIW, all the APi X86 boxes that I am aware of run 20.04 and have seen nothing like that except for Wayne when he started out with some misconfiguration

On 8/13/22 19:50, Rob Robinett wrote:

I have just received that two sites running Ubuntu 20.04 on intel CPUs find that unless they maintain at least one ssh session, their system will fall into a state where the CPU is 100% busy and connections to the Kiwi(s) restart every several seconds.  
That problem has not appeared on the WD sites I manage, so I am looking for reports from other WD users who have systems which exhibit that problem so I might find what is common about such failing sites.


wayne roth
 
Edited

That's essentially what's been happening on the Atomic Pi's that you took a look at.  It's still happening on occasion.  I've been more careful about ssh-ing in, executing wdln or wdle, and then exiting, which has reduced the frequency that this happens.  It can take some time after the last ssh-exit for this to happen.  Only one time I was able to capture the beginning of the failure where it appears that wsprd generates a segment fault after the ssh user session terminates after exiting ssh.  It happened on both APi's though within a couple minutes since I went it, ran some wsprdeamon.sh commands, then exited on each of them.  Usually the journal file gets so filled with can't find logs, touch errors etc that it's hard to find the beginning of the sequence where the loop goes nuts and hammers the journal file with errors.  I think that's why the CPU usage goes up to 100%.  I set up a third APi that you don't have access to currently and that's the last one that got into the error loop.  That third APi's conf file was scp'd from the others that you fixed, with minimal changes to the receiver name  and bands that it operates on (40,30,20m).  I did a git pull just yesterday to get your latest wd code, and am running 20.04 OS

The passwords are now the same as last time you logged in.  I can add you to the third one if you tell give me a new channel ID.

Wayne


kk6pr
 

That sounds exactly like the problem I had with my wsprdaemon Think Center a couple of days ago.  CPU 100% and Kiwi connections restarting every 2-3 seconds.  It happened once after I had rebooted and once again after a momentary power outage

I usually have an SSH session to it from my other Think Center which runs OpenWebRx for a VHF/UHF AirSpy R2 - and which also runs a btop monitor on all of my online SDRs.  After a reboot however, I have to be there to set that up manually.

To test this, I just rebooted.  I was able to start the btop/SSH session and everything worked normally - but after dropping the SSH session (and restarting it again) my wspraemon CPU load is now 100% and it is currently trying to connect to all my Kiwis, with the connections dropping after a couple of seconds.

I will leave it running in this situation so you can look at it.

73 / Rick
KK6PR

 

 


wayne roth
 

Note that if I set WD up to start at boot with the -Z switch, restart the Atomic Pi with shutdown -r or just cycle the power, it runs flawlessly.  Only when I ssh in and cause the wsprdaemon.sh script to run, terminate that with ^C, and then exit the ssh session will the error loop where it cycles through Kiwi receiver channels and fills the journal with errors happens.  And it's not consistent - sometimes there's no issue.  Sometimes it takes hours for the error state to happen.


Gwyn Griffiths
 

Hi Rob
Happened here twice yesterday. I initially thought it was high temperature, but no, the TC went to 77˚C because of the problem, rather than it being the cause.

Whatever the trigger many wd processes are spawned when the kiwirecorder sessions fail after a few seconds and a wd -z does not kill them all, in fact only a few of the several hundred that keep the CPUs at 100%. I took a screen grab of htop to show wd processes even after several -z.

There was likely no long-term build up to this situation, screengrab of CPU temperature attached, the rise to 77˚C was pretty quick.

Gwyn G3ZIL


Andrew Cowan
 

Rob
I did a fresh install on 22.04 with a M2 ssd yesterday on my Ryzen 7 
The cpu will hit 100% when the wav files are active using task manager as a monitor don't remember this before
The 3 kiwi logs look normal, concerned about this cpu load as I intend to run more on  this machine.
Thanks
Andrew GM0UDL