Snap auto refresh kills cluster

Shantur_Rathore · September 17, 2018, 11:54am

@stgraber looks like snap refresh to 3.5 killed the cluster again.

stgraber · September 17, 2018, 12:05pm

Well, that’s annoying given that we did land the code to have LXD self-refresh as needed when that happens.
What happened exactly? Had one node upgrade to 3.5 and the rest stay on 3.4?

It’d be nice to get /var/snap/lxd/common/lxd/logs/lxd.log for the various nodes, they’re supposed to be logging something when that situation is detected and a self-refresh is attempted.

Shantur_Rathore · September 17, 2018, 12:14pm

@stgraber
Unfortunately, i just did
pkill -9 -f "lxd --logfile"
and
sudo snap restart lxd.daemon
and things started working.
The only thing in one of the old logs i can see

lvl=info msg="Updating images" t=2018-09-17T05:50:55+0000
lvl=info msg="Done updating images" t=2018-09-17T05:50:55+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:14+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:15+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:16+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:17+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:18+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:19+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:20+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:21+0000
lvl=info msg="Node is out-of-date with respect to other cluster nodes" t=2018-09-17T10:26:00+0000
lvl=info msg="Triggering cluster update using: /snap/lxd/current/commands/refresh" t=2018-09-17T10:26:00+0000
lvl=info msg="Received 'terminated signal', exiting" t=2018-09-17T10:26:04+0000
lvl=info msg="Starting shutdown sequence" t=2018-09-17T10:26:04+0000
lvl=info msg="Stopping REST API handler:" t=2018-09-17T10:26:04+0000

stgraber · September 17, 2018, 12:35pm

Ok, so it does show that one node upgraded to 3.5 and then this other node noticed and triggered the update, which then triggered a restart. But this should then have been shortly followed by the daemon restarting and coming back online, instead it looks like it got stuck on its way down…

stgraber · September 17, 2018, 12:52pm

And there was nothing in that log after Stopping REST API handler:?

@freeekanayaka and I are scratching our heads at the code because it seems impossible that this message would get printed but the one immediately afterwards wouldn’t.

Shantur_Rathore · September 17, 2018, 1:34pm

No, this is what it was in the log, maybe the other log file would be generated once new version is launched