Snap auto refresh kills cluster

@stgraber looks like snap refresh to 3.5 killed the cluster again.

Well, that’s annoying given that we did land the code to have LXD self-refresh as needed when that happens.
What happened exactly? Had one node upgrade to 3.5 and the rest stay on 3.4?

It’d be nice to get /var/snap/lxd/common/lxd/logs/lxd.log for the various nodes, they’re supposed to be logging something when that situation is detected and a self-refresh is attempted.

@stgraber
Unfortunately, i just did
pkill -9 -f "lxd --logfile"
and
sudo snap restart lxd.daemon
and things started working.
The only thing in one of the old logs i can see

lvl=info msg="Updating images" t=2018-09-17T05:50:55+0000
lvl=info msg="Done updating images" t=2018-09-17T05:50:55+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:14+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:15+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:16+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:17+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:18+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:19+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:20+0000
lvl=warn msg="Failed to get events from node 10.187.21.32:8443: Unable to connect to: 10.187.21.32:8443" t=2018-09-17T10:25:21+0000
lvl=info msg="Node is out-of-date with respect to other cluster nodes" t=2018-09-17T10:26:00+0000
lvl=info msg="Triggering cluster update using: /snap/lxd/current/commands/refresh" t=2018-09-17T10:26:00+0000
lvl=info msg="Received 'terminated signal', exiting" t=2018-09-17T10:26:04+0000
lvl=info msg="Starting shutdown sequence" t=2018-09-17T10:26:04+0000
lvl=info msg="Stopping REST API handler:" t=2018-09-17T10:26:04+0000

Ok, so it does show that one node upgraded to 3.5 and then this other node noticed and triggered the update, which then triggered a restart. But this should then have been shortly followed by the daemon restarting and coming back online, instead it looks like it got stuck on its way down…

And there was nothing in that log after Stopping REST API handler:?

@freeekanayaka and I are scratching our heads at the code because it seems impossible that this message would get printed but the one immediately afterwards wouldn’t.

No, this is what it was in the log, maybe the other log file would be generated once new version is launched