Here we go again...Upgrade to 3.17 Causing - Error: Get http://unix.socket/1.0: EOF

I’ve taken a look at this and as far as I can see it is not getting as far as doing a schema update during start. Instead it is having trouble reconnecting to the cluster when starting up.

I would appreciate @freeekanayaka taking a look at this, as it seem to be hanging here https://github.com/lxc/lxd/blob/master/lxd/daemon.go#L704-L712
in this part https://github.com/lxc/lxd/blob/master/lxd/db/db.go#L178-L214

This might be useful, in my case, it seems to have several lxd trying to start?!!

Sorry about that, it was about 2am and I got woken up to a page, and they looked like very similar issues, and I didn’t want to overrun the team with redundant threads.

I’ll start another one now.

I understand been there myself. Hard enough to get these simple disasters fixed.

1 Like

Indeed, you have far too many LXD processes for my liking :slight_smile:
You should do:

  • rm /var/snap/lxd/common/lxd/unix.socket
  • systemctl stop snap.lxd.daemon snap.lxd.daemon.unix.socket
  • Kill any leftover lxd process you notice

Once that’s all done, run:

  • lxd --debug --group lxd

That should give you a single clean LXD running in debug mode directly attached to your terminal.

Tried it on two servers, MOE, the bottom screen is one of the main db servers, notice it gives batch has zero errors and does not go on.
Curlyjoe is trying to find someone to talk to, so I guess he is ok.

Ok, I’ve seen this once I believe, that’s definitely a DB bug that @freeekanayaka will need to look into next week (so we’ll need that DB dump so we can reproduce it). There is a very good chance that undoing the last transaction on all 3 servers will just unblock things though and as that transaction is most likely the failed upgrade, this shouldn’t cause any data loss.

Thanks Again… Looks like it is working great again