FIXED after DYING after UPGRADE - LXC 4.1 auto installed itself and is not letting my cluster members talk to each other

The Answer is probably to reboot them. However is there anything I should before I reboot them tomorrow. Is there any information logs that you guys are interested in looking at before I do.

lxc version
Client version: 4.1
Server version: 4.1
lxc cluster list
±-----±-------------------------±---------±--------±--------------------------------------±-------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE |
±-----±-------------------------±---------±--------±--------------------------------------±-------------+
| Q1 | https://84…18:8443 | YES | OFFLINE | no heartbeat since 32h48m0.565177638s | x86_64 |
±-----±-------------------------±---------±--------±--------------------------------------±-------------+
| Q2 | https://84…19:8443 | YES | OFFLINE | no heartbeat since 32h48m0.564705158s | x86_64 |
±-----±-------------------------±---------±--------±--------------------------------------±-------------+
| Q3 | https://84…20:8443 | YES | OFFLINE | no heartbeat since 32h48m0.564869057s | x86_64 |
±-----±-------------------------±---------±--------±--------------------------------------±-------------+
| Q4 | https://84…21:8443 | NO | OFFLINE | no heartbeat since 32h48m0.564564881s | x86_64 |
±-----±-------------------------±---------±

Interestingly 2 of the 4 servers let me do lxc list, two get stuck. The containers are still working on all of them. My guess is that Snap refresh LXD to new version and caused an inconsistency somewhere. I hope it is not in my database.

If you were upgrading from 4.0, there was a bug that happened when restarting the daemon that was fixed in 4.1, so perhaps you got hit by it during the upgrade.

So what do you suggest? Just Reboot? Should I backup database?

Please try to reboot. We’ll release LXD 4.2 shortly which has further fixes that where identified around cluster reboot. Once you upgrade to 4.2, please let us know if it still happens.

Of course it did not work. Now I am down. Two machines look like they could go, 2 are stuck with no Unix.socket

systemctl restart snap.lxd.daemon.unix.socket
Job for snap.lxd.daemon.unix.socket failed.
See “systemctl status snap.lxd.daemon.unix.socket” and “journalctl -xe” for details.
root@Q2:/home/ic2000# systemctl stop snap.lxd.daemon

lxc list --fast --debug
DBUG[05-27|13:27:36] Connecting to a local LXD over a Unix socket
DBUG[05-27|13:27:36] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
Error: Get “http://unix.socket/1.0”: EOF

/snap/bin/lxd --debug --group lxd
DBUG[05-27|14:08:41] Connecting to a local LXD over a Unix socket
DBUG[05-27|14:08:41] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
INFO[05-27|14:08:41] LXD 4.1 is starting in normal mode path=/var/snap/lxd/common/lxd
INFO[05-27|14:08:41] Kernel uid/gid map:
INFO[05-27|14:08:41] - u 0 0 4294967295
INFO[05-27|14:08:41] - g 0 0 4294967295
INFO[05-27|14:08:41] Configured LXD uid/gid map:
INFO[05-27|14:08:41] - u 0 1000000 1000000000
INFO[05-27|14:08:41] - g 0 1000000 1000000000
INFO[05-27|14:08:41] Kernel features:
INFO[05-27|14:08:41] - netnsid-based network retrieval: no
INFO[05-27|14:08:41] - uevent injection: no
INFO[05-27|14:08:41] - seccomp listener: no
INFO[05-27|14:08:41] - seccomp listener continue syscalls: no
INFO[05-27|14:08:41] - unprivileged file capabilities: yes
INFO[05-27|14:08:41] - cgroup layout: hybrid
WARN[05-27|14:08:41] - Couldn’t find the CGroup memory swap accounting, swap limits will be ignored
INFO[05-27|14:08:41] - shiftfs support: no
INFO[05-27|14:08:41] Initializing local database
DBUG[05-27|14:08:41] Initializing database gateway
DBUG[05-27|14:08:41] Start database node id=3 address=84.17.40.19:8443 role=voter
DBUG[05-27|14:08:42] Connecting to a local LXD over a Unix socket
DBUG[05-27|14:08:42] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
DBUG[05-27|14:08:42] Detected stale unix socket, deleting
DBUG[05-27|14:08:42] Detected stale unix socket, deleting
INFO[05-27|14:08:42] Starting cluster handler:
INFO[05-27|14:08:42] Starting /dev/lxd handler:
INFO[05-27|14:08:42] - binding devlxd socket socket=/var/snap/lxd/common/lxd/devlxd/sock
INFO[05-27|14:08:42] REST API daemon:
INFO[05-27|14:08:42] - binding Unix socket socket=/var/snap/lxd/common/lxd/unix.socket
INFO[05-27|14:08:42] - binding TCP socket socket=[::]:8443
INFO[05-27|14:08:42] Initializing global database
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.18:8443 attempt=0
DBUG[05-27|14:08:42] Found cert name=0
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.19:8443 attempt=0
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.20:8443 attempt=0
DBUG[05-27|14:08:42] Found cert name=0
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=84.17.40.21:8443 attempt=0
DBUG[05-27|14:08:42] Found cert name=0
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.18:8443 attempt=1
DBUG[05-27|14:08:42] Found cert name=0
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.19:8443 attempt=1
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.20:8443 attempt=1
WARN[05-27|14:08:42] Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=84.17.40.21:8443 attempt=1
DBUG[05-27|14:08:43] Found cert name=0
DBUG[05-27|14:08:43] Found cert name=0
WARN[05-27|14:08:43] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.18:8443 attempt=2
DBUG[05-27|14:08:43] Found cert name=0
WARN[05-27|14:08:43] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.19:8443 attempt=2
DBUG[05-27|14:08:43] Found cert name=0
WARN[05-27|14:08:43] Dqlite: server unavailable err=failed to establish network connection: 503 Service Unavailable address=84.17.40.20:8443 attempt=2
WARN[05-27|14:08:43] Dqlite: server unavailable err=failed to establish network connection: some nodes are behind this node’s version address=84.17.40.21:8443 attempt=2
DBUG[05-27|14:08:43] Found cert

Update:
After several reboots and removing firewall I was able to get cluster puttering.
Apparently 4.1 is looking for another port beyond 8443. One cluster member Q3 is not doing well, cant get containers running. Lxc functions are not working eventhough it is database master
I can do a lxc list across cluster but most other functions give Error: Missing event connection with target cluster member.
Trying more work on Q3 to unstick it.

After many pkill -9 -f “lxd --logfile”

And a reboot to final Q3 member it looks like everything is back to normal.

NOW, I going to figure how to stop these auto update. I can’t afford for my servers to be down for a day because they decided to snap update and crash. If I wanted that I would be running windows.

We’ve been fixing reboot and snap-refreshed issues reported by users for a while now. I believe things are now in much better shape than they used to be. LXD 4.1 has some improvements and 4.2 has more refinements on top of that. We’ve not seen issues with 4.2 so far. Again, please let us know if you find anything problematic with 4.2, once that’s out (in the next few days). Thanks.

I hope the automatic refresh to 4.2 goes well. The problem goes back quite a bit. I have screaming for help since the beginning of ver 3. It has cost me hundreds of hours of work. And very simply until you fix this problem, LXD is really not ready for production. It needs to run with one member, two members or 100 members. It should not matter if I unplug the router or pull the electric plug, when it reboots or reconnect it should be fine. It should be easy mount or dismount a cluster member. It should also be easy to upgrade a server to voting member or demote.
All these cluster concepts are not new, I worked on VaxClusters 30 years ago. We would never go done even for software upgrades, we just with one command shift the load from a group of machines to another. They would work almost like a RAID. Look at these numbers

Screenshot from 2020-05-28 18-31-31

LXD has come a long way. But managing a cluster should as easy as managing containers.
And if one or five cluster member goes down you should be able to keep running.
Now something weird, that is either a new feature or a bug in 4.1. I was able to start some containers on some machines even if cluster itself was broken.

Believe me guys, not trying to rattle your chains. I believe in LXD. We have made it part of our business stack. I just know what it can be, and it still needs a bit to get there. Will try to contribute some more concrete ideas later. And thank you for your continued work.