LXC cluster: failing to start the first node after reboot

MirtoBusico · April 23, 2019, 4:19pm

I have a LXC cluster with three nodes that are all KVM virtual machines with Kubuntu 18.04 with LXD 3.0.3 installed via apt

Also the host machine is a Kubuntu 18.04

I installed LXD on the first node (kvmnode1) and then I joined to the cluster the other two nodes (kvmnode2, kvmnode3)

Then I verified that starting the nodes from the first to the last (waiting for every node to complete the startup) and closing the nodes from the last to the first always gives a fully oerational cluster

Today I forgot to wait for the complete startup of every node and I ended up with a cluster with the first node non starting LXD end the other 2 nodes working correctly

On the failing first node I see

sysop@kvmnode1:~/Scaricati$ sudo systemctl status lxd
[sudo] password di sysop: 
● lxd.service - LXD - main daemon
   Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
   Active: activating (start-post) (Result: exit-code) since Mon 2019-04-22 10:36:10 CEST; 1min 47s ago
     Docs: man:lxd(1)
  Process: 1592 ExecStart=/usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log (code=exited, status=1/FAILURE)
  Process: 1560 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
 Main PID: 1592 (code=exited, status=1/FAILURE); Control PID: 1593 (lxd)
    Tasks: 8
   CGroup: /system.slice/lxd.service
           └─1593 /usr/lib/lxd/lxd waitready --timeout=600

apr 22 10:36:10 kvmnode1 systemd[1]: Starting LXD - main daemon...
apr 22 10:36:10 kvmnode1 lxd[1592]: t=2019-04-22T10:36:10+0200 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
apr 22 10:36:19 kvmnode1 lxd[1592]: t=2019-04-22T10:36:19+0200 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: upda
apr 22 10:36:19 kvmnode1 lxd[1592]: Error: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1
apr 22 10:36:19 kvmnode1 systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
sysop@kvmnode1:~/Scaricati$

And on the second node I see

sysop@kvmnode2:~$ lxc cluster list
+----------+-----------------------------+----------+---------+----------------------------------------+
|   NAME   |             URL             | DATABASE |  STATE  |                MESSAGE                 |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode1 | https://192.168.201.11:8443 | YES      | OFFLINE | no heartbeat since 134h36m2.926365228s |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode2 | https://192.168.201.12:8443 | YES      | ONLINE  | fully operational                      |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode3 | https://192.168.201.13:8443 | YES      | ONLINE  | fully operational                      |
+----------+-----------------------------+----------+---------+----------------------------------------+
sysop@kvmnode2:~$

What can I do to have kvmnode1 starting correctly and join the cluster?

I tried to start/stop the three nodes in every sequence but I always end in the same status

I tried to start lxd in debug mode and it report the same failure

sysop@kvmnode1:~$ sudo lxd --debug --group lxd
DBUG[04-22|12:19:42] Connecting to a local LXD over a Unix socket 
DBUG[04-22|12:19:42] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[04-22|12:19:42] LXD 3.0.3 is starting in normal mode     path=/var/lib/lxd
INFO[04-22|12:19:42] Kernel uid/gid map: 
INFO[04-22|12:19:42]  - u 0 0 4294967295 
INFO[04-22|12:19:42]  - g 0 0 4294967295 
INFO[04-22|12:19:42] Configured LXD uid/gid map: 
INFO[04-22|12:19:42]  - u 0 165536 65536 
INFO[04-22|12:19:42]  - g 0 165536 65536 
WARN[04-22|12:19:42] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[04-22|12:19:42] Kernel features: 
INFO[04-22|12:19:42]  - netnsid-based network retrieval: no 
INFO[04-22|12:19:42]  - unprivileged file capabilities: yes 
INFO[04-22|12:19:42] Initializing local database 
DBUG[04-22|12:19:42] Initializing database gateway 
DBUG[04-22|12:19:42] Connecting to a local LXD over a Unix socket 
DBUG[04-22|12:19:42] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[04-22|12:19:42] Detected stale unix socket, deleting 
INFO[04-22|12:19:42] Starting /dev/lxd handler: 
INFO[04-22|12:19:42]  - binding devlxd socket                 socket=/var/lib/lxd/devlxd/sock
INFO[04-22|12:19:42] REST API daemon: 
INFO[04-22|12:19:42]  - binding Unix socket                   socket=/var/lib/lxd/unix.socket
INFO[04-22|12:19:42]  - binding TCP socket                    socket=[::]:8443
INFO[04-22|12:19:42] Initializing global database 
DBUG[04-22|12:19:42] Found cert                               k=0
DBUG[04-22|12:19:42] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=192.168.201.11:8443 attempt=0 
DBUG[04-22|12:19:42] Dqlite: connected address=192.168.201.12:8443 attempt=0 
DBUG[04-22|12:19:42] Database error: failed to update node version info: updated 0 rows instead of 1 
EROR[04-22|12:19:42] Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1 
INFO[04-22|12:19:42] Starting shutdown sequence 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42]  - closing socket                        socket=[::]:8443
INFO[04-22|12:19:42]  - closing socket                        socket=/var/lib/lxd/unix.socket
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
INFO[04-22|12:19:42]  - closing socket                        socket=/var/lib/lxd/devlxd/sock
DBUG[04-22|12:19:42] Stop database gateway 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
DBUG[04-22|12:19:42] Not unmounting temporary filesystems (containers are still running) 
INFO[04-22|12:19:42] Saving simplestreams cache 
INFO[04-22|12:19:42] Saved simplestreams cache 
Error: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1
sysop@kvmnode1:~$

Any hint?

(P.S: cross-posted in ask-ubuntu)

freeekanayaka · April 23, 2019, 6:09pm

You should be able to start the nodes in any order with no problem.

This is a bug that was fixed here:

https://github.com/lxc/lxd/pull/5532

but probably not backported to 3.0.x series yet.

freeekanayaka · April 23, 2019, 6:13pm

You should be able to fix your cluster by running a SQL query. Can you post here the output of

echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db

?

You should run the command above on the node that is not starting. I will then tell you which exact query to run and how to run it.

MirtoBusico · April 24, 2019, 7:52am

Thanks.
Do you know if there is a released version (or snap) containing the patch?

MirtoBusico · April 24, 2019, 8:21am

Well, seems that only root can issue tha command. On the failing node (IP 192.168.201.11) I have

sysop@kvmnode1:~/Scaricati$ echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
Error: unable to open database "/var/lib/lxd/database/local.db": unable to open database file
sysop@kvmnode1:~/Scaricati$ sudo ls -lh /var/lib/lxd/database/local.db
-rw-r--r-- 1 root root 40K apr 22 12:09 /var/lib/lxd/database/local.db
sysop@kvmnode1:~/Scaricati$ sudo -i
root@kvmnode1:~# echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
2|core.https_address|[::]
root@kvmnode1:~#

On a working node I have

sysop@kvmnode2:~$ sudo -i
root@kvmnode2:~# echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
1|core.https_address|192.168.201.12:8443
root@kvmnode2:~#

How to recover?

freeekanayaka · April 24, 2019, 8:29am

LXD 3.12 has the fix. You can get it with the snap (snap install lxd).

freeekanayaka · April 24, 2019, 8:33am

Okay, you should be able to fix this with:

echo UPDATE config SET value='192.168.201.11:8443' WHERE key='core.https_address'| sqlite3 /var/lib/lxd/database/local.db

(as root).

This means that you will listen to a specific interface, but at least it should make the cluster work again. To have a full fix you’ll probably have to upgrade or wait for the next 3.0.x stable release.

MirtoBusico · April 26, 2019, 2:56pm

Thanks a lot it worked.
But I’ll restart from scratch using the snap version of LXD

MirtoBusico · May 5, 2019, 4:25pm

Well I restarted from scratch using lxd 3.12 installed using snap; but there is the same error: after a cluster startup the first node fails to start lxd

On the failing node I have

sysop@kvmnode1:/var/zdata$ sudo lxd --version
[sudo] password di sysop:
3.12
sysop@kvmnode1:/var/zdata$ sudo lxc cluster list
^C
sysop@kvmnode1:/var/zdata$ sudo systemctl status snap.lxd.daemon
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2019-05-05 18:05:02 CEST; 5min ago
Process: 7275 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 7275 (code=exited, status=1/FAILURE)
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 47.
mag 05 18:05:02 kvmnode1 systemd[1]: Stopped Service for snap application lxd.daemon.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
mag 05 18:05:02 kvmnode1 systemd[1]: Failed to start Service for snap application lxd.daemon.
sysop@kvmnode1:/var/zdata$ sudo -i
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|[::]
root@kvmnode1:~#

Seems core.https_address is lost

On a working node I see:

sysop@kvmnode2:/var/zdata$ sudo -i
[sudo] password for sysop:
root@kvmnode2:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.12:8443
2|core.https_address|192.168.201.12:8443
root@kvmnode2:~# lxc cluster list
±---------±----------------------------±---------±--------±------------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode1 | https://192.168.201.11:8443 | YES | OFFLINE | no heartbeat since 19m15.295952309s |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode2 | https://192.168.201.12:8443 | YES | ONLINE | fully operational |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode3 | https://192.168.201.13:8443 | YES | ONLINE | fully operational |
±---------±----------------------------±---------±--------±------------------------------------+
root@kvmnode2:~# lxd --version
3.12
root@kvmnode2:~#

At least lxd version remains 3.12
What I’m missing to have an lxd cluster starting correctly?

UPDATE

Also resetting core.https_addrress seems to ha ve no effect

sysop@kvmnode1:/var/zdata$ sudo systemctl status snap.lxd.daemon
[sudo] password di sysop:
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: active (running) since Sun 2019-05-05 19:13:37 CEST; 4s ago
Main PID: 13143 (daemon.start)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/snap.lxd.daemon.service
‣ 13143 /bin/sh /snap/lxd/10601/commands/daemon.start
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Setting up ceph configuration
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Setting up LVM configuration
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Rotating logs
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Setting up ZFS (0.7)
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Escaping the systemd cgroups
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Escaping the systemd process resource limits
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Disabling shiftfs on this kernel (auto)
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: => Re-using existing LXCFS
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: => Starting LXD
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: t=2019-05-05T19:13:39+0200 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
sysop@kvmnode1:/var/zdata$ sudo -i
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|[::]
root@kvmnode1:~# echo “UPDATE config SET value=‘192.168.201.11:8443’ WHERE key=‘core.https_address’”| sqlite3 /var/snap/lxd/common/lxd/database/local.db
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|192.168.201.11:8443
root@kvmnode1:~# systemctl status snap.lxd.daemon
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2019-05-05 19:15:04 CEST; 53s ago
Process: 16229 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 16229 (code=exited, status=1/FAILURE)
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 118.
mag 05 19:15:04 kvmnode1 systemd[1]: Stopped Service for snap application lxd.daemon.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
mag 05 19:15:04 kvmnode1 systemd[1]: Failed to start Service for snap application lxd.daemon.

gpatel-fr · May 5, 2019, 6:40pm

can I ask why you are doing this, updating a system table directly, using a system utility to up the ante - snap lxd packages its own sqlite3 - while it has been specifically said many times by lxd devs that this should only be done under instruction ? lxc api has an instruction to setup the remote address
lxc config set core.https_address
so what’s the specific reasoning behind this direct access ? Mind, I don’t know lxd clustering.

MirtoBusico · May 6, 2019, 8:25am

Obviously you can ask.
I’m doing this because it was suggested in this thread.

Happy to know that there is an api for this.

Can you share the link to this api documentation?

Thanks

Ahks · April 15, 2022, 9:52pm

Sorry to revive a 3 year old thread but this is in the top 3 google results for a problem I had and the best pointer to what my fix was so I’ll drop this here in case someone else needs it…

After pulling some nodes from the cluster I got the following error when trying to rejoin them:
Error: Failed to join cluster: cannot set 'cluster.https_address' to '[::]:8443': Wildcard addresses aren't allowed

on a whim I tried

lxc config edit since I remember seeing edit as an available command in both lxd and lxc bins…

…and up pops the .yaml for the config

config: core.https_address: '[::]'

Since I was re-initializing lxd on this node and didn’t need any prior config, I just deleted both lines, saved it, and lxd init ran through with no more error.

Now look out! This zombie thread wants to eat your braaaaaaiiiinnnnssss

<3