LXC cluster: failing to start the first node after reboot

cluster

(Mirto Busico) #1

I have a LXC cluster with three nodes that are all KVM virtual machines with Kubuntu 18.04 with LXD 3.0.3 installed via apt

Also the host machine is a Kubuntu 18.04

I installed LXD on the first node (kvmnode1) and then I joined to the cluster the other two nodes (kvmnode2, kvmnode3)

Then I verified that starting the nodes from the first to the last (waiting for every node to complete the startup) and closing the nodes from the last to the first always gives a fully oerational cluster

Today I forgot to wait for the complete startup of every node and I ended up with a cluster with the first node non starting LXD end the other 2 nodes working correctly

On the failing first node I see

sysop@kvmnode1:~/Scaricati$ sudo systemctl status lxd
[sudo] password di sysop: 
● lxd.service - LXD - main daemon
   Loaded: loaded (/lib/systemd/system/lxd.service; indirect; vendor preset: enabled)
   Active: activating (start-post) (Result: exit-code) since Mon 2019-04-22 10:36:10 CEST; 1min 47s ago
     Docs: man:lxd(1)
  Process: 1592 ExecStart=/usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log (code=exited, status=1/FAILURE)
  Process: 1560 ExecStartPre=/usr/lib/x86_64-linux-gnu/lxc/lxc-apparmor-load (code=exited, status=0/SUCCESS)
 Main PID: 1592 (code=exited, status=1/FAILURE); Control PID: 1593 (lxd)
    Tasks: 8
   CGroup: /system.slice/lxd.service
           └─1593 /usr/lib/lxd/lxd waitready --timeout=600

apr 22 10:36:10 kvmnode1 systemd[1]: Starting LXD - main daemon...
apr 22 10:36:10 kvmnode1 lxd[1592]: t=2019-04-22T10:36:10+0200 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored."
apr 22 10:36:19 kvmnode1 lxd[1592]: t=2019-04-22T10:36:19+0200 lvl=eror msg="Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: upda
apr 22 10:36:19 kvmnode1 lxd[1592]: Error: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1
apr 22 10:36:19 kvmnode1 systemd[1]: lxd.service: Main process exited, code=exited, status=1/FAILURE
sysop@kvmnode1:~/Scaricati$ 

And on the second node I see

sysop@kvmnode2:~$ lxc cluster list
+----------+-----------------------------+----------+---------+----------------------------------------+
|   NAME   |             URL             | DATABASE |  STATE  |                MESSAGE                 |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode1 | https://192.168.201.11:8443 | YES      | OFFLINE | no heartbeat since 134h36m2.926365228s |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode2 | https://192.168.201.12:8443 | YES      | ONLINE  | fully operational                      |
+----------+-----------------------------+----------+---------+----------------------------------------+
| kvmnode3 | https://192.168.201.13:8443 | YES      | ONLINE  | fully operational                      |
+----------+-----------------------------+----------+---------+----------------------------------------+
sysop@kvmnode2:~$

What can I do to have kvmnode1 starting correctly and join the cluster?

I tried to start/stop the three nodes in every sequence but I always end in the same status

I tried to start lxd in debug mode and it report the same failure

sysop@kvmnode1:~$ sudo lxd --debug --group lxd
DBUG[04-22|12:19:42] Connecting to a local LXD over a Unix socket 
DBUG[04-22|12:19:42] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
INFO[04-22|12:19:42] LXD 3.0.3 is starting in normal mode     path=/var/lib/lxd
INFO[04-22|12:19:42] Kernel uid/gid map: 
INFO[04-22|12:19:42]  - u 0 0 4294967295 
INFO[04-22|12:19:42]  - g 0 0 4294967295 
INFO[04-22|12:19:42] Configured LXD uid/gid map: 
INFO[04-22|12:19:42]  - u 0 165536 65536 
INFO[04-22|12:19:42]  - g 0 165536 65536 
WARN[04-22|12:19:42] CGroup memory swap accounting is disabled, swap limits will be ignored. 
INFO[04-22|12:19:42] Kernel features: 
INFO[04-22|12:19:42]  - netnsid-based network retrieval: no 
INFO[04-22|12:19:42]  - unprivileged file capabilities: yes 
INFO[04-22|12:19:42] Initializing local database 
DBUG[04-22|12:19:42] Initializing database gateway 
DBUG[04-22|12:19:42] Connecting to a local LXD over a Unix socket 
DBUG[04-22|12:19:42] Sending request to LXD                   method=GET url=http://unix.socket/1.0 etag=
DBUG[04-22|12:19:42] Detected stale unix socket, deleting 
INFO[04-22|12:19:42] Starting /dev/lxd handler: 
INFO[04-22|12:19:42]  - binding devlxd socket                 socket=/var/lib/lxd/devlxd/sock
INFO[04-22|12:19:42] REST API daemon: 
INFO[04-22|12:19:42]  - binding Unix socket                   socket=/var/lib/lxd/unix.socket
INFO[04-22|12:19:42]  - binding TCP socket                    socket=[::]:8443
INFO[04-22|12:19:42] Initializing global database 
DBUG[04-22|12:19:42] Found cert                               k=0
DBUG[04-22|12:19:42] Dqlite: server connection failed err=failed to establish network connection: some nodes are behind this node's version address=192.168.201.11:8443 attempt=0 
DBUG[04-22|12:19:42] Dqlite: connected address=192.168.201.12:8443 attempt=0 
DBUG[04-22|12:19:42] Database error: failed to update node version info: updated 0 rows instead of 1 
EROR[04-22|12:19:42] Failed to start the daemon: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1 
INFO[04-22|12:19:42] Starting shutdown sequence 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42]  - closing socket                        socket=[::]:8443
INFO[04-22|12:19:42]  - closing socket                        socket=/var/lib/lxd/unix.socket
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
INFO[04-22|12:19:42]  - closing socket                        socket=/var/lib/lxd/devlxd/sock
DBUG[04-22|12:19:42] Stop database gateway 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
INFO[04-22|12:19:42] Stopping REST API handler: 
INFO[04-22|12:19:42] Stopping /dev/lxd handler 
DBUG[04-22|12:19:42] Not unmounting temporary filesystems (containers are still running) 
INFO[04-22|12:19:42] Saving simplestreams cache 
INFO[04-22|12:19:42] Saved simplestreams cache 
Error: failed to open cluster database: failed to ensure schema: failed to update node version info: updated 0 rows instead of 1
sysop@kvmnode1:~$ 

Any hint?

(P.S: cross-posted in ask-ubuntu)


LXD downgraded after installing JUJU
(Free Ekanayaka) #2

You should be able to start the nodes in any order with no problem.

This is a bug that was fixed here:

but probably not backported to 3.0.x series yet.


(Free Ekanayaka) #3

You should be able to fix your cluster by running a SQL query. Can you post here the output of

echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db

?

You should run the command above on the node that is not starting. I will then tell you which exact query to run and how to run it.


(Mirto Busico) #4

Thanks.
Do you know if there is a released version (or snap) containing the patch?


(Mirto Busico) #5

Well, seems that only root can issue tha command. On the failing node (IP 192.168.201.11) I have

sysop@kvmnode1:~/Scaricati$ echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
Error: unable to open database "/var/lib/lxd/database/local.db": unable to open database file
sysop@kvmnode1:~/Scaricati$ sudo ls -lh /var/lib/lxd/database/local.db
-rw-r--r-- 1 root root 40K apr 22 12:09 /var/lib/lxd/database/local.db
sysop@kvmnode1:~/Scaricati$ sudo -i
root@kvmnode1:~# echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
2|core.https_address|[::]
root@kvmnode1:~# 

On a working node I have

sysop@kvmnode2:~$ sudo -i
root@kvmnode2:~# echo "SELECT * FROM config" | sqlite3 /var/lib/lxd/database/local.db
1|core.https_address|192.168.201.12:8443
root@kvmnode2:~# 

How to recover?


(Free Ekanayaka) #6

LXD 3.12 has the fix. You can get it with the snap (snap install lxd).


(Free Ekanayaka) #7

Okay, you should be able to fix this with:

echo UPDATE config SET value='192.168.201.11:8443' WHERE key='core.https_address'| sqlite3 /var/lib/lxd/database/local.db

(as root).

This means that you will listen to a specific interface, but at least it should make the cluster work again. To have a full fix you’ll probably have to upgrade or wait for the next 3.0.x stable release.


(Mirto Busico) #8

Thanks a lot it worked.
But I’ll restart from scratch using the snap version of LXD


(Mirto Busico) #9

Well I restarted from scratch using lxd 3.12 installed using snap; but there is the same error: after a cluster startup the first node fails to start lxd

On the failing node I have

sysop@kvmnode1:/var/zdata$ sudo lxd --version
[sudo] password di sysop:
3.12
sysop@kvmnode1:/var/zdata$ sudo lxc cluster list
^C
sysop@kvmnode1:/var/zdata$ sudo systemctl status snap.lxd.daemon
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2019-05-05 18:05:02 CEST; 5min ago
Process: 7275 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 7275 (code=exited, status=1/FAILURE)
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 47.
mag 05 18:05:02 kvmnode1 systemd[1]: Stopped Service for snap application lxd.daemon.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
mag 05 18:05:02 kvmnode1 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
mag 05 18:05:02 kvmnode1 systemd[1]: Failed to start Service for snap application lxd.daemon.
sysop@kvmnode1:/var/zdata$ sudo -i
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|[::]
root@kvmnode1:~#

Seems core.https_address is lost

On a working node I see:

sysop@kvmnode2:/var/zdata$ sudo -i
[sudo] password for sysop:
root@kvmnode2:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.12:8443
2|core.https_address|192.168.201.12:8443
root@kvmnode2:~# lxc cluster list
±---------±----------------------------±---------±--------±------------------------------------+
| NAME | URL | DATABASE | STATE | MESSAGE |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode1 | https://192.168.201.11:8443 | YES | OFFLINE | no heartbeat since 19m15.295952309s |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode2 | https://192.168.201.12:8443 | YES | ONLINE | fully operational |
±---------±----------------------------±---------±--------±------------------------------------+
| kvmnode3 | https://192.168.201.13:8443 | YES | ONLINE | fully operational |
±---------±----------------------------±---------±--------±------------------------------------+
root@kvmnode2:~# lxd --version
3.12
root@kvmnode2:~#

At least lxd version remains 3.12
What I’m missing to have an lxd cluster starting correctly?

UPDATE

Also resetting core.https_addrress seems to ha ve no effect

sysop@kvmnode1:/var/zdata$ sudo systemctl status snap.lxd.daemon
[sudo] password di sysop:
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: active (running) since Sun 2019-05-05 19:13:37 CEST; 4s ago
Main PID: 13143 (daemon.start)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/snap.lxd.daemon.service
‣ 13143 /bin/sh /snap/lxd/10601/commands/daemon.start
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Setting up ceph configuration
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Setting up LVM configuration
mag 05 19:13:37 kvmnode1 lxd.daemon[13143]: ==> Rotating logs
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Setting up ZFS (0.7)
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Escaping the systemd cgroups
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Escaping the systemd process resource limits
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: ==> Disabling shiftfs on this kernel (auto)
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: => Re-using existing LXCFS
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: => Starting LXD
mag 05 19:13:39 kvmnode1 lxd.daemon[13143]: t=2019-05-05T19:13:39+0200 lvl=warn msg=“CGroup memory swap accounting is disabled, swap limits will be ignored.”
sysop@kvmnode1:/var/zdata$ sudo -i
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|[::]
root@kvmnode1:~# echo “UPDATE config SET value=‘192.168.201.11:8443’ WHERE key=‘core.https_address’”| sqlite3 /var/snap/lxd/common/lxd/database/local.db
root@kvmnode1:~# echo “SELECT * FROM config” | sqlite3 /var/snap/lxd/common/lxd/database/local.db
1|cluster.https_address|192.168.201.11:8443
3|core.https_address|192.168.201.11:8443
root@kvmnode1:~# systemctl status snap.lxd.daemon
● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2019-05-05 19:15:04 CEST; 53s ago
Process: 16229 ExecStart=/usr/bin/snap run lxd.daemon (code=exited, status=1/FAILURE)
Main PID: 16229 (code=exited, status=1/FAILURE)
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Scheduled restart job, restart counter is at 118.
mag 05 19:15:04 kvmnode1 systemd[1]: Stopped Service for snap application lxd.daemon.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Start request repeated too quickly.
mag 05 19:15:04 kvmnode1 systemd[1]: snap.lxd.daemon.service: Failed with result ‘exit-code’.
mag 05 19:15:04 kvmnode1 systemd[1]: Failed to start Service for snap application lxd.daemon.


#10

can I ask why you are doing this, updating a system table directly, using a system utility to up the ante - snap lxd packages its own sqlite3 - while it has been specifically said many times by lxd devs that this should only be done under instruction ? lxc api has an instruction to setup the remote address
lxc config set core.https_address
so what’s the specific reasoning behind this direct access ? Mind, I don’t know lxd clustering.


(Mirto Busico) #11

Obviously you can ask.
I’m doing this because it was suggested in this thread.

Happy to know that there is an api for this.

Can you share the link to this api documentation?

Thanks