LXD Cluster hangs

Hi there,

This is the second time (between completely removing the LXD snap and reinstalling it) I’ve set this up, and it does the exact same thing when I’m finished:

Those three 17.10 VMs have been processing the lxc cluster list command for about 40 minutes at this point. Load is minimal, reboot does nothing. As you see on LXD01 it briefly showed output for lxc cluster list before adding more nodes, but after adding the nodes it’s no longer responsive.

If I try a snap refresh at this point it hangs forever trying to stop the LXD service.
The same happens if I try to reboot, it gets stuck and shutting down the LXD service.

I did not use the preseed yaml, just popped in identical answers on every node (because the three VMs are identical in all but IP address).

So I’m not getting too far with LXD clustering… or even adding containers since any lxc command I type just hangs.

And now, following a reboot - all three VMs are playing up:

@freeekanayaka

@jasonbayton dmesg output during the hang might have been useful, please post that if it happens again. As for the current error, can you paste journalctl -u snap.lxd.daemon for at least one of the affected machines?

-- Logs begin at Fri 2018-04-06 16:00:45 BST, end at Fri 2018-04-06 22:17:45 BST. --
Apr 06 20:00:55 ubuntu-lxd02 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3463]: => Preparing the system
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3463]: ==> Loading snap configuration
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3463]: /snap/lxd/6492/commands/daemon.start: 21: .: Can't open /var/snap/lxd/common/config
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Unit entered failed state.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3521]: => Preparing the system
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3521]: ==> Loading snap configuration
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3521]: /snap/lxd/6492/commands/daemon.start: 21: .: Can't open /var/snap/lxd/common/config
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Unit entered failed state.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3584]: => Preparing the system
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3584]: ==> Loading snap configuration
Apr 06 20:00:56 ubuntu-lxd02 lxd.daemon[3584]: /snap/lxd/6492/commands/daemon.start: 21: .: Can't open /var/snap/lxd/common/config
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Unit entered failed state.
Apr 06 20:00:56 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-code'.
Apr 06 20:00:57 ubuntu-lxd02 systemd[1]: snap.lxd.daemon.service: Service hold-off time over, scheduling restart.
Apr 06 20:00:57 ubuntu-lxd02 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 20:00:57 ubuntu-lxd02 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: => Preparing the system
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Loading snap configuration
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up mntns symlink
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up kmod wrapper
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Preparing /boot
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Preparing a clean copy of /run
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Preparing a clean copy of /etc
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up bash completion
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up ceph configuration
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up LVM configuration
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Setting up ZFS (0.6)
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Escaping the systemd cgroups
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Escaping the systemd process resource limits
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: ==> Enabling unprivileged containers kernel support
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: => Starting LXCFS
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: mount namespace: 5
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]: hierarchies:
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]:   0: fd:   6: rdma
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]:   1: fd:   7: perf_event
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]:   2: fd:   8: memory
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]:   3: fd:   9: blkio
Apr 06 20:00:57 ubuntu-lxd02 lxd.daemon[3650]:   4: fd:  10: cpu,cpuacct
...skipping...
Apr 06 22:14:33 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:33+0000
Apr 06 22:14:35 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:14:35 http: multiple response.WriteHeader calls
Apr 06 22:14:39 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:39+0000
Apr 06 22:14:42 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:42+0000
Apr 06 22:14:46 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:46+0000
Apr 06 22:14:51 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:51+0000
Apr 06 22:14:55 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:14:55 http: multiple response.WriteHeader calls
Apr 06 22:14:57 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:14:57+0000
Apr 06 22:15:02 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:02+0000
Apr 06 22:15:07 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:07+0000
Apr 06 22:15:11 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:11+0000
Apr 06 22:15:15 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:15:15 http: multiple response.WriteHeader calls
Apr 06 22:15:16 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:16+0000
Apr 06 22:15:19 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:19+0000
Apr 06 22:15:25 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:25+0000
Apr 06 22:15:29 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:29+0000
Apr 06 22:15:33 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:33+0000
Apr 06 22:15:35 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:15:35 http: multiple response.WriteHeader calls
Apr 06 22:15:40 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:40+0000
Apr 06 22:15:44 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:44+0000
Apr 06 22:15:49 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:49+0000
Apr 06 22:15:54 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:15:54+0000
Apr 06 22:15:55 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: gRPC grpcConnection failed: context deadline exceeded" t=2018-04-06
Apr 06 22:16:01 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:01+0000
Apr 06 22:16:05 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:16:05 http: multiple response.WriteHeader calls
Apr 06 22:16:07 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:07+0000
Apr 06 22:16:10 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:10+0000
Apr 06 22:16:14 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:14+0000
Apr 06 22:16:18 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:18+0000
Apr 06 22:16:22 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:22+0000
Apr 06 22:16:25 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:16:25 http: multiple response.WriteHeader calls
Apr 06 22:16:28 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:28+0000
Apr 06 22:16:32 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:32+0000
Apr 06 22:16:36 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:36+0000
Apr 06 22:16:42 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:42+0000
Apr 06 22:16:45 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:16:45 http: multiple response.WriteHeader calls
Apr 06 22:16:48 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:48+0000
Apr 06 22:16:53 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:53+0000
Apr 06 22:16:59 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:16:59+0000
Apr 06 22:17:04 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:04+0000
Apr 06 22:17:05 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:17:05 http: multiple response.WriteHeader calls
Apr 06 22:17:09 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:09+0000
Apr 06 22:17:14 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:14+0000
Apr 06 22:17:19 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:19+0000
Apr 06 22:17:22 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:22+0000
Apr 06 22:17:25 ubuntu-lxd02 lxd.daemon[2906]: 2018/04/06 21:17:25 http: multiple response.WriteHeader calls
Apr 06 22:17:26 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:26+0000
Apr 06 22:17:33 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:33+0000
Apr 06 22:17:38 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:38+0000
Apr 06 22:17:41 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Raft: Election timeout reached, restarting election" t=2018-04-06T21:17:41+0000
Apr 06 22:17:45 ubuntu-lxd02 lxd.daemon[2906]: lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: gRPC grpcConnection failed: context deadline exceeded" t=2018-04-06
lines 364-414/414 (END)
-- Logs begin at Fri 2018-04-06 15:50:11 BST, end at Fri 2018-04-06 22:18:05 BST. --
Apr 06 20:44:13 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:44:14 ubuntu-lxd03 lxd.daemon[4248]: => Preparing the system
Apr 06 20:44:14 ubuntu-lxd03 lxd.daemon[4248]: ==> Loading snap configuration
Apr 06 20:44:14 ubuntu-lxd03 lxd.daemon[4248]: /snap/lxd/6578/commands/daemon.start: 22: .:Apr 06 20:44:14 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Main process exited, codeApr 06 20:44:14 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Unit entered failed stateApr 06 20:44:14 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Service hold-off time oveApr 06 20:44:15 ubuntu-lxd03 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4292]: => Preparing the system
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4292]: ==> Loading snap configuration
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4292]: /snap/lxd/6578/commands/daemon.start: 22: .:Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Main process exited, codeApr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Unit entered failed stateApr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Service hold-off time oveApr 06 20:44:15 ubuntu-lxd03 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4311]: => Preparing the system
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4311]: ==> Loading snap configuration
Apr 06 20:44:15 ubuntu-lxd03 lxd.daemon[4311]: /snap/lxd/6578/commands/daemon.start: 22: .:Apr 06 20:44:15 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Main process exited, code...skipping...
Apr 06 21:57:09 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Unit entered failed stateApr 06 21:57:09 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Failed with result 'timeo-- Reboot --
Apr 06 21:58:29 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: => Preparing the system
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Loading snap configuration
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Setting up mntns symlink
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Setting up kmod wrapper
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Preparing /boot
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Preparing a clean copy of /run
Apr 06 21:58:30 ubuntu-lxd03 lxd.daemon[1664]: ==> Preparing a clean copy of /etc
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Setting up ceph configuration
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Setting up LVM configuration
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Rotating logs
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Setting up ZFS (0.6)
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Escaping the systemd cgroups
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Escaping the systemd process resource liApr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: ==> Enabling unprivileged containers kernel
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: => Starting LXCFS
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: => Starting LXD
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: mount namespace: 5
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]: hierarchies:
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   0: fd:   6: perf_event
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   1: fd:   7: freezer
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   2: fd:   8: devices
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   3: fd:   9: blkio
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   4: fd:  10: cpu,cpuacct
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   5: fd:  11: memory
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   6: fd:  12: hugetlb
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   7: fd:  13: rdma
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   8: fd:  14: pids
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:   9: fd:  15: net_cls,net_prio
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:  10: fd:  16: cpuset
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:  11: fd:  17: name=systemd
Apr 06 21:58:32 ubuntu-lxd03 lxd.daemon[1664]:  12: fd:  18: unified
Apr 06 21:58:33 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="CGroup memory swap accounting
Apr 06 21:58:38 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Heartbeat timeout from \Apr 06 21:58:41 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:58:45 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:58:13 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:58:18 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:58:24 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:58:28 ubuntu-lxd03 lxd.daemon[1664]: lvl=warn msg="Raft: Election timeout reachedApr 06 21:59:01 ubuntu-lxd03 lxd.daemon[1664]: Error: no "source" property found for the stApr 06 22:07:55 ubuntu-lxd03 lxd.daemon[1664]: Error: LXD still not running after 600s timeApr 06 22:07:55 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Main process exited, codeApr 06 22:07:55 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Unit entered failed stateApr 06 22:07:55 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-Apr 06 22:07:56 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Service hold-off time oveApr 06 22:07:56 ubuntu-lxd03 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 22:07:56 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: => Preparing the system
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Loading snap configuration
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Setting up mntns symlink
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Setting up kmod wrapper
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Preparing /boot
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Preparing a clean copy of /run
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Preparing a clean copy of /etc
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Setting up ceph configuration
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Setting up LVM configuration
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Rotating logs
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Setting up ZFS (0.6)
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Escaping the systemd cgroups
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Escaping the systemd process resource liApr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: ==> Enabling unprivileged containers kernel
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: => Starting LXCFS
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: mount namespace: 5
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: hierarchies:
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   0: fd:   6: perf_event
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   1: fd:   7: freezer
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   2: fd:   8: devices
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   3: fd:   9: blkio
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   4: fd:  10: cpu,cpuacct
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   5: fd:  11: memory
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   6: fd:  12: hugetlb
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   7: fd:  13: rdma
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   8: fd:  14: pids
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:   9: fd:  15: net_cls,net_prio
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:  10: fd:  16: cpuset
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:  11: fd:  17: name=systemd
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]:  12: fd:  18: unified
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: => Starting LXD
Apr 06 22:07:56 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="CGroup memory swap accounting
Apr 06 22:07:59 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Heartbeat timeout from \Apr 06 22:08:05 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:09 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:13 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:17 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:22 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:26 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Election timeout reachedApr 06 22:08:26 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: AppendEntries to {Voter
Apr 06 22:08:28 ubuntu-lxd03 lxd.daemon[2261]: lvl=warn msg="Raft: Failed to contact 1 in 2Apr 06 22:08:51 ubuntu-lxd03 lxd.daemon[2261]: Error: no "source" property found for the stApr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2261]: Error: LXD still not running after 600s timeApr 06 22:17:56 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Main process exited, codeApr 06 22:17:56 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Unit entered failed stateApr 06 22:17:56 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Failed with result 'exit-Apr 06 22:17:56 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Service hold-off time oveApr 06 22:17:56 ubuntu-lxd03 systemd[1]: Stopped Service for snap application lxd.daemon.
Apr 06 22:17:56 ubuntu-lxd03 systemd[1]: Started Service for snap application lxd.daemon.
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: => Preparing the system
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Loading snap configuration
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Setting up mntns symlink
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Setting up kmod wrapper
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Preparing /boot
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Preparing a clean copy of /run
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Preparing a clean copy of /etc
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Setting up ceph configuration
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Setting up LVM configuration
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Rotating logs
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Setting up ZFS (0.6)
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Escaping the systemd cgroups
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Escaping the systemd process resource liApr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: ==> Enabling unprivileged containers kernel
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: => Starting LXCFS
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: mount namespace: 5
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: hierarchies:
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: => Starting LXD
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   0: fd:   6: perf_event
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   1: fd:   7: freezer
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   2: fd:   8: devices
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   3: fd:   9: blkio
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   4: fd:  10: cpu,cpuacct
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   5: fd:  11: memory
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   6: fd:  12: hugetlb
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   7: fd:  13: rdma
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   8: fd:  14: pids
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:   9: fd:  15: net_cls,net_prio
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:  10: fd:  16: cpuset
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:  11: fd:  17: name=systemd
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]:  12: fd:  18: unified
Apr 06 22:17:56 ubuntu-lxd03 lxd.daemon[2386]: lvl=warn msg="CGroup memory swap accounting
Apr 06 22:17:58 ubuntu-lxd03 lxd.daemon[2386]: lvl=warn msg="Raft: Failed to get previous lApr 06 22:18:05 ubuntu-lxd03 lxd.daemon[2386]: Error: no "source" property found for the st
jason@ubuntu-lxd03:~$

Interestingly LXD01 (master) has recovered itself, as has LXD02. LXD03 is still down however with the same error in the screenshot above. Here’s dmesg on LXD03:

[   12.461819] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[   12.526951] vmw_vmci 0000:00:07.7: Found VMCI PCI device at 0x11080, irq 16
[   12.527079] vmw_vmci 0000:00:07.7: Using capabilities 0xc
[   12.530150] Guest personality initialized and is active
[   12.531332] VMCI host device registered (name=vmci, major=10, minor=55)
[   12.531334] Initialized host personality
[   13.378102] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[   13.378104] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[   13.378105] RAPL PMU: hw unit of domain package 2^-0 Joules
[   13.378106] RAPL PMU: hw unit of domain pp1-gpu 2^-0 Joules
[   15.585098] floppy0: no floppy controllers found
[   15.874035] ppdev: user-space parallel port driver
[   16.174908] spl: loading out-of-tree module taints kernel.
[   16.182201] SPL: Loaded module v0.6.5.11-1ubuntu1
[   16.189683] znvpair: module license 'CDDL' taints kernel.
[   16.189685] Disabling lock debugging due to kernel taint
[   16.301267] ZFS: Loaded module v0.6.5.11-1ubuntu3, ZFS pool version 5000, ZFS filesystem version 5
[   16.524702] SPL: using hostid 0x00000000
[   17.967131] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   18.151851] IPv6: ADDRCONF(NETDEV_UP): br0: link is not ready
[   18.153907] br0: port 1(ens160) entered blocking state
[   18.153911] br0: port 1(ens160) entered disabled state
[   18.154116] device ens160 entered promiscuous mode
[   18.158992] vmxnet3 0000:03:00.0 ens160: intr type 3, mode 0, 3 vectors allocated
[   18.160284] vmxnet3 0000:03:00.0 ens160: NIC Link is Up 10000 Mbps
[   18.160842] br0: port 1(ens160) entered blocking state
[   18.160845] br0: port 1(ens160) entered forwarding state
[   18.160904] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
[   18.815020] audit: type=1400 audit(1523048258.342:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=1526 comm="apparmor_parser"
[   18.815028] audit: type=1400 audit(1523048258.342:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1526 comm="apparmor_parser"
[   18.815032] audit: type=1400 audit(1523048258.342:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=1526 comm="apparmor_parser"
[   18.815035] audit: type=1400 audit(1523048258.342:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=1526 comm="apparmor_parser"
[   18.821403] audit: type=1400 audit(1523048258.349:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default" pid=1525 comm="apparmor_parser"
[   18.821410] audit: type=1400 audit(1523048258.349:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-cgns" pid=1525 comm="apparmor_parser"
[   18.821413] audit: type=1400 audit(1523048258.349:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-mounting" pid=1525 comm="apparmor_parser"
[   18.821417] audit: type=1400 audit(1523048258.349:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxc-container-default-with-nesting" pid=1525 comm="apparmor_parser"
[   18.854113] NET: Registered protocol family 40
[   19.023434] audit: type=1400 audit(1523048258.551:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" pid=1528 comm="apparmor_parser"
[   19.033020] audit: type=1400 audit(1523048258.560:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/core/4206/usr/lib/snapd/snap-confine" pid=1527 comm="apparmor_parser"
[   23.891254] kauditd_printk_skb: 19 callbacks suppressed
[   23.891257] audit: type=1400 audit(1523048301.163:31): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.benchmark" pid=1614 comm="apparmor_parser"
[   23.901168] audit: type=1400 audit(1523048301.172:32): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.check-kernel" pid=1616 comm="apparmor_parser"
[   23.910694] audit: type=1400 audit(1523048301.182:33): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.daemon" pid=1618 comm="apparmor_parser"
[   23.919448] audit: type=1400 audit(1523048301.191:34): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.database" pid=1620 comm="apparmor_parser"
[   23.928459] audit: type=1400 audit(1523048301.200:35): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.hook.configure" pid=1622 comm="apparmor_parser"
[   23.936468] audit: type=1400 audit(1523048301.208:36): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.lxc" pid=1624 comm="apparmor_parser"
[   23.945001] audit: type=1400 audit(1523048301.216:37): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.lxd" pid=1626 comm="apparmor_parser"
[   23.953415] audit: type=1400 audit(1523048301.225:38): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.lxd.migrate" pid=1628 comm="apparmor_parser"
[   35.466868] new mount options do not match the existing superblock, will be ignored
[  637.065940] new mount options do not match the existing superblock, will be ignored
[ 1237.577428] new mount options do not match the existing superblock, will be ignored
jason@ubuntu-lxd03:~$

Does systemctl reload snap.lxd.daemon help with lxd03?

That error about storage source property on lxd03 is probably the source of the problem though, @freeekanayaka should be able to help with that.

My guess is that he’s going to ask for lxc storage list on a working node, as well as lxc storage show --target NAME for any of the pools listed and where NAME should be replaced with every one of the cluster nodes one by one. And for the output of sqlite3 /var/snap/lxd/common/lxd/lxd.db .dump on lxd03.

So assuming you have 3 nodes (lxd01, lxd02 and lxd03) and only one storage pool called local, that’d be a total of:

  • lxc storage list (on lxd01)
  • lxc storage show local --target lxd01 (on lxd01)
  • lxc storage show local --target lxd02 (on lxd01)
  • lxc storage show local --target lxd03 (on lxd01)
  • sqlite3 /var/snap/lxd/common/lxd/lxd.db .dump (on lxd03)

That should help figure out if there’s indeed some missing config there.

OK, I’ll break this down :slight_smile:

jason@ubuntu-lxd03:~$ sudo systemctl reload snap.lxd.daemon
Job for snap.lxd.daemon.service failed because the control process exited with error code.
See "systemctl  status snap.lxd.daemon.service" and "journalctl  -xe" for details.
jason@ubuntu-lxd03:~$ systemctl  status snap.lxd.daemon.service
● snap.lxd.daemon.service - Service for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; enabled; vendor preset: enabled)
   Active: active (running) (Result: exit-code) since Fri 2018-04-06 22:41:57 BST; 1min 54s ago
  Process: 2076 ExecReload=/usr/bin/snap run --command=reload lxd.daemon (code=exited, status=1/FAILURE)
 Main PID: 1943 (daemon.start)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/snap.lxd.daemon.service
           ‣ 1943 /bin/sh /snap/lxd/6578/commands/daemon.start

Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]:   8: fd:  14: freezer
Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]:   9: fd:  15: memory
Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]:  10: fd:  16: blkio
Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]:  11: fd:  17: name=systemd
Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]:  12: fd:  18: unified
Apr 06 22:41:57 ubuntu-lxd03 lxd.daemon[1943]: lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2018-04-06T21:41:57+0000
Apr 06 22:41:58 ubuntu-lxd03 lxd.daemon[1943]: Error: no "source" property found for the storage pool
Apr 06 22:43:41 ubuntu-lxd03 systemd[1]: Reloading Service for snap application lxd.daemon.
Apr 06 22:43:41 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
Apr 06 22:43:41 ubuntu-lxd03 systemd[1]: Reload failed for Service for snap application lxd.daemon.
jason@ubuntu-lxd03:~$
jason@ubuntu-lxd01:~$ lxc storage list
+-------+-------------+--------+---------+---------+
| NAME  | DESCRIPTION | DRIVER |  STATE  | USED BY |
+-------+-------------+--------+---------+---------+
| local |             | zfs    | CREATED | 1       |
+-------+-------------+--------+---------+---------+
jason@ubuntu-lxd01:~$ lxc storage show local --target ubuntu-lxd01
config:
  source: local
  volatile.initial_source: /dev/sdb
  zfs.pool_name: local
description: ""
name: local
driver: zfs
used_by:
- /1.0/profiles/default
status: Created
locations:
- ubuntu-lxd01
- ubuntu-lxd02
- ubuntu-lxd03
jason@ubuntu-lxd01:~$ lxc storage show local --target ubuntu-lxd02
config:
  source: local
  volatile.initial_source: /dev/sdb
  zfs.pool_name: local
description: ""
name: local
driver: zfs
used_by:
- /1.0/profiles/default
status: Created
locations:
- ubuntu-lxd01
- ubuntu-lxd02
- ubuntu-lxd03
jason@ubuntu-lxd01:~$ lxc storage show local --target ubuntu-lxd03
Error: Get https://10.10.30.13:8443/1.0/storage-pools/local?target=ubuntu-lxd03: Unable to connect to: 10.10.30.13:8443
jason@ubuntu-lxd03:~$ sqlite3 /var/snap/lxd/common/lxd/lxd.db .dump
The program 'sqlite3' is currently not installed. You can install it by typing:
sudo apt install sqlite3

To be 100% perfectly clear, I’ve done absolutely nothing different on 3 compared to 2. 3 reported in on 1 and 2 successfully, and has now died.

jason@ubuntu-lxd01:~$ lxc cluster list
+--------------+--------------------------+----------+---------+-----------------------------------+
|     NAME     |           URL            | DATABASE |  STATE  |              MESSAGE              |
+--------------+--------------------------+----------+---------+-----------------------------------+
| ubuntu-lxd01 | https://10.10.30.11:8443 | YES      | ONLINE  | fully operational                 |
+--------------+--------------------------+----------+---------+-----------------------------------+
| ubuntu-lxd02 | https://10.10.30.12:8443 | YES      | ONLINE  | fully operational                 |
+--------------+--------------------------+----------+---------+-----------------------------------+
| ubuntu-lxd03 | https://10.10.30.13:8443 | YES      | OFFLINE | no heartbeat since 32m2.86096704s |
+--------------+--------------------------+----------+---------+-----------------------------------+

Also interestingly, zpool list now shows no pools on lxd03. I can’t re-run lxd init on the box though:

jason@ubuntu-lxd03:~$ sudo lxd init
Error: Failed to connect to local LXD: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory

Can you install sqlite3 and do that .dump on ubuntu-lxd03?

And on 01 or 02, can you run lxd sql "SELECT * FROM storage_pools_config;"?

jason@ubuntu-lxd01:~$ lxd sql "SELECT * FROM storage_pools_config;"
+------------+-----------------+------------+-------------------------+----------+
| id         | storage_pool_id | node_id    | key                     | value    |
+------------+-----------------+------------+-------------------------+----------+
| 3          | 1               | 1          | source                  | local    |
| 4          | 1               | 1          | volatile.initial_source | /dev/sdb |
| 5          | 1               | 1          | zfs.pool_name           | local    |
| 6          | 1               | 2          | source                  | local    |
| 7          | 1               | 2          | volatile.initial_source | /dev/sdb |
| 8          | 1               | 2          | zfs.pool_name           | local    |
+------------+-----------------+------------+-------------------------+----------+

Oh sure, sorry - not sure why I assumed it’d just work…

jason@ubuntu-lxd03:~$ sqlite3 /var/snap/lxd/common/lxd/lxd.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE schema (
    id         INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    version    INTEGER NOT NULL,
    updated_at DATETIME NOT NULL,
    UNIQUE (version)
);
INSERT INTO schema VALUES(1,37,1523052307);
CREATE TABLE config (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    key VARCHAR(255) NOT NULL,
    value TEXT,
    UNIQUE (key)
);
INSERT INTO config VALUES(2,'core.https_address','10.10.30.13:8443');
CREATE TABLE patches (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    name VARCHAR(255) NOT NULL,
    applied_at DATETIME NOT NULL,
    UNIQUE (name)
);
INSERT INTO patches VALUES(1,'invalid_profile_names',1523052307);
INSERT INTO patches VALUES(2,'leftover_profile_config',1523052307);
INSERT INTO patches VALUES(3,'network_permissions',1523052307);
INSERT INTO patches VALUES(4,'storage_api',1523052307);
INSERT INTO patches VALUES(5,'storage_api_v1',1523052307);
INSERT INTO patches VALUES(6,'storage_api_dir_cleanup',1523052307);
INSERT INTO patches VALUES(7,'storage_api_lvm_keys',1523052307);
INSERT INTO patches VALUES(8,'storage_api_keys',1523052307);
INSERT INTO patches VALUES(9,'storage_api_update_storage_configs',1523052307);
INSERT INTO patches VALUES(10,'storage_api_lxd_on_btrfs',1523052307);
INSERT INTO patches VALUES(11,'storage_api_lvm_detect_lv_size',1523052307);
INSERT INTO patches VALUES(12,'storage_api_insert_zfs_driver',1523052308);
INSERT INTO patches VALUES(13,'storage_zfs_noauto',1523052308);
INSERT INTO patches VALUES(14,'storage_zfs_volume_size',1523052308);
INSERT INTO patches VALUES(15,'network_dnsmasq_hosts',1523052308);
INSERT INTO patches VALUES(16,'storage_api_dir_bind_mount',1523052308);
INSERT INTO patches VALUES(17,'fix_uploaded_at',1523052308);
INSERT INTO patches VALUES(18,'storage_api_ceph_size_remove',1523052308);
INSERT INTO patches VALUES(19,'devices_new_naming_scheme',1523052308);
CREATE TABLE raft_nodes (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    address TEXT NOT NULL,
    UNIQUE (address)
);
INSERT INTO raft_nodes VALUES(1,'10.10.30.11:8443');
INSERT INTO raft_nodes VALUES(2,'10.10.30.12:8443');
INSERT INTO raft_nodes VALUES(4,'10.10.30.13:8443');
DELETE FROM sqlite_sequence;
INSERT INTO sqlite_sequence VALUES('schema',1);
INSERT INTO sqlite_sequence VALUES('patches',19);
INSERT INTO sqlite_sequence VALUES('config',2);
INSERT INTO sqlite_sequence VALUES('raft_nodes',4);
COMMIT;
jason@ubuntu-lxd03:~$

Ok, cool. I’m not sure why 03 is not properly joined (missing storage config), but we can fix that in the global database with:

lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'source', 'local');"
lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'volatile.initial_source', '/dev/sdb');"
lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'zfs.pool_name', 'local');"

Run the above on 01 or 02, that should fix the database. Then try a systemctl reload snap.lxd.daemon on 03 to see if that gets you past that error.

Unfortunately that didn’t resolve the issue.

jason@ubuntu-lxd01:~$ lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'source', 'local');"
Rows affected: 1
jason@ubuntu-lxd01:~$ lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'volatile.initial_source', '/dev/sdb');"
Rows affected: 1
jason@ubuntu-lxd01:~$ lxd sql "INSERT INTO storage_pools_config (storage_pool_id, node_id, key, value) VALUES (1, 3, 'zfs.pool_name', 'local');"
Rows affected: 1
jason@ubuntu-lxd03:~$ sudo systemctl reload snap.lxd.daemon
Job for snap.lxd.daemon.service failed because the control process exited with error code.
See "systemctl  status snap.lxd.daemon.service" and "journalctl  -xe" for details.
jason@ubuntu-lxd03:~$ systemctl  status snap.lxd.daemon.service
● snap.lxd.daemon.service - Service for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; enabled; vendor preset: enabled)
   Active: active (running) (Result: exit-code) since Fri 2018-04-06 23:19:18 BST; 2min 55s ago
  Process: 2293 ExecReload=/usr/bin/snap run --command=reload lxd.daemon (code=exited, status=1/FAILURE)
 Main PID: 2146 (daemon.start)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/snap.lxd.daemon.service
           ‣ 2146 /bin/sh /snap/lxd/6578/commands/daemon.start

Apr 06 23:19:19 ubuntu-lxd03 lxd.daemon[2146]: Error: no "source" property found for the storage pool
Apr 06 23:22:00 ubuntu-lxd03 systemd[1]: Reloading Service for snap application lxd.daemon.
Apr 06 23:22:00 ubuntu-lxd03 systemd[1]: snap.lxd.daemon.service: Control process exited, code=exited status=1
Apr 06 23:22:00 ubuntu-lxd03 systemd[1]: Reload failed for Service for snap application lxd.daemon.

I also now just decided to remove the snap, reinstall and re-run init after forcibly removing the node from the cluster. It came online, rejoined and looked OK, then after a reboot I’m back in the same situation.

That’s bizarre, it doesn’t make much sense that 01 and 02 would be fine but 03 is somehow lacking that config…

Do you still have the output of lxd init for when you joined 03 again? I’d like to see if it asked you for the source property for your local pool or if that got skipped somehow.

I certainly do, here you go:

jason@ubuntu-lxd03:~$ sudo lxd init
Would you like to use LXD clustering? (yes/no) [default=no]: yes
What name should be used to identify this node in the cluster? [default=ubuntu-lxd03]:
What IP address or DNS name should be used to reach this node? [default=10.10.30.13]:
Are you joining an existing cluster? (yes/no) [default=no]: yes
IP address or FQDN of an existing cluster node: 10.10.30.11
Cluster fingerprint: aef77f96b392da746ab3a54fdd878adceeb701f02e362b7ca909fcb17cfdc0b5
You can validate this fingerpring by running "lxc info" locally on an existing node.
Is this the correct fingerprint? (yes/no) [default=no]: yes
Cluster trust password:
All existing data is lost when joining a cluster, continue? (yes/no) [default=no] yes
Choose the local disk or dataset for storage pool "local" (empty for loop disk): /dev/sdb
IP address or FQDN of an existing cluster node: 10.10.30.11
Cluster fingerprint: aef77f96b392da746ab3a54fdd878adceeb701f02e362b7ca909fcb17cfdc0b5
You can validate this fingerpring by running "lxc info" locally on an existing node.
Is this the correct fingerprint? (yes/no) [default=no]: yes
Cluster trust password:
All existing data is lost when joining a cluster, continue? (yes/no) [default=no] yes
Choose the local disk or dataset for storage pool "local" (empty for loop disk): /dev/sdb
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:

Oh, ldx03 is now up and running OK.
Seems it did what lxd02 did initially, gave the error:

jason@ubuntu-lxd03:~$ lxc cluster list
Error: Get http://unix.socket/1.0: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: no such file or directory

Then after a delay, came online.

Hmm, ok, that’s very odd.
What does lxd sql "SELECT * FROM storage_pools_config;" look like now?

I can understand a node taking a while to come back, especially when it’s a db node as a new election must happen before it can bring its DB back online, but that should take a few seconds not longer than that…


jason@ubuntu-lxd01:~$ lxd sql "SELECT * FROM storage_pools_config;"
+------------+-----------------+------------+-------------------------+----------+
| id         | storage_pool_id | node_id    | key                     | value    |
+------------+-----------------+------------+-------------------------+----------+
| 3          | 1               | 1          | source                  | local    |
| 4          | 1               | 1          | volatile.initial_source | /dev/sdb |
| 5          | 1               | 1          | zfs.pool_name           | local    |
| 6          | 1               | 2          | source                  | local    |
| 7          | 1               | 2          | volatile.initial_source | /dev/sdb |
| 8          | 1               | 2          | zfs.pool_name           | local    |
| 9          | 1               | 3          | source                  | local    |
| 10         | 1               | 3          | volatile.initial_source | /dev/sdb |
| 11         | 1               | 3          | zfs.pool_name           | local    |
+------------+-----------------+------------+-------------------------+----------+

LXD03 is on another physical VM host, but I mean I’ve got a bunch of computers with different machines on that don’t encounter any issues talking to one another so can’t really say…

Can you check that the time on all 3 systems is very similar? I’m not sure if RAFT cares about that, but it’s not uncommon for clustering related tools to expect the clock to match.

All synced via NTP and identical :frowning:

I don’t see any obvious issues that’d cause this so I’ll just have to keep an eye on it. Not feeling stable enough to migrate my 20 containers to it yet :frowning:

On a tenuous note, does 3.0 introduce a way of moving containers between nodes in a cluster manually while online? Should I need to bounce a host that would be handy.

I need to figure out how ceph can be setup so that won’t be an issue, but until then :)…