Snap.lxd.activate.service hanging after system reboot

laney · July 7, 2020, 2:15pm

I restarted my machine earlier, and lxd is failing to come up unfortunately. snap.lxd.activate.service seems to hang indefinitely. Hoping someone can help me get it back on its feet. Here’s some logs, let me know if anything else would be helpful:

laney@raleigh> systemctl list-jobs
JOB   UNIT                      TYPE  STATE  
12950 snap.lxd.activate.service start running

1 jobs listed.
laney@raleigh> systemctl status snap.lxd.activate.service 
● snap.lxd.activate.service - Service for snap application lxd.activate
     Loaded: loaded (/etc/systemd/system/snap.lxd.activate.service; enabled; vendor preset: enabled)
     Active: activating (start) since Tue 2020-07-07 14:57:29 BST; 2min 25s ago
   Main PID: 46558 (daemon.activate)
      Tasks: 0 (limit: 38370)
     Memory: 53.3M
     CGroup: /system.slice/snap.lxd.activate.service
             ├─46558 /bin/sh /snap/lxd/15913/commands/daemon.activate
             └─46594 lxd activateifneeded

Jul 07 14:57:29 raleigh systemd[1]: Starting Service for snap application lxd.activate...
Jul 07 14:57:29 raleigh lxd.activate[46558]: => Starting LXD activation
Jul 07 14:57:29 raleigh lxd.activate[46558]: ==> Loading snap configuration
Jul 07 14:57:29 raleigh lxd.activate[46558]: ==> Checking for socket activation support
Jul 07 14:57:29 raleigh lxd.activate[46558]: ==> Setting LXD socket ownership
Jul 07 14:57:29 raleigh lxd.activate[46558]: ==> Checking if LXD needs to be activated

and (just because this one is long)

laney@raleigh> snap info lxd
name:      lxd
summary:   System container manager and API
publisher: Canonical✓
store-url: https://snapcraft.io/lxd
contact:   https://github.com/lxc/lxd/issues
license:   unset
description: |
  **LXD is a system container manager**
  
  With LXD you can run hundreds of containers of a variety of Linux
  distributions, apply resource limits, pass in directories, USB devices
  or GPUs and setup any network and storage you want.
  
  LXD containers are lightweight, secure by default and a great
  alternative to running Linux virtual machines.
  
  
  **Run any Linux distribution you want**
  
  Pre-made images are available for Ubuntu, Alpine Linux, ArchLinux,
  CentOS, Debian, Fedora, Gentoo, OpenSUSE and more.
  
  A full list of available images can be found here: https://images.linuxcontainers.org
  
  Can't find the distribution you want? It's easy to make your own images too, either using our
  `distrobuilder` tool or by assembling your own image tarball by hand.
  
  
  **Containers at scale**
  
  LXD is network aware and all interactions go through a simple REST API,
  making it possible to remotely interact with containers on remote
  systems, copying and moving them as you wish.
  
  Want to go big? LXD also has built-in clustering support,
  letting you turn dozens of servers into one big LXD server.
  
  
  **Configuration options**
  
  Supported options for the LXD snap (`snap set lxd KEY=VALUE`):
     - ceph.builtin: Use snap-specific ceph configuration [default=false]
     - criu.enable: Enable experimental live-migration support [default=false]
     - daemon.debug: Increases logging to debug level [default=false]
     - daemon.group: Group of users that can interact with LXD [default=lxd]
     - daemon.preseed: A YAML configuration to feed `lxd init` on initial start
     - lxcfs.pidfd: Start per-container process tracking [default=false]
     - lxcfs.loadavg: Start tracking per-container load average [default=false]
     - lxcfs.cfs: Consider CPU shares for CPU usage [default=false]
     - openvswitch.builtin: Run a snap-specific OVS daemon [default=false]
     - shiftfs.enable: Enable shiftfs support [default=auto]
  
  Documentation: https://lxd.readthedocs.io
commands:
  - lxd.benchmark
  - lxd.buginfo
  - lxd.check-kernel
  - lxd.lxc
  - lxd.lxc-to-lxd
  - lxd
  - lxd.migrate
services:
  lxd.activate: oneshot, enabled, inactive
  lxd.daemon:   simple, enabled, active
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable/ubuntu-18.10
refresh-date: today at 14:57 BST
channels:
  latest/stable:    4.3         2020-07-06 (16044) 74MB -
  latest/candidate: 4.3         2020-07-06 (16044) 74MB -
  latest/beta:      ↑                                   
  latest/edge:      git-814c96f 2020-07-06 (16067) 74MB -
  4.3/stable:       4.3         2020-07-06 (16044) 74MB -
  4.3/candidate:    4.3         2020-07-06 (16044) 74MB -
  4.3/beta:         ↑                                   
  4.3/edge:         ↑                                   
  4.2/stable:       4.2         2020-07-01 (15896) 74MB -
  4.2/candidate:    4.2         2020-07-01 (15896) 74MB -
  4.2/beta:         ↑                                   
  4.2/edge:         ↑                                   
  4.1/stable:       4.1         2020-06-05 (15359) 72MB -
  4.1/candidate:    4.1         2020-06-05 (15359) 72MB -
  4.1/beta:         ↑                                   
  4.1/edge:         ↑                                   
  4.0/stable:       4.0.2       2020-07-06 (16048) 74MB -
  4.0/candidate:    4.0.2       2020-07-06 (16048) 74MB -
  4.0/beta:         ↑                                   
  4.0/edge:         git-f3ffd6c 2020-07-06 (16035) 74MB -
  3.0/stable:       3.0.4       2019-10-10 (11348) 55MB -
  3.0/candidate:    3.0.4       2019-10-10 (11348) 55MB -
  3.0/beta:         ↑                                   
  3.0/edge:         git-81b81b9 2019-10-10 (11362) 55MB -
  2.0/stable:       2.0.11      2019-10-10  (8023) 28MB -
  2.0/candidate:    2.0.11      2019-10-10  (8023) 28MB -
  2.0/beta:         ↑                                   
  2.0/edge:         git-160221d 2020-01-13 (12854) 27MB -
installed:          4.3                    (15913) 74MB -

freeekanayaka · July 7, 2020, 2:22pm

Can you paste the content of /var/snap/lxd/common/lxd/logs/lxd.log ?

laney · July 7, 2020, 2:30pm

Yes, this seems to be kind of interesting, thanks!

laney@raleigh> sudo cat /var/snap/lxd/common/lxd/logs/lxd.log                                                                         ~
[sudo] password for laney: 
t=2020-07-07T14:57:30+0100 lvl=info msg="LXD 4.3 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-07-07T14:57:30+0100 lvl=info msg="Kernel uid/gid map:" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - u 0 0 4294967295" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - g 0 0 4294967295" 
t=2020-07-07T14:57:30+0100 lvl=info msg="Configured LXD uid/gid map:" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - u 0 1000000 1000000000" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - g 0 1000000 1000000000" 
t=2020-07-07T14:57:30+0100 lvl=info msg="Kernel features:" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - netnsid-based network retrieval: yes" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - pidfds: no" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - uevent injection: yes" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - seccomp listener: yes" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - seccomp listener continue syscalls: yes" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - unprivileged file capabilities: yes" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - cgroup layout: hybrid" 
t=2020-07-07T14:57:30+0100 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored" 
t=2020-07-07T14:57:30+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored" 
t=2020-07-07T14:57:30+0100 lvl=info msg=" - shiftfs support: disabled" 
t=2020-07-07T14:57:30+0100 lvl=info msg="Initializing local database" 
t=2020-07-07T14:57:31+0100 lvl=info msg="Starting cluster handler:" 
t=2020-07-07T14:57:31+0100 lvl=info msg="Starting /dev/lxd handler:" 
t=2020-07-07T14:57:31+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-07-07T14:57:31+0100 lvl=info msg="REST API daemon:" 
t=2020-07-07T14:57:31+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-07-07T14:57:31+0100 lvl=info msg=" - binding TCP socket" socket=[::]:8443
t=2020-07-07T14:57:31+0100 lvl=info msg="Initializing global database" 
t=2020-07-07T14:57:31+0100 lvl=warn msg="Dqlite: attempt 0: server 0: dial: Failed to connect to HTTP endpoint: dial tcp: address 0: missing port in address" 
t=2020-07-07T14:57:31+0100 lvl=warn msg="Dqlite: attempt 1: server 0: dial: Failed to connect to HTTP endpoint: dial tcp: address 0: missing port in address"

and then the last line repeats up to attempt 11.

I had a look back in those logs and this message was repeating before the restart which broke things, but maybe it’s a problem that would have waited until I restarted lxd? (Or perhaps lxd was broken but the containers were up - not sure I used it directly in this period). Here’s where it started:

t=2020-07-02T21:28:23+0100 lvl=eror msg="Unaccounted raft node(s) not found in 'nodes' table for heartbeat: map[0:{ID:1 Address:0 Role:voter}]"
t=2020-07-02T21:28:32+0100 lvl=info msg="Received 'terminated signal', exiting" 
t=2020-07-02T21:28:32+0100 lvl=info msg="Transfer leadership"
t=2020-07-02T21:28:32+0100 lvl=warn msg="Could not handover member's responsibilities: Failed to transfer leadership: No online voter found" 
t=2020-07-02T21:28:32+0100 lvl=info msg="Starting shutdown sequence"  
t=2020-07-02T21:28:32+0100 lvl=info msg="Closing the database"
t=2020-07-02T21:28:32+0100 lvl=info msg="Stop database gateway" 
t=2020-07-02T21:28:32+0100 lvl=info msg="Stopping REST API handler:"
t=2020-07-02T21:28:32+0100 lvl=info msg=" - closing socket" socket=[::]:8443
t=2020-07-02T21:28:32+0100 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-07-02T21:28:32+0100 lvl=info msg="Stopping /dev/lxd handler:"  
t=2020-07-02T21:28:32+0100 lvl=info msg=" - closing socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
[ the log rotated here ]
t=2020-07-02T21:28:38+0100 lvl=info msg="LXD 4.3 is starting in normal mode" path=/var/snap/lxd/common/lxd
t=2020-07-02T21:28:38+0100 lvl=info msg="Kernel uid/gid map:"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - u 0 0 4294967295"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - g 0 0 4294967295"
t=2020-07-02T21:28:38+0100 lvl=info msg="Configured LXD uid/gid map:"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - u 0 1000000 1000000000"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - g 0 1000000 1000000000"
t=2020-07-02T21:28:38+0100 lvl=info msg="Kernel features:"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - netnsid-based network retrieval: yes"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - pidfds: no"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - uevent injection: yes"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - seccomp listener: yes"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - seccomp listener continue syscalls: yes"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - unprivileged file capabilities: yes"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - cgroup layout: hybrid"
t=2020-07-02T21:28:38+0100 lvl=warn msg=" - Couldn't find the CGroup blkio.weight, I/O weight limits will be ignored"
t=2020-07-02T21:28:38+0100 lvl=warn msg=" - Couldn't find the CGroup memory swap accounting, swap limits will be ignored"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - shiftfs support: disabled"
t=2020-07-02T21:28:38+0100 lvl=info msg="Initializing local database"
t=2020-07-02T21:28:38+0100 lvl=info msg="Starting cluster handler:"
t=2020-07-02T21:28:38+0100 lvl=info msg="Starting /dev/lxd handler:"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - binding devlxd socket" socket=/var/snap/lxd/common/lxd/devlxd/sock
t=2020-07-02T21:28:38+0100 lvl=info msg="REST API daemon:"
t=2020-07-02T21:28:38+0100 lvl=info msg=" - binding Unix socket" inherited=true socket=/var/snap/lxd/common/lxd/unix.socket
t=2020-07-02T21:28:38+0100 lvl=info msg=" - binding TCP socket" socket=[::]:8443
t=2020-07-02T21:28:38+0100 lvl=info msg="Initializing global database"
t=2020-07-02T21:28:38+0100 lvl=warn msg="Dqlite: attempt 0: server 0: dial: Failed to connect to HTTP endpoint: dial tcp: address 0: missing port in address"

Hmm, interesting, I just looked in snap changes lxd to try to find out if this was an update, and all it says is 801 Undoing yesterday at 23:38 BST - Auto-refresh snap "lxd" (no idea what that means).

stgraber · July 7, 2020, 2:31pm

Also systemctl -a | grep lxd and journalctl -u snap.lxd.daemon may be useful.

lxd activateifneeded will kick snap.lxd.daemon if there are containers that need to start or if it’s configured to listen on the network, so that’s where I’d expect things to be interesting.

ps fauxww may also be useful to see if something is stuck.

stgraber · July 7, 2020, 2:32pm

Ah, ok, so you have clustering enabled somehow but with an invalid address of 0.

We’ve had another user hit this recently, I really wonder what’s the source of this.

@freeekanayaka second time isn’t a coincidence anymore so we need to sort this out.

stgraber · July 7, 2020, 2:33pm

@laney Would you mind making a tarball of /var/snap/lxd/common/lxd/database and send that to @freeekanayaka?

laney · July 7, 2020, 2:39pm

I think I tried to play with that (I wanted to be able to transparently start arm64 instances on my pi) but then gave up once I saw some message that all the nodes have to be configured identically, so backed off. I / LXD probably left something stray.

Yes, ok, done! Happy to help supply info and then hopefully one of you can tell me how to manually undo the damage (disable clustering) ;-).

freeekanayaka · July 7, 2020, 2:47pm

It might be a regression indeed.

freeekanayaka · July 7, 2020, 3:19pm

The local.db file has these settings:

sqlite> select * from config;
5|core.https_address|[::]:8443
6|cluster.https_address|[::]:8443

which I believe are wrong and we probably don’t allow them any more these days. Just for the record, is this kind a old deployment or a relatively recent one if you remember?

I’ll try to come up with a fix.

laney · July 7, 2020, 3:24pm

I’ve had lxd on this system for many years.

I don’t remember when I set those particular values. BUT it’s still in my shell history: lxc config set core.https_address '[::]:8443' so probably not that long ago.

FTR on another machine (also 4.3) I could set it to that value still and it seems to work as a remote.

On one of my rpis (set up in the last month) I have

laney@rutland:~$ lxc config show
config:
  core.https_address: '[::]'

is that one ok?

stgraber · July 7, 2020, 5:52pm

Yeah, that address is perfectly valid for core.https_address, it’s just cluster.https_address which must be a specific, static address that should always be present on your system.

laney · July 7, 2020, 7:22pm

ah ok. For fun I tried deleting that configuration item from the database but it didn’t allow lxd to come back - I guess there’s something more required.

freeekanayaka · July 7, 2020, 7:45pm

Yeah, I tried that too and had issues indeed, I’ll need some more time to dig into that, will do it tomorrow.

freeekanayaka · July 8, 2020, 7:33am

Okay, I still don’t quite understand how your LXD ended up in this state, but in order to recover please stop the LXD process and create these two files under /var/snap/lxd/common/lxd/database:

patch.local.sql:

DELETE FROM config WHERE key='cluster.https_address';
DELETE FROM raft_nodes;

patch.global.sql:

UPDATE nodes SET name='none', address='0.0.0.0' WHERE id=1;

Then start again LXD.

freeekanayaka · July 8, 2020, 7:37am

My best guess is that for some reason at some point you have enabled clustering (either via lxd init for via lxd cluster enable) and before or after then the core.https_address config was set to '[::]:8443', and perhaps that was done at a time where we still had a bug around setting '[::]:8443' as core address when clustered.

laney · July 8, 2020, 8:17am

Great, thanks @freeekanayaka, this did indeed recover it!

If you initialise (or did) the cluster.https_address config item from core.https_address that sounds likely. I guess there’s potential for a defensive programming fixup to be applied here, rather than hanging forever.

freeekanayaka · July 8, 2020, 8:20am

Yeah, however I’m not entirely sure how that should look like, since there are subtle details that might vary from instance to instance and that might require small changes to the fix.

freeekanayaka · July 8, 2020, 8:21am

Or perhaps you mean bailing out if we detect [::] being used as cluster address.

laney · July 8, 2020, 9:02am

Yeah, since Stéphane said that one was illegal. Or since I was getitng an error from dqlite apparently, maybe that kind of thing could be detected and disable clustering in that case.