Upgraded lxd to snap version, and mounted disks, ended up using lxd.recover, now two containers won't start

stratvox · October 9, 2022, 1:24pm

Hi!

Long story short: updated to snap version of lxd, mounted zfs pool with containers, had to use lxd-recover, managed to negotiate all that, and imported all the containers. However, two of them refuse to start now.

I’m stumped as to why the rest of them are going but these two are not.

Here’s my snap info lxd:

jack@kdsapp0:~$ snap info lxd
name:      lxd
summary:   LXD - container and VM manager
publisher: Canonical✓
store-url: https://snapcraft.io/lxd
contact:   https://github.com/lxc/lxd/issues
license:   unset
description: |
  LXD is a system container and virtual machine manager.
  
  It offers a simple CLI and REST API to manage local or remote instances,
  uses an image based workflow and support for a variety of advanced features.
  
  Images are available for all Ubuntu releases and architectures as well
  as for a wide number of other Linux distributions. Existing
  integrations with many deployment and operation tools, makes it work
  just like a public cloud, except everything is under your control.
  
  LXD containers are lightweight, secure by default and a great
  alternative to virtual machines when running Linux on Linux.
  
  LXD virtual machines are modern and secure, using UEFI and secure-boot
  by default and a great choice when a different kernel or operating
  system is needed.
  
  With clustering, up to 50 LXD servers can be easily joined and managed
  together with the same tools and APIs and without needing any external
  dependencies.
  
  
  Supported configuration options for the snap (snap set lxd [<key>=<value>...]):
  
    - ceph.builtin: Use snap-specific Ceph configuration [default=false]
    - ceph.external: Use the system's ceph tools (ignores ceph.builtin) [default=false]
    - criu.enable: Enable experimental live-migration support [default=false]
    - daemon.debug: Increase logging to debug level [default=false]
    - daemon.group: Set group of users that have full control over LXD [default=lxd]
    - daemon.user.group: Set group of users that have restricted LXD access [default=lxd]
    - daemon.preseed: Pass a YAML configuration to `lxd init` on initial start
    - daemon.syslog: Send LXD log events to syslog [default=false]
    - daemon.verbose: Increase logging to verbose level [default=false]
    - lvm.external: Use the system's LVM tools [default=false]
    - lxcfs.pidfd: Start per-container process tracking [default=false]
    - lxcfs.loadavg: Start tracking per-container load average [default=false]
    - lxcfs.cfs: Consider CPU shares for CPU usage [default=false]
    - openvswitch.builtin: Run a snap-specific OVS daemon [default=false]
    - openvswitch.external: Use the system's OVS tools (ignores openvswitch.builtin) [default=false]
    - ovn.builtin: Use snap-specific OVN configuration [default=false]
    - shiftfs.enable: Enable shiftfs support [default=auto]
  
  For system-wide configuration of the CLI, place your configuration in
  /var/snap/lxd/common/global-conf/ (config.yml and servercerts)
commands:
  - lxd.benchmark
  - lxd.buginfo
  - lxd.check-kernel
  - lxd.lxc
  - lxd.lxc-to-lxd
  - lxd
  - lxd.migrate
services:
  lxd.activate:    oneshot, enabled, inactive
  lxd.daemon:      simple, enabled, active
  lxd.user-daemon: simple, enabled, inactive
snap-id:      J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking:     latest/stable
refresh-date: yesterday at 12:30 EDT
channels:
  latest/stable:    5.6-794016a   2022-09-27 (23680) 142MB -
  latest/candidate: 5.6-794016a   2022-09-23 (23680) 142MB -
  latest/beta:      ↑                                      
  latest/edge:      git-c2aa857   2022-10-08 (23760) 142MB -
  5.6/stable:       5.6-794016a   2022-09-27 (23680) 142MB -
  5.6/candidate:    5.6-794016a   2022-09-23 (23680) 142MB -
  5.6/beta:         ↑                                      
  5.6/edge:         ↑                                      
  5.5/stable:       5.5-37534be   2022-08-27 (23537) 113MB -
  5.5/candidate:    5.5-37534be   2022-08-19 (23537) 113MB -
  5.5/beta:         ↑                                      
  5.5/edge:         ↑                                      
  5.4/stable:       5.4-1ff8d34   2022-08-13 (23367) 108MB -
  5.4/candidate:    5.4-3bf11b7   2022-08-12 (23453) 108MB -
  5.4/beta:         ↑                                      
  5.4/edge:         ↑                                      
  5.3/stable:       5.3-91e042b   2022-07-06 (23270) 107MB -
  5.3/candidate:    5.3-b403e7f   2022-07-05 (23282) 107MB -
  5.3/beta:         ↑                                      
  5.3/edge:         ↑                                      
  5.0/stable:       5.0.1-9dcf35b 2022-08-24 (23541) 107MB -
  5.0/candidate:    5.0.1-9dcf35b 2022-08-19 (23541) 107MB -
  5.0/beta:         ↑                                      
  5.0/edge:         git-13e1e53   2022-08-21 (23564) 113MB -
  4.0/stable:       4.0.9-8e2046b 2022-03-26 (22753)  71MB -
  4.0/candidate:    4.0.9-dea944b 2022-09-27 (23694)  72MB -
  4.0/beta:         ↑                                      
  4.0/edge:         git-407205d   2022-03-31 (22797)  73MB -
  3.0/stable:       3.0.4         2019-10-10 (11348)  55MB -
  3.0/candidate:    3.0.4         2019-10-10 (11348)  55MB -
  3.0/beta:         ↑                                      
  3.0/edge:         git-81b81b9   2019-10-10 (11362)  55MB -
installed:          5.6-794016a              (23680) 142MB -

I’m suspecting some kind of network problem… but I really have no idea. The reason is because the configs for the broken containers don’t show a volatile.eth0.host_name entry. Here’s the config show of a working container:

jack@kdsapp0:~$ lxc config show apps -e
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20181206)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20181206"
  image.version: "18.04"
  volatile.base_image: 84a71299044bc3c3563396bef153c0da83d494f6bf3d38fecc55d776b1e19bf9
  volatile.cloud-init.instance-id: 91f5603b-7d99-458e-a80a-336062d512a2
  volatile.eth0.host_name: veth4aa4cc02
  volatile.eth0.hwaddr: 00:16:3e:fd:e9:cf
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 3dbf843c-bbd6-4b0b-80b3-226e523557c8
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: zomg
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

and here’s one of a broken container

architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 18.04 LTS amd64 (release) (20181206)
  image.label: release
  image.os: ubuntu
  image.release: bionic
  image.serial: "20181206"
  image.version: "18.04"
  volatile.base_image: 84a71299044bc3c3563396bef153c0da83d494f6bf3d38fecc55d776b1e19bf9
  volatile.cloud-init.instance-id: bb566fca-6a45-41d5-bb71-df1e0c69e993
  volatile.eth0.hwaddr: 00:16:3e:df:ea:84
  volatile.idmap.base: "0"
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: f322beb3-4e0c-4dad-aca1-f916b97b7d97
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: zomg
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

and I’ve noticed that the broken containers don’t have a volatile.eth0.host_name entry.

At any rate, any help would be appreciated! I’ve been working on this all weekend and I’m stumped.

Thank you!

Jack

stratvox · October 9, 2022, 1:39pm

In the interests of completeness, here’s the config show of the other non-starting container:

jack@kdsapp0:~$ lxc config show services4 -e
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 20.04 LTS amd64 (release) (20210105)
  image.label: release
  image.os: ubuntu
  image.release: focal
  image.serial: "20210105"
  image.type: squashfs
  image.version: "20.04"
  volatile.base_image: 21da67063730fc446ca7fe090a7cf90ad9397ff4001f69907d7db690a30897c3
  volatile.cloud-init.instance-id: a3de452c-f59d-4ebf-adef-972f2ce3711e
  volatile.eth0.hwaddr: 00:16:3e:73:8c:47
  volatile.idmap.base: "0"
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":100000,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":100000,"Nsid":0,"Maprange":65536}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 9c2a4266-e873-49d2-9eae-f8b5fde61bc4
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: zomg
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

tomp · October 10, 2022, 8:25am

What is the start error you get?

stratvox · October 10, 2022, 9:11pm

I don’t. Instead, the commands hang. If I wait long enough I get a time out message.

This also occurs when I try to start or stop other containers. Basically, the computer boots with a selection of containers running, and that’s it. I cannot stop them, start containers that aren’t running. I also tried lxc launch. It created a container but it was unable to start it and I have not been able to get it to run once.

tomp · October 12, 2022, 12:05pm

What is the output of uname -a and what OS version are you using on the host?

stratvox · October 12, 2022, 9:19pm

jack@kdsapp0:~$ uname -a
Linux kdsapp0 5.4.0-128-generic #144-Ubuntu SMP Tue Sep 20 11:00:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
jack@kdsapp0:~$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

stratvox · October 12, 2022, 9:21pm

On reflection, that kernel version ain’t right. <sigh>

tomp · October 13, 2022, 7:06am

5.4 is the version I’d expect for Focal 20.04.

What storage driver are you using?

stratvox · October 13, 2022, 1:45pm

The zfs driver.

tomp · October 13, 2022, 1:53pm

Can you try setting:

snap unset lxd shiftfs.enable
sudo systemctl reload snap.lxd.daemon

stratvox · October 17, 2022, 3:01pm

Hi Thomas! I’m going to do that once we finish getting the rest of the data off the pool, see if I can revive these containers or not. We had to do a full recovery/rebuild cycle to get services back online.

Thank you for your help so far, and once I get a chance to get that pool up I’ll let you know what happens.