Lxd 3.22: can't list or create new containers

Containers can no longer be created or listed. Execing in to a container as well as starting existed containers still work. Listing the cluster works. Container list will return this sometimes, or just hand indefinitely:

2020-03-16 04:45:40.051840 7f0ed123c700 0 – 10.224.1.24:0/4191326730 >>
10.224.0.83:6789/0 pipe(0x7f0ec40008c0 sd=3 :46724 s=1 pgs=0 cs=0 l=1
c=0x7f0ec400d080).connect protocol feature mismatch, my 27ffffffefdfbfff < peer
27fddff8efacbfff missing 200000

2020-03-16 04:45:41.623505 7f0eee7a1100  0 monclient(hunting): authenticate timed out 
after 300

2020-03-16 04:45:41.623546 7f0eee7a1100  0 librados: client.lxd authentication error 
(110) Connection timed out\nrbd: couldn't connect to the cluster!" 
fp=88fc31ecbfb1696960f389cf2e3f275cca020a6ad7f46ee7ca87af88666a2948

+-----------------+--------------------------+----------+--------+-------------------+--------------+
 |      NAME       |           URL            | DATABASE | STATE  |      MESSAGE      | 
ARCHITECTURE |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n1 | https://10.224.1.11:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n2 | https://10.224.1.12:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n3 | https://10.224.1.13:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef101-n4 | https://10.224.1.14:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n1 | https://10.224.1.21:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n2 | https://10.224.1.22:8443 | NO       | ONLINE | fully operational | x86_64       
|
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n3 | https://10.224.1.23:8443 | YES      | ONLINE | fully operational | 
x86_64       |
+-----------------+--------------------------+----------+--------+-------------------+--------------+
| aa1-cptef102-n4 | https://10.224.1.24:8443 | YES      | ONLINE | fully operational | 
x86_64       |

±----------------±-------------------------±---------±-------±------------------±-------------+

Rebooting the entire cluster has not fixed the issue. Can provide further information as needed.

Those are ceph errors, not LXD errors.

It sounds like you may have upgraded your ceph cluster to a version that isn’t compatible with your kernel’s rbd client or with the version of the ceph client shipped in the LXD snap.

That or your ceph cluster is actually timing out.

The only changes to ceph are that we set:

ceph osd set-require-min-compat-client luminous,

which is older that what should be packaged in the snap as far as I know. We didn’t upgrade anything ceph beyond NIC firmware. The missing feature it’s complaining about is 200000, which seems to be CEPH_FEATURE_MON_GV.

The snap is built on an Ubuntu 16.04 base which means it’s using Jewel, you requiring Luminous is likely causing the problem.

We did the work to port the snap to an 18.04 base (with Ceph Mimic) but this is currently stuck waiting for other server snaps to make the jump…

Ceph is not letting us change the required version - is there any way to use those snaps, or do we need to build lxd ourselves?

The edge snap is built on top of core18 but that’s the current upstream snapshot of LXD, so it’s changing several times a day and may not be working at times. If you have to use it, you certainly will want some way to prevent automated updates on that.

One thing we may be able to do on our side is have the beta channel populated with the same build of LXD as is in candidate but applying the core18 patch to it.

You would then be able to run using the beta channel and get effectively a preview of how things will be once switched over to core18.

@mar I’ve put automation in place now, the beta channel should start populating soon.

Done - switched to the beta snap channel. (via snap refresh lxd --channel=beta)
lxc list now returns a result much faster than before (seconds, vs hours)
However upon container creation, we get: Error: Failed instance creation: Create instance: Create instance: Invalid devices: Failed detecting root disk device: No root device could be found

That would suggest no root device defined in the default profile.

Can you show lxc profile show default?

config: {}
description: Default LXD profile
devices:
  root:
    path: /
    pool: default
    type: disk
name: default

lxc storage list

And what’s the command you’re running to create the container?

+---------+-------------+--------+---------+---------+
|  NAME   | DESCRIPTION | DRIVER |  STATE  | USED BY |
+---------+-------------+--------+---------+---------+
| default |             | ceph   | CREATED | 35      |
+---------+-------------+--------+---------+---------+
ubuntu@aa1-cptef101-n2:~$ lxc launch ubuntu:18.04 ubuntu-test
Creating ubuntu-test
The local image 'ubuntu' couldn't be found, trying 'ubuntu:' instead.
Error: Failed instance creation: Create instance: Create instance: Invalid devices: Failed detecting root disk device: No root device could be found

There’s something very odd going on there, the message:

The local image 'ubuntu' couldn't be found, trying 'ubuntu:' instead.

Would only make sense if you instead use lxc launch ubuntu ubuntu-test.

Can you show lxc image list too?

+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+
| ALIAS | FINGERPRINT  | PUBLIC |                 DESCRIPTION                 | ARCHITECTURE |   TYPE    |    SIZE    |          UPLOAD DATE          |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+
|       | 2fb7e1e169b7 | no     | Ubuntu 18.04 LTS server (20200129.1)        | x86_64       | CONTAINER | 13147.53MB | Feb 15, 2020 at 2:29am (UTC)  |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+
|       | 98e43d99d83e | no     | ubuntu 18.04 LTS amd64 (release) (20200317) | x86_64       | CONTAINER | 178.92MB   | Mar 18, 2020 at 2:12am (UTC)  |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+
|       | ab51411547ec | no     | Debian stretch amd64 (20200318_05:24)       | x86_64       | CONTAINER | 65.38MB    | Mar 18, 2020 at 11:47am (UTC) |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+
|       | c6f89d6b65f3 | no     | Debian stretch amd64 (20200318_05:24)       | x86_64       | CONTAINER | 79.46MB    | Mar 18, 2020 at 11:47am (UTC) |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+------------+-------------------------------+

Can you show lxc image info 98e43d99d83e?

Fingerprint: 98e43d99d83ef1e4d0b28a31fc98e01dd98a2dbace3870e51c5cb03ce908144b
Size: 178.92MB
Architecture: x86_64
Type: container
Public: no
Timestamps:
    Created: 2020/03/17 00:00 UTC
    Uploaded: 2020/03/18 02:12 UTC
    Expires: 2023/04/26 00:00 UTC
    Last used: 2020/03/16 17:23 UTC
Properties:
    version: 18.04
    architecture: amd64
    description: ubuntu 18.04 LTS amd64 (release) (20200317)
    label: release
    os: ubuntu
    release: bionic
    serial: 20200317
    type: squashfs
Aliases:
Cached: yes
Auto update: enabled
Source:
    Server: https://cloud-images.ubuntu.com/releases
    Protocol: simplestreams
    Alias: 18.04
Profiles: []

Ok, so no idea why that is, but that explains the problem at least.
Your cached image has no profiles associated with it, so when you create a container, no profile will be applied, causing the error.

Unless you’ve manually lxc image edit that image, this isn’t supposed to happen…
An easy way to recover from this is to run lxc image delete 98e43d99d83ef1e4d0b28a31fc98e01dd98a2dbace3870e51c5cb03ce908144b and then do your lxc launch again.

Coworker deleted all images and yet still:
±------±-------------±-------±-------------------------------------±-------------±----------±-----------±---- -------------------------+
| ALIAS | FINGERPRINT | PUBLIC | DESCRIPTION | ARCHITECTURE | TYPE | SIZE | UPLOAD DATE |
±------±-------------±-------±-------------------------------------±-------------±----------±-----------±-----------------------------+
| | 2fb7e1e169b7 | no | Ubuntu 18.04 LTS server (20200129.1) | x86_64 | CONTAINER | 13147.53MB | Feb 15, 2020 at 2:29am (UTC) |
±------±-------------±-------±-------------------------------------±-------------±----------±-----------±-----------------------------+

ubuntu@aa1-cptef101-n2:~$ lxc image delete 2fb7e1e169b7
Error: failed to notify peer 10.224.1.13:8443: Failed to delete image from peer node: Failed to run: rbd --id lxd --cluster ceph --pool rbd-lxc-aa0.a1f children --image image_2fb7e1e169b77d7eec26e22839215d23a551ab749ab97b77cd710314f4e56d51_ext4 --snap readonly: rbd: error opening image image_2fb7e1e169b77d7eec26e22839215d23a551ab749ab97b77cd710314f4e56d51_ext4: (2) No such file or directory
ubuntu@aa1-cptef101-n2:~$ lxc launch ubuntu:18.04 ubuntu
Creating ubuntu
Error: Failed instance creation: Create instance from image: Failed to run: rbd --id lxd --cluster ceph --image-feature layering clone rbd-lxc-aa0.a1f/image_98e43d99d83ef1e4d0b28a31fc98e01dd98a2dbace3870e51c5cb03ce908144b_ext4@readonly rbd-lxc-aa0.a1f/container_ubuntu: 2020-03-19 10:46:59.406177 7f3dfdffb700 -1 librbd::image::OpenRequest: failed to set image snapshot: (2) No such file or directory
rbd: clone error: (2) No such file or directory
2020-03-19 10:46:59.406520 7f3e1e13e0c0 -1 librbd: error opening parent image: (2) No such file or directory

However: i was able to provision 19.10:
ubuntu@aa1-cptef101-n2:~$ lxc launch ubuntu:19.10 ubuntu
Creating ubuntu

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach

Starting ubuntu

This issue seems extremely familiar

Yeah, it does, doesn’t it…

Can you show rbd ls --pool rbd-lxc-aa0.a1f to see if that image is present under a different name or soemthing?

container_ubuntu1804
container_virt-machine-test-ceph
container_virt-machine-test-ceph2
image_2fb7e1e169b77d7eec26e22839215d23a551ab749ab97b77cd710314f4e56d51
image_301b4443fb10544f7f1c367a55160e5bf84d76d5e9a007162cbd1123982ff467_ext4
image_4c508847e12fe24adf4bf05d5610d8c5b1d9216a47b3c8b5a416a32e8dca73f9
image_4c508847e12fe24adf4bf05d5610d8c5b1d9216a47b3c8b5a416a32e8dca73f9.block
image_5cba5a1288273a5c49056c8966595516a2d874ec6fa9bf7ca796399d0daeee9d
image_5cba5a1288273a5c49056c8966595516a2d874ec6fa9bf7ca796399d0daeee9d.block
image_77866cd160e953f29064754b487e9a6a06e857d1a3882cc7996f6e5c659e3d37
image_796acff102fac15f08c5cecf9e767c55720e6c99fc8e1fcf32fb9c01ab046516
image_7d64315079bc4af8abeb50dfbe1e600a89852abc893b9be752e5b7c9bf77ffe7
image_7d64315079bc4af8abeb50dfbe1e600a89852abc893b9be752e5b7c9bf77ffe7.bak
image_7d64315079bc4af8abeb50dfbe1e600a89852abc893b9be752e5b7c9bf77ffe7.bak2
image_80cf568d09bb9a4b46c15f8721f6cb454966ccfb2bd6083da5009c6e49dd978d_ext4
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block
image_8c4e87e53c024e0449003350f0b0626b124b68060b73c0a7ad9547670e00d4b3
image_8c6b98199f45cf67548efa795e4f40fe20d1e16438253fed411bc83d905b19c3
image_8c6b98199f45cf67548efa795e4f40fe20d1e16438253fed411bc83d905b19c3.block
image_8d1e0577b1d1ad9f37518931f809802fed96ff060eee97401b91974240bd41bd
image_98e43d99d83ef1e4d0b28a31fc98e01dd98a2dbace3870e51c5cb03ce908144b_ext4
image_9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f.bak
image_d30a815a9ba01dee728c0a853489c7275eea89836639e6f499822b82726122f0_ext4
lxd_rbd-lxc-aa0.a1f
virtual-machine_debian9
virtual-machine_debian9.block
virtual-machine_teaching-lobster
virtual-machine_teaching-lobster.block
zombie_image_31adb27ee6c93d3f956604ff5f40fcc84ddba954b23367ff382e371e9b609320_ext4
zombie_image_856a4d0fe97326ca4f6df89b734cd55d2361e14f10cac6c072272387b73d6525_ext4
zombie_image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601_ext4
zombie_image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601_ext4.block
zombie_image_8c4e87e53c024e0449003350f0b0626b124b68060b73c0a7ad9547670e00d4b3_ext4
zombie_image_9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f_ext4