Lxd 3.21: more ceph issues

(the cluster from Issues with Ceph cluster and lxd has been upgraded to 3.21)

Currently experiencing new issues, now unable to create a VM…which worked last week.

The command is:
lxc launch ubuntu some-vm-name --vm

Output:
Creating the instance
The local image ‘ubuntu’ couldn’t be found, trying ‘ubuntu:’ instead.
Error: Failed instance creation: Create instance from image: Failed to run: rbd --id lxd –
cluster ceph --image-feature layering clone rbd-lxc-
aa0.a1f/image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.
block@readonly rbd-lxc-aa0.a1f/virtual-machine_perfect-goldfish.block: 2020-02-20
20:52:14.788395 7f79f77fe700 -1 librbd::image::OpenRequest: failed to set image
snapshot: (2) No such file or directory 2020-02-20 20:52:14.788599 7f7a224f4100 -1
librbd: error opening parent image: (2) No such file or directory
rbd: clone error: (2) No such file or directory

Poking ceph directly (rbd du) shows it exists:
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block
23 GiB 558 MiB
image_8c4e87e53c024e0449003350f0b0626b124b68060b73c0a7ad9547670e00d4b3
@readonly 23 GiB 1.2 GiB
image_8c4e87e53c024e0449003350f0b0626b124b68060b73c0a7ad9547670e00d4b3
23 GiB 0 B
image_8c6b98199f45cf67548efa795e4f40fe20d1e16438253fed411bc83d905b19c3
48 MiB 28 MiB
image_8c6b98199f45cf67548efa795e4f40fe20d1e16438253fed411bc83d905b19c3.block

SQL desync again, maybe?

Nope, that’s a different one and we fixed that one already.

snap refresh lxd should get you past this one.

No change after refresh - it said it had no updates available.

Hmm, what does snap info lxd show you?
And can you show lxd sql global "SELECT * FROM storage_pools_config;"

3.21 has the new ceph implementation so I suspect it’s a different issue.
I first suspected it was a bug related to ceph credentials, but looking again above, that seems unlikely to be the case.

snap-id: J60k4JY0HppjwOjW8dZdYc8obXKxujRu
tracking: latest/stable
refresh-date: today at 03:38 UTC

(let me know if you need more of the output)

+----+-----------------+---------+-------------------------+-----------------+
| id | storage_pool_id | node_id |           key           |      value      |
+----+-----------------+---------+-------------------------+-----------------+
| 6  | 4               | 2       | source                  | rbd-lxc-aa0.a1f |
| 7  | 4               | 3       | source                  | rbd-lxc-aa0.a1f |
| 8  | 4               | 4       | source                  | rbd-lxc-aa0.a1f |
| 9  | 4               | 5       | source                  | rbd-lxc-aa0.a1f |
| 10 | 4               | 6       | source                  | rbd-lxc-aa0.a1f |
| 11 | 4               | 7       | source                  | rbd-lxc-aa0.a1f |
| 12 | 4               | 8       | source                  | rbd-lxc-aa0.a1f |
| 43 | 4               | 9       | source                  | rbd-lxc-aa0.a1f |
| 44 | 4               | 9       | volatile.initial_source | rbd-lxc-aa0.a1f |
| 45 | 4               | <nil>   | ceph.user.name          | lxd             |
| 46 | 4               | <nil>   | ceph.cluster_name       | ceph            |
| 47 | 4               | <nil>   | volatile.pool.pristine  | false           |
| 48 | 4               | <nil>   | volume.size             | 25GB            |
| 49 | 4               | <nil>   | ceph.osd.force_reuse    | true            |
| 50 | 4               | <nil>   | ceph.osd.pg_num         | 32              |
| 51 | 4               | <nil>   | ceph.osd.pool_name      | rbd-lxc-aa0.a1f |
+----+-----------------+---------+-------------------------+-----------------+

Do you see more in ceph?

In your output above, I only see image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block which is lacking its snapshot (@readonly) and also appears to be lacking the non-block part of the image.

For a VM image, I would have expected to see:

  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601
  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601@readonly
  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block
  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block@readonly

These are all we’re seeing in ceph:

image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601                                  
48 MiB   28 MiB
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block                            
23 GiB  558 MiB

Ok, so you have the images but not the associated readonly snapshot.
Manually creating it with rbd will likely fix the situation.

I’m provisioning a test system here to stress test the image management part of the Ceph driver as there’s clearly something odd going on here.

I’ve found a number of bugs in the Ceph handling code, including what may be your issue (I hope).

Alright. I’ll need to think of a way to test a newer build as we have semi-production stuff running currently on the cluster so generally need to wait for a new release to hit snap.

Ah - looks like snap refreshed and pulled the fixes. Seems happier now.

Excellent! Anything still behaving weird?

Seems mostly fixed aside from this desync:

root@aa1-cptef101-n2:/home/ubuntu# lxc launch ubuntu ubuntu-vm --vm
Creating ubuntu-vm
The local image 'ubuntu' couldn't be found, trying 'ubuntu:' instead.
Error: Failed instance creation: Locate image 
8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601 in the cluster: 
image not available on any online node

The image is in lxc image ls, however:

Error: failed to notify peer 10.224.1.13:8443: Failed to delete image from peer node: Failed 
to run: rbd --id lxd --cluster ceph --pool rbd-lxc-aa0.a1f children --image 
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601
_ext4.block --snap readonly: rbd: error opening image 
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601
_ext4.block: (2) No such file or directory

Can you show rbd du --pool rbd-lxc-aa0.a1f? It may be a pre-fix image and so be misisng the _ext4 part or has it all the way at the end, so may just need a quick rename to line it up.

Only two files matching that hash exist:

image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601                                  
48 MiB   28 MiB
image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block                            
23 GiB  558 MiB

Ok, so to make things consistent with the pattern expected by my fix, you’d want to rename:

  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601 to image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601_ext4
  • image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601.block to image_8bac6546bbc5cfbe5c490f2c991cff8cff1428b57fb9a74d33a64cb6dff66601_ext4.block

This should then fix the no such file or directory error.