Issues with Ceph cluster and lxd

[apologies if my markdown is wrong]

I am using lxd 3.20 from snap on ubuntu 19.10

I had ceph working with a node standalone, however attempting to replicate the config to a cluster isn’t working (even specifying the same node that worked standalone with --target):

(I have copied the ceph keys and config to all nodes)

lxc storage create default ceph --target aa1-cptef101-n1 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef101-n2 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef101-n3 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef101-n4 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef102-n1 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef102-n2 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef102-n3 source=rbd-lxc-aa0.a1f
lxc storage create default ceph --target aa1-cptef102-n4 source=rbd-lxc-aa0.a1f
lxc storage create default ceph ceph.user.name=lxd ceph.cluster_name=ceph
lxc storage create default ceph ceph.user.name=lxd ceph.cluster_name=ceph 
ceph.osd.force_reuse=true
lxc profile device add default root disk path=/ pool=default

And the storage config as it appears:

config:
 ceph.cluster_name: ceph
 ceph.osd.force_reuse: "true"
 ceph.osd.pg_num: "32"
 ceph.osd.pool_name: rbd-lxc-aa0.a1f
 ceph.user.name: lxd
 volatile.pool.pristine: "false"
 description: ""
name: default
driver: ceph
used_by:
- /1.0/containers/lasting-bengal
- /1.0/profiles/default
status: Created
locations:
- aa1-cptef101-n2
- aa1-cptef101-n3
- aa1-cptef101-n4
- aa1-cptef102-n1
- aa1-cptef102-n2
- aa1-cptef102-n3
- aa1-cptef102-n4
- aa1-cptef101-n1

Creating a container will either hang forever or return Error: Failed instance creation: Create instance from image: No such object

Is there a way to get debug info on what it’s doing to ceph? lxd -d is saying it’s already running despite an lxd shutdown

lxc monitor --type=logging --pretty in a separate shell on the same system that you’re trying to create the container on should help.

DBUG[02-12|22:57:47] Handling                                 ip=10.224.1.11:49072 method=GET 
url="/1.0/operations/b0f9931b-3c49-4c59-a98a-eceb703a184c?target=aa1-cptef101-n2" . 
user=
DBUG[02-12|22:57:47] Image already exists in the db           
image=9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f
INFO[02-12|22:57:47] Creating container                       ephemeral=false name=fond- 
lioness project=default
INFO[02-12|22:57:47] Created container                        ephemeral=false name=fond- 
lioness project=default
DBUG[02-12|22:57:47] Creating RBD storage volume for container "fond-lioness" on  
storage pool "default"
INFO[02-12|22:57:47] Deleting container                       name=fond-lioness project=default 
used="1970-01-01 00:00:00 +0000 UTC" created="2020-02-12 22:57:47.593203723 
+0000 UTC" ephemeral=false
DBUG[02-12|22:57:47] Failure for task operation: b0f9931b-3c49-4c59-a98a- 
eceb703a184c: Create instance from image: No such object
INFO[02-12|22:57:47] Deleted container                        project=default used="1970-01-01 
00:00:00 +0000 UTC" created="2020-02-12 22:57:47.593203723 +0000 UTC" 
ephemeral=false name=fond-lioness

Hmm. Not as much debug info as I would’ve wanted.

Yeah, this is a bit light on debug information :slight_smile:

Can you show:

  • lxd sql global “SELECT * FROM nodes;”
  • lxd sql global “SELECT * FROM images;”
  • lxd sql global “SELECT * FROM images_nodes;”
  • lxd sql global “SELECT * FROM storage_volumes WHERE type=1;”

Absolutely! Happy to provide as much information as possible.

+----+-----------------+-------------+------------------+--------+----------------+------------------------------- 
------+---------+------+
| id |      name       | description |     address      | schema | api_extensions |              
heartbeat              | pending | arch |
+----+-----------------+-------------+------------------+--------+----------------+------------------------------- 
------+---------+------+
| 2  | aa1-cptef101-n2 |             | 10.224.1.12:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063634275-08:00 | 0       | 2    |
| 3  | aa1-cptef101-n3 |             | 10.224.1.13:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063043196-08:00 | 0       | 2    |
| 4  | aa1-cptef101-n4 |             | 10.224.1.14:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063180046-08:00 | 0       | 2    |
| 5  | aa1-cptef102-n1 |             | 10.224.1.21:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063274336-08:00 | 0       | 2    |
| 6  | aa1-cptef102-n2 |             | 10.224.1.22:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063348426-08:00 | 0       | 2    |
| 7  | aa1-cptef102-n3 |             | 10.224.1.23:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063420615-08:00 | 0       | 2    |
| 8  | aa1-cptef102-n4 |             | 10.224.1.24:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063491265-08:00 | 0       | 2    |
| 9  | aa1-cptef101-n1 |             | 10.224.1.11:8443 | 24     | 165            | 2020-02- 
12T15:55:40.063562885-08:00 | 0       | 2    |
+----+-----------------+-------------+------------------+--------+----------------+------------------------------- 
------+---------+------+

images:

+----+------------------------------------------------------------------+-------------------------------------------- 
---+----------------+--------+--------------+---------------------------+---------------------------+-------------- 
----------------------+--------+-------------------------------------+-------------+------------+------+
| id |                           fingerprint                            |                   filename                    |      
size      | public | architecture |       creation_date       |        expiry_date        |             
upload_date             | cached |            last_use_date            | auto_update | project_id | 
type |
+----+------------------------------------------------------------------+-------------------------------------------- 
---+----------------+--------+--------------+---------------------------+---------------------------+-------------- 
-----------------------+--------+-------------------------------------+-------------+------------+------+
| 1  | 9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f | 
ubuntu-18.04-server-cloudimg-amd64-lxd.tar.xz | 1.87413264e+08 | 0      | 2            | 2020- 
01-28T16:00:00-08:00 | 2023-04-25T17:00:00-07:00 | 2020-02-12T12:13:40.562369879- 
08:00 | 1      | 2020-02-12T15:05:22.031966555-08:00 | 1           | 1          | 0    |
+----+------------------------------------------------------------------+-------------------------------------------- 
---+----------------+--------+--------------+---------------------------+---------------------------+-------------- 
----------------------+--------+-------------------------------------+-------------+------------+------+

images_node:

+----+----------+---------+
| id | image_id | node_id |
+----+----------+---------+
| 1  | 1        | 2       |
| 2  | 1        | 3       |
| 3  | 1        | 4       |
| 4  | 1        | 9       |
+----+----------+---------+

storage_volumes: (I have a feeling this shouldn’t be blank?)

+----+------+-----------------+---------+------+-------------+----------+------------+
| id | name | storage_pool_id | node_id | type | description | snapshot | project_id |
+----+------+-----------------+---------+------+-------------+----------+------------+
+----+------+-----------------+---------+------+-------------+----------+------------+

Ok, so the above suggests that:

  • You have a single image in the image store
  • The image hasn’t been loaded onto CEPH yet (rbd ls --pool RBD-POOL would confirm)
  • The image is physically present (in /var/snap/lxd/common/lxd/images/) on 4 of the cluster nodes:
    • 101-n2
    • 101-n3
    • 101-n4
    • 101-n1

The error seems to suggest that LXD thinks the image is available on CEPH already, despite the database indicating that it shouldn’t be yet, so that’s all a bit confusing :slight_smile:

Here’s the weird bit:

root@ceph-operator-101:~# rbd ls --pool rbd-lxc-aa0.a1f
container_pure-malamute
image_9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f
lxd_rbd-lxc-aa0.a1f

It does appear to be there, unless I am reading the ID wrong

Indeed sure looks like it’s there. This is a bit confusing. I wonder if that’s part of the issue.

Can you try moving it aside, see if that helps?

rbd mv --pool rbd-lxc-aa0.a1f --image image_9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f image_9e7158fc0683d41f7f692ce8b17598716d7eee925c6a593432df59488bf4131f.bak

Done - time to try launching again, or is there another intermediate thing to do for testing first?

Nope, I’d just try again now and see if you get a different result :slight_smile:

No change - command seems to be hanging. Let it run for about 30 minutes with no output past “Creating the instance”

HOWEVER. if I choose a different image:
root@aa1-cptef101-n4:/home/ubuntu# time lxc launch ubuntu:19.10
Creating the instance
Instance name is: viable-hermit

The instance you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to an instance, use: lxc network attach
Starting viable-hermit
real	0m30.300s

It works

Repeatable, too:

Deleting the image with lxc image rm fixes things, and i can launch 18.04 now. Weird.