I’d guess it’s either because of your Ceph running on a very recent release and our older client being a bit confused, if that’s the case this may help:
snap set lxd ceph.external=true
systemctl reload snap.lxd.daemon
Or it may just be that Ceph wasn’t configured to allow for less than the standard 3 replicas and since you only have a single OSD, it’s impossible for any write to complete.
I believe there is a ceph.conf config key to set the default number of replicas which in your case probably should be set to 1 to avoid issues.
So I have tried setting up a cluster again, this time with a 3 node cluster, I have set ceph.external and reloaded the daemon on all 3 nodes. The Ceph cluster looks as follows:
$ ceph -s
cluster:
id: 92031cf6-bf96-11eb-a07c-5b3f8f9b90b4
health: HEALTH_WARN
Degraded data redundancy: 10 pgs undersized
services:
mon: 3 daemons, quorum node1,node2,node3 (age 6m)
mgr: node1.yvpsor(active, since 32m), standbys: node2.zsvbgi
osd: 3 osds: 3 up (since 6m), 3 in (since 6m)
data:
pools: 2 pools, 33 pgs
objects: 0 objects, 0 B
usage: 3.0 GiB used, 11 TiB / 11 TiB avail
pgs: 23 active+clean
10 active+undersized
$ sudo ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 7.17670 1.00000 7.2 TiB 1.0 GiB 3 MiB 0 B 1 GiB 7.2 TiB 0.01 0.51 33 up
1 hdd 3.53419 1.00000 3.5 TiB 1.0 GiB 3 MiB 0 B 1 GiB 3.5 TiB 0.03 1.04 33 up
2 ssd 0.33800 1.00000 346 GiB 1.0 GiB 192 KiB 0 B 1 GiB 345 GiB 0.29 10.88 23 up
TOTAL 11 TiB 3.0 GiB 6.2 MiB 0 B 3 GiB 11 TiB 0.03
MIN/MAX VAR: 0.51/10.88 STDDEV: 0.15
$ sudo ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 11 TiB 11 TiB 6.1 MiB 2.0 GiB 0.02
ssd 346 GiB 345 GiB 200 KiB 1.0 GiB 0.29
TOTAL 11 TiB 11 TiB 6.3 MiB 3.0 GiB 0.03
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 3.5 TiB
lxd-storage 2 32 0 B 0 0 B 0 3.5 TiB
I have tried running lxd init both through a preseed file and through the CLI prompt and both seem to get stuck when attempting to initiate the cluster.
Following is the input given on lxd init:
Would you like to use LXD clustering? (yes/no) [default=no]: yes
What name should be used to identify this node in the cluster? [default=node1]:
What IP address or DNS name should be used to reach this node? [default=192.168.6.10]: 192.168.1.110
Are you joining an existing cluster? (yes/no) [default=no]:
Setup password authentication on the cluster? (yes/no) [default=yes]:
Trust password for new clients:
Again:
Do you want to configure a new local storage pool? (yes/no) [default=yes]: no
Do you want to configure a new remote storage pool? (yes/no) [default=no]: yes
Name of the storage backend to use (ceph, cephfs) [default=ceph]: ceph
Create a new CEPH pool? (yes/no) [default=yes]: no
Name of the existing CEPH cluster [default=ceph]:
Name of the existing OSD storage pool [default=lxd]: lxd-storage
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]:
Would you like to create a new Fan overlay network? (yes/no) [default=yes]: no
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: yes
config:
core.https_address: 192.168.1.110:8443
core.trust_password: secret
networks: []
storage_pools:
- config:
ceph.cluster_name: ceph
ceph.osd.pool_name: lxd-storage
source: lxd-storage
description: ""
name: remote
driver: ceph
profiles:
- config: {}
description: ""
devices:
root:
path: /
pool: remote
type: disk
name: default
projects: []
cluster:
server_name: node1
enabled: true
member_config: []
cluster_address: ""
cluster_certificate: ""
server_address: ""
cluster_password: ""
### lxd blocks here and does not return ###
I have tried reloading the daemon yet again on the bootstrap node and noticed the following error message when executing lxd init:
Error: Failed to create storage pool 'default': Storage pool directory "/var/snap/lxd/common/lxd/storage-pools/default" already exists
Removing the directory was not sufficient as the init process would still get stuck. I then realized LXD was already listening on port 8443 so I had to unset the core.https_address:
$ sudo netstat -atulpen | grep lxd
tcp 0 0 192.168.1.110:8443 0.0.0.0:* LISTEN 0 210086 20984/lxd
$ lxd init
### init process blocks like the previous attempt ###
$ lxc config unset core.https_address
$ lxd init
### init runs fine this time around ###
$ lxc cluster list
+-------+----------------------------+----------+--------+-------------------+--------------+----------------+
| NAME | URL | DATABASE | STATE | MESSAGE | ARCHITECTURE | FAILURE DOMAIN |
+-------+----------------------------+----------+--------+-------------------+--------------+----------------+
| node1 | https://192.168.1.110:8443 | YES | ONLINE | Fully operational | x86_64 | default |
+-------+----------------------------+----------+--------+-------------------+--------------+----------------+
This however does not seem like an optimal solution. Any ideas as what could cause LXD init blocking on the first init sequence are appreciated.
It seems the successful cluster creation might have been a false positive. While the storage pool is created on LXD, it seems to be unusable as an instance creation command has been stuck for about 3 hours:
Hmm, can you try snap set lxd ceph.external=true followed by systemctl reload snap.lxd.daemon to restart the snap (or if completely stuck, reboot the system maybe?)
This will make the snap use the same version of the ceph tools as your system. We’ve found this to be required in some environments depending on the version of ceph running on the server side.
After setting, reloading the daemon and rebooting all nodes I tried to create a new instance and failed:
$ lxc init ubuntu:20.04 container-test -p container-private -s default
Creating container-test
Error: Failed instance creation: Failed creating instance from image: Failed to create mount directory "/var/snap/lxd/common/lxd/storage-pools/default/images/52c9bf12cbd3b06d591c5f56f8d9a185aca4a9a7da4d6e9f26f0ba44f68867b7": mkdir /var/snap/lxd/common/lxd/storage-pools/default/images/52c9bf12cbd3b06d591c5f56f8d9a185aca4a9a7da4d6e9f26f0ba44f68867b7: no such file or directory
The image however seems to be present:
$ lxc image list
+-------+--------------+--------+---------------------------------------------+--------------+-----------+----------+-------------------------------+
| ALIAS | FINGERPRINT | PUBLIC | DESCRIPTION | ARCHITECTURE | TYPE | SIZE | UPLOAD DATE |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+----------+-------------------------------+
| | 52c9bf12cbd3 | no | ubuntu 20.04 LTS amd64 (release) (20210510) | x86_64 | CONTAINER | 358.34MB | May 28, 2021 at 10:37am (UTC) |
+-------+--------------+--------+---------------------------------------------+--------------+-----------+----------+-------------------------------+
Weird thing is I have both LXD and Ceph cluster setup through Puppet, and I had times where the setup works and times that the LXD cluster bootstrap phase blocks my Puppet run. Though the latter is more common.
Seems to suggest that the previous failure prevented /var/snap/lxd/common/lxd/storage-pools/default or its sub-directories (images, containers, containers-snapshots, virtual-machines, virtual-machines-snapshots, custom and custom-snapshots) from getting created.