Hello,
I am currently facing an issue with my Ceph Cluster, running on Ubuntu 20.04 plus OpenStack Victoria Ubuntu Cloud Archive.
The Ceph OSDs are hosted within LXD containers, and everything was functioning correctly when both the hosts and containers were running Ubuntu 20.04.
I am currently using LXD version 5.13 (I also tried version 5.14 from latest/candidate
SNAP channel) installed via SNAP, which worked fine on Ubuntu 20.04.
However, after upgrading the host from Ubuntu 20.04 to 22.04 using the do-release-upgrade
command, the Ceph OSD daemon (still unchanged on Ubuntu 20.04 + UCA Victoria) within the LXD container fails to start.
Here is a portion of the LXD Profile (which works in Ubuntu 20.04):
config:
raw.lxc: |-
lxc.apparmor.profile = unconfined
lxc.cgroup.devices.allow = b 253:* rwm
lxc.mount.entry = /proc/sys/vm proc/sys/vm proc bind,rw 0 0
lxc.mount.entry = /proc/sys/fs proc/sys/fs proc bind,rw 0 0
security.privileged: "true"
description: osds
devices:
...
Here is a portion of the LXD Container for Ceph OSD (which works in Ubuntu 20.04):
...
devices:
mapper-control:
path: /dev/mapper/control
type: unix-char
sda:
path: /dev/sda
source: /dev/disk/by-id/ata-Kingston_SSD_XYZ
type: unix-block
sdc:
path: /dev/sdc
source: /dev/disk/by-id/ata-Seagate_HDD_XYSA
type: unix-block
sdd:
path: /dev/sdd
source: /dev/disk/by-id/ata-Seagate_HDD_XYCZ
type: unix-block
sys-fs:
path: /proc/sys/fs
source: /proc/sys/fs
type: disk
sys-vm:
path: /proc/sys/vm
source: /proc/sys/vm
type: disk
...
Since the host has been upgraded, the Ceph OSD inside the container (Ubuntu 20.04 + UCA) no longer starts. The following errors are encountered:
[ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 1-<REMOVED>
[ceph_volume.process][INFO ] Running command: /usr/sbin/ceph-volume lvm trigger 4-<REMOVED>
[ceph_volume.process][INFO ] stderr Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-999
/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-<REMOVED>/osd-block-<REMOVED> --path /var/lib/ceph/osd/ceph-999 --no-mon-config
abel for /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
400 <STRING> -1 bluestore(/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>) _read_bdev_label failed to open /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
d returned non-zero exit status: 1
[ceph_volume.process][INFO ] stderr Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9999
/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-block-<REMOVED>/osd-block-<REMOVED> --path /var/lib/ceph/osd/ceph-9999 --no-mon-config
abel for /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
400 <STRING> -1 bluestore(/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>) _read_bdev_label failed to open /dev/ceph-block-<REMOVED>/osd-block-<REMOVED>: (1) Operation not permitted
d returned non-zero exit status: 1
[systemd][WARNING] command returned non-zero exit status: 1
[systemd][WARNING] failed activating OSD, retries left: 1
[systemd][WARNING] command returned non-zero exit status: 1
[systemd][WARNING] failed activating OSD, retries left: 1
As a result, the /var/lib/ceph/osd/ceph-XYZ
aren’t being mounted inside of the LXD Container, as it was before the upgrade to Ubuntu 22.04 in the host. And Ceph OSD doesn’t show up online in the Ceph Mon controllers.
To debug, I ran:
root@osd-1:~# dd if=/dev/ceph-block-<REMOVED>/osd-block-<REMOVED> of=/tmpdata bs=1024 count=1000
dd: failed to open '/dev/ceph-block-<REMOVED>/osd-block-<REMOVED>': Operation not permitted
…Like in the Ceph Volume logs. This works in the other 20.04-based host/containers.
Worth mentioning that the lvdisplay
command works inside of the Ceph OSD container (20.04), and I can also see them with ls /dev/mapper
. So the problem lies someplace else! I believe…
NOTE: I’m also running /sbin/lvm vgmknodes --refresh
as a systemd
service in the Ceph OSD container, otherwise, the LVM utilities won’t work, and Ceph Ansible doesn’t even deploy it.
I consulted with ChatGPT, which suggested that the issue might be related to changes in the LXD security model and the introduction of “LXD Security Denials” in Ubuntu 22.04. However, I am skeptical of this suggestion, I think it’s hallucinating. Enabling nesting, also recommended by ChatGPT, did not resolve the issue, nor did other tips it provided.
I intend to continue running my Ceph OSDs as LXD containers on Ubuntu 22.04 while ensuring they function correctly. Currently, the other nodes in the cluster (all host/container running 20.04) are working as expected (Ceph OSD Inside LXD Containers).
How can I resolve this problem on Ubuntu 22.04?
Given that LXD is the same SNAP package on Ubuntu 20.04 and 22.04, I expected no issues since the SNAP package itself was not modified.
If I export the Ceph OSD container, by running lxc export osd-1 osd-1.tar.gz
, reinstall the host O.S. back to Ubuntu 20.04, then lxc import osd-1.tar.gz
, everything runs again! This means that the data is intact in the storage devices! It’s just failing to start… This is the third time within the past year that I’m trying to upgrade to Ubuntu 22.04.
I kindly request your advice, as it prevents me from upgrading my entire infrastructure to Ubuntu 22.04 or newer.
Thank you for any assistance you can provide.
NOTE: Also tried ideas from this post: https://chris-sanders.github.io/2018-05-11-block-device-in-containers/
Cheers!
Thiago