Introducing MicroCeph

As of this morning, two nodes of my three node Microceph cluster are experiencing constant restarts of the microceph daemon. The error message is “Daemon failed to start: Failed to re-establish cluster connection: context deadline exceeded”. The Ceph cluster itself says it’s healthy and the LXD cluster shows all nodes as online. This all started this morning when I got a notification from my monitoring that the ceph cluster was in WARN because of some OSDs that appeared to bounce. This looks like the start of the trouble:

Apr 14 09:21:29 lxd01 audit[428130]: AVC apparmor="DENIED" operation="ptrace" namespace="root//lxd-nextcloud_<var-snap-lxd-common-lxd>" profile="snap.nextcloud.nextcloud-cron" >
Apr 14 09:21:29 lxd01 audit[428130]: AVC apparmor="DENIED" operation="ptrace" namespace="root//lxd-nextcloud_<var-snap-lxd-common-lxd>" profile="snap.nextcloud.nextcloud-cron" >
Apr 14 09:29:10 lxd01 snapd[3408143]: storehelpers.go:769: cannot refresh: snap has no updates available: "bpytop", "btop", "core20", "core22", "lxd", "microcloud", "snapd"
Apr 14 09:29:18 lxd01 systemd[1]: Reloading.
Apr 14 09:29:23 lxd01 systemd[1]: Mounting Mount unit for microceph, revision 318...
Apr 14 09:29:23 lxd01 kernel: loop16: detected capacity change from 0 to 181128
Apr 14 09:29:23 lxd01 systemd[1]: Mounted Mount unit for microceph, revision 318.
Apr 14 09:29:24 lxd01 systemd[1]: Stopping Service for snap application microceph.mon...
Apr 14 09:29:24 lxd01 microceph.mon[1812]: 2023-04-14T09:29:24.068-0500 7f4c37fff640 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
Apr 14 09:29:24 lxd01 microceph.mon[1812]: 2023-04-14T09:29:24.068-0500 7f4c37fff640 -1 mon.lxd01@2(peon) e4 *** Got Signal Terminated ***
Apr 14 09:29:24 lxd01 systemd[1]: snap.microceph.mon.service: Deactivated successfully.
Apr 14 09:29:24 lxd01 systemd[1]: Stopped Service for snap application microceph.mon.
Apr 14 09:29:24 lxd01 systemd[1]: snap.microceph.mon.service: Consumed 12h 53min 20.848s CPU time.
Apr 14 09:29:24 lxd01 systemd[1]: Stopping Service for snap application microceph.daemon...
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: State 'stop-sigterm' timed out. Killing.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 1808 (microcephd) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 3714 (microcephd) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 3743 (microcephd) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 4693 (n/a) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 8394 (n/a) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Killing process 12558 (n/a) with signal SIGKILL.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Main process exited, code=killed, status=9/KILL
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Failed with result 'timeout'.
Apr 14 09:29:54 lxd01 systemd[1]: Stopped Service for snap application microceph.daemon.
Apr 14 09:29:54 lxd01 systemd[1]: snap.microceph.daemon.service: Consumed 1h 38min 5.330s CPU time.
Apr 14 09:29:54 lxd01 systemd[1]: Stopping Service for snap application microceph.mds...
Apr 14 09:29:55 lxd01 microceph.mds[1809]: 2023-04-14T09:29:54.996-0500 7f2a175e7640 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
Apr 14 09:29:55 lxd01 microceph.mds[1809]: 2023-04-14T09:29:54.996-0500 7f2a175e7640 -1 mds.lxd01 *** got signal Terminated ***
Apr 14 09:30:03 lxd01 systemd[1]: snap.microceph.mds.service: Deactivated successfully.
Apr 14 09:30:03 lxd01 systemd[1]: Stopped Service for snap application microceph.mds.
Apr 14 09:30:03 lxd01 systemd[1]: snap.microceph.mds.service: Consumed 1h 27min 59.255s CPU time.
Apr 14 09:30:03 lxd01 systemd[1]: Stopping Service for snap application microceph.mgr...
Apr 14 09:30:03 lxd01 systemd[1]: snap.microceph.mgr.service: Deactivated successfully.
Apr 14 09:30:03 lxd01 systemd[1]: Stopped Service for snap application microceph.mgr.
Apr 14 09:30:03 lxd01 systemd[1]: snap.microceph.mgr.service: Consumed 1h 1min 22.952s CPU time.
Apr 14 09:30:03 lxd01 systemd[1]: Stopping Service for snap application microceph.osd...
Apr 14 09:30:04 lxd01 kernel: libceph: osd18 (1)192.168.86.27:6827 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd20 (1)192.168.86.27:6843 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd15 (1)192.168.86.27:6803 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd22 (1)192.168.86.27:6859 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd21 (1)192.168.86.27:6851 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd18 (1)192.168.86.27:6827 socket closed (con state V1_BANNER)
Apr 14 09:30:04 lxd01 kernel: libceph: osd17 (1)192.168.86.27:6819 socket closed (con state OPEN)
Apr 14 09:30:04 lxd01 kernel: libceph: osd18 (1)192.168.86.27:6827 socket error on write
Apr 14 09:30:05 lxd01 kernel: libceph: osd18 (1)192.168.86.27:6827 socket error on write
Apr 14 09:30:05 lxd01 kernel: libceph: osd15 down
Apr 14 09:30:05 lxd01 kernel: libceph: osd16 down
Apr 14 09:30:05 lxd01 kernel: libceph: osd18 down
Apr 14 09:30:05 lxd01 kernel: libceph: osd19 down
Apr 14 09:30:05 lxd01 kernel: libceph: osd22 (1)192.168.86.27:6859 socket closed (con state V1_BANNER)
Apr 14 09:30:05 lxd01 kernel: libceph: osd21 (1)192.168.86.27:6851 socket closed (con state V1_BANNER)
Apr 14 09:30:05 lxd01 kernel: libceph: osd22 (1)192.168.86.27:6859 socket error on write
Apr 14 09:30:05 lxd01 kernel: libceph: osd21 (1)192.168.86.27:6851 socket error on write
Apr 14 09:30:06 lxd01 kernel: libceph: osd22 (1)192.168.86.27:6859 socket error on write
Apr 14 09:30:06 lxd01 kernel: libceph: osd20 down
Apr 14 09:30:06 lxd01 kernel: libceph: osd21 down
Apr 14 09:30:06 lxd01 kernel: libceph: osd22 down
Apr 14 09:30:07 lxd01 systemd[1]: snap.microceph.osd.service: Deactivated successfully.
Apr 14 09:30:07 lxd01 systemd[1]: Stopped Service for snap application microceph.osd.
Apr 14 09:30:07 lxd01 systemd[1]: snap.microceph.osd.service: Consumed 5d 8h 50min 32.310s CPU time.
Apr 14 09:30:07 lxd01 kernel: libceph: osd17 (1)192.168.86.27:6819 socket closed (con state V1_BANNER)
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.daemon.service
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.mgr.service
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.mds.service
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.rgw.service
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.mon.service
Apr 14 09:30:07 lxd01 snapd[3408143]: services.go:1090: RemoveSnapServices - disabling snap.microceph.osd.service
Apr 14 09:30:07 lxd01 systemd[1]: Reloading.
Apr 14 09:30:07 lxd01 kernel: libceph: osd17 (1)192.168.86.27:6819 socket error on write
Apr 14 09:30:08 lxd01 kernel: libceph: osd17 (1)192.168.86.27:6819 socket error on write
Apr 14 09:30:08 lxd01 kernel: libceph: osd17 down
Apr 14 09:30:16 lxd01 audit[430964]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/snap/snapd/18596/usr>
Apr 14 09:30:16 lxd01 audit[430964]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/snap/snapd/18596/usr>
Apr 14 09:30:16 lxd01 kernel: kauditd_printk_skb: 33 callbacks suppressed
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.051:131897): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unc>
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.051:131898): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unc>
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.051:131898): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unc>
Apr 14 09:30:16 lxd01 audit[430968]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mds" pid=430968 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.243:131899): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mds" pid=43>
Apr 14 09:30:16 lxd01 audit[430974]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.rbd" pid=430974 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.247:131900): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.rbd" pid=43>
Apr 14 09:30:16 lxd01 audit[430973]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap.microceph.radosgw-admin" pid=430973 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 audit[430971]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mon" pid=430971 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.251:131901): apparmor="STATUS" operation="profile_load" profile="unconfined" name="snap.microceph.radosgw-admin">
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.251:131902): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mon" pid=43>
Apr 14 09:30:16 lxd01 audit[430969]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mgr" pid=430969 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.271:131903): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.mgr" pid=43>
Apr 14 09:30:16 lxd01 audit[430970]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.microceph" pid=430970 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 audit[430967]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.daemon" pid=430967 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.279:131904): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.microceph" >
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.279:131905): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.daemon" pid>
Apr 14 09:30:16 lxd01 audit[430966]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.ceph" pid=430966 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 kernel: audit: type=1400 audit(1681482616.287:131906): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.ceph" pid=4>
Apr 14 09:30:16 lxd01 audit[430975]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.rgw" pid=430975 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 audit[430972]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.microceph.osd" pid=430972 comm="apparmor_parser"
Apr 14 09:30:16 lxd01 audit[430978]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap-update-ns.microc>
Apr 14 09:30:16 lxd01 audit[430977]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap-update-ns.microc>
Apr 14 09:30:16 lxd01 audit[430979]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.microcloud.daemo>
Apr 14 09:30:16 lxd01 audit[430980]: AVC apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.microcloud.micro>
Apr 14 09:30:16 lxd01 systemd[1]: Reloading.
Apr 14 09:30:17 lxd01 systemd[1]: Reloading.
Apr 14 09:30:19 lxd01 systemd[1]: Started Service for snap application microceph.daemon.
Apr 14 09:30:19 lxd01 systemd[1]: Started Service for snap application microceph.osd.
Apr 14 09:30:19 lxd01 systemd[1]: Started Service for snap application microceph.mon.
Apr 14 09:30:19 lxd01 systemd[1]: Started Service for snap application microceph.mds.
Apr 14 09:30:19 lxd01 systemd[1]: Started Service for snap application microceph.mgr.
Apr 14 09:30:19 lxd01 audit[431054]: AVC apparmor="DENIED" operation="capable" profile="/snap/snapd/18596/usr/lib/snapd/snap-confine" pid=431054 comm="snap-confine" capability=>
Apr 14 09:30:19 lxd01 audit[431054]: AVC apparmor="DENIED" operation="capable" profile="/snap/snapd/18596/usr/lib/snapd/snap-confine" pid=431054 comm="snap-confine" capability=>
Apr 14 09:30:19 lxd01 audit[431061]: AVC apparmor="DENIED" operation="capable" profile="/snap/snapd/18596/usr/lib/snapd/snap-confine" pid=431061 comm="snap-confine" capability=>
Apr 14 09:30:19 lxd01 audit[431061]: AVC apparmor="DENIED" operation="capable" profile="/snap/snapd/18596/usr/lib/snapd/snap-confine" pid=431061 comm="snap-confine" capability=>
Apr 14 09:30:19 lxd01 snapd[3408143]: storehelpers.go:769: cannot refresh snap "microceph": snap has no updates available
Apr 14 09:30:29 lxd01 audit[431092]: AVC apparmor="DENIED" operation="unlink" profile="snap.microceph.mgr" name="/var/snap/microceph/220/run/ceph-mgr.lxd01.asok" pid=431092 com>
Apr 14 09:30:29 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:29.035-0500 7f49f57dcdc0 -1 asok(0x561a692799c0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen:>
Apr 14 09:30:29 lxd01 kernel: kauditd_printk_skb: 10 callbacks suppressed
Apr 14 09:30:29 lxd01 kernel: audit: type=1400 audit(1681482629.035:131917): apparmor="DENIED" operation="unlink" profile="snap.microceph.mgr" name="/var/snap/microceph/220/run>
Apr 14 09:30:29 lxd01 audit[431075]: AVC apparmor="DENIED" operation="mknod" profile="snap.microceph.mds" name="/var/snap/microceph/220/run/ceph-mds.lxd01.asok" pid=431075 comm>
Apr 14 09:30:29 lxd01 microceph.mds[431075]: 2023-04-14T09:30:29.651-0500 7f7a2be0f6c0 -1 asok(0x55a4db036f20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen:>
Apr 14 09:30:29 lxd01 microceph.mds[431075]: starting mds.lxd01 at
Apr 14 09:30:29 lxd01 kernel: audit: type=1400 audit(1681482629.651:131918): apparmor="DENIED" operation="mknod" profile="snap.microceph.mds" name="/var/snap/microceph/220/run/>
Apr 14 09:30:31 lxd01 audit[431068]: AVC apparmor="DENIED" operation="mknod" profile="snap.microceph.mon" name="/var/snap/microceph/220/run/ceph-mon.lxd01.asok" pid=431068 comm>
Apr 14 09:30:31 lxd01 microceph.mon[431068]: 2023-04-14T09:30:31.271-0500 7f1212758980 -1 asok(0x56275fc3c8f0) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen:>
Apr 14 09:30:31 lxd01 kernel: audit: type=1400 audit(1681482631.271:131919): apparmor="DENIED" operation="mknod" profile="snap.microceph.mon" name="/var/snap/microceph/220/run/>
Apr 14 09:30:33 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:33.435-0500 7f49f57dcdc0 -1 mgr[py] Module alerts has missing NOTIFY_TYPES member
Apr 14 09:30:33 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:33.623-0500 7f49f57dcdc0 -1 mgr[py] Module balancer has missing NOTIFY_TYPES member
Apr 14 09:30:33 lxd01 audit[431330]: AVC apparmor="DENIED" operation="unlink" profile="snap.microceph.osd" name="/var/snap/microceph/220/run/ceph-osd.15.asok" pid=431330 comm=">
Apr 14 09:30:33 lxd01 microceph.osd[431330]: 2023-04-14T09:30:33.719-0500 7f3f7550b5c0 -1 asok(0x561e8a509e20) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen:>
Apr 14 09:30:33 lxd01 kernel: audit: type=1400 audit(1681482633.719:131920): apparmor="DENIED" operation="unlink" profile="snap.microceph.osd" name="/var/snap/microceph/220/run>
Apr 14 09:30:33 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:33.927-0500 7f49f57dcdc0 -1 mgr[py] Module crash has missing NOTIFY_TYPES member
Apr 14 09:30:34 lxd01 microceph.osd[431330]: 2023-04-14T09:30:34.615-0500 7f3f7550b5c0 -1 Falling back to public interface
Apr 14 09:30:35 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:35.487-0500 7f49f57dcdc0 -1 mgr[py] Module devicehealth has missing NOTIFY_TYPES member
Apr 14 09:30:35 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:35.595-0500 7f49f57dcdc0 -1 mgr[py] Module influx has missing NOTIFY_TYPES member
Apr 14 09:30:35 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:35.807-0500 7f49f57dcdc0 -1 mgr[py] Module iostat has missing NOTIFY_TYPES member
Apr 14 09:30:36 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:36.371-0500 7f49f57dcdc0 -1 mgr[py] Module orchestrator has missing NOTIFY_TYPES member
Apr 14 09:30:36 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:36.587-0500 7f49f57dcdc0 -1 mgr[py] Module osd_perf_query has missing NOTIFY_TYPES member
Apr 14 09:30:36 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:36.691-0500 7f49f57dcdc0 -1 mgr[py] Module osd_support has missing NOTIFY_TYPES member
Apr 14 09:30:36 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:36.903-0500 7f49f57dcdc0 -1 mgr[py] Module pg_autoscaler has missing NOTIFY_TYPES member
Apr 14 09:30:37 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:37.011-0500 7f49f57dcdc0 -1 mgr[py] Module progress has missing NOTIFY_TYPES member
Apr 14 09:30:37 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:37.503-0500 7f49f57dcdc0 -1 mgr[py] Module prometheus has missing NOTIFY_TYPES member
Apr 14 09:30:37 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:37.695-0500 7f49f57dcdc0 -1 mgr[py] Module rbd_support has missing NOTIFY_TYPES member
Apr 14 09:30:38 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:38.551-0500 7f49f57dcdc0 -1 mgr[py] Module selftest has missing NOTIFY_TYPES member
Apr 14 09:30:38 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:38.675-0500 7f49f57dcdc0 -1 mgr[py] Module snap_schedule has missing NOTIFY_TYPES member
Apr 14 09:30:39 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:39.003-0500 7f49f57dcdc0 -1 mgr[py] Module status has missing NOTIFY_TYPES member
Apr 14 09:30:39 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:39.111-0500 7f49f57dcdc0 -1 mgr[py] Module telegraf has missing NOTIFY_TYPES member
Apr 14 09:30:39 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:39.435-0500 7f49f57dcdc0 -1 mgr[py] Module telemetry has missing NOTIFY_TYPES member
Apr 14 09:30:39 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:39.751-0500 7f49f57dcdc0 -1 mgr[py] Module test_orchestrator has missing NOTIFY_TYPES member
Apr 14 09:30:40 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:40.147-0500 7f49f57dcdc0 -1 mgr[py] Module volumes has missing NOTIFY_TYPES member
Apr 14 09:30:40 lxd01 microceph.mgr[431092]: 2023-04-14T09:30:40.255-0500 7f49f57dcdc0 -1 mgr[py] Module zabbix has missing NOTIFY_TYPES member
Apr 14 09:31:01 lxd01 microceph.daemon[431054]: Error: Unable to start daemon: Daemon failed to start: Failed to re-establish cluster connection: context deadline exceeded
Apr 14 09:31:01 lxd01 systemd[1]: snap.microceph.daemon.service: Main process exited, code=exited, status=1/FAILURE
Apr 14 09:31:01 lxd01 systemd[1]: snap.microceph.daemon.service: Failed with result 'exit-code'.
Apr 14 09:31:01 lxd01 systemd[1]: snap.microceph.daemon.service: Consumed 4.850s CPU time.
Apr 14 09:31:02 lxd01 systemd[1]: snap.microceph.daemon.service: Scheduled restart job, restart counter is at 1.

Sorry for the large log dump but I wanted to capture from the original kill of the daemon by init to the error message “context deadline exceeded”.

I’m not super familiar with snaps but maybe due to a snap package update? Both the affected hosts are on “microceph 0+git.ec95dcb 318” and the unaffected host is “microceph 0+git.6208776 220”

Hi I am new to this MicroCeph. I have installed it on an ubuntu machine. All went good till the files system creation, which I will be using for my clients. How do I configure MicroCeph storage as Kernel mount on my client machines? here.

I have Been a Ceph storage user for 2years with Ceph Octopus. There I was able to use it as Kernel mount as well as with the Kubernetes as Storage Class fro the files system created from Ceph Storage.

Thanks in advance.

FYI how to enable web dashboard

microceph.ceph config set mgr mgr/dashboard/ssl false
microceph.ceph mgr module enable dashboard
echo -n "p@ssw0rd" > /var/snap/microceph/current/conf/password.txt
microceph.ceph dashboard ac-user-create -i /etc/ceph/password.txt admin administrator
rm /var/snap/microceph/current/conf/password.txt

voila! you have dashboard on http port 8080 :sunny:

3 Likes

thanks, I was searching for this

how to mount ceph fs in path /mnt/ceph on ubuntu 22.04? Which commands i need after initial setup?

MicroCeph is a Canonical product which is no longer related to Linux Containers since Canonical’s decision to take away LXD from the project. You should probably reach out to Canonical, likely through their forum for support on MicroCeph.

If you are running microceph with partitions (rather than whole disks) - apparmor needs to be put in complain mode for OSD creation to succeed - if you see errors like this in dmesg:

[ 1549.859092] audit: type=1400 audit(1699830490.855:88): apparmor="DENIED" operation="open" profile="snap.microceph.daemon" name="/dev/sr0" pid=28283 comm="microcephd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 1550.766299] audit: type=1400 audit(1699830491.755:89): apparmor="DENIED" operation="open" profile="snap.microceph.daemon" name="/dev/vda3" pid=29327 comm="ceph-osd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 1550.767410] audit: type=1400 audit(1699830491.755:90): apparmor="DENIED" operation="open" profile="snap.microceph.daemon" name="/dev/vda3" pid=29327 comm="ceph-osd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 1550.767420] audit: type=1400 audit(1699830491.755:91): apparmor="DENIED" operation="open" profile="snap.microceph.daemon" name="/dev/vda3" pid=29327 comm="ceph-osd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[ 1550.773590] audit: type=1400 audit(1699830491.763:92): apparmor="DENIED" operation="open" profile="snap.microceph.daemon" name="/dev/vda3" pid=29327 comm="ceph-osd" requested_mask="wrc" denied_mask="wrc" fsuid=0 ouid=0

To temporarily enable complain mode:

echo -n complain > /sys/module/apparmor/parameters/mode
  • This allowed microceph init to succeed:

  • The following firewall outbound destination ports need opening tcp 3300 6789 7443 & additionally for OSD’s destination ports tcp 6800-6810 is recommended

  • Having the ceph daemon listening on a wireguard interface works ok

  • microceph.ceph health detail is a useful command

  • It took about 20 minutes for ceph to become happy - this would probably have been quicker if I restarted snap.microceph.osd.service after setting the correct firewall rules.

  • Adding partitions from /dev/disk/by-path works:

[root@host4 ~]# microceph.ceph status

  cluster:
    id:     61ee0596-5913-48c2-92dd-7d24d74bd979
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum host1,host3,host4 (age 90m)
    mgr: host1(active, since 2h), standbys: host3, host4
    osd: 4 osds: 4 up (since 53m), 4 in (since 54m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   84 MiB used, 152 GiB / 152 GiB avail
    pgs:     1 active+clean

  • The above healthy cluster is with 4 x 38gb nvme disk partitions & 1gbps ports connected with wireguard using wg-meshconf (forked to add preshared keys)
  • Enabling wg-quick@interface_name services was less problematic than configuring wireguard with systemd-networkd (which worked & then stopped setting the routes & giving wg0 an ip address after a reboot)
  • If you are running ceph from a partition on a single disk - copy line 619 in /var/lib/snapd/apparmor/profiles/snap.microceph.osd to allow your partition (e.g add a line of /dev/vda3 rwk, - so ceph still works after a reboot) - & perhaps make the file immutable with chattr +i until this is fixed.

if you are running incus - you should also snap remove lxd

I did some more reading & network testing of wireguard (680 mbps) & a normal 1gbps interface (935mbps) - so it makes more sense to run microceph on the main interface firewalled to trusted ip’s on a small connection (unless you really want double encryption).

For the firewall allow tcp 3300 6800-6810 7443 & not tcp 6789 (the legacy v1 protocol) - 3300 is the new v2 messenger protocol which is end to end encrypted.

The dashboard doesn’t work at the moment on either https with a self signed cert or as http - & probably needs normal ceph configured to work properly. Again apparmor the most likely cause:

Nov 14 22:52:59 host1 kernel: [  196.137689] audit: type=1400 audit(1700002379.659:34): apparmor="DENIED" operation="capable" profile="snap.microceph.mgr" pid=627 comm="dashboard" capability=12  capname="net_admin"
  • Disabling apparmor to ‘fix’ the dashboard is not an option as microceph refuses to start without it
  • Enabling prometheus alerts / metrics is probably sufficient

To enable end to end encrypted connections with compression - on each node run:

ceph mon enable-msgr2  # (enable msgr v2 protocol)
ceph config set mon ms_bind_msgr1 false
# enable secure only mode + compression
ceph config set mon ms_cluster_mode secure
ceph config set mon ms_service_mode secure
ceph config set mon ms_client_mode secure
ceph config set mon ms_mon_cluster_mode secure
ceph config set mon ms_mon_service_mode secure
ceph config set mon ms_mon_client_mode secure
ceph config set mon ms_compress_secure true
ceph config set osd ms_osd_compress_mode force
# restart services (or reboot)
systemctl restart snap.microceph.mon
systemctl restart snap.microceph.osd
  • This seems to stop traffic on port 6789 (the old v1 monitor protocol that is susceptible to man in the middle attacks) - the port is still open but does not send anything.

  • Your node monitors are now communicating securely on port 3300 & your osd’s too.

  • End to end encryption + compression uses around 200mb more ram ( 1.1gb total on monitor nodes). On a 4 node cluster the 4th node not running the monitor uses approx 650mb less ram (so would be a good choice for scheduler.instance options)

It seems sensible to configure microovn next - open tcp ports 6443 / 6641-6643 (ovsdb-server) & 6081 (geneve tunnel) limited to your node ip’s.

@stgraber Really nice piece of software. What is the best way to start over. For example, we will test MicroCeph and then decice to remove cluster from the host.

If I’ll try to remove host with:

microceph cluster remove micro01

Then i got an error:

Error: Cannot remove cluster members, there are no remaining non-pending members

Do I need to uninstall microceph with snap ?

Will be disks then usable to bootstrap new ceph cluster for example with different public network ?

Thanks.

Hey there,

I no longer work at Canonical and so haven’t touched any of the microXYZ stuff since I left back in July.

For your situation, I suspect doing snap remove --purge microceph followed by a system reboot should give you a proper clean slate.

i make cluster using snap of microceph only without lxd
but after two weeks
ceph -s
cluster:
id: 3285f2c6-f7e2-4cf8-868a-528b4ec24448
health: HEALTH_WARN
insufficient standby MDS daemons available
197 pgs not deep-scrubbed in time
232 pgs not scrubbed in time

services:
mon: 3 daemons, quorum storage01.mfa.local,storage02.mfa.local,storage03.mfa.local (age 18h)
mgr: storage01.mfa.local(active, since 18h), standbys: storage03.mfa.local, storage02.mfa.local
mds: 3/3 daemons up
osd: 6 osds: 6 up (since 18h), 6 in (since 21h); 18 remapped pgs

data:
volumes: 3/3 healthy
pools: 7 pools, 241 pgs
objects: 41.67M objects, 2.8 TiB
usage: 9.1 TiB used, 15 TiB / 24 TiB avail
pgs: 6937112/125021568 objects misplaced (5.549%)
223 active+clean
17 active+remapped+backfill_wait
1 active+remapped+backfilling

io:
recovery: 1.8 MiB/s, 21 objects/s
how can solved it