VM does not start on ceph storage (microceph)

I’m trying to deploy VM on my RPI4 cluster with Incus 6.3 and I’m getting the error:

$ incus start test-vm --project test --console
Error: Failed setting up device via monitor: Failed adding block device for disk device "root": Failed adding block device: error reading conf file /etc/ceph/ceph.conf: No such file or directory
Try `incus info --show-log test-vm` for more info

Below is configuration:
Instance config:
$ incus config show --expanded test-vm --project test

architecture: aarch64
config:
  image.architecture: arm64
  image.description: Alpine edge arm64 (20240715_13:01)
  image.os: Alpine
  image.release: edge
  image.requirements.secureboot: "false"
  image.serial: "20240715_13:01"
  image.type: disk-kvm.img
  image.variant: default
  security.secureboot: "false"
  volatile.apply_template: create
  volatile.base_image: 2f8163c0b5f6b63cf28d2048991ae055aed222ae5c29d0ce1645aa7bf647c41f
  volatile.cloud-init.instance-id: 34c6de9a-2b4b-42f6-b5f8-a8edce2744b1
  volatile.uuid: 791f8fad-460a-46e6-9ae8-1ccdcf1e74f2
  volatile.uuid.generation: 791f8fad-460a-46e6-9ae8-1ccdcf1e74f2
devices:
  root:
    path: /
    pool: test_remote
    type: disk
ephemeral: false
profiles:
- stor.remote
stateful: false
description: ""

Profile config:

$ incus profile show stor.remote --project test

config: {}
description: ""
devices:
  root:
    path: /
    pool: test_remote
    type: disk
name: stor.remote
used_by:
- /1.0/instances/test-pki?project=test
- /1.0/instances/test-vault?project=test
- /1.0/instances/test-vm?project=test
project: test

Storage config:
$ incus storage show test_remote

config:
  ceph.cluster_name: ceph
  ceph.osd.pg_num: "32"
  ceph.osd.pool_name: test_remote
  ceph.user.name: admin
  volatile.pool.pristine: "true"
description: ""
name: test_remote
driver: ceph
used_by:
- /1.0/images/28fcfdfbd2a76019a1b9db93c4aeeafda091cf6610dc31b1815b48b9e38adcb5
- /1.0/images/2f8163c0b5f6b63cf28d2048991ae055aed222ae5c29d0ce1645aa7bf647c41f
- /1.0/images/65e7ee6e5f32f92f1d2acc0921cab309bce39f2400cce1580267323e66c3f4ac
- /1.0/images/cff1686997a3f1313e910f402992cc6c6b421a7931fa4465e29e68da807785ac
- /1.0/instances/test-pki?project=test
- /1.0/instances/test-vault?project=test
- /1.0/instances/test-vm?project=test
- /1.0/profiles/app.ad-home-dc?project=test
- /1.0/profiles/app.dhcp?project=test
- /1.0/profiles/app.dns?project=test
- /1.0/profiles/app.svk-tun?project=test
- /1.0/profiles/default?project=test
- /1.0/profiles/stor.remote?project=test
status: Created
locations:
- cl-05
- cl-06
- cl-07
- cl-01
- cl-02
- cl-03
- cl-04

ceph-common package is installed in addition to microceph package
$ apt list --installed ceph*

Listing... Done
ceph-common/noble,now 19.2.0~git20240301.4c76c50-0ubuntu6 arm64 [installed]

/etc/ceph/ceph.conf' file is linked to microceph’ config file and readable for every user:

ls -l /etc/ceph/ceph.conf
lrwxrwxrwx 1 root root 42 Jun 19 23:08 /etc/ceph/ceph.conf -> /var/snap/microceph/current/conf/ceph.conf

/etc/seph/ceph.conf content:

# # Generated by MicroCeph, DO NOT EDIT.
[global]
run dir = /var/snap/microceph/982/run
fsid = 8aa2cf2d-582d-42f1-aed1-c7d92b945cec
mon host = xxx.xxx.xxx.xx1,xxx.xxx.xxx.xx2,xxx.xxx.xxx.xx3,xxx.xxx.xxx.xx4,xxx.xxx.xxx.xx5,xxx.xxx.xxx.xx6,xxx.xxx.xxx.xx7
public_network = xxx.xxx.xxx.xx7/23
auth allow insecure global id reclaim = false
ms bind ipv4 = true
ms bind ipv6 = false

[client]

Any containers were deployed and run on that storage successfully. However, VM does not start.

For containers, the handling of ceph.conf is done directly by Incus.

For VMs, it’s QEMU that directly accesses it instead. QEMU runs in a very restrictive sandbox to avoid security issues and I suspect that this sandbox cannot see /var/snap/microceph.

You could check dmesg to see if you’re getting AppArmor DENIED entries for /var/snap/microceph.
If that’s the case, you can workaround that with a raw.apparmor rule on the instance itself, something like /var/snap/microceph/** r, would do the trick.

The alternative is to have /etc/ceph contain copies rather than symlinks, this should avoid the issue.

The only two files you really need are /etc/ceph/ceph.conf and `/etc/ceph/ceph.client.admin.keyring

You are right ,AppArmor policies block access:

[56984.360628] audit: type=1400 audit(1721077764.181:193): apparmor="STATUS" operation="profile_load" profile="unconfined" name="incus-test_test-vm_</var/lib/incus>" pid=27720 comm="apparmor_parser"
[56985.697164] audit: type=1400 audit(1721077765.517:194): apparmor="DENIED" operation="open" class="file" profile="incus-test_test-vm_</var/lib/incus>" name="/var/snap/microceph/982/conf/ceph.conf" pid=27732 comm="qemu-system-aar" requested_mask="r" denied_mask="r" fsuid=993 ouid=0

I will try to play with the copied files, but it’s nood good workaround for me.
Need to think whether I can switch to full ceph from the microceph safely.

Copying ceph.conf and ceph.client.admin.keyring from /var/snap/microceph/current/conf/ to /et/ceph/ didn’t help. VM start fails with the same message. However, apparmor DENIED message is different:

[ 3367.516392] audit: type=1400 audit(1721091687.993:180): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="incus-test_test-vm_</var/lib/incus>" pid=5356 comm="apparmor_parser"
[ 3368.766962] audit: type=1400 audit(1721091689.244:181): apparmor="DENIED" operation="open" class="file" profile="incus-test_test-vm_</var/lib/incus>" name="/proc/5368/task/5385/comm" pid=5368 comm="qemu-system-aar" requested_mask="wr" denied_mask="wr" fsuid=993 ouid=0

Okay, so /etc/ceph itself is a real directory and both ceph.conf and ceph.client.admin.keyring are both real files, no more symlinks anywhere there?

You may need to trim the ceph.conf, I don’t know if the run dir line specifically is causing issues here.
Most of the content of your ceph.conf is also not relevant for a simple client config.

You only really need:

[global]
fsid = XYZ
mon_host = foo,bar,baz

I commented out all lines except fsid and mon_host in /etc/ceph/ceph.conf (which is real file now), deleted and initialized VM again, restarted incus daemon… without success.

Here is full dmesq log related to starting VM:

[16201.308017]  rbd4: p1 p2
[16201.308581] rbd: rbd4: capacity 10737418240 features 0x1
[16201.774502] rbd: rbd5: capacity 524288000 features 0x1
[16201.946501] EXT4-fs (rbd5): mounted filesystem f80aa2a7-d938-4a79-a772-22913fa9ba54 r/w with ordered data mode. Quota mode: none.
[16202.057642] audit: type=1400 audit(1721104522.429:193): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="incus-test_test-vm_</var/lib/incus>" pid=8396 comm="apparmor_parser"
[16203.572858] audit: type=1400 audit(1721104523.944:194): apparmor="DENIED" operation="open" class="file" profile="incus-test_test-vm_</var/lib/incus>" name="/proc/8408/task/8425/comm" pid=8408 comm="qemu-system-aar" requested_mask="wr" denied_mask="wr" fsuid=993 ouid=0
[16204.349456] EXT4-fs (rbd5): unmounting filesystem f80aa2a7-d938-4a79-a772-22913fa9ba54.

And the error you get during incus start is still identical?

Sorry, did not point that. It’s more generic error:

Error: Failed setting up device via monitor: Failed adding block device for disk device "root": Failed adding block device: Monitor is disconnected
Try `incus info --show-log test-vm` for more info

Is it possible to get more detailed log?

Anything useful in that incus info --show-log test-vm output?

$ incus info --show-log test-vm --project test
Name: test-vm
Status: STOPPED
Type: virtual-machine
Architecture: aarch64
Location: cl-01
Created: 2024/07/15 21:33 PDT
Last Used: 1969/12/31 16:00 PST

Log:

terminate called after throwing an instance of 'std::system_error'
  what():  Permission denied

Can you try setting raw.apparmor to /proc/*/task/*/comm rw, ?

I don’t know why it would need writing to that and why I’m not running into this on my own systems here, but since you’re seeing a failure for that and then an apparmor denial, it’s probably worth allowing that operation.

After applying:
$ incus config set test-vm --project test raw.apparmor "/proc/*/task/*/comm rw,"
VM has started.

Can you show the full incus info output? That should give us enough info about the environment to reproduce that AppArmor denial and try to work out a good way to handle it in Incus itself.

Ok, I will provide additional detail:

  • 7x RPI4B nodes cluster (mixed 8GB and 4GB)
  • Ubuntu 24.04 (upgraded from 22.04 using do-release-upgrade
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04 LTS
Release:        24.04
Codename:       noble
  • Incus 6.3

  • microceph installed via snap: ver:18.2.0+snap71f71782c5 Revision:982

  • ceph-common/noble,now 19.2.0~git20240301.4c76c50-0ubuntu6 package installed to get access to ceph storage provided by microceph.

  • The only ceph storage is used

  • The only unmanaged networks are used (OS controlled vlans and bridges)

  • incus info:

config:
  cluster.https_address: xxx.xxx.xxx.xx1:8443
  cluster.max_voters: "5"
  core.https_address: xxx.xxx.xxx.xx1:8443
  core.metrics_address: xxx.xxx.xxx.xx1:8444
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_dev_incus
- migration_pre_copy
- infiniband
- dev_incus_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- dev_incus_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- images_all_projects
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- zfs_delegate
- storage_api_remote_volume_snapshot_copy
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- image_restriction_privileged
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- certificate_description
- disk_io_bus_virtio_blk
- loki_config_instance
- instance_create_start
- clustering_evacuation_stop_options
- boot_host_shutdown_action
- agent_config_drive
- network_state_ovn_lr
- image_template_permissions
- storage_bucket_backup
- storage_lvm_cluster
- shared_custom_block_volumes
- auth_tls_jwt
- oidc_claim
- device_usb_serial
- numa_cpu_balanced
- image_restriction_nesting
- network_integrations
- instance_memory_swap_bytes
- network_bridge_external_create
- network_zones_all_projects
- storage_zfs_vdev
- container_migration_stateful
- profiles_all_projects
- instances_scriptlet_get_instances
- instances_scriptlet_get_cluster_members
- instances_scriptlet_get_project
- network_acl_stateless
- instance_state_started_at
- networks_all_projects
- network_acls_all_projects
- storage_buckets_all_projects
- resources_load
- instance_access
- project_access
- projects_force_delete
- resources_cpu_flags
- disk_io_bus_cache_filesystem
- instance_oci
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: sysadmin
auth_user_method: unix
environment:
  addresses:
  - xxx.xxx.xxx.xx1:8443
  architectures:
  - aarch64
  - armv6l
  - armv7l
  - armv8l
  certificate: |
    -----BEGIN CERTIFICATE-----
    <---------SKIPPED--------->
    -----END CERTIFICATE-----
  certificate_fingerprint: <---------SKIPPED--------->
  driver: lxc | qemu
  driver_version: 6.0.1 | 9.0.1
  firewall: nftables
  kernel: Linux
  kernel_architecture: aarch64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_binfmt: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.8.0-1007-raspi
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "24.04"
  project: default
  server: incus
  server_clustered: true
  server_event_mode: full-mesh
  server_name: cl-01
  server_pid: 8126
  server_version: "6.3"
  storage: ceph | cephfs
  storage_version: 19.2.0~git20240301.4c76c50 | 19.2.0~git20240301.4c76c50
  storage_supported_drivers:
  - name: btrfs
    version: 6.6.3
    remote: false
  - name: ceph
    version: 19.2.0~git20240301.4c76c50
    remote: true
  - name: cephfs
    version: 19.2.0~git20240301.4c76c50
    remote: true
  - name: cephobject
    version: 19.2.0~git20240301.4c76c50
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: false
  - name: lvmcluster
    version: 2.03.16(2) (2022-05-18) / 1.02.185 (2022-05-18) / 4.48.0
    remote: true

Okay, we’ll have to see if we can make an apparmor rule that’s PID specific so it doesn’t end up allowing QEMU write access to all processes on the system.

It’s interesting that I’m not seeing this on a 6.9 kernel running the same Incus on arm64, but the Ubuntu 24.04 kernel is known for having a LOT of apparmor patches which have not been pushed upstream and have been causing some issues… So I guess this is another case of that where the Ubuntu kernel is triggering denials which we’re not seeing on mainline.

Hmm, actually, this has got to be an apparmor bug of some kind because we have the following rule already…

  owner @{PROC}/@{pid}/task/@{tid}/comm     rw,

It may be that priv dropping is causing the owner part to fail, so maybe we should relax that a bit.

Hopefully that helps with this issue

1 Like

Looks like Ubuntu pushed AppArmor fixed to the main
Today multiple packages were updated including incus, incus-base, incus-client, apparmor and libapparmor1. After that fixes I can start VM without config adjustment:

P.S.
Setting raw.apparmor to /var/snap/microceph/<microceph-snap-rev>/conf/ceph.conf r, in VM config allows to run it when /etc/ceph/ceph.conf is linked to microceph snap’s config file like below:

ls -l /etc/ceph/
total 16
lrwxrwxrwx 1 root root  58 Jul 18 15:54 ceph.client.admin.keyring -> /var/snap/microceph/current/conf/ceph.client.admin.keyring
lrwxrwxrwx 1 root root  42 Jul 18 15:54 ceph.conf -> /var/snap/microceph/current/conf/ceph.conf