(Snap) LXD cannot Resize default BTRFS storage pool

This is the output of nsenter --mount=/run/snapd/ns/lxd.mnt /snap/lxd/current/bin/btrfs scrub start /var/snap/lxd/common/lxd/storage-pools/default

WARNING: cannot create scrub data file, mkdir /var/lib/btrfs failed: Read-only file system. Status recording disabled
WARNING: failed to open the progress status socket at /var/lib/btrfs/scrub.progress.37ab66ba-6522-43ce-adcc-024792370708: No such file or directory. Progress cannot be queried
scrub started on /var/snap/lxd/common/lxd/storage-pools/default, fsid 37ab66ba-6522-43ce-adcc-024792370708 (pid=15417)

Ok, so the entire filesystem went offline and is now readonly due to this issue.
Easiest at this stage is most likely to stop LXD and any related process, unmap the loop device, then run btrfsck against the loop file to figure out what’s going on.

i stopped lxd and killed related processes and unmapped the loop device.
And ran:
btrfsck /dev/loop15
Opening filesystem to check…
ERROR: cannot open device ‘/dev/loop15’: Device or resource busy
ERROR: cannot open file system

It still says device or resource busy.

Run grep loop15 /proc/*/mountinfo to see if anything still uses it somewhere.

No output after that.
root@hostname:~# grep loop15 /proc/*/mountinfo
root@hostname:~#

Okay, then you probably want to use the big hammer and do snap disable lxd, then reboot the system and use btrfsck at that point when you have a clean kernel and have never tried mounting or using the btrfs volume on that boot.

i run btrfsck /dev/loop15
Opening filesystem to check…
ERROR: could not check mount status: No such device or address

Yep, that’s normal, the loop device isn’t setup yet.

Does btrfsck /var/snap/lxd/common/lxd/disks/default.img work?

That worked!

btrfsck /var/snap/lxd/common/lxd/disks/default.img
Opening filesystem to check…
Checking filesystem on /var/snap/lxd/common/lxd/disks/default.img
UUID: 37ab66ba-6522-43ce-adcc-024792370708
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 55864983552
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 59086209024
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 80561045504
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 98413187072 bytes used, no error found
total csum bytes: 95433292
total tree bytes: 607059968
total fs tree bytes: 475201536
total extent tree bytes: 23707648
btree space waste bytes: 92449963
file data blocks allocated: 128041459712
referenced 114545954816

Ok, so you’ll want to run it again as btrfsck --repair /var/snap/lxd/common/lxd/disks/default.img

btrfsck --repair /var/snap/lxd/common/lxd/disks/default.img

enabling repair mode
WARNING:

Do not use --repair unless you are advised to do so by a developer
or an experienced user, and then only after having accepted that no
fsck can successfully repair all types of filesystem corruption. Eg.
some software or hardware bugs can fatally damage a volume.
The operation will start in 10 seconds.
Use Ctrl-C to stop it.

10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check…
Checking filesystem on /var/snap/lxd/common/lxd/disks/default.img
UUID: 37ab66ba-6522-43ce-adcc-024792370708
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
No device size related problem found
[3/7] checking free space cache
cache and super generation don’t match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 98413187072 bytes used, no error found
total csum bytes: 95433292
total tree bytes: 607059968
total fs tree bytes: 475201536
total extent tree bytes: 23707648
btree space waste bytes: 92449963
file data blocks allocated: 128041459712
referenced 114545954816

Ok, that’s encouraging.

Try snap enable lxd and lxc info to see if LXD comes back up.

Succesfully enabled!

lxc info
config:
core.https_address: ‘[::]:8443’
core.trust_password: true
api_extensions:

  • storage_zfs_remove_snapshots
  • container_host_shutdown_timeout
  • container_stop_priority
  • container_syscall_filtering
  • auth_pki
  • container_last_used_at
  • etag
  • patch
  • usb_devices
  • https_allowed_credentials
  • image_compression_algorithm
  • directory_manipulation
  • container_cpu_time
  • storage_zfs_use_refquota
  • storage_lvm_mount_options
  • network
  • profile_usedby
  • container_push
  • container_exec_recording
  • certificate_update
  • container_exec_signal_handling
  • gpu_devices
  • container_image_properties
  • migration_progress
  • id_map
  • network_firewall_filtering
  • network_routes
  • storage
  • file_delete
  • file_append
  • network_dhcp_expiry
  • storage_lvm_vg_rename
  • storage_lvm_thinpool_rename
  • network_vlan
  • image_create_aliases
  • container_stateless_copy
  • container_only_migration
  • storage_zfs_clone_copy
  • unix_device_rename
  • storage_lvm_use_thinpool
  • storage_rsync_bwlimit
  • network_vxlan_interface
  • storage_btrfs_mount_options
  • entity_description
  • image_force_refresh
  • storage_lvm_lv_resizing
  • id_map_base
  • file_symlinks
  • container_push_target
  • network_vlan_physical
  • storage_images_delete
  • container_edit_metadata
  • container_snapshot_stateful_migration
  • storage_driver_ceph
  • storage_ceph_user_name
  • resource_limits
  • storage_volatile_initial_source
  • storage_ceph_force_osd_reuse
  • storage_block_filesystem_btrfs
  • resources
  • kernel_limits
  • storage_api_volume_rename
  • macaroon_authentication
  • network_sriov
  • console
  • restrict_devlxd
  • migration_pre_copy
  • infiniband
  • maas_network
  • devlxd_events
  • proxy
  • network_dhcp_gateway
  • file_get_symlink
  • network_leases
  • unix_device_hotplug
  • storage_api_local_volume_handling
  • operation_description
  • clustering
  • event_lifecycle
  • storage_api_remote_volume_handling
  • nvidia_runtime
  • container_mount_propagation
  • container_backup
  • devlxd_images
  • container_local_cross_pool_handling
  • proxy_unix
  • proxy_udp
  • clustering_join
  • proxy_tcp_udp_multi_port_handling
  • network_state
  • proxy_unix_dac_properties
  • container_protection_delete
  • unix_priv_drop
  • pprof_http
  • proxy_haproxy_protocol
  • network_hwaddr
  • proxy_nat
  • network_nat_order
  • container_full
  • candid_authentication
  • backup_compression
  • candid_config
  • nvidia_runtime_config
  • storage_api_volume_snapshots
  • storage_unmapped
  • projects
  • candid_config_key
  • network_vxlan_ttl
  • container_incremental_copy
  • usb_optional_vendorid
  • snapshot_scheduling
  • container_copy_project
  • clustering_server_address
  • clustering_image_replication
  • container_protection_shift
  • snapshot_expiry
  • container_backup_override_pool
  • snapshot_expiry_creation
  • network_leases_location
  • resources_cpu_socket
  • resources_gpu
  • resources_numa
  • kernel_features
  • id_map_current
  • event_location
  • storage_api_remote_volume_snapshots
  • network_nat_address
  • container_nic_routes
  • rbac
  • cluster_internal_copy
  • seccomp_notify
  • lxc_features
  • container_nic_ipvlan
  • network_vlan_sriov
  • storage_cephfs
  • container_nic_ipfilter
  • resources_v2
  • container_exec_user_group_cwd
  • container_syscall_intercept
  • container_disk_shift
  • storage_shifted
  • resources_infiniband
  • daemon_storage
  • instances
  • image_types
  • resources_disk_sata
  • clustering_roles
  • images_expiry
  • resources_network_firmware
  • backup_compression_algorithm
  • ceph_data_pool_name
  • container_syscall_intercept_mount
  • compression_squashfs
  • container_raw_mount
  • container_nic_routed
  • container_syscall_intercept_mount_fuse
  • container_disk_ceph
  • virtual-machines
  • image_profiles
  • clustering_architecture
  • resources_disk_id
  • storage_lvm_stripes
  • vm_boot_priority
  • unix_hotplug_devices
  • api_filtering
  • instance_nic_network
  • clustering_sizing
  • firewall_driver
  • projects_limits
  • container_syscall_intercept_hugetlbfs
  • limits_hugepages
  • container_nic_routed_gateway
  • projects_restrictions
  • custom_volume_snapshot_expiry
  • volume_snapshot_scheduling
  • trust_ca_certificates
  • snapshot_disk_usage
  • clustering_edit_roles
  • container_nic_routed_host_address
  • container_nic_ipvlan_gateway
  • resources_usb_pci
  • resources_cpu_threads_numa
  • resources_cpu_core_die
  • api_os
  • container_nic_routed_host_table
  • container_nic_ipvlan_host_table
  • container_nic_ipvlan_mode
  • resources_system
  • images_push_relay
  • network_dns_search
  • container_nic_routed_limits
  • instance_nic_bridged_vlan
  • network_state_bond_bridge
  • usedby_consistency
  • custom_block_volumes
  • clustering_failure_domains
  • resources_gpu_mdev
  • console_vga_type
  • projects_limits_disk
  • network_type_macvlan
  • network_type_sriov
  • container_syscall_intercept_bpf_devices
  • network_type_ovn
  • projects_networks
  • projects_networks_restricted_uplinks
  • custom_volume_backup
  • backup_override_name
  • storage_rsync_compression
  • network_type_physical
  • network_ovn_external_subnets
  • network_ovn_nat
  • network_ovn_external_routes_remove
  • tpm_device_type
  • storage_zfs_clone_copy_rebase
  • gpu_mdev
    api_status: stable
    api_version: “1.0”
    auth: trusted
    public: false
    auth_methods:
  • tls
    environment:
    addresses:
    • 192.168.0.126:8443
    • 10.0.3.1:8443
    • 10.112.7.1:8443
    • ‘[fd42:c2bb:440f:2b97::1]:8443’
      architectures:
    • x86_64
    • i686
      certificate: |
      -----BEGIN CERTIFICATE-----
      MIICADCCAYagAwIBAgIQDvQAvOYhJ4lS3iW/1Mga3zAKBggqhkjOPQQDAzAzMRww
      GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRMwEQYDVQQDDApyb290QGNoaWNv
      MB4XDTIwMTExMTE2NDUzMFoXDTMwMTEwOTE2NDUzMFowMzEcMBoGA1UEChMTbGlu
      dXhjb250YWluZXJzLm9yZzETMBEGA1UEAwwKcm9vdEBjaGljbzB2MBAGByqGSM49
      AgEGBSuBBAAiA2IABG2XBRUONBDaXlhGzoA7802xlZEY2z8hzx/XeRyOywxbaItb
      f8iKu3Ixvfx0TS0t/6BcaivfQOzcwumZYkX796yp5AopRQtUVuxfjlYyYCOxayud
      Qc+WPp4YqIgxVpeY9qNfMF0wDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQMMAoGCCsG
      AQUFBwMBMAwGA1UdEwEB/wQCMAAwKAYDVR0RBCEwH4IFY2hpY2+HBH8AAAGHEAAA
      AAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxAJFr209OzEGrzfYCuafw
      veeWUpfx8pn+sfLBB4+tfA/b25hKctTbtEfMaaWznHlagQIwe4uMLDbxz4Ll2Cet
      s9WwjyXaISptN1ryD54IaZBMihgZQVaNvAePj5+YkTnYuvtk
      -----END CERTIFICATE-----
      certificate_fingerprint: 04090c253d2e71917cfdbfdaf9a36d2702276d4fe2c38362d5f841d2f267e626
      driver: lxc
      driver_version: 4.0.5
      firewall: xtables
      kernel: Linux
      kernel_architecture: x86_64
      kernel_features:
      netnsid_getifaddrs: “true”
      seccomp_listener: “true”
      seccomp_listener_continue: “true”
      shiftfs: “false”
      uevent_injection: “true”
      unpriv_fscaps: “true”
      kernel_version: 5.4.0-54-generic
      lxc_features:
      cgroup2: “true”
      devpts_fd: “true”
      mount_injection_file: “true”
      network_gateway_device_route: “true”
      network_ipvlan: “true”
      network_l2proxy: “true”
      network_phys_macvlan_mtu: “true”
      network_veth_router: “true”
      pidfd: “true”
      seccomp_allow_deny_syntax: “true”
      seccomp_notify: “true”
      seccomp_proxy_send_notify_fd: “true”
      os_name: Ubuntu
      os_version: “20.04”
      project: default
      server: lxd
      server_clustered: false
      server_name: chico
      server_pid: 7689
      server_version: “4.8”
      storage: btrfs
      storage_version: 4.15.1

Ok, so far so good, you can see if your container feels like starting now.

If this blows up again, then we’ll need to disable+reboot again, do the btrfsck repair again, manually mount the pool and run a full scrub this time.

It did not work. I’m trying to mount the storage pool manually with:

mount /var/snap/lxd/common/lxd/storage-pools/default /mnt
mount: /mnt: /var/snap/lxd/common/lxd/storage-pools/default is not a block device

mount /var/snap/lxd/common/lxd/disks/default.img /mnt

That will mount the pool on /mnt at which point you can run btrfs scrub start /mnt

btrfs scrub start /mnt
scrub started on /mnt, fsid 37ab66ba-6522-43ce-adcc-024792370708 (pid=7570).

That worked, should i enable lxd now?

Now, you need to monitor it with btrfs scrub status /mnt

The output:

btrfs scrub status /mnt
UUID: 37ab66ba-6522-43ce-adcc-024792370708
Scrub started: Mon Nov 30 17:45:15 2020
Status: running
Duration: 0:01:40
Time left: 0:00:32
ETA: Mon Nov 30 17:47:28 2020
Total to scrub: 92.22GiB
Bytes scrubbed: 69.49GiB
Rate: 711.58MiB/s
Error summary: csum=15787872
Corrected: 0
Uncorrectable: 15787872
Unverified: 0