(Snap) LXD cannot Resize default BTRFS storage pool

cesar-ayuuk · November 30, 2020, 9:46pm

This is the output of nsenter --mount=/run/snapd/ns/lxd.mnt /snap/lxd/current/bin/btrfs scrub start /var/snap/lxd/common/lxd/storage-pools/default

WARNING: cannot create scrub data file, mkdir /var/lib/btrfs failed: Read-only file system. Status recording disabled
WARNING: failed to open the progress status socket at /var/lib/btrfs/scrub.progress.37ab66ba-6522-43ce-adcc-024792370708: No such file or directory. Progress cannot be queried
scrub started on /var/snap/lxd/common/lxd/storage-pools/default, fsid 37ab66ba-6522-43ce-adcc-024792370708 (pid=15417)

stgraber · November 30, 2020, 9:54pm

Ok, so the entire filesystem went offline and is now readonly due to this issue.
Easiest at this stage is most likely to stop LXD and any related process, unmap the loop device, then run btrfsck against the loop file to figure out what’s going on.

cesar-ayuuk · November 30, 2020, 10:10pm

i stopped lxd and killed related processes and unmapped the loop device.
And ran:
btrfsck /dev/loop15
Opening filesystem to check…
ERROR: cannot open device ‘/dev/loop15’: Device or resource busy
ERROR: cannot open file system

It still says device or resource busy.

stgraber · November 30, 2020, 10:11pm

Run grep loop15 /proc/*/mountinfo to see if anything still uses it somewhere.

cesar-ayuuk · November 30, 2020, 10:13pm

No output after that.
root@hostname:~# grep loop15 /proc/*/mountinfo
root@hostname:~#

stgraber · November 30, 2020, 10:14pm

Okay, then you probably want to use the big hammer and do snap disable lxd, then reboot the system and use btrfsck at that point when you have a clean kernel and have never tried mounting or using the btrfs volume on that boot.

cesar-ayuuk · November 30, 2020, 10:19pm

i run btrfsck /dev/loop15
Opening filesystem to check…
ERROR: could not check mount status: No such device or address

stgraber · November 30, 2020, 10:20pm

Yep, that’s normal, the loop device isn’t setup yet.

Does btrfsck /var/snap/lxd/common/lxd/disks/default.img work?

cesar-ayuuk · November 30, 2020, 10:21pm

That worked!

btrfsck /var/snap/lxd/common/lxd/disks/default.img
Opening filesystem to check…
Checking filesystem on /var/snap/lxd/common/lxd/disks/default.img
UUID: 37ab66ba-6522-43ce-adcc-024792370708
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 55864983552
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 59086209024
btrfs: csum mismatch on free space cache
failed to load free space cache for block group 80561045504
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 98413187072 bytes used, no error found
total csum bytes: 95433292
total tree bytes: 607059968
total fs tree bytes: 475201536
total extent tree bytes: 23707648
btree space waste bytes: 92449963
file data blocks allocated: 128041459712
referenced 114545954816

stgraber · November 30, 2020, 10:23pm

Ok, so you’ll want to run it again as btrfsck --repair /var/snap/lxd/common/lxd/disks/default.img

cesar-ayuuk · November 30, 2020, 10:24pm

btrfsck --repair /var/snap/lxd/common/lxd/disks/default.img

enabling repair mode
WARNING:

Do not use --repair unless you are advised to do so by a developer
or an experienced user, and then only after having accepted that no
fsck can successfully repair all types of filesystem corruption. Eg.
some software or hardware bugs can fatally damage a volume.
The operation will start in 10 seconds.
Use Ctrl-C to stop it.

10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check…
Checking filesystem on /var/snap/lxd/common/lxd/disks/default.img
UUID: 37ab66ba-6522-43ce-adcc-024792370708
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
No device size related problem found
[3/7] checking free space cache
cache and super generation don’t match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 98413187072 bytes used, no error found
total csum bytes: 95433292
total tree bytes: 607059968
total fs tree bytes: 475201536
total extent tree bytes: 23707648
btree space waste bytes: 92449963
file data blocks allocated: 128041459712
referenced 114545954816

stgraber · November 30, 2020, 10:25pm

Ok, that’s encouraging.

Try snap enable lxd and lxc info to see if LXD comes back up.

cesar-ayuuk · November 30, 2020, 10:28pm

Succesfully enabled!

lxc info
config:
core.https_address: ‘[::]:8443’
core.trust_password: true
api_extensions:

storage_zfs_remove_snapshots
container_host_shutdown_timeout
container_stop_priority
container_syscall_filtering
auth_pki
container_last_used_at
etag
patch
usb_devices
https_allowed_credentials
image_compression_algorithm
directory_manipulation
container_cpu_time
storage_zfs_use_refquota
storage_lvm_mount_options
network
profile_usedby
container_push
container_exec_recording
certificate_update
container_exec_signal_handling
gpu_devices
container_image_properties
migration_progress
id_map
network_firewall_filtering
network_routes
storage
file_delete
file_append
network_dhcp_expiry
storage_lvm_vg_rename
storage_lvm_thinpool_rename
network_vlan
image_create_aliases
container_stateless_copy
container_only_migration
storage_zfs_clone_copy
unix_device_rename
storage_lvm_use_thinpool
storage_rsync_bwlimit
network_vxlan_interface
storage_btrfs_mount_options
entity_description
image_force_refresh
storage_lvm_lv_resizing
id_map_base
file_symlinks
container_push_target
network_vlan_physical
storage_images_delete
container_edit_metadata
container_snapshot_stateful_migration
storage_driver_ceph
storage_ceph_user_name
resource_limits
storage_volatile_initial_source
storage_ceph_force_osd_reuse
storage_block_filesystem_btrfs
resources
kernel_limits
storage_api_volume_rename
macaroon_authentication
network_sriov
console
restrict_devlxd
migration_pre_copy
infiniband
maas_network
devlxd_events
proxy
network_dhcp_gateway
file_get_symlink
network_leases
unix_device_hotplug
storage_api_local_volume_handling
operation_description
clustering
event_lifecycle
storage_api_remote_volume_handling
nvidia_runtime
container_mount_propagation
container_backup
devlxd_images
container_local_cross_pool_handling
proxy_unix
proxy_udp
clustering_join
proxy_tcp_udp_multi_port_handling
network_state
proxy_unix_dac_properties
container_protection_delete
unix_priv_drop
pprof_http
proxy_haproxy_protocol
network_hwaddr
proxy_nat
network_nat_order
container_full
candid_authentication
backup_compression
candid_config
nvidia_runtime_config
storage_api_volume_snapshots
storage_unmapped
projects
candid_config_key
network_vxlan_ttl
container_incremental_copy
usb_optional_vendorid
snapshot_scheduling
container_copy_project
clustering_server_address
clustering_image_replication
container_protection_shift
snapshot_expiry
container_backup_override_pool
snapshot_expiry_creation
network_leases_location
resources_cpu_socket
resources_gpu
resources_numa
kernel_features
id_map_current
event_location
storage_api_remote_volume_snapshots
network_nat_address
container_nic_routes
rbac
cluster_internal_copy
seccomp_notify
lxc_features
container_nic_ipvlan
network_vlan_sriov
storage_cephfs
container_nic_ipfilter
resources_v2
container_exec_user_group_cwd
container_syscall_intercept
container_disk_shift
storage_shifted
resources_infiniband
daemon_storage
instances
image_types
resources_disk_sata
clustering_roles
images_expiry
resources_network_firmware
backup_compression_algorithm
ceph_data_pool_name
container_syscall_intercept_mount
compression_squashfs
container_raw_mount
container_nic_routed
container_syscall_intercept_mount_fuse
container_disk_ceph
virtual-machines
image_profiles
clustering_architecture
resources_disk_id
storage_lvm_stripes
vm_boot_priority
unix_hotplug_devices
api_filtering
instance_nic_network
clustering_sizing
firewall_driver
projects_limits
container_syscall_intercept_hugetlbfs
limits_hugepages
container_nic_routed_gateway
projects_restrictions
custom_volume_snapshot_expiry
volume_snapshot_scheduling
trust_ca_certificates
snapshot_disk_usage
clustering_edit_roles
container_nic_routed_host_address
container_nic_ipvlan_gateway
resources_usb_pci
resources_cpu_threads_numa
resources_cpu_core_die
api_os
container_nic_routed_host_table
container_nic_ipvlan_host_table
container_nic_ipvlan_mode
resources_system
images_push_relay
network_dns_search
container_nic_routed_limits
instance_nic_bridged_vlan
network_state_bond_bridge
usedby_consistency
custom_block_volumes
clustering_failure_domains
resources_gpu_mdev
console_vga_type
projects_limits_disk
network_type_macvlan
network_type_sriov
container_syscall_intercept_bpf_devices
network_type_ovn
projects_networks
projects_networks_restricted_uplinks
custom_volume_backup
backup_override_name
storage_rsync_compression
network_type_physical
network_ovn_external_subnets
network_ovn_nat
network_ovn_external_routes_remove
tpm_device_type
storage_zfs_clone_copy_rebase
gpu_mdev
api_status: stable
api_version: “1.0”
auth: trusted
public: false
auth_methods:
tls
environment:
addresses:
- 192.168.0.126:8443
- 10.0.3.1:8443
- 10.112.7.1:8443
- ‘[fd42:c2bb:440f:2b97::1]:8443’
  architectures:
- x86_64
- i686
  certificate: |
  -----BEGIN CERTIFICATE-----
  MIICADCCAYagAwIBAgIQDvQAvOYhJ4lS3iW/1Mga3zAKBggqhkjOPQQDAzAzMRww
  GgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMRMwEQYDVQQDDApyb290QGNoaWNv
  MB4XDTIwMTExMTE2NDUzMFoXDTMwMTEwOTE2NDUzMFowMzEcMBoGA1UEChMTbGlu
  dXhjb250YWluZXJzLm9yZzETMBEGA1UEAwwKcm9vdEBjaGljbzB2MBAGByqGSM49
  AgEGBSuBBAAiA2IABG2XBRUONBDaXlhGzoA7802xlZEY2z8hzx/XeRyOywxbaItb
  f8iKu3Ixvfx0TS0t/6BcaivfQOzcwumZYkX796yp5AopRQtUVuxfjlYyYCOxayud
  Qc+WPp4YqIgxVpeY9qNfMF0wDgYDVR0PAQH/BAQDAgWgMBMGA1UdJQQMMAoGCCsG
  AQUFBwMBMAwGA1UdEwEB/wQCMAAwKAYDVR0RBCEwH4IFY2hpY2+HBH8AAAGHEAAA
  AAAAAAAAAAAAAAAAAAEwCgYIKoZIzj0EAwMDaAAwZQIxAJFr209OzEGrzfYCuafw
  veeWUpfx8pn+sfLBB4+tfA/b25hKctTbtEfMaaWznHlagQIwe4uMLDbxz4Ll2Cet
  s9WwjyXaISptN1ryD54IaZBMihgZQVaNvAePj5+YkTnYuvtk
  -----END CERTIFICATE-----
  certificate_fingerprint: 04090c253d2e71917cfdbfdaf9a36d2702276d4fe2c38362d5f841d2f267e626
  driver: lxc
  driver_version: 4.0.5
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
  netnsid_getifaddrs: “true”
  seccomp_listener: “true”
  seccomp_listener_continue: “true”
  shiftfs: “false”
  uevent_injection: “true”
  unpriv_fscaps: “true”
  kernel_version: 5.4.0-54-generic
  lxc_features:
  cgroup2: “true”
  devpts_fd: “true”
  mount_injection_file: “true”
  network_gateway_device_route: “true”
  network_ipvlan: “true”
  network_l2proxy: “true”
  network_phys_macvlan_mtu: “true”
  network_veth_router: “true”
  pidfd: “true”
  seccomp_allow_deny_syntax: “true”
  seccomp_notify: “true”
  seccomp_proxy_send_notify_fd: “true”
  os_name: Ubuntu
  os_version: “20.04”
  project: default
  server: lxd
  server_clustered: false
  server_name: chico
  server_pid: 7689
  server_version: “4.8”
  storage: btrfs
  storage_version: 4.15.1

stgraber · November 30, 2020, 10:29pm

Ok, so far so good, you can see if your container feels like starting now.

stgraber · November 30, 2020, 10:30pm

If this blows up again, then we’ll need to disable+reboot again, do the btrfsck repair again, manually mount the pool and run a full scrub this time.

cesar-ayuuk · November 30, 2020, 10:43pm

It did not work. I’m trying to mount the storage pool manually with:

mount /var/snap/lxd/common/lxd/storage-pools/default /mnt
mount: /mnt: /var/snap/lxd/common/lxd/storage-pools/default is not a block device

stgraber · November 30, 2020, 10:44pm

mount /var/snap/lxd/common/lxd/disks/default.img /mnt

That will mount the pool on /mnt at which point you can run btrfs scrub start /mnt

cesar-ayuuk · November 30, 2020, 10:46pm

btrfs scrub start /mnt
scrub started on /mnt, fsid 37ab66ba-6522-43ce-adcc-024792370708 (pid=7570).

That worked, should i enable lxd now?

stgraber · November 30, 2020, 10:46pm

Now, you need to monitor it with btrfs scrub status /mnt

cesar-ayuuk · November 30, 2020, 10:47pm

The output:

btrfs scrub status /mnt
UUID: 37ab66ba-6522-43ce-adcc-024792370708
Scrub started: Mon Nov 30 17:45:15 2020
Status: running
Duration: 0:01:40
Time left: 0:00:32
ETA: Mon Nov 30 17:47:28 2020
Total to scrub: 92.22GiB
Bytes scrubbed: 69.49GiB
Rate: 711.58MiB/s
Error summary: csum=15787872
Corrected: 0
Uncorrectable: 15787872
Unverified: 0