Cannot delete a snapshot after its container has been erased

Clem · May 21, 2019, 11:52am

Hello,

I’m trying to delete the default storage to use another disk, separated from /var, for my servers (cluster of two servers called server and server2). The first problem is some volume are still present on the storage, although all containers and images have been erased.

% lxc storage delete default
Error: storage pool "default" has volumes attached to it

Indeed, something went wrong while deleting all containers and their snapshots

% lxc storage volume list default
+----------------------+---------------------+-------------+---------+----------+
|         TYPE         |        NAME         | DESCRIPTION | USED BY | LOCATION |
+----------------------+---------------------+-------------+---------+----------+
| container (snapshot) | ntp-backup/working  |             | 1       | server   |
+----------------------+---------------------+-------------+---------+----------+
| container (snapshot) | template/2019051201 |             | 1       | server2  |
+----------------------+---------------------+-------------+---------+----------+

But I cannot delete theses snapshots.

% lxc storage volume delete default ntp-backup/working
Error: No such object
% lxc storage volume delete default template/2019051201
Error: No such object

I’m using lxd 3.13

% lxc --version
3.13

Back-end storage is BTRFS.
Thank you in advance for your help.

gpatel-fr · May 21, 2019, 12:34pm

It has happened to me one time and I used btrfs tool to delete the phantom snapshot.
you should be able to see your snapshots with btrfs using

sudo nsenter -t $(pgrep daemon.start) -m – /snap/lxd/current/bin/btrfs subvolume list /var/snap/lxd/common/lxd/storage-pools/default

It works for me at least; now I’m not doing more advanced tests of this kind on my own disk but given that you want to get rid of it anyway I think that trying out
‘subvolume delete’ should do what you want.

Clem · May 21, 2019, 2:03pm

Strange thing

root@server:/var/snap/lxd/common/lxd/storage-pools/default# btrfs subvolume delete containers-snapshots/ntp-backup
ERROR: not a subvolume: containers-snapshots/ntp-backup

Although it is listed as subvolume

% sudo nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs subvolume list /var/snap/lxd/common/lxd/storage-pools/default
ID 420 gen 2712846 top level 5 path snap/lxd/common/lxd/storage-pools/default
ID 421 gen 2712846 top level 420 path containers
ID 422 gen 2712961 top level 420 path containers-snapshots
ID 423 gen 2712846 top level 420 path images
ID 424 gen 2712846 top level 420 path custom
ID 425 gen 2712846 top level 420 path custom-snapshots
ID 510 gen 2643133 top level 422 path containers-snapshots/ntp-backup/working

And the subvolume is in readonly mode

% sudo btrfs subvolume show /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
        Name:                   working
        UUID:                   6e3f2cbf-14fc-2742-8ff5-a24ce137966a
        Parent UUID:            a2159e66-f131-6c46-907f-7958cce550a2
        Received UUID:          -
        Creation time:          2019-05-15 09:37:15 +0200
        Subvolume ID:           510
        Generation:             2643133
        Gen at creation:        2643133
        Parent ID:              422
        Top level ID:           422
        Flags:                  readonly

I guess something terribly wrong happened when some snapshots were deleted.

gpatel-fr · May 21, 2019, 2:29pm

no the problem is that I said ‘brtfs subvolume delete’ and assumed that you would replace in the command I gave you list by delete. Instead you entered btrfs subvolume delete directly and omitted the nsenter command. This nsenter stuff is essential with snap since in this case the storage is mapped only for the lxd process, not your user process. So reenter the btrfs subvolume delete with all the nsenter incantation and it should work better.

Clem · May 21, 2019, 2:36pm

Oki, sorry, I didn’t understand. Unfortunately, it is still not working

% sudo nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs subvolume delete /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
Delete subvolume (no-commit): '/var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working'
ERROR: cannot delete '/var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working': Operation not permitted

gpatel-fr · May 21, 2019, 2:46pm

I’m not sure what happens here. I’d have expected that by running sudo nsenter you would have inherited the root powers of the lxd process. Maybe try to delete directly ntp-backup ? or even adding another sudo before the btrfs command ?

Clem · May 21, 2019, 2:55pm

Neither are working

The system doesn’t really like adding the sudo ^^

% sudo nsenter -t $(pgrep daemon.start) -m -- sudo /snap/lxd/current/bin/btrfs subvolume delete /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
[sudo] password for clement:
sudo: unable to stat /etc/sudoers: No such file or directory
sudo: no valid sudoers sources found, quitting
sudo: unable to initialize policy plugin

Removing data directly returns a bunch of error messages

# rm -r /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working/
rm: cannot remove '/var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working/backup.yaml': Read-only file system

gpatel-fr · May 21, 2019, 3:14pm

so much for sudo, but i did not think to using rm, I was meaning using btrfs subvolume delete on containers-snapshots/ntp-backup

Clem · May 21, 2019, 3:26pm

Still no luck

% sudo nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs subvolume delete /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup
ERROR: not a subvolume: /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup

gpatel-fr · May 21, 2019, 3:55pm

got it I think.
sudo nsenter -t $(pgrep daemon.start) -m – ls -l /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup
should show you that the snapshot has a ‘+’ displayed showing that it has an ACL set. I think that using getfacl and setfacl -b should get you to the light (do not forget to use nsenter)

Clem · May 21, 2019, 4:11pm

I’m not so sure to understand, but here are the results

% sudo nsenter -t $(pgrep daemon.start) -m -- ls -l /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup
total 0
drwx--x--x 1 root root 78 May 15 00:25 working

I don’t see anything unusual. I don’t know the two other tools

% sudo nsenter -t $(pgrep daemon.start) -m -- getfacl /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup
getfacl: Removing leading '/' from absolute path names
# file: var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup
# owner: root
# group: root
user::rwx
group::--x
other::--x

% sudo nsenter -t $(pgrep daemon.start) -m -- getfacl /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
getfacl: Removing leading '/' from absolute path names
# file: var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
# owner: root
# group: root
user::rwx
group::--x
other::--x

The setfacl -b command didn’t return any error, a getfacl returned the same output on both directories, I still cannot remove the subvolume.

gpatel-fr · May 21, 2019, 5:38pm

Baffling. Can you try

sudo nsenter -t $(pgrep daemon.start) -m -- chmod g+rw,o+rw /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working

does it work ? if yes, is still btrfs subvolume delete returning no perm ? If yes, probably the storage must be read only and the perm error is a bad error message.
Maybe try btrfs scrub then (still with nsenter of course)
Or possibly restart lxd with sudo snap restart lxd. Maybe it’s as simple as that (it would be a bug of course). Try this first.

Clem · May 22, 2019, 7:52am

The first command didn’t work

% sudo nsenter -t $(pgrep daemon.start) -m -- chmod g+rw,o+rw /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
chmod: changing permissions of '/var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working': Read-only file system

btrfs scrub didn’t return any error

% sudo nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs scrub start -B /var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
WARNING: cannot create scrub data file, mkdir /var/lib/btrfs failed: Read-only file system. Status recording disabled
WARNING: failed to open the progress status socket at /var/lib/btrfs/scrub.progress.de67eea8-b6fc-40c8-bc0e-55f293f95
77e: No such file or directory. Progress cannot be queried
scrub done for de67eea8-b6fc-40c8-bc0e-55f293f9577e
        scrub started at Wed May 22 09:31:01 2019 and finished after 00:00:30
        total bytes scrubbed: 4.05GiB with 0 errors

So, I restarted lxd

 % sudo snap restart lxd
Restarted.

Try to delete the volume again from lxd without success

% lxc storage volume delete default ntp-backup/working
Error: No such object

However, btrfs command are working again

% sudo nsenter -t $(pgrep daemon.start) -m -- /snap/lxd/current/bin/btrfs subvolume delete /var/snap
/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working
Delete subvolume (no-commit): '/var/snap/lxd/common/lxd/storage-pools/default/containers-snapshots/ntp-backup/working'

BUT, it is still listed in storage

% lxc storage volume list default
+----------------------+---------------------+-------------+---------+----------+
|         TYPE         |        NAME         | DESCRIPTION | USED BY | LOCATION |
+----------------------+---------------------+-------------+---------+----------+
| container (snapshot) | ntp-backup/working  |             | 1       | server   |
+----------------------+---------------------+-------------+---------+----------+
| container (snapshot) | template/2019051201 |             | 1       | server2  |
+----------------------+---------------------+-------------+---------+----------+

I did the same on server2 for the other snapshot. The subvolume are indeed gone, but still listed in the storage of lxd. So, I cannot delete the storage default

 % lxc storage delete default
Error: storage pool "default" has volumes attached to it

gpatel-fr · May 22, 2019, 8:32am

Did you try to restart again lxd after deleting snapshots with btrfs ? maybe lxd needs to be informed that you deleted stuff.

Clem · May 22, 2019, 8:37am

I tried but it doesn’t update the storage status. I also tried to stop lxd on both servers at the same time, then start again, but it’s not working either.

gpatel-fr · May 22, 2019, 9:13am

so if you run

sudo nsenter -t $(pgrep daemon.start) -m – /snap/lxd/current/bin/btrfs subvolume list /var/snap/lxd/common/lxd/storage-pools/default

you do not see your snapshots anymore but they can still be seen with lxc storage list default ?

Clem · May 22, 2019, 9:17am

yes, exactly

gpatel-fr · May 22, 2019, 9:40am

oh, yuck. I’m pretty sure that it worked for me.
At this point, I am at a loss for rational answers. Maybe restart computers ? Or trying to be a bit Conan-the-Barbarian with lxc storage edit default ???

Clem · May 22, 2019, 12:01pm

Conan-the-Barbarian it is. I exported all my containers, removed lxd on both servers and removed all subvolumes and /var/snap/lxd folders. I also removed the partition corresponding to my second storage. I reinstalled lxd without defining a storage and added my own afterward. It is ok now. I don’t know what happened.

gpatel-fr · May 22, 2019, 1:48pm

For the record, looking at LXD code, I think that this wasl not bad but not sufficient:

        // Delete the mountpoint.
        if shared.PathExists(customSubvolumeName) {
                err = os.Remove(customSubvolumeName)
                if err != nil {
                        return err
                }
        }

        err = s.s.Cluster.StoragePoolVolumeDelete(
                "default",
                s.volume.Name,
                storagePoolVolumeTypeCustom,
                s.poolID)
        if err != nil {
                logger.Errorf(`Failed to delete database entry for BTRFS storage volume "%s" on storage pool "%s"`, s.volume.Name, s.pool.Name)
        }

Deleting the mountpoint and the sqlite database entry (when using a cluster as it is the case for you) were necessary as well. More difficult than I thought, maybe it worked for me because I’m not using clusters.