Incus copy failed, likely due to race condition

chatziko · November 14, 2025, 8:16am

Hello,

I have a script that runs every 10 minutes, it first creates a snapshot and then runs incus copy –-refresh. Occasionally it fails with the following error:

Nov 14 03:05:14 Error: Failed migration on source: Failed to run: btrfs property set -f -ts /var/lib/incus/storage-pools/default/containers-snapshots/docker-pi-living/auto-20251113-030513 ro true: exit status 1 (ERROR: Could not open: No such file or directory)

I think (cant be sure) the problem is that, due to the regularity of the task, the snapshot expiration time is very close to the invocations of the script. In the example above the snapshot was created exactly one day before the error, and had 1 day expiration time. This creates a race condition between incus copy and snapshot deletion: incus copy sees the snapshot, but by the time it tries to run btrfs property set the snapshot is already gone.

I’ve changed the expiration time to avoid clashing with future executions, but a clean solution would be to lock snapshots while copy is running (or take into account sudden deletions).

Am I interpreting this correctly? Should I open an issue?

Kostas

stgraber · November 14, 2025, 5:42pm

Ah, that’s interesting, so you’re saying that running the incus copy --refresh just a few seconds later then worked fine?

If that’s the case, please indeed raise an issue and we’ll be looking at adding some extra locking in the storage driver to try and prevent this situation.

I think we’d probably want to block volume and snapshot deletion or renaming while a migration is ongoing, snapshot creation should generally be safe as should be most other operations. We basically want to make sure we don’t overly lock things up as migrations can sometimes be very very slow.

chatziko · November 14, 2025, 7:30pm

The next execution was 10 minutes later and worked fine, my guess is that a few seconds later would also work.

As a workaround, I did a nice hack to “disable” snapshot deletion during migrations: I use the convention to add 1 year to all expiration times (eg 368 days instead of 3). So incus wont delete anything, and I use my own deletion script which I run before the migrations.

So far with my workaround I haven’t seen the error, which reinforces my hypothesis about the race condition.

Sounds quite reasonable to me, I’ll open an issue.

Btw it would be also useful to add a configuration option to disable automatic deletions (what I do with the hack above), so that someone can schedule them when they find convenient. And maybe a command to run it manually, say incus snapshot delete-expired.