gqdc
(Olivier M.)
December 18, 2025, 6:54pm
1
Hi,
Last Sunday, the Linstor pool managed by Incus crashed after a sudden disk pool saturation.
To restore the services, we deleted all the volume’s snapshots.
While we were cleaning all the mess, we saw that one volume existed on 2 satellites.
This was preventing snapshots to be generated by Incus for this volume, so I removed it after a node reboot.
But the volume is actually in a bad state inside linstor, and I don’t know what to do to recover a great status.
I tried many things, and now, this is the situation :
root@node4:/# linstor resource-definition list-properties incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
╭────────────────────────────────────────────────────────────────────────────────────────╮
┊ Key ┊ Value ┊
╞════════════════════════════════════════════════════════════════════════════════════════╡
┊ Aux/Incus/content-type ┊ filesystem ┊
┊ Aux/Incus/name ┊ incus-volume-webserver-otypo-prod ┊
┊ Aux/Incus/snapshot-name/test-otypo-0 ┊ incus-volume-8c0d39a8059b4547881a7824930e7295 ┊
┊ Aux/Incus/type ┊ containers ┊
┊ DrbdOptions/Net/allow-two-primaries ┊ yes ┊
┊ DrbdOptions/Net/ping-timeout ┊ 10 ┊
┊ DrbdOptions/Resource/quorum ┊ majority ┊
┊ DrbdOptions/auto-add-quorum-tiebreaker ┊ true ┊
┊ DrbdOptions/auto-verify-alg ┊ sha256 ┊
┊ DrbdPrimarySetOn ┊ NODE4 ┊
┊ Internal/Drbd/QuorumSetBy ┊ user ┊
┊ cloned-from ┊ incus-volume-d4926874a385431f9657b894c43b91c2 ┊
╰────────────────────────────────────────────────────────────────────────────────────────╯
root@node4:/# drbdadm status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 --verbose --statistics
drbdsetup status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 --verbose --statistics
incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 node-id:0 role:Secondary suspended:no force-io-failures:no
write-ordering:flush
volume:0 minor:1031 disk:Inconsistent backing_dev:/dev/zvol/nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000 quorum:yes open:no
size:244140628 read:758107489 written:1811295512 al-writes:288109 bm-writes:112402 upper-pending:0 lower-pending:0 al-suspended:no blocked:no
node5.nakweb.agency node-id:2 connection:Connected role:Primary tls:no congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:SyncTarget peer-disk:UpToDate done:100.00 resync-suspended:no
received:85377336 sent:0 out-of-sync:172 pending:0 unacked:0 dbdt1:0.00 eta:nan
node6.nakweb.agency node-id:1 connection:StandAlone role:Unknown tls:no congested:no ap-in-flight:0 rs-in-flight:0
volume:0 replication:Off peer-disk:DUnknown resync-suspended:dependency
received:0 sent:0 out-of-sync:0 pending:0 unacked:0
The “out-of-sync:172” is running for about 24 hours now. The `received:` value is growing, slowly.
Also, the ext4 fs on this volume has been repaired with fsck, and many things were corrected or deleted.
And, of cource, a very critical apps is running on the instance associated to the volume.
Is there anything that can do related with Incus ? Or it is better to ask Linbit forum ?
bensmrs
(Benjamin Somers)
December 18, 2025, 8:23pm
2
Isn’t that supposed to be the case of all your volumes, as per your place count policy? Or do you mean mounted (“available” in LINSTOR’s vocabulary) on two satellites?
How so? Which messages did you get?
What did you do to remove it?
gqdc
(Olivier M.)
December 18, 2025, 9:24pm
3
root@node4:/# linstor v l --resources incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource ┊ Node ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊ Repl ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node4 ┊ nvme_pool ┊ 0 ┊ 1031 ┊ /dev/drbd1031 ┊ 59.34 GiB ┊ Unused ┊ SyncTarget(100.00%) ┊ node5: SyncTarget(100.00%) ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node5 ┊ nvme_pool ┊ 0 ┊ 1031 ┊ /dev/drbd1031 ┊ 59.34 GiB ┊ InUse ┊ UpToDate ┊ node4: SyncSource ┊
┊ ┊ ┊ ┊ ┊ ┊ ┊ ┊ ┊ ┊ node6: Established ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node6 ┊ DfltDisklessStorPool ┊ 0 ┊ 1031 ┊ /dev/drbd1031 ┊ ┊ Unused ┊ TieBreaker ┊ Established(1) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
This is the state now ^
The error message while automatic snapshoting was :
time="2025-12-16T00:59:20+01:00" level=error msg="Error creating snapshot" err="Create instance snapshot: Could not create resource snapshot: Message: 'New snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' registered.'; Details: 'Snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' UUID is: 09d084d8-43be-4a0e-bc85-88b696b8aa3b' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node4' for snapshot' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node5' for snapshot' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000007]' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node4'' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node5'' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000008]'" instance=webserver-otypo-prod project=default snapshot=snap0
To remove incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000 from node6, I did :
root@node6:/# linstor resource delete node6 incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
root@node6:/# zfs destroy -f nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000
cannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy
root@node6:/# wipefs -a /dev/zd496
wipefs: error: /dev/zd496: probing initialization failed: Device or resource busy
root@node6:/# zfs destroy -f nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000
cannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy
root@node6:/# rm /sys/block/zd496/holders/drbd1031
rm: cannot remove '/sys/block/zd496/holders/drbd1031': Operation not permitted
And then, rebooted. The zvol was removed automatically, but there was this :
root@node6:/# drbdadm status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 role:Secondary
disk:Diskless open:no
node4.nakweb.agency connection:StandAlone
node5.nakweb.agency role:Primary
peer-disk:UpToDate
So, I did :
root@node6:/# drbdadm secondary incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
root@node6:/# drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
root@node6:/# drbdadm connect --discard-my-data incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
And now, I’m here
bensmrs
(Benjamin Somers)
December 19, 2025, 7:21am
4
Looks fine to me.
gqdc:
The error message while automatic snapshoting was :
time="2025-12-16T00:59:20+01:00" level=error msg="Error creating snapshot" err="Create instance snapshot: Could not create resource snapshot: Message: 'New snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' registered.'; Details: 'Snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' UUID is: 09d084d8-43be-4a0e-bc85-88b696b8aa3b' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node4' for snapshot' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node5' for snapshot' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000007]' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node4'' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node5'' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000008]'" instance=webserver-otypo-prod project=default snapshot=snap0
The important part is
Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000007]'
So the real error is that the dataset was busy, not necessarily a LINSTOR thing, but the investigation should have started here.
Not sure what this intended to solve, but it’s generally a bad idea to manually tinker with Incus-managed LINSTOR resources
There, same error as what LINSTOR reported when trying to snapshot.
gqdc:
And then, rebooted. The zvol was removed automatically, but there was this :
root@node6:/# drbdadm status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 role:Secondary
disk:Diskless open:no
node4.nakweb.agency connection:StandAlone
node5.nakweb.agency role:Primary
peer-disk:UpToDate
That looks fine to me, what’s the problem?
Again, what are you intending to solve / do with that?
gqdc
(Olivier M.)
December 19, 2025, 9:02am
5
Looks fine to me.
If I compare with another volume, this is not looking fine, at least because it shows "SyncTarget(100,00%)”. To compare with a clean one :
root@node4:/# linstor v l --resources incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource ┊ Node ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊ Repl ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node4 ┊ nvme_pool ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 204.35 GiB ┊ Unused ┊ UpToDate ┊ Established(2) ┊
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node5 ┊ nvme_pool ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ 204.35 GiB ┊ InUse ┊ UpToDate ┊ Established(2) ┊
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node6 ┊ DfltDisklessStorPool ┊ 0 ┊ 1001 ┊ /dev/drbd1001 ┊ ┊ Unused ┊ TieBreaker ┊ Established(2) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
So the real error is that the dataset was busy, not necessarily a LINSTOR thing, but the investigation should have started here
Yes, I did some research, asked an AI, and “we” concluded that it was a bug, because nor drdbd, nor zfs nor linstor was doing anything on the volume. As i remember, even `lsof` was not showing anything.
That looks fine to me, what’s the problem?
The problem is the StandAlone state, that I couldn’t on others clean volumes :
root@node5:/# drbdadm status incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f
incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f role:Primary
disk:UpToDate open:yes
node4 role:Secondary
peer-disk:UpToDate
node6 role:Secondary
peer-disk:Diskless
Again, what are you intending to solve / do with that?
I wanted to “change” the StandAlone state of node4, to a clean Secondary state.
gqdc
(Olivier M.)
December 19, 2025, 9:45am
6
So, my colleague just revive the thing up with :
on node6 :
drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadm connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
on node4 :
drbdadb disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadm down incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadm up incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadb connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
Thank you @bensmrs for your time.
bensmrs
(Benjamin Somers)
December 19, 2025, 9:46am
7
You’ll probably want to cut and restart the sync between node4 and node5.
On node4, you can run:
drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5:node5
drbdadm connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5:node5
That should reset and restart the sync.
gqdc
(Olivier M.)
December 19, 2025, 9:52am
9
Yes, everything is right, thank you again