Incus cluster, a volume exists on 2 nodes

Hi,

Last Sunday, the Linstor pool managed by Incus crashed after a sudden disk pool saturation.
To restore the services, we deleted all the volume’s snapshots.

While we were cleaning all the mess, we saw that one volume existed on 2 satellites.

This was preventing snapshots to be generated by Incus for this volume, so I removed it after a node reboot.

But the volume is actually in a bad state inside linstor, and I don’t know what to do to recover a great status.

I tried many things, and now, this is the situation :

root@node4:/# linstor resource-definition list-properties incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
╭────────────────────────────────────────────────────────────────────────────────────────╮
┊ Key                                    ┊ Value                                         ┊
╞════════════════════════════════════════════════════════════════════════════════════════╡
┊ Aux/Incus/content-type                 ┊ filesystem                                    ┊
┊ Aux/Incus/name                         ┊ incus-volume-webserver-otypo-prod             ┊
┊ Aux/Incus/snapshot-name/test-otypo-0   ┊ incus-volume-8c0d39a8059b4547881a7824930e7295 ┊
┊ Aux/Incus/type                         ┊ containers                                    ┊
┊ DrbdOptions/Net/allow-two-primaries    ┊ yes                                           ┊
┊ DrbdOptions/Net/ping-timeout           ┊ 10                                            ┊
┊ DrbdOptions/Resource/quorum            ┊ majority                                      ┊
┊ DrbdOptions/auto-add-quorum-tiebreaker ┊ true                                          ┊
┊ DrbdOptions/auto-verify-alg            ┊ sha256                                        ┊
┊ DrbdPrimarySetOn                       ┊ NODE4                                         ┊
┊ Internal/Drbd/QuorumSetBy              ┊ user                                          ┊
┊ cloned-from                            ┊ incus-volume-d4926874a385431f9657b894c43b91c2 ┊
╰────────────────────────────────────────────────────────────────────────────────────────╯
root@node4:/# drbdadm status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 --verbose --statistics
drbdsetup status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 --verbose --statistics
incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 node-id:0 role:Secondary suspended:no force-io-failures:no
    write-ordering:flush
  volume:0 minor:1031 disk:Inconsistent backing_dev:/dev/zvol/nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000 quorum:yes open:no
      size:244140628 read:758107489 written:1811295512 al-writes:288109 bm-writes:112402 upper-pending:0 lower-pending:0 al-suspended:no blocked:no
  node5.nakweb.agency node-id:2 connection:Connected role:Primary tls:no congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:SyncTarget peer-disk:UpToDate done:100.00 resync-suspended:no
        received:85377336 sent:0 out-of-sync:172 pending:0 unacked:0 dbdt1:0.00 eta:nan
  node6.nakweb.agency node-id:1 connection:StandAlone role:Unknown tls:no congested:no ap-in-flight:0 rs-in-flight:0
    volume:0 replication:Off peer-disk:DUnknown resync-suspended:dependency
        received:0 sent:0 out-of-sync:0 pending:0 unacked:0

The “out-of-sync:172” is running for about 24 hours now. The `received:` value is growing, slowly.

Also, the ext4 fs on this volume has been repaired with fsck, and many things were corrected or deleted.

And, of cource, a very critical apps is running on the instance associated to the volume.

Is there anything that can do related with Incus ? Or it is better to ask Linbit forum ?

Isn’t that supposed to be the case of all your volumes, as per your place count policy? Or do you mean mounted (“available” in LINSTOR’s vocabulary) on two satellites?

How so? Which messages did you get?

What did you do to remove it?

root@node4:/# linstor v l --resources incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource                                      ┊ Node  ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊               State ┊ Repl                       ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node4 ┊ nvme_pool            ┊     0 ┊    1031 ┊ /dev/drbd1031 ┊ 59.34 GiB ┊ Unused ┊ SyncTarget(100.00%) ┊ node5: SyncTarget(100.00%) ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node5 ┊ nvme_pool            ┊     0 ┊    1031 ┊ /dev/drbd1031 ┊ 59.34 GiB ┊ InUse  ┊            UpToDate ┊ node4: SyncSource          ┊
┊                                               ┊       ┊                      ┊       ┊         ┊               ┊           ┊        ┊                     ┊ node6: Established         ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 ┊ node6 ┊ DfltDisklessStorPool ┊     0 ┊    1031 ┊ /dev/drbd1031 ┊           ┊ Unused ┊          TieBreaker ┊ Established(1)             ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

This is the state now ^

The error message while automatic snapshoting was :

time="2025-12-16T00:59:20+01:00" level=error msg="Error creating snapshot" err="Create instance snapshot: Could not create resource snapshot: Message: 'New snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' registered.'; Details: 'Snapshot 'incus-volume-d1707ad91dfe477abee8993730e9b98b' of resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' UUID is: 09d084d8-43be-4a0e-bc85-88b696b8aa3b' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node4' for snapshot' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Suspended IO of '[incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5]' on 'node5' for snapshot' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000007]' next error: Message: '(node4) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node4'' next error: Message: '(node5) Resource 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' [DRBD] adjusted.' next error: Message: 'Aborted snapshot and resumed IO of 'incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5' on 'node5'' next error: Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000008]'" instance=webserver-otypo-prod project=default snapshot=snap0

To remove incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000 from node6, I did :

root@node6:/# linstor resource delete node6 incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

root@node6:/# zfs destroy -f nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000
cannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy

root@node6:/# wipefs -a /dev/zd496
wipefs: error: /dev/zd496: probing initialization failed: Device or resource busy

root@node6:/# zfs destroy -f nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000
cannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy

root@node6:/# rm /sys/block/zd496/holders/drbd1031
rm: cannot remove '/sys/block/zd496/holders/drbd1031': Operation not permitted

And then, rebooted. The zvol was removed automatically, but there was this :

root@node6:/# drbdadm status incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 role:Secondary
  disk:Diskless open:no
  node4.nakweb.agency connection:StandAlone
  node5.nakweb.agency role:Primary
    peer-disk:UpToDate

So, I did :

root@node6:/# drbdadm secondary incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

root@node6:/# drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

root@node6:/# drbdadm connect --discard-my-data incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

And now, I’m here :melting_face:

Looks fine to me.

The important part is

Message: '(node6) Failed to delete zfs volume'; Details: 'Command 'zfs destroy nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000' returned with exitcode 1. \n\nStandard out: \n\n\nError message: \ncannot destroy 'nvme_pool/incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5_00000': dataset is busy\n\n'; Reports: '[693F56FA-C4E60-000007]'

So the real error is that the dataset was busy, not necessarily a LINSTOR thing, but the investigation should have started here.

Not sure what this intended to solve, but it’s generally a bad idea to manually tinker with Incus-managed LINSTOR resources

There, same error as what LINSTOR reported when trying to snapshot.

That looks fine to me, what’s the problem?

Again, what are you intending to solve / do with that?

Looks fine to me.

If I compare with another volume, this is not looking fine, at least because it shows "SyncTarget(100,00%)”. To compare with a clean one :

root@node4:/# linstor v l --resources incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Resource                                      ┊ Node  ┊ StoragePool          ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊  Allocated ┊ InUse  ┊      State ┊ Repl           ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node4 ┊ nvme_pool            ┊     0 ┊    1001 ┊ /dev/drbd1001 ┊ 204.35 GiB ┊ Unused ┊   UpToDate ┊ Established(2) ┊
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node5 ┊ nvme_pool            ┊     0 ┊    1001 ┊ /dev/drbd1001 ┊ 204.35 GiB ┊ InUse  ┊   UpToDate ┊ Established(2) ┊
┊ incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f ┊ node6 ┊ DfltDisklessStorPool ┊     0 ┊    1001 ┊ /dev/drbd1001 ┊            ┊ Unused ┊ TieBreaker ┊ Established(2) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

So the real error is that the dataset was busy, not necessarily a LINSTOR thing, but the investigation should have started here

Yes, I did some research, asked an AI, and “we” concluded that it was a bug, because nor drdbd, nor zfs nor linstor was doing anything on the volume. As i remember, even `lsof` was not showing anything.

That looks fine to me, what’s the problem?

The problem is the StandAlone state, that I couldn’t on others clean volumes :

root@node5:/# drbdadm status incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f
incus-volume-c88dcd8ac6fb415fb5f5ebabd9d8722f role:Primary
  disk:UpToDate open:yes
  node4 role:Secondary
    peer-disk:UpToDate
  node6 role:Secondary
    peer-disk:Diskless

Again, what are you intending to solve / do with that?

I wanted to “change” the StandAlone state of node4, to a clean Secondary state.

So, my colleague just revive the thing up with :

on node6 :

drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5 
drbdadm connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

on node4 :

drbdadb disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadm down incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadm up incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5
drbdadb connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5

Thank you @bensmrs for your time.

You’ll probably want to cut and restart the sync between node4 and node5.
On node4, you can run:

drbdadm disconnect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5:node5
drbdadm connect incus-volume-912ccf5d8a6d4868bc3c2885a958c4f5:node5

That should reset and restart the sync.

So, all good with that?

Yes, everything is right, thank you again :slight_smile: