How do i remove a degraded_device?

GlitchedAxiom · April 3, 2026, 2:47pm

I’ve got a full failure of a drive recently, replaced the drive and now can’t remove the degraded_device.

I am running the pool as zfs-raid1, here is the storage configuration:

config:
  scrub_schedule: 0 4 * * 0
state:
  drives:
    - id: /dev/disk/by-id/nvme-KINGSTON_SNVS1000G_50026B7685313920
      model_family: ''
      model_name: KINGSTON SNVS1000G
      serial_number: 50026B7685313920
      bus: nvme
      capacity_in_bytes: 1000204886016
      boot: true
      multipath: false
      removable: false
      remote: false
      smart:
        enabled: true
        passed: true
        power_on_hours: 29216
        data_units_read: 101877488
        data_units_written: 60105895
        available_spare: 100
        percentage_used: 12
      member_pool: local
    - id: /dev/disk/by-id/nvme-WD_BLACK_SN8100_1000GB_25413E800328
      model_family: ''
      model_name: WD_BLACK SN8100 1000GB
      serial_number: '25413E800328'
      bus: nvme
      capacity_in_bytes: 1000204886016
      boot: false
      multipath: false
      removable: false
      remote: false
      smart:
        enabled: true
        passed: true
        power_on_hours: 46
        data_units_read: 7410
        data_units_written: 580290
        available_spare: 100
      member_pool: local
  pools:
    - name: local
      type: zfs-raid1
      devices:
        - /dev/disk/by-id/nvme-KINGSTON_SNVS1000G_50026B7685313920-part11
        - /dev/disk/by-id/nvme-WD_BLACK_SN8100_1000GB_25413E800328
      state: DEGRADED
      last_scrub:
        state: FINISHED
        start_time: '2026-04-01T21:07:12Z'
        end_time: '2026-04-01T21:10:31Z'
        progress: 100.00%
        errors: 0
      encryption_key_status: available
      devices_degraded:
        - /dev/disk/by-id/nvme-KINGSTON_SNVS1000G_50026B7685313913-part11
      raw_pool_size_in_bytes: 962072674304
      usable_pool_size_in_bytes: 962072674304
      pool_allocated_space_in_bytes: 202256719872
      volumes:
        - name: incus
          usage_in_bytes: 202175729664
          quota_in_bytes: 0
          use: incus

I couldn’t find a way to remove the degraded device - is there a way? As long as the device is there, the pool state will be degraded.

stgraber · April 3, 2026, 3:22pm

@gibmat

gibmat · April 3, 2026, 10:17pm

@GlitchedAxiom can you share some additional details about how you replaced the failed drive:

Was the failed drive the main (root) drive, necessitating a re-install of IncusOS, or was it a regular data drive?
What command(s) did you use to replace the drive?
Did you follow the tutorial for expanding your “local” storage pool ( Expanding the “local” storage pool - IncusOS documentation )?

For the “local” pool, I’m confused why the WD Black device is listed without -part11 in your pool state. I’ve just re-run through the tutorial referenced above and both my pool devices have the expected -part11 ending. The “local” pool is a bit special, and if you’ve managed to somehow get to your current state there’s probably some edge case that IncusOS needs to properly handle so the old, degraded device is automatically cleaned up.

GlitchedAxiom · April 3, 2026, 11:07pm

The failed drive wasn’t the main drive. Back in February, i expanded the pool with the now degraded KINGSTON_SNVS1000G_50026B7685313913 to a raid1. No reinstall needed as the main drive works (the other working Kingston drive)!

I used incus admin os system storage edit with:

config:
  scrub_schedule: 0 4 * * 0
  pools:
    - name: local
      type: zfs-raid1
      devices:
        - /dev/disk/by-id/nvme-KINGSTON_SNVS1000G_50026B7685313920-part11
        - /dev/disk/by-id/nvme-WD_BLACK_SN8100_1000GB_25413E800328

I used the linked tutorial back when i installed this host. For the replacement i just ran this single edit command with a completely new WD Black. After that the scrub started and finished successfully as seen in the state.

I could try to reproduce this on a virtual IncusOS tomorrow and write down all my steps i did.

GlitchedAxiom · April 4, 2026, 9:11am

@gibmat Here i recreated this in a vm environment: https://www.youtube.com/watch?v=7uN7hSeF3bk

First expanded the pool with second drive to a raid1
Removed second drive to simulate complete drive failure → degraded pool
Attached new drive and put them into the pool
Degraded pool because the old device cannot be removed

(The actual hardware does have stable ids, the problem is the same as with this test env)

My guess is that the -part11 only generate on ONLINE pools, not on DEGRADED ones.

Hope that helps figuring out what i did.

gibmat · April 7, 2026, 7:27pm

Got it figured out – IncusOS wasn’t properly handling degraded devices that were physically missing, and was instead attempting to extend the existing storage pool instead of replacing the degraded device. I’ve got a fix here: Handle missing degraded storage devices by gibmat · Pull Request #1025 · lxc/incus-os · GitHub . Once that lands in an IncusOS release this particular issue shouldn’t happen again.

To fix up your existing IncusOS system, the easiest way will be to temporarily boot a live Linux environment with ZFS support; you’ll likely need to disable SecureBoot temporarily first, but don’t wipe the SecureBoot keys. Then, you’ll want to remove the second drive from the local zpool by hand. Because the zpool is already a RAID1 mirror, it should be able to tolerate the removal without data loss.

zpool offline local nvme-WD_BLACK_SN8100_1000GB_25413E800328
zpool detach local nvme-WD_BLACK_SN8100_1000GB_25413E800328
zpool status local

After confirming ZFS isn’t warning about data loss, wipe the replacement drive.

sgdisk -Z /dev/disk/by-id/nvme-WD_BLACK_SN8100_1000GB_25413E800328

Re-enable SecureBoot and boot back into IncusOS. At this point you should essentially be back at the point of being ready to replace the failed drive with the new one. Wait for an updated IncusOS release to have the fix, apply the update, reboot, and then replace your failed drive. Hopefully it should all work as expected.

gibmat · April 8, 2026, 7:27pm

Stable release 202604080235 of IncusOS contains the fix for handling missing degraded devices.

GlitchedAxiom · April 11, 2026, 6:05pm

Thank you for this really fast fix. Just followed your guide - works now!