I’ve been running lxd for some time and now suddenly it just came to a halt and behaves more broken than whole - although it still responds.
My setup:
- lxd from snap, version 4.0.7 stable.
- zfs backend on a raid block device.
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
lxdhosts 556G 539G 17.4G - - 63% 96% 1.00x ONLINE -
(Note I have 17.4G free)
I have a rebooted server with all containers STOPPED and I can communicate with lxd.
Stuff that I see:
lxc list
It works and lists my containers just fine.
lxc launch
Gets stuck
root@iceberg:~# lxc launch ubuntu:18.04 u1
Creating u1
Retrieving image: Unpack: 100% (773.23MB/s)
… but the container is listed as “STOPPED” with lxc list.
lxc shell
Gets stuck, but after a long wait (3-4 minutes) I seem to be in it after some Ctrl-C…
I’m really stuck here and I hoped not to have to nuke my whole server to get back from this.
I do note that my “iowait” gets very high at times and that has previously made me reduce the number of containers on the host. But its really not that many.
lxc list --format csv
caspians-dator,RUNNING,CONTAINER,0
juju-75f241-0,STOPPED,CONTAINER,0
juju-75f241-1,STOPPED,CONTAINER,0
juju-530f52-0,STOPPED,CONTAINER,0
juju-554e9d-4,STOPPED,CONTAINER,0
juju-554e9d-5,STOPPED,CONTAINER,0
juju-554e9d-6,STOPPED,CONTAINER,0
juju-a00094-0,STOPPED,CONTAINER,0
juju-f4cf5f-4,STOPPED,CONTAINER,0
juju-f4cf5f-5,STOPPED,CONTAINER,0
juju-fe5353-0,STOPPED,CONTAINER,0
I’m suspecting my zfs is the culprit, but since I do have space left and my containers aren’t even up - I can’t see why this should be the issue. But how can I test?
Any advice or help here is welcome.
I’m currently running “zpool trim; zpool scrub” to see if this helps, but I’m not sure this is how to go about. The problem was there even before I executed trim and scrub on the zpool.
zpool status
pool: lxdhosts
state: ONLINE
scan: scrub in progress since Thu Sep 2 12:14:33 2021
539G scanned at 589M/s, 117G issued at 128M/s, 539G total
0B repaired, 21.76% done, 0 days 00:56:04 to go
config:NAME STATE READ WRITE CKSUM
lxdhosts ONLINE 0 0 0
sdb ONLINE 0 0 0errors: No known data errors
UPDATE #1: 2021-09-01T22:00:00Z
So, I managed to figure out which containers allocated the most disk via:
zfs list | grep containers`
In my case, the container: juju-a00094-0 … was using way too much disk which I also knew could be deleted.
So I mounted the zfs disk like below (taking the path from zfs list):
zfs mount lxdhosts/containers/juju-a00094-0
The disk mounted at /var/snap/lxd/common/lxd/storage-pools/lxdhosts/containers/juju-a00094-0 and allowed me to cd into the filesystem and I could remove the unwanted data with rm.
Finally unmounted the disk again:
zfs umount /var/snap/lxd/common/lxd/storage-pools/lxdhosts/containers/juju-a00094-0
At this point, the lxc commands started to work alot more responsive again.
My thoughts at the moment is if this situation occurs BEFORE running out of zfs disk. Is this a known feature of zfs and if so, when should one start to mitigate this with large disks before its too late?