Lxc shell - blocking or extremely slow

I’ve been running lxd for some time and now suddenly it just came to a halt and behaves more broken than whole - although it still responds.

My setup:

  • lxd from snap, version 4.0.7 stable.
  • zfs backend on a raid block device.

zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
lxdhosts 556G 539G 17.4G - - 63% 96% 1.00x ONLINE -

(Note I have 17.4G free)

I have a rebooted server with all containers STOPPED and I can communicate with lxd.

Stuff that I see:

lxc list

It works and lists my containers just fine.

lxc launch

Gets stuck

root@iceberg:~# lxc launch ubuntu:18.04 u1
Creating u1
Retrieving image: Unpack: 100% (773.23MB/s)

… but the container is listed as “STOPPED” with lxc list.

lxc shell

Gets stuck, but after a long wait (3-4 minutes) I seem to be in it after some Ctrl-C…

I’m really stuck here and I hoped not to have to nuke my whole server to get back from this.

I do note that my “iowait” gets very high at times and that has previously made me reduce the number of containers on the host. But its really not that many.

lxc list --format csv
caspians-dator,RUNNING,CONTAINER,0
juju-75f241-0,STOPPED,CONTAINER,0
juju-75f241-1,STOPPED,CONTAINER,0
juju-530f52-0,STOPPED,CONTAINER,0
juju-554e9d-4,STOPPED,CONTAINER,0
juju-554e9d-5,STOPPED,CONTAINER,0
juju-554e9d-6,STOPPED,CONTAINER,0
juju-a00094-0,STOPPED,CONTAINER,0
juju-f4cf5f-4,STOPPED,CONTAINER,0
juju-f4cf5f-5,STOPPED,CONTAINER,0
juju-fe5353-0,STOPPED,CONTAINER,0

I’m suspecting my zfs is the culprit, but since I do have space left and my containers aren’t even up - I can’t see why this should be the issue. But how can I test?

Any advice or help here is welcome.

I’m currently running “zpool trim; zpool scrub” to see if this helps, but I’m not sure this is how to go about. The problem was there even before I executed trim and scrub on the zpool.

zpool status
pool: lxdhosts
state: ONLINE
scan: scrub in progress since Thu Sep 2 12:14:33 2021
539G scanned at 589M/s, 117G issued at 128M/s, 539G total
0B repaired, 21.76% done, 0 days 00:56:04 to go
config:

NAME STATE READ WRITE CKSUM
lxdhosts ONLINE 0 0 0
sdb ONLINE 0 0 0

errors: No known data errors

UPDATE #1: 2021-09-01T22:00:00Z
So, I managed to figure out which containers allocated the most disk via:

zfs list | grep containers`

In my case, the container: juju-a00094-0 … was using way too much disk which I also knew could be deleted.

So I mounted the zfs disk like below (taking the path from zfs list):

zfs mount lxdhosts/containers/juju-a00094-0

The disk mounted at /var/snap/lxd/common/lxd/storage-pools/lxdhosts/containers/juju-a00094-0 and allowed me to cd into the filesystem and I could remove the unwanted data with rm.

Finally unmounted the disk again:

zfs umount /var/snap/lxd/common/lxd/storage-pools/lxdhosts/containers/juju-a00094-0

At this point, the lxc commands started to work alot more responsive again.

My thoughts at the moment is if this situation occurs BEFORE running out of zfs disk. Is this a known feature of zfs and if so, when should one start to mitigate this with large disks before its too late?

Hi,
Have you ever checked lxc operation ls or dmesg for disk error messages? May be you can get any clue about system bottleneck?

DIdn’t show anyting. But as you might have seen from my UPDATE above, I found that my suspicion on the zfs disk utilization was approaching 95% was right.

After removing significant amount of data from some containers, everything came back to life. This was a good lesson for me, that letting zfs approach max capacity is a dangerous situation.

I would love to discuss how others manage this to avoid getting into such a global impact on the LXD host itself.

  • Capping storage with some kind of default lxc profile?
  • Monitoring? (What tools and how?)

I glad that was the problem and overcome momentarily. You can tune one or two zfs setting can gain some place. As you know, checking zfs get dedup and zfs get compression make some tuning for zfs. But be sure, data written before this setting is not deduplicated. As your second question, you can check it out the tool, Zabbix.
Regards.

UPDATE: Apparently there is some “rule” out there to never go below 80% on zfs. I’ve also read 70% so this feels like some dark magic ops thing.

Love to learn more about this since I’ve now ran into a similar situation again but this time with a bit different problem profile…

Yes I hit that the other day, ZFS grinds to a halt if it has nearly filled up. Which means you tend not to get the more useful error “no disk space” and instead get operations blocking. :frowning:

1 Like

Yeah, I’m fighting performance issues and since I’m not sure where to look for zfs performance issues - I’m still in the dark.

Do you know of anyone that is competent to help out understanding if we have a problem with performance on zfs ? We will pay.