LXD Containers Hogs Mounted NFS Storage Available Space Until Restart

Hi, I would like some help from everybody to look into a strange issue that I think might be related to LXD in relation with NFS and tmpfs.

Some background:

  1. we are running a java application (tomcat)

  2. this application have multiple instances running in multiple host/nodes

  3. some of these nodes are VM (not LXD), some are LXD containers (which exhibiting the strange behavior, reason why I’m writing this)

  4. the application requires some NFS-exported storage

  5. NFS exports are mounted within each container; ie. we have:

    "raw.apparmor: mount fstype=nfs*, mount fstype=rpc_pipefs,"
    

    configuration for these containers

  6. one of the NFS exports to be mounted are tmpfs-backed (I call it temp1) which store temporary, small (1KiB ~ 10MiB-ish), generated files

  7. we have cron job at NFS server to delete files in this temp1 storage every hour, and yes I can confirm that it is working

  8. NFS server that exports this temp1 storage is a VM running on dedicated VM-only server

  9. LXD containers runs in a physical server, Ubuntu 18.04, LXD version 3.0.3

The problem that caught my attention was an NFS export (6) somehow went nearly full, while the application shouldn’t be able to get it to full unless the cron job is broken or not executed for many hours.

Looking at the NFS server that export this temp1 storage; when I run du -csh on the directory the total usage is far lower than what df reports on both itself and every NFS client that mount this export.
I inspected the content of temp1 manually at low-traffic hours it confirms the du output.

I know that when a file is being used by (say) process A and then other process B delete that file in the filesystem the deletion is not visible to process A because it still have handle to that file, and until the process A is done with the file (close) the free space displaced by that file is not recovered for processes outside process A.

On that little tidbit I confirmed that the application instances using the temp1 storage by running lsof | grep temp1 at the lxd containers hosts and within each application containers.
It came up empty, so at least application seems to close after opening temp1 files (i.e. that is no long-lived open()).

With that out of the way I suspect some of these containers is in need of a restart, this the second time I’m having this issue and I “solved” it last time by restarting containers (i.e. running lxc restart container-name on container host).
After suspected containers successfuly restarted we get back our missing space at NFS server export temp1:

recov_nfs_server_available_space_recovery_after_containers_restart

There are two containers that get restarted matching the time the missing space recovered. Since this is an NFS storage I look at interesting combination of network usage and space usage.
I found high Tx network traffic indicating write to NFS server, at night time which is our low-traffic hours, and the disk space remains high until both containers restarted.

Some questions :

  1. Is there anything that I need to provide to shed more light into this issue?
  2. have any one seen something like these?
  3. can someone have any idea as what could be the cause to what seems to be LXD container “stealing” disk space at NFS export mounted inside that container which can only be recovered by restarting the container?
  4. I know that this is a rather old version of LXD, is this a known issue?

The biggest neon-arrow-sign that leads me to LXD is that by restarting both container I recover all the missing space, but I havent seen temp1 disk space recovered by restarting VM instances, so only LXD containers.

My suspicions are either a) LXD container having some kind of handle to NFS mounts, b) some uncommitted NFS state that only “committed” on shutdown/restart, or c) something related to systemd.

We have (prometheus) monitoring in-place, both on LXD containers host and within each containers. I’m looking for maybe some /sys/something or /proc/whatever that I’m not aware of and need to be monitored.
Basically any pointer that can explain this.

I’m concurrently trying to rebuild similar setup on lab environment, hopefully can shed some light on this weird behavior.

Your environment seems so virtualized already, why not try running the snap stable version of LXD so that you can at least rule out (or implicate) the version of LXD you are using? I’m curious to know what’s causing this as well.

Thank you for the response.

I am currently busy with work so I will be slow to respond. I’m working on creating mockup environment in lab.

and… to hijack my own thread I cannot install LXD:

	user@test-lxd-mockup:~$ sudo snap install lxd
	error: cannot install "lxd": Post https://api.snapcraft.io/v2/snaps/refresh: CONNECT denied (ask
		   the admin to allow HTTPS tunnels)
	user@test-lxd-mockup:~$ curl -XPOST https://api.snapcraft.io/v2/snaps/refresh
	{"error-list":[{"code":"api-error","message":"Error decoding JSON request body: Expecting value: line 1 column 1 (char 0)"}]}
	user@test-lxd-mockup:~$ cat /etc/os-release 
	NAME="Ubuntu"
	VERSION="18.04.6 LTS (Bionic Beaver)"
	ID=ubuntu
	ID_LIKE=debian
	PRETTY_NAME="Ubuntu 18.04.6 LTS"
	VERSION_ID="18.04"
	HOME_URL="https://www.ubuntu.com/"
	SUPPORT_URL="https://help.ubuntu.com/"
	BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
	PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
	VERSION_CODENAME=bionic
	UBUNTU_CODENAME=bionic
	user@test-lxd-mockup:~$ sudo snap set system proxy.https=
	user@test-lxd-mockup:~$ sudo snap set system proxy.http=
	user@test-lxd-mockup:~$ sudo snap install lxd
	error: cannot install "lxd": Post https://api.snapcraft.io/v2/snaps/refresh: CONNECT denied (ask
		   the admin to allow HTTPS tunnels)
	user@test-lxd-mockup:~$ uptime
	 03:02:58 up 6 min,  1 user,  load average: 0.01, 0.02, 0.00

just great… yet another hurdle :smiley:

This is on fresh Ubuntu 18.04 install, any idea guys?

Sounds like you’ve got snapd configured to use a http(s) proxy and that proxy doesn’t allow CONNECT for https connections.

I got it working later on after my last post after looking around.

As usual we have local http proxy for apt caching for quicker VM provisioning.
It seems Ubuntu installer, trying to be helpful as ever, apply this configuration also for snapd and probably what else that might be going to contact internet. I know apt is affected, but snapd being affected too is new to me.

Obviously, I have to:

  • sudo truncate --size=0 /etc/systemd/system/snapd.service.d/snap_proxy.conf

so that snap forget about proxy environment values coming from installation, and then

  • sudo systemctl daemon-reload
  • sudo systemctl restart

so that systemd is happy about the changes

Then I can run snap search lxd:

$ snap search lxd
Name               Version        Publisher       Notes    Summary
lxd                4.20           canonical✓      -        LXD - container and VM manager
lxd-bgp            0+git.6e804b1  stgraber        -        BGP server that exposes LXD routes
lxd-demo-server    0+git.6d54658  stgraber        -        Online software demo sessions using LXD
lxdmosaic          0+git.2124524  turtle0x1       -        A web interface to manage multiple instances of LXD
lxd-gitlab-runner  0.1            alexclewontin   -        GitLab CI/CD runner with built in LXD executor
fabrica            1.1.0          ogra            -        Build snaps by simply pointing a web form to a git tree
nova               ocata          james-page      -        OpenStack Compute Service (nova)
nova-hypervisor    ocata          james-page      -        OpenStack Compute Service - KVM Hypervisor (nova)
distrobuilder      2.0            stgraber        classic  Image builder for LXC and LXD
satellite          0.1.2          alanzanattadev  -        Advanced scalable Open source intelligence platform

Snarkiness aside, I’ll update when I have more information. Thank you for the response.