I noticed that semi-randomly one of my nodes was restarting/stopping all it’s vms and containers. Finally did a incus admin os debug log to try and diagnose and discovered that systemd was saying stuff in the incus unit was being OOM killed. While it’s possible I had been over-provisoning while moving stuff from Debian + incus to incus os, I figured I’d ask if there was any better way to diagnose memory pressure on a physical install.
I didn’t think to save the logs, but one thing I noticed was that the incus-osd process was listed in the kernel oom list as having a very large number of “pages” allocated in rss ( to the tune of 200,000) more than everything else listed in the log, other than maybe a qemu process, which makes me think maybe there’s a memory leak or similar? This host had been up since the 25th so I could see it being a very slow leak, if at all.
One thing I definitely want to get set up is some kind of metrics gathering for my now-incusos cluster, so hopefully I can get pretty graphs for memory, maybe other node-stuff. I vaguely recall a live stream adding something to that effect, but can’t find any relevant documentation in the incusos docs for how I’d scrape that.
Sorry that the post got a bit rambly, getting to be bed time soon and wanted to dump this all out before I forgot again.
The Incus /1.0/metrics endpoint when running on IncusOS includes the host OS metrics as gathered by node-exporter.
So your best bet is to setup Prometheus and Grafana, then have Prometheus scrap /1.0/metrics at which point you should have data suitable for both the Incus dashboard and the Node exporter dashboard.
It appears that my system is suffering from similar issues now that I’ve moved all workload to the physical host and more scheduled tasks are running like backups and snapshots. Seemingly random some instances are killed, sometimes including Incus. The KVM is showing no errors, yet the Incus API no longer responses.
I’ll setup some external system for observability to see if that gives some pointers.
Yeah, that’s normal. Even on a regular system, the .incus network only gets resolved by the host if you specifically configured resolved to do so.
In the case of IncusOS, you may have hundreds of networks on the system, all using the .incus default domain, so we wouldn’t really know which one should have resolved send DNS queries to.
I think we can rule out OOM for my system. Yet, I still wanted to share my investigation.
As the metrics still didn’t reveal any real (OOM) problem, I noticed that the system keeps responding to non-network input. E.g. Zigbee automations within Home Assistant were still run as expected.
Finally found the spoons to spin up a monitoring setup now that I accidentally burnt down my k3s cluster, and running into a problem here, where do I get the CA cert from incusos? The incus monitoring docs say to use a file from /var/lib/incus/… which I assume I can’t just pull from the incusos hosts?
Hrm. I’m seeing a sudden spike of like 2x memory usage today around 0650 ET, and it doesn’t correspond with any of the guests on that node using more memory according to the Incus Grafana dashboard. The only clue is a bunch of this from the prometheus-node-exporter.service unit:
Hmm, right, so that seems to relate to when Incus queries the local node-exporter prior to returning metrics.
Can you get the full incus query /1.0/metrics output?
That should let us extract some memory data both from Incus and from node-exporter (assuming Incus can get a response from it at all).
I’ve rebooted all of my nodes since this morning, applying the re-release of yesterday’s update, but I see a similar, but not as massive bump just 45m ago now
In this case, the jump was from ~4GiB in-use on the host, to ~8GiB
I’m using a shell one-liner to turn incus ls location=<node> --all-projects into a bunch of &var-name=<instance> values to paste into the grafana incus dashboard, and the “Project Memory Usage” graph shows no bump in the same timespan as the host’s reported memory usage jumping