Possible to investigate OOM problems on a physical host?

I noticed that semi-randomly one of my nodes was restarting/stopping all it’s vms and containers. Finally did a incus admin os debug log to try and diagnose and discovered that systemd was saying stuff in the incus unit was being OOM killed. While it’s possible I had been over-provisoning while moving stuff from Debian + incus to incus os, I figured I’d ask if there was any better way to diagnose memory pressure on a physical install.

I didn’t think to save the logs, but one thing I noticed was that the incus-osd process was listed in the kernel oom list as having a very large number of “pages” allocated in rss ( to the tune of 200,000) more than everything else listed in the log, other than maybe a qemu process, which makes me think maybe there’s a memory leak or similar? This host had been up since the 25th so I could see it being a very slow leak, if at all.

One thing I definitely want to get set up is some kind of metrics gathering for my now-incusos cluster, so hopefully I can get pretty graphs for memory, maybe other node-stuff. I vaguely recall a live stream adding something to that effect, but can’t find any relevant documentation in the incusos docs for how I’d scrape that.

Sorry that the post got a bit rambly, getting to be bed time soon and wanted to dump this all out before I forgot again.

1 Like

The Incus /1.0/metrics endpoint when running on IncusOS includes the host OS metrics as gathered by node-exporter.

So your best bet is to setup Prometheus and Grafana, then have Prometheus scrap /1.0/metrics at which point you should have data suitable for both the Incus dashboard and the Node exporter dashboard.

It appears that my system is suffering from similar issues now that I’ve moved all workload to the physical host and more scheduled tasks are running like backups and snapshots. Seemingly random some instances are killed, sometimes including Incus. The KVM is showing no errors, yet the Incus API no longer responses.

I’ll setup some external system for observability to see if that gives some pointers.

We support remote syslog logging which can be pretty useful in those kind of situations to track down what’s going on.

Here’s some first result, just before it goes OOM:

Can you load the Node Exporter dashboard too? That would get you the host CPU and memory usage metrics.

I enabled remote syslog now.

Is Node Explorer part of Grafana or Prometheus? It appears missing in Grafana Cloud.

Are instances not found by hostname on IncusOS? It appears that:

config:
  syslog:
    address: 10.146.200.99:514
    log_format: rfc5424
    protocol: udp

is working, but with:

config:
  syslog:
    address: alloy.incus:514
    log_format: rfc5424
    protocol: udp

it isn’t :man_shrugging:

Yeah, that’s normal. Even on a regular system, the .incus network only gets resolved by the host if you specifically configured resolved to do so.

In the case of IncusOS, you may have hundreds of networks on the system, all using the .incus default domain, so we wouldn’t really know which one should have resolved send DNS queries to.

1 Like

For node exporter, the data will already be in Prometheus if you’re scraping Incus’ /1.0/metrics. You just need a dashboard to see it:

Here it is :slight_smile: Grafana

Nothing obviously wrong there, system seems to have plenty of free memory and even the CPU doesn’t really show any bad spikes.

I think we can rule out OOM for my system. Yet, I still wanted to share my investigation.

As the metrics still didn’t reveal any real (OOM) problem, I noticed that the system keeps responding to non-network input. E.g. Zigbee automations within Home Assistant were still run as expected.

I just remembered that previously I had to disable offloading for the onboard I219-LM that uses the e1000e driver using this script. I saw that Add support for disabling tcp segmentation offloading for an interface · Issue #723 · lxc/incus-os · GitHub was recently merged, so I disabled all offloading again. Hopefully that resolves my issues :crossed_fingers:

Finally found the spoons to spin up a monitoring setup now that I accidentally burnt down my k3s cluster, and running into a problem here, where do I get the CA cert from incusos? The incus monitoring docs say to use a file from /var/lib/incus/… which I assume I can’t just pull from the incusos hosts?

incus info has the public certificate.

ahha, danke, got it all scraping now.

Hrm. I’m seeing a sudden spike of like 2x memory usage today around 0650 ET, and it doesn’t correspond with any of the guests on that node using more memory according to the Incus Grafana dashboard. The only clue is a bunch of this from the prometheus-node-exporter.service unit:

time=2026-01-18T11:59:37.940Z level=ERROR source=http.go:225 msg=“error encoding and sending metric family: write tcp 127.0.0.1:9100->127.0.0.1:40468: write: broken pipe”

so I wonder if something’s going on causing prometheus-node-exporter to balloon in memory?

The node went from ~6GiB inuse to 14GiB

Looking forward in the logs, it seems that same write tcp error is continuing to happen even up til now (0818)

Hmm, right, so that seems to relate to when Incus queries the local node-exporter prior to returning metrics.

Can you get the full incus query /1.0/metrics output?
That should let us extract some memory data both from Incus and from node-exporter (assuming Incus can get a response from it at all).

I’ve rebooted all of my nodes since this morning, applying the re-release of yesterday’s update, but I see a similar, but not as massive bump just 45m ago now

In this case, the jump was from ~4GiB in-use on the host, to ~8GiB

Here’s the /1.0/metrics query from the relevent node: https://0x0.st/PKeB.prom

I’m using a shell one-liner to turn incus ls location=<node> --all-projects into a bunch of &var-name=<instance> values to paste into the grafana incus dashboard, and the “Project Memory Usage” graph shows no bump in the same timespan as the host’s reported memory usage jumping