The metrics in there show both incusd and node-exporter at below 100MiB of memory usage, so that’s looking good.
If you want to try tracking this, the metrics to keep an eye on would be:
go_memstats_alloc_bytes
go_memstats_sys_bytes
incus_go_alloc_bytes
incus_go_sys_bytes
The first two are coming from the merged metrics from node-exporter with the latter two being the incusd equivalent. So if the memory issue is coming directly from either of those processes, you should be able to catch it that way.
Are the memory statistics of incus-osd available as well? I recall seeing that was listed explicitely in the OOM killer logs, but that was after maybe two weeks of uptime when I first opened the thread.
I suspect now the “incus processes” killed by the OOM killer were one or more of the guests, which is what clued me into the issue in the first place. Some of my guests were “rebooting” for seemingly no reason.
So, I may have gotten a bit distracted from the original topic here, but I have a theory as to why node-exporter suddenly starts spewing out write on closed pipes. I wonder if the incusd proxying is timing out in this dial – I can’t be sure because it seems incusd never reports any errors in that code. I’m running IncusOS on a trio of 8th-gen intel NUCs, so I could see spikes in load causing it to start missing deadlines there. Especially because on the NUCs I’ve got to disable tcp hardware offloading, so spikes in network traffic could cause the rest of the networking stack to be bogged down?
Actually, wait. Looking at node_scrape_collector_duration_seconds I’m seeing the hwmon collector regularly spiking to >2.5s, could the timeouts be causing it to just fail entirely? AIUI the node-exporter pulls the metrics at scrape time, so a slow hwmon collector could be causing the entire request to be canceled on the incusd side.