Monitoring lxd zfs backend with prometheus node exporter

erik_lonroth · May 5, 2022, 6:44pm

I’m looking for some help monitoring my zfs backend for lxd.

I’ve deployed prometheus node exporter:

/usr/bin/node_exporter --version
node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7b940ddcff26873612bccdf7cd4c42b6b6)
  build user:       root@243aafa5525c
  build date:       20211205-11:09:49
  go version:       go1.17.3
  platform:         linux/amd64

… which produces some metrics:

curl -k http://192.168.111.2:9100/metrics  | grep zfs | grep size

HELP node_zfs_abd_linear_data_size kstat.zfs.misc.abdstats.linear_data_size
# TYPE node_zfs_abd_linear_data_size untyped
node_zfs_abd_linear_data_size 5.2704768e+07
# HELP node_zfs_abd_scatter_data_size kstat.zfs.misc.abdstats.scatter_data_size
# TYPE node_zfs_abd_scatter_data_size untyped
node_zfs_abd_scatter_data_size 6.0842836992e+10
# HELP node_zfs_abd_struct_size kstat.zfs.misc.abdstats.struct_size
# TYPE node_zfs_abd_struct_size untyped
node_zfs_abd_struct_size 3.069084e+07

… But this doesn’t make much sense to me and I was looking for some advice on:

How and what to monitor on disk utilization (space left etc.)?
Is there anyone which has a grafana dashboard which I could use?
Does anyone have some general advice on alerting strategies on my zfs ?

This is a followup on: Question on LXD dashboard values - #3 by erik_lonroth

sdeziel · May 5, 2022, 7:08pm

ZFS’ ABD stands for ARC buffer data so those metrics are about the in memory cache, which is interesting to monitor but not related to the disk usage directly. I’d check what other disk related metrics that node exporter might give you.

As for what to monitor, the node exporter dashboard exposes nice basics stats which is was we tried to mimic with the LXD dashboard. If you want something fancier the node exported dashboard have advanced sections IIRC.

P.S: too bad our disk metrics no longer works on ZFS… it used to, but got broken maybe by a kernel/zfs update at some point

erik_lonroth · May 5, 2022, 7:15pm

Thanx for the attention. I’m using the lxd dashboard but its really a pitty that the zfs components seems to have got lost.

Is there anything we can do to get that back?

I’ll let you know how far we get with this…

We are using juju to deploy most of it. Absolutely fantastic!

sdeziel · May 5, 2022, 7:25pm

Oh believe me, I’d absolutely love to get those ZFS metrics back as I have ZFS deployed everywhere
I’ll be upgrading my hosts to a newer kernel/zfs module soon so maybe that will work, who knows. I’ll report back if it does.

sdeziel · March 3, 2023, 12:45am

I’ve moved the 22.04 HWE kernel (5.19) and it seems that fixed the ZFS metrics! @erik_lonroth you might want to give that a try.

erik_lonroth · March 7, 2023, 1:32pm

@sdeziel I’m not too keen on upgrading all of the hosts we have to get that kernel in. But I guess we can get there eventually.

Would perhaps btrfs be a better option?

sdeziel · March 7, 2023, 2:10pm

I hear you regarding the (early) HWE kernel. As for btrfs, I don’t know as all my deployments are using zpools to back instances.

tomp · March 17, 2023, 9:58am

For what its worth, we don’t recommend BTRFS for use with VMs:

https://linuxcontainers.org/lxd/docs/master/reference/storage_btrfs/#quotas

erik_lonroth · March 17, 2023, 7:05pm

Good to know.

What we are juggling is how to provide insight into performance of zfs containers.

kamzar1 · March 18, 2023, 12:06am

A while ago, I rather used zfs_exporter, with the following config:


OPTIONS="--web.listen-address=localhost:9134 \
         --collector.dataset-snapshot \
         --no-collector.pool \
         --web.disable-exporter-metrics \
         --exclude=deleted \
         --pool=zp2 \
         --pool=zp3"

And eventually define desired zfs properties for filesystem, snapshots, volumes …
–properties.dataset-filesystem=‘available,logicalused,referenced,used,usedbydataset,usedbysnapshots,quota,compressratio,volsize,written’
–properties.dataset-snapshot=“logicalused,referenced,used,written” \

Dont need to wait for prometheus or grafana for the result, you can check it immediately. To have a pretty output, pipe it through prom2josn than jq.

curl http://localhost:9134/metrics
curl http://localhost:9134/metrics | prom2json | jq .

By having a json output, you can define your alerting strategies upon selected json keys/values.