Lxc query not showing disk stats for all containers

Any thoughts as to why lxc query would show lxd_disk reads and writes on some containers and not others?
lxc query /1.0/metrics

I have 2 different storage pools and the lxc query /1.0/metric only shows lxd disk reads and writes for One container in each storage pool.
The lxc query shows other metrics for All containers such as lxd_cpu_, lxd_network_, lxd_procs_, lxd_filesystem_, etc. for All containers on each storage pool but not lxd_disk_*

Thanks you for any ideas you can share.

Are the storage pools using the same driver type?

Can you also show me the specific commands/queries/results you are getting so I can try to reproduce.
Please also show thelxc storage show <pool> output for each pool. Thanks

Yes, both my storage pools are using the same driver: zfs

root@chips:/usr/local/bin# lxc storage list
+------------------+--------+------------------+-------------+---------+---------+
|       NAME       | DRIVER |      SOURCE      | DESCRIPTION | USED BY |  STATE  |
+------------------+--------+------------------+-------------+---------+---------+
| containers       | zfs    | containers       |             | 39      | CREATED |
+------------------+--------+------------------+-------------+---------+---------+
| grandecontainers | zfs    | grandecontainers |             | 6       | CREATED |
+------------------+--------+------------------+-------------+---------+---------+
root@chips:/usr/local/bin# lxc storage show containers
config:
  source: containers
  volatile.initial_source: /dev/nvme0n1
  zfs.pool_name: containers
description: ""
name: containers
driver: zfs
used_by:
- /1.0/instances/acochrane-lxc
- /1.0/instances/bralston-lxc
- /1.0/instances/ccannon-lxc
- /1.0/instances/ckotfila-lxc
- /1.0/instances/cnewton-lxc
- /1.0/instances/cwilson-lxc
- /1.0/instances/danderson-lxc
- /1.0/instances/dchu-lxc
- /1.0/instances/drude-lxc
- /1.0/instances/dsappington-lxc
- /1.0/instances/golden-template-chips
- /1.0/instances/hpennington-lxc
- /1.0/instances/jfeatherston-lxc
- /1.0/instances/jfrankel-lxc
- /1.0/instances/jliddle-lxc
- /1.0/instances/jmadsen-lxc
- /1.0/instances/jthomas-lxc
- /1.0/instances/jyoung-lxc
- /1.0/instances/kcraig-lxc
- /1.0/instances/kgrego-lxc
- /1.0/instances/ksappington-lxc
- /1.0/instances/kstamper-lxc
- /1.0/instances/kwindham-lxc
- /1.0/instances/mkhan-lxc
- /1.0/instances/mronquest-lxc
- /1.0/instances/ncooke-lxc
- /1.0/instances/ngross-lxc
- /1.0/instances/nhellmuth-lxc
- /1.0/instances/ntschohl-lxc
- /1.0/instances/ozabalaferrera-lxc
- /1.0/instances/rculhane-lxc
- /1.0/instances/relstad-lxc
- /1.0/instances/vdhand-lxc
- /1.0/instances/wjianping-lxc
- /1.0/instances/wschultz-lxc
- /1.0/instances/zBackup-golden-template-chips
- /1.0/profiles/bootstrap-k8s
- /1.0/profiles/default
- /1.0/profiles/k8s
status: Created
locations:
- none
root@chips:/usr/local/bin# lxc storage show grandecontainers
config:
  source: grandecontainers
  volatile.initial_source: /dev/nvme2n1
  zfs.pool_name: grandecontainers
description: ""
name: grandecontainers
driver: zfs
used_by:
- /1.0/instances/afox-lxc
- /1.0/instances/aheyne-lxc
- /1.0/instances/aszabo-lxc
- /1.0/instances/elahrvivaz-lxc
- /1.0/instances/elahrvivaz-lxc/snapshots/elahrvivaz-lxc-bkjan272023
- /1.0/instances/mdailey-lxc
status: Created
locations:
- none

Basically I’m trying to use lxc query to look at disk metrics
When I run this command it shows various metrics for all containers.
It shows disk metrics which is what I’m after;

# lxc query /1.0/metrics
TYPE lxd_disk_read_bytes_total counter
TYPE lxd_disk_reads_completed_total counter
TYPE lxd_disk_written_bytes_total counter
TYPE lxd_disk_writes_completed_total counter

However it only shows this information for 2 Containers, 1 in each pool.
acochrane-lxc in “containers”
and
elahrvivaz-lxc in “grandecontainers”

# lxc query /1.0/metrics --raw | grep lxd_disk_*
# HELP lxd_disk_read_bytes_total The total number of bytes read.
# TYPE lxd_disk_read_bytes_total counter
lxd_disk_read_bytes_total{device="loop1",name="acochrane-lxc",project="default",type="container"} 37888
lxd_disk_read_bytes_total{device="sda",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_read_bytes_total{device="nvme1n1",name="acochrane-lxc",project="default",type="container"} 53248
lxd_disk_read_bytes_total{device="dm-0",name="acochrane-lxc",project="default",type="container"} 53248
lxd_disk_read_bytes_total{device="loop1",name="elahrvivaz-lxc",project="default",type="container"} 166912
lxd_disk_read_bytes_total{device="loop3",name="elahrvivaz-lxc",project="default",type="container"} 2048
lxd_disk_read_bytes_total{device="sda",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_read_bytes_total{device="nvme1n1",name="elahrvivaz-lxc",project="default",type="container"} 344064
lxd_disk_read_bytes_total{device="dm-0",name="elahrvivaz-lxc",project="default",type="container"} 344064
# HELP lxd_disk_reads_completed_total The total number of completed reads.
# TYPE lxd_disk_reads_completed_total counter
lxd_disk_reads_completed_total{device="loop1",name="acochrane-lxc",project="default",type="container"} 1
lxd_disk_reads_completed_total{device="sda",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_reads_completed_total{device="nvme1n1",name="acochrane-lxc",project="default",type="container"} 5
lxd_disk_reads_completed_total{device="dm-0",name="acochrane-lxc",project="default",type="container"} 5
lxd_disk_reads_completed_total{device="loop1",name="elahrvivaz-lxc",project="default",type="container"} 10
lxd_disk_reads_completed_total{device="loop3",name="elahrvivaz-lxc",project="default",type="container"} 1
lxd_disk_reads_completed_total{device="sda",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_reads_completed_total{device="nvme1n1",name="elahrvivaz-lxc",project="default",type="container"} 14
lxd_disk_reads_completed_total{device="dm-0",name="elahrvivaz-lxc",project="default",type="container"} 14
# HELP lxd_disk_written_bytes_total The total number of bytes written.
# TYPE lxd_disk_written_bytes_total counter
lxd_disk_written_bytes_total{device="loop1",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="sda",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="nvme1n1",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="dm-0",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="loop1",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="loop3",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="sda",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="nvme1n1",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_written_bytes_total{device="dm-0",name="elahrvivaz-lxc",project="default",type="container"} 0
# HELP lxd_disk_writes_completed_total The total number of completed writes.
# TYPE lxd_disk_writes_completed_total counter
lxd_disk_writes_completed_total{device="loop1",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="sda",name="acochrane-lxc",project="default",type="container"} 11
lxd_disk_writes_completed_total{device="nvme1n1",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="dm-0",name="acochrane-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="loop1",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="loop3",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="sda",name="elahrvivaz-lxc",project="default",type="container"} 12
lxd_disk_writes_completed_total{device="nvme1n1",name="elahrvivaz-lxc",project="default",type="container"} 0
lxd_disk_writes_completed_total{device="dm-0",name="elahrvivaz-lxc",project="default",type="container"} 0

As you can see in “containers pool” there are 39 containers and in “grandecontainers pool” there are 6 containers.

I need to get the disk metrics for the other 38 and other 5.

Thanks.

What version of LXD is this?

Also, just to confirm, the other containers are running right?

Can you show me the lxc config show <instance> --expanded for one of the problem containers?

root@chips:/usr/local/bin# lxc --version
5.0.2

Yes all containers in the list are running.

Also I have a second lxc server with the same setup with 50 containers in 1 pool and 1 container in the second pool and I get the same result with the lxc query. Only 1 container in each pool shows disk metrics.

From Containers in storage pool grandecontainers (shows disk metrics)

root@chips:/usr/local/bin# lxc config show elahrvivaz-lxc --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230107)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230107"
  image.type: squashfs
  image.version: "22.04"
  limits.cpu: "16"
  limits.memory: 64GB
  limits.memory.swap: "false"
  linux.kernel_modules: ip_tables,ip6_tables,nf_nat,overlay,br_netfilter
  raw.lxc: "lxc.cap.drop= \nlxc.cgroup.devices.allow=a\nlxc.mount.auto=proc:rw sys:rw\nlxc.mount.entry
    = /dev/kmsg dev/kmsg none defaults,bind,create=file"
  security.nesting: "true"
  security.privileged: "true"
  volatile.base_image: ed7509d7e83f29104ff6caa207140619a8b235f66b5997f1ed6c5e462617fb71
  volatile.cloud-init.instance-id: 7577872b-b25e-41ce-a342-db1fca343716
  volatile.eth0.host_name: vethf4b6d85f
  volatile.eth0.hwaddr: 00:16:3e:8c:b1:0b
  volatile.idmap.base: "0"
  volatile.idmap.current: '[]'
  volatile.idmap.next: '[]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 3275df8d-db06-42bc-96ab-01fcd7881ba6
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxcbr0
    type: nic
  kmsg:
    path: /dev/kmsg
    source: /dev/kmsg
    type: unix-char
  root:
    path: /
    pool: grandecontainers
    size: 250GB
    type: disk
ephemeral: false
profiles:
- k8s
stateful: false
description: ""

From Containers in storage pool grandecontainers (Does NOT show disk metrics)

root@chips:/usr/local/bin# lxc config show afox-lxc --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230107)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230107"
  image.type: squashfs
  image.version: "22.04"
  limits.cpu: "16"
  limits.memory: 64GB
  security.nesting: "true"
  volatile.base_image: ed7509d7e83f29104ff6caa207140619a8b235f66b5997f1ed6c5e462617fb71
  volatile.cloud-init.instance-id: cb28886c-9815-45d9-8fa5-0f047927341c
  volatile.eth0.host_name: vethde7ae084
  volatile.eth0.hwaddr: 00:16:3e:2a:b7:28
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: c7d78707-480e-4ba3-92db-65b3b2db8c74
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxcbr0
    type: nic
  root:
    path: /
    pool: grandecontainers
    size: 250GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
```bash

From Containers in storage pool “containers” (shows disk metrics)

root@chips:/etc# lxc config show acochrane-lxc --expanded 
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230107)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230107"
  image.type: squashfs
  image.version: "22.04"
  limits.cpu: "16"
  limits.memory: 64GB
  security.nesting: "true"
  volatile.base_image: ed7509d7e83f29104ff6caa207140619a8b235f66b5997f1ed6c5e462617fb71
  volatile.cloud-init.instance-id: 0682f865-791b-4571-863f-2109f09c2ad8
  volatile.eth0.host_name: vethf52c3ef6
  volatile.eth0.hwaddr: 00:16:3e:28:75:c0
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: 4ea62a76-27b5-4ad8-bd2e-ebc479ba5558
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxcbr0
    type: nic
  root:
    path: /
    pool: containers
    size: 75GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
From Containers in storage pool "containers" (Does NOT show disk metrics)
root@chips:/etc# lxc config show bralston-lxc --expanded 
architecture: x86_64
config:
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230107)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230107"
  image.type: squashfs
  image.version: "22.04"
  limits.cpu: "16"
  limits.memory: 64GB
  security.nesting: "true"
  volatile.base_image: ed7509d7e83f29104ff6caa207140619a8b235f66b5997f1ed6c5e462617fb71
  volatile.cloud-init.instance-id: a7c2a4b7-16a2-48f0-b4b3-bca8531532e5
  volatile.eth0.host_name: vethcea55a14
  volatile.eth0.hwaddr: 00:16:3e:49:72:64
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.power: RUNNING
  volatile.uuid: e8c08341-9fd9-410e-ba0b-457dfdefa7c5
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: lxcbr0
    type: nic
  root:
    path: /
    pool: containers
    size: 75GB
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

I am seeing this message in lxd.log if it sheds any light.

level=warning msg=“Failed to get disk stats” err=“Failed extracting io.stat “253:0” (from “259:3 253:0”)” instance=jliddle-lxc instanceType=con
tainer project=default

and

level=error msg=“Failed writing error for HTTP response” err=“write unix /var/snap/lxd/common/lxd/unix.socket->@: write: broken pipe” url=/1.0/metr
ics writeErr=“write unix /var/snap/lxd/common/lxd/unix.socket->@: write: broken pipe”

Checking the snap.lxd.daemon shows the same errors.

root@chips:/var/snap/lxd/common/lxd# systemctl status snap.lxd.daemon

● snap.lxd.daemon.service - Service for snap application lxd.daemon
Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.service; static)
Active: active (running) since Sat 2023-01-28 16:11:34 UTC; 3 weeks 4 days ago
TriggeredBy: ● snap.lxd.daemon.unix.socket
Main PID: 4437 (daemon.start)
Tasks: 0 (limit: 618852)
Memory: 8.5M
CPU: 594ms
CGroup: /system.slice/snap.lxd.daemon.service
‣ 4437 /bin/sh /snap/lxd/24322/commands/daemon.start

Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=warning msg=“Failed to get disk stats” err="Failed extracting io.stat “253:0” (from "259:3 253:0>
Feb 22 13:43:22 chips lxd.daemon[4623]: time=“2023-02-22T13:43:22Z” level=error msg=“Failed writing error for HTTP response” err="write unix /var/snap/lxd/common/lxd/unix.soc>

It looks like those logs line are not complete and are being truncated.

Can you confirm whether you see those errors for the instance(s) affected by the lack of metrics?

@sdeziel @amikhalitsyn do you remember seeing something like this recently?
Something about a kernel change missing a new line?
Is this the same problem?

Does restarting one of the problem instances fix it (temporarily)?

@tomp, I think you are right. Looks related to

2 Likes

Yes sorry the lines got truncated.

Yes I see the errors in the logs for each instance that doesn’t show the disk metrics.
time=“2023-02-22T21:36:04Z” level=warning msg=“Failed to get disk stats” err=“Failed extracting io.stat “253:0” (from “259:3 253:0”)” instance=bralston-lxc instanceType=co
ntainer project=default

I restarted a few of my containers and yes the disk metrics do show with the command
lxc query /1.0/metrics --raw | grep lxd_disk_*

Not sure if thats temporary or permanent.

Thanks I’ll take a look at what we can do parse the duff output from the kernel.

Thank you!

This PR that will be in LXD 5.14 should fix this issue: