Error: mkdir /var/snap/lxd/common/lxd/shmounts/test: read-only file system when launching a container

suddenly, after 7 days of uptime, I start getting “Error: mkdir /var/snap/lxd/common/lxd/shmounts/test: read-only file system” when staring a container.

~$ lxc launch c8 test --target lxd11
Creating test
Starting test
Error: mkdir /var/snap/lxd/common/lxd/shmounts/test: read-only file system
Try `lxc info --show-log lxd:test` for more info

container log is empty, which is kinda logical, since it didn’t get to set it up

~$  lxc info --show-log lxd:test
Name: test
Status: STOPPED
Type: container
Architecture: x86_64
Location: lxd11
Created: 2023/02/09 16:00 UTC

Log:

lxd log is empty (i.e. nothing in there that does not hapen regularly for months), I didn’t have lxcfs.debug or daemon.debug to true though, i do have them now.

So, I had a look around and indeed, something has remounted /var/snap/lxd/common/lxd/shmounts and lxcfs read only

[root@lxd11 ~]# nsenter -a -t $(pgrep daemon.start)
-bash-5.0# mount | fgrep shmounts
tmpfs on /var/snap/lxd/common/shmounts type tmpfs (rw,relatime,size=1024k,mode=711)
lxcfs on /var/snap/lxd/common/shmounts/lxcfs type fuse.lxcfs (ro,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
tmpfs on /var/snap/lxd/common/shmounts/instances type tmpfs (ro,relatime,size=100k,mode=711)

if I remount rw, all is good (as far as I can tell so far):

-bash-5.0# mount /var/snap/lxd/common/shmounts/lxcfs -oremount,rw
-bash-5.0# mount /var/snap/lxd/common/shmounts/instances -oremount,rw

this has happened twice already, with two different servers from the same cluster.

any ideas how to debug this further?

lxd is the latest 5.10-b392610

anyone with any ideas? On the same servers I also see Lxc publish fails with Error: mkdir /tmp/lxd_lxd_metadata_133785845: no such file or directory but it might be a coincidence

Have you seen this before @amikhalitsyn ?

thanks for ping (-;

No, unfortunately I haven’t seen such effect.

@Aleks if you are ready to run it on your environment I can prepare tracing script for you and we can try to catch this “playful” process, who remounts tmpfs with readonly flag.

@Aleks have you updated kernel version recently or just lxd snap?

sure, I’ll be happy to run a tracing script

there was no recent kernel upgrade, last kernel was installed in August

[root@lxd11 ~]# rpm -qi kernel | fgrep Install
Install Date: Thu 04 Aug 2022 01:04:53 PM CEST
2 Likes

Great!

# setup perf probes (one time)
perf probe 'do_mount dev_name:string dir_name:string flags:u64'
perf probe 'reconfigure_super fc->fs_type->name:string fc->root->d_sb->s_dev:u32'

# start tracing. Runs until Ctrl+C (SIGTERM)
perf record -e probe:* --call-graph dwarf -aR
# you probably need to use nohup + & to make it run like a daemon:
# nohup perf record -e probe:* --call-graph dwarf -aR &
# and then kill -SIGTERM $(pidof perf)

# check a collected traces
perf script

# remove all probes (after tracing finish)
perf probe -d probe:*

You’ll need to setup Linux kernel debug symbols to use perf.
On RHEL-like debuginfo-install kernel-debuginfo-$(uname -r)
On Ubuntu apt install linux-image-$(uname -r)-dbgsym

Tracing using perf is generally safe even on a production environments.

eh, it generates 150megs of data per minute, apparently stuff gets mounted all the time. That would be 200+ gigs per day, and it happens randomly it can be several days until I catch it but I don’t have that much disk space. Is there any way to limit it only to the namespace of lxd daemon, where those mounts that get remounted actually are?

[root@lxd10 lxd10]# timeout 60s perf record -e probe:* --call-graph dwarf -aR
[ perf record: Woken up 588 times to write data ]
[ perf record: Captured and wrote 164.050 MB perf.data ]

ah, yep, that’s my bad. We can try to optimize size of trace file:

# let's use only one probe
perf probe 'reconfigure_super fstype=fc->fs_type->name:string fc->root->d_sb->s_dev:u32'

# ... end enable ringbuffer overwrite and filtering by filesystem type :-)
perf record -e probe:* --overwrite --filter 'fstype == "tmpfs"' --call-graph dwarf -aR

eh, it generates 150megs of data per minute, apparently stuff gets mounted all the time

It’s because of under-hood implementation details of perf. Perf collects PERF_RECORD_MMAP and PERF_RECORD_MMAP2 events to make userspace IP register resolution (stack unwinding and so on).

If it still produces too much traces then I’ll prepare another tracing script for you (-:

just an update:

Unfortunately, the read-only remount did not happen since, but the trace files are now suspiciously empty. How sure are you that this is tracng the right thing? I have tried this:

[root@lxd13 ~]# mkdir /tmp/aa
[root@lxd13 ~]# mount -t tmpfs /dev/shm /tmp/aa
[root@lxd13 ~]# mount -o remount,ro /tmp/aa
[root@lxd13 ~]# touch /tmp/aa/aa
touch: cannot touch '/tmp/aa/aa': Read-only file system
[root@lxd13 ~]# mount -o remount,rw /tmp/aa

but the trace file is still empty, i.e. “perf script” outputs nothing

Have you used the last version of trace script? I’ve rechecked it just now:

mount 287786 [005] 13963.671694: probe:reconfigure_super: (ffffffffaddf4fc0) fstype="tmpfs" s_dev=29
        ffffffffaddf4fc1 reconfigure_super+0x1 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffade23347 __x64_sys_mount+0x117 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffae90d179 do_syscall_64+0x59 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffaea0009b entry_SYSCALL_64+0x9b (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
                  126eae __GI___mount+0x6ee (inlined)
                   23d5b [unknown] (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                   246c1 mnt_context_do_mount+0x2a1 (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                   29ee3 mnt_context_mount+0x1e3 (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                    5eac [unknown] (/usr/bin/mount)
                   29d8f __libc_start_call_main+0x4df (inlined)
                   29e3f __libc_start_main_impl+0x58f (inlined)
                    6e94 [unknown] (/usr/bin/mount)

mount 287809 [005] 13965.808335: probe:reconfigure_super: (ffffffffaddf4fc0) fstype="tmpfs" s_dev=29
        ffffffffaddf4fc1 reconfigure_super+0x1 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffade23347 __x64_sys_mount+0x117 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffae90d179 do_syscall_64+0x59 (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
        ffffffffaea0009b entry_SYSCALL_64+0x9b (/usr/lib/debug/boot/vmlinux-5.19.0-32-generic)
                  126eae __GI___mount+0x6ee (inlined)
                   23d5b [unknown] (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                   246c1 mnt_context_do_mount+0x2a1 (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                   29ee3 mnt_context_mount+0x1e3 (/usr/lib/x86_64-linux-gnu/libmount.so.1.1.0)
                    5eac [unknown] (/usr/bin/mount)
                   29d8f __libc_start_call_main+0x4df (inlined)
                   29e3f __libc_start_main_impl+0x58f (inlined)
                    6e94 [unknown] (/usr/bin/mount)

Which kernel version do you have?

im running this from systemd unit, it dies every hour

[root@lxd10 ~]# systemctl cat perf
# /etc/systemd/system/perf.service
[Unit]
Description=sets up probes and runs perf collector to /perf

[Service]
Type=simple
ExecStart=/root/bin/perf.sh
KillSignal=INT
RestartSec=5s
Restart=on-failure

[Install]
WantedBy=multi-user.target

#!/bin/bash

# perf service, for tracking stuff like https://discuss.linuxcontainers.org/t/error-mkdir-var-snap-lxd-common-lxd-shmounts-test-read-only-file-system-when-launching-a-container/16355/6

set -o errexit

TS=$(date +%Y%m%d%H%M)

perf probe -d probe:* || true
perf probe 'reconfigure_super fstype=fc->fs_type->name:string fc->root->d_sb->s_dev:u32'
timeout -s INT 3600s perf record -e probe:* --overwrite --filter 'fstype == "tmpfs"' --call-graph dwarf -aR -o /perf/perf.$TS.dat

kernel is

[root@lxd10 ~]# uname -a
Linux lxd10.2e-systems.com 4.18.0-408.el8.x86_64 #1 SMP Mon Jul 18 17:42:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I can suggest to try without filtering. But it’s strange. I’ve just checked CentOS 8 kernel and it should work perfectly fine, because reconfigure_super function has the same arguments and structures are also contains required fields. Have you tried to run this tracing script only from systemd unit context or you’ve tried to run it “directly” using bash?