Latest Centos 8 Image doesn't work for VM? Can't launch agent

bot403 · September 9, 2020, 2:37pm

I had an un-pleasant surprise trying to launch a centos 8 image on my existing lxc machine. It seems in the latest build (20200909_07) the lxd agent fails to start in vm mode making the image unusable for me since I cant exec in. I can use the console but don’t know the starting root password.

For various reasons I’d strongly like to use the vm mode and not run it as a container.

Using a very vanilla default profile and options the LXD agent fails to start in vm mode. If I try centos7, centos 8 in container mode it works fine and I can lxc exec.

The only output in the console of note was:

[FAILED] Failed to start LXD - agent - 9p mount.
See ‘systemctl status lxd-agent-9p.service’ for details.
[DEPEND] Dependency failed for LXD - agent.

$ lxc exec reportpool1 bash
Error: Failed to connect to lxd-agent

I have other images working fine - ubuntu, centos 8 on an earlier build (20200518_07) so it seems like it might just be the latest builds.

Does anyone know how to solve this? Without the root password or lxc exec I can’t even get in to look at any logs to begin to troubleshoot. Its a big locked black box.

Run command:

lxc launch images:centos/8 reportpool1 --vm

Host os: Ubuntu 18.04 with HWE

lxc --version
4.0.3

Any idea on how to solve this? Any ideas on how to “break in” to the container enough (or set it on launch ) to set a root password and troubleshoot?

toby63 · September 9, 2020, 6:30pm

For breaking in, see here:

And here (Section: Extra steps for official Ubuntu images):

You need to set the root password via cloud-init and then you can login via lxc console.

bot403 · September 9, 2020, 6:55pm

Thanks toby. I was able to launch a cloud instance and inject a user.

It looks like the LXD agent isn’t starting because of a failure to mount?

mount: /run/lxd_config/9p: unknown filesystem type ‘9p’

systemctl status lxd-agent-9p.service
lxd-agent-9p.service - LXD - agent - 9p mount
Loaded: loaded (/usr/lib/systemd/system/lxd-agent-9p.service; enabled; vendor
preset: disabled)
Active: failed (Result: exit-code) since Wed 2020-09-09 18:48:41 U
TC; 4min 10s ago
Docs: Linux Containers - LXD - Has been moved to Canonical Process: 505 ExecStart=/bin/mount -t 9p config /run/lxd_config/9p -o access=0,
trans=virtio (code=exited, status=32) Process: 503 ExecStartPre=/bin/chmod 0700 /run/lxd_config/ (code=exited, status=0/SUCCESS) Process: 501 ExecStartPre=/bin/mkdir -p /run/lxd_config/9p (code=exited, status=0/SUCCESS) Process: 499 ExecStartPre=/sbin/modprobe 9pnet_virtio (code=exited, status=0/SUCCESS) Main PID: 505 (code=exited, status=32)

Sep 09 18:48:41 localhost systemd[1]: Starting LXD - agent - 9p mount…
Sep 09 18:48:41 localhost mount[505]: mount: /run/lxd_config/9p: unknown filesystem type ‘9p’.
Sep 09 18:48:41 localhost systemd[1]: lxd-agent-9p.service: Main process exited, code=exited, status=32/n/a
Sep 09 18:48:41 localhost systemd[1]: lxd-agent-9p.service: Failed with result ‘exit-code’.

toby63 · September 9, 2020, 6:59pm

Can you check for the 9p kernel module in the VM?
Run:

 # modinfo 9p

stgraber · September 9, 2020, 7:22pm

This is a CentOS 8 regression, they had issues building the 9p kernel module in a recent kernel update and so decided to just not build it anymore, breaking everyone using it in the process…

We immediately filed a bug which was promptly ignored by the CentOS team…
https://bugs.centos.org/view.php?id=17552

https://bugs.centos.org/view.php?id=16196 shows when the module got disabled.

So I’d recommend heading over there and complaining about it some more, maybe it will get re-enabled?

Until then, I was under the impression that RedHat was promising a very stable kernel where no such thing would ever happen, but apparently I was wrong…

bot403 · September 9, 2020, 7:24pm

Thanks. I’ve been debugging along those lines. It certainly seems like 9p support is busted. I’ve been trying to google instructions for getting 9p running in centos but despite loading the correct modules and using the centos.plus kernel it doesn’t seem to want to enable 9p support. But I admit I’m rather new to centos and 9p

$ modinfo 9p
modinfo: ERROR: Module 9p not found.
$ lsmod |grep -i 9p
9pnet_virtio 20480 0
9pnet 90112 1 9pnet_virtio
cat /proc/filesystem
nodev sysfs
nodev rootfs
nodev ramfs
nodev bdev
nodev proc
nodev cpuset
nodev cgroup
nodev cgroup2
nodev tmpfs
nodev devtmpfs
nodev configfs
nodev debugfs
nodev tracefs
nodev securityfs
nodev sockfs
nodev bpf
nodev pipefs
nodev hugetlbfs
nodev devpts
nodev autofs
nodev pstore
nodev efivarfs
nodev mqueue
nodev selinuxfs
ext3
ext2
ext4
vfat
iso9660
$ uname -a
Linux reportpool1 4.18.0-193.14.2.el8_2.centos.plus.x86_64 #1 SMP Mon Aug 3 15:15:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

bot403 · September 9, 2020, 7:55pm

Is there a simple way just to revert to a previous lxc centos8 image? My older one seems to work fine but I don’t seem to be able to get this exact image build launched.

lxc config show backupweb2
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Centos 8 amd64 (20200518_07:08)
  image.os: Centos
  image.release: "8"
  image.serial: "20200518_07:08"
  image.type: disk-kvm.img

stgraber · September 9, 2020, 9:53pm

Sadly, no, we only keep the past 3 images around as we would otherwise very quickly run out of disk space on the image servers…

In our own test systems, the workaround I used was to manually download the old centos8 kernel and forcefully install that, then prevent the system from ever updating it.

This isn’t really a solution I’d recommend though as that means you’re not getting security updates either. In our case, those test VMs are completely disconnected from the internet, so it’s not a problem, but for anyone else, that would likely be an issue…

We are looking at using virtiofs in the future, though it’s unclear that this would be supported any better on the centos8 kernel…

bot403 · September 10, 2020, 1:36pm

Sadly, no, we only keep the past 3 images around

Understood. Thank you anyways for your help! I’ll try to downgrade just the kernel and see how it goes. I’ll also look harder for other workarounds. Maybe we can actually use a container, or maybe we can just use an ubuntu VM vs centos.

stgraber · September 10, 2020, 5:33pm

One option may also be to copy the LXD agent binary from 9p or from the host and run it manually in the VM. You’ll also need the various certificates that come with it.

The downside to this is that the binary may get out of date as LXD gets updated.
It also will not be getting some of the dynamic files we generate for disk passthrough and other bits, but this may not be an issue for your use.

toby63 · September 10, 2020, 7:25pm

Would a dkms module be possible for 9p?

I also once read about only “partly recompiling the kernel”, but I never tried that.

Compiling the whole kernel takes hours…

I think torecat (the maintainer from here: https://bugs.centos.org/view.php?id=17552) should at least show what errors occured exactly.

stgraber · September 10, 2020, 7:49pm

Yeah, dkms could be an option, though if you figure out what’s needed to make it build through dkms, the same could be applied to the kernel, restoring the module for everyone.

toby63 · September 10, 2020, 8:20pm

That’s true .
Assuming that it works in other kernels, centOS must have components or changes that interfere with 9p .
But just closing your bugreport with wont-fix is a really bad handling of the situation.

bot403 · September 10, 2020, 8:42pm

Let me spin the problem around another way…

Given I figured out how to create users and use cloud-init, what is the impact of NOT having the LXD-agent running? Obviously lxc exec is impacted. Is the agent critical or just a really-nice-to-have ?
Assuming the console would be fine for use (and ssh to the vm itself), could I still use the vm successfully without the agent?

toby63 · September 10, 2020, 8:53pm

@stgraber will know best, but I guess it’s mostly “nice-to-have”, because you can use some of the usual lxc commands with it, like lxc exec and lxc file; see for example simos blog for some details: How to use virtual machines in LXD – Mi blog lah!
(Section: Using a LXD virtual machine ).

Update:
Also lxc info seems to be affected.

If you can read some code, here is the source code for the lxd-agent:
https://github.com/lxc/lxd/tree/master/lxd-agent

stgraber · September 10, 2020, 10:11pm

Yeah, the agent effectively provides:

exec
file operations
State information (displayed in lxc info and lxc list)
Seeding & triggering of cloud-init on first boot
Automated mounting of disk devices that pass a host path into the VM (though that similarly relies on 9p so a manual agent will fail that part)

We are planning on adding quite a few more features including handling for CPU/memory hotplug, network device renaming and hotplug, virtiofs filesystems, …

But in general we never expect the agent to be a requirement and so long as you’ve configured some kind of user in the VM, you can access it using lxc console or over SSH.