Issue rebooting container with physical nic device passthrough

victoitor · September 12, 2024, 11:32am

Btw, the best way to be able to fix this issue is to be able to reproduce it reliably. I can reproduce it on a certain machine of mine, but I don’t know what’s causing it. You can do the same on one of your machines. If we can somehow figure out a way to reproduce this issue using incus VMs, then there might be some hope to fix it. Not being able to reproduce it reliably would be really hard to fix. Since I don’t have the time to try to attempt this, I can’t really help right now.

johnp789 · September 14, 2024, 1:03am

When I stop my VyOS VM, I see journal entries like this:

incusd[542]: time="2024-09-13T19:38:43-05:00" level=error msg="Failed to stop device" device=enp1s0f3 err="Failed probing device \"0000:01:00.3\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=VyOS instanceType=virtual-machine project=default
incusd[542]: time="2024-09-13T19:38:43-05:00" level=error msg="Failed to stop device" device=enp1s0f2 err="Failed probing device \"0000:01:00.2\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=VyOS instanceType=virtual-machine project=default
incusd[542]: time="2024-09-13T19:38:43-05:00" level=error msg="Failed to stop device" device=enp1s0f1 err="Failed probing device \"0000:01:00.1\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=VyOS instanceType=virtual-machine project=default
incusd[542]: time="2024-09-13T19:38:43-05:00" level=error msg="Failed to stop device" device=enp1s0f0 err="Failed probing device \"0000:01:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=VyOS instanceType=virtual-machine project=default

To prepare for starting the VyOS again without rebooting the whole host machine, with the PCI IDs above, these make it possible to start the VM again.

echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.0/remove 
echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.1/remove 
echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.2/remove 
echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.3/remove 
echo 1 | sudo tee /sys/bus/pci/rescan

victoitor · September 14, 2024, 6:09pm

What is your host OS? I have a feeling this is an issue with the host OS not allowing incus to stop the network devices. It could also be host OS and hardware related as well, or it would be easier to reproduce.

Just a recap on what I had figured out, it seems it’s an old bug, which has been reported back still in 2022 with lxd on this post. The best explanation for the issue was due to @brauner in which he said:

When the container is stopped LXC will move the network device back to the host. In order to that it will use a “transient” name which it has used during interface creation. It’s basically a low-effort way to avoid name collisions on the host when moving a network device back that usually has a high-collision probability name such as “eth0” in the container.

In the final step it is renamed from the transient name to its original name on the host. Since the rename step fails after the device has been moved back it makes it somewhat likely that it’s a naming collision, i.e. it’s original hostname has been taken by another device.

So lxc renames the interface into those interface names we see and then tries to rename into the original name. The error occurs when lxc tries to rename the interface back into it’s original name, which fails for some unknown reason.

This is why I believe it’s an issue between lxc (which is used by incus) and the host OS and maybe something related to the particular hardware interface.

johnp789 · September 24, 2024, 1:08am

I’m seeing the same kind of problem with a VyOS guest VM on an Arch Linux host. The same issue has occurred with an Intel-based 4-port gigabit Ethernet NIC and with a Mellanox ConnectX-5 dual-SFP28 NIC. The remove and rescan trick works with either of the two NICs.