Virtual machine bridge network fails after about 10 seconds on LXD 5.14

maxwell-k · June 2, 2023, 6:46pm

When I create a new virtual machine with LXD 5.14 connected to a bridge network, after about 10 seconds the network stops functioning.

If I restart the guest it, the network sometimes comes back for about 10 seconds.
Disabling the firewall on the host has no effect.
I tested with Fedora 38 as the host and with Ubuntu and Alpine Linux guests.
I configured LXD with lxd init --auto using 5.13; then I configure systemd-resolved
I don’t experience the problem with containers only virtual machines.

Installing

I install with snap, for 5.13:

sudo snap refresh --channel=5.13/stable lxd

For 5.14:

sudo snap refresh --channel=latest/stable lxd

lxc version

Output on 5.13:

Client version: 5.13
Server version: 5.13

Output on 5.14:

Client version: 5.14
Server version: 5.14

Testing

Launch a new VM, wait ten seconds and trying apt update

lxc launch ubuntu:22.04 vm1 --vm \
&& sleep 10 \
&& lxc exec vm1 -- apt update

Result on 5.13: Success

Result on 5.14: Fails - unable to connect or failure resolving

Trying to connect to port 22:

nc -z -v vm1.lxd 22

Result on 5.13: Success

Result on 5.14: Fails no route to host or name or service not known

Clean up

lxc stop vm1 \
&& lxc delete vm1

Checking ip neigh show dev lxdbr0, I can see the entry goes from DELAY →
REACHABLE → INCOMPLETE or FAILED → … → FAILED

tcpdump

Checking with:

sudo tcpdump --interface lxdbr0 -nn | tee log.txt

I can see DHCP requests and replies

Initially when the network is functioning I can see ARP request and replies
like:

18:57:25.787012 ARP, Request who-has 10.55.190.221 tell 10.55.190.1, length 28
18:57:25.787781 ARP, Reply 10.55.190.221 is-at 00:16:3e:8c:64:8d, length 28

When the network fails I can see requests like the following without reply:

18:58:40.802081 ARP, Request who-has 10.55.190.221 tell 10.55.190.1, length 28

From the guest

Testing from inside the guest there is no route to the host (10.55.190.1) e.g. ping gives Destination host unreachable

Details

lxc info

``` config: {} api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache - amd_sev - storage_pool_loop_resize - migration_vm_live - ovn_nic_nesting - oidc - network_ovn_l3only - ovn_nic_acceleration_vdpa - cluster_healing - instances_state_total api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: [] architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- MIIB/TCCAYSgAwIBAgIRANRLZbjXp/Rw043ojX5yNjwwCgYIKoZIzj0EAwMwMjEc MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBkZWxs MB4XDTIzMDYwMjEwMDkyMVoXDTMzMDUzMDEwMDkyMVowMjEcMBoGA1UEChMTbGlu dXhjb250YWluZXJzLm9yZzESMBAGA1UEAwwJcm9vdEBkZWxsMHYwEAYHKoZIzj0C AQYFK4EEACIDYgAEJgZjVuEOSGAC6O8A47kX61R7k5eoJEGDyTzYNrYHM7jeLM8S 23ow+a7N/wYxQ4FBSnvw1hsQaCEvGBkvKfCRwz3RPNqTEaqlXyaSQCkzFNwhpa7Q VCGPjBp3N382wMMko14wXDAOBgNVHQ8BAf8EBAMCBaAwEwYDVR0lBAwwCgYIKwYB BQUHAwEwDAYDVR0TAQH/BAIwADAnBgNVHREEIDAeggRkZWxshwR/AAABhxAAAAAA AAAAAAAAAAAAAAABMAoGCCqGSM49BAMDA2cAMGQCMBhAKrKmXzwa0qvVcFNG4Ucw TEk9kWGkvncGZW+lX7bPtTEsdQVnV+mxtdKiUK8qnQIwFMqW2SKtMU3NzuOVbxES dnw0VslSQG/cqXgXpRsOkn5hE+JpX7tco1BaSb3/KMWi -----END CERTIFICATE----- certificate_fingerprint: bf835c29966c2a75efadac1c55965b7570c2d867d9e4642da48fa57e646f8774 driver: lxc | qemu driver_version: 5.0.2 | 8.0.0 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 6.3.4-201.fc38.x86_64 lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Fedora Linux os_version: "38" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: dell server_pid: 226797 server_version: "5.14" storage: btrfs storage_version: 5.16.2 storage_supported_drivers: - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.47.0 remote: false - name: btrfs version: 5.16.2 remote: false - name: ceph version: 17.2.5 remote: true - name: cephfs version: 17.2.5 remote: true - name: cephobject version: 17.2.5 remote: true ```

/etc/os-release

NAME="Fedora Linux"
VERSION="38 (Workstation Edition)"
ID=fedora
VERSION_ID=38
VERSION_CODENAME=""
PLATFORM_ID="platform:f38"
PRETTY_NAME="Fedora Linux 38 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:38"
DEFAULT_HOSTNAME="fedora"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f38/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=38
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=38
SUPPORT_END=2024-05-14
VARIANT="Workstation Edition"
VARIANT_ID=workstation

$ uname -a
Linux dell 6.3.4-201.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Sat May 27 15:08:36 UTC 2023 x86_64 GNU/Linux

I’m not sure how to proceed. Do I stick to 5.13? Should I report a bug or issue? Does more information help?

tomp · June 2, 2023, 9:27pm

I’ve confirmed the issue, it appears to be a problem with the vhost_net kernel module.
I can see kernel processes on the host pegging the cpu at 100%. And packet loss.

LXD 5.14 enabled vhost-net CPU offloading for VM NICs, so that likely explains why its not affecting LXD 5.13, although its strange that we are apparently only seeing this issue on Fedora hosts.

I’ll investigate more.

You could try blacklisting the vhost_net kernel module and then restarting your host and see if that helps.

midnight_caller · June 4, 2023, 7:34am

Hello ! It seems that it doesn’t only happen on Fedora, as I’m on Debian 11 and face the same issue.
Not sure about the 10 seconds, but my VM’s are started with the network adapter joined in an unmanaged bridge. I can SSH in the VM but then it loses connectivity.

+-------+----------+---------+------+------+-------------+---------+-------+
| NAME  |   TYPE   | MANAGED | IPV4 | IPV6 | DESCRIPTION | USED BY | STATE |
+-------+----------+---------+------+------+-------------+---------+-------+
| bond0 | bond     | NO      |      |      |             | 0       |       |
+-------+----------+---------+------+------+-------------+---------+-------+
| br0   | bridge   | NO      |      |      |             | 27      |       |
+-------+----------+---------+------+------+-------------+---------+-------+

When upgrading to 5.14, connectivity of the VM is lost (no ping, arp -n doesn’t find the MAC address, etc). Also the kernel “vhost” process is taking 100% of a CPU all the time.
Reverting to 5.13 fixes the issue.

tomp · June 4, 2023, 8:05am

Is this with alpine guests still?

midnight_caller · June 4, 2023, 8:20am

Didn’t try with Alpine. All my VM’s are debian/11/cloud based.

tomp · June 4, 2023, 8:45am

Ive got a potential fix, but as it takes time to manifest itself I run out of time Friday evening to confirm. But I did observe that there are some differences in how qemu would have configured the tun device (had vhost-net been working) compared to how lxd is setting it up. I wonder if that is revealing a bug in the vhost-net driver.

But I’ve not seen it on ubuntu jammy hwe 5.19 hosts so looks like the issue is affecting newer or different kernels.

Hristo_Netov · June 4, 2023, 10:19am

same issue with my server - VM can’t get IP with 5.14 and is working fine with 5.13
Unfortunately LXD 5.13 edge channel is closed (for UI)

[root@lxd~]# sudo snap install --channel=5.13/edge lxd
lxd (5.13/stable) 5.13-8e2d7eb from Canonical✓ installed
Channel 5.13/edge for lxd is closed; temporarily forwarding to 5.13/stable.
[root@lxd ~]#

OS info:

[root@lxd ~]# uname -a
Linux lxd 6.3.5-1.el8.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 30 15:48:02 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@lxd ~]# cat /etc/redhat-release 
Rocky Linux release 8.8 (Green Obsidian)
[root@lxd ~]#

maxwell-k · June 4, 2023, 11:23pm

Thanks @tomp ; I am delighted that you can reproduce this. The intermittent / time element of this meant I spent a lot of time looking elsewhere for the problem before I suspected LXD 5.14.

For now I can use 5.13 as a workaround so I hesitate a little to go on. I did follow your suggestion and try tried to blacklist vhost-net but hit an error.

Error blacklisting vhost-net

Same Fedora 38 host
printf 'blacklist vhost_net\n' | sudo tee /etc/modprobe.d/testing.conf
Reboot and check vhost net is not loaded with lsmod
lxc launch ubuntu:22.04 vm1 --vm

Error message:

Error: Failed setting up device via monitor: Error opening /dev/vhost-net for queue 0: open /dev/vhost-net: no such device
Try `lxc info --show-log local:vm1` for more info

There was no useful output in the lxd info command.

I appreciate your help! Thank you

(I can also see that LXD 5.13 doesn’t load vhost_net; which you’ve already confirmed above)

tomp · June 5, 2023, 10:14am

It looks like this fix works:

github.com/lxc/lxd

VM: Fix addNetDevConfig to match the tun interface settings that QEMU uses

lxc:master ← tomponline:tp-vm-vhost-net-tuntap

opened 10:11AM - 05 Jun 23 UTC

tomponline

+3 -3

When QEMU is passed a tun interface to configure itself it sets unix.IFF_ONE_QUE…UE and unix.IFF_VNET_HDR. LXD was not setting these on the tun interface it was configuring before passing via fd. This was causing some problems with packet loss and high CPU usage caused by the vhost-net kernel process. Reported from https://discuss.linuxcontainers.org/t/virtual-machine-bridge-network-fails-after-about-10-seconds-on-lxd-5-14/17324

RandomUser · June 10, 2023, 3:34am

how this bug sneaked into the stable version? it’s reproducable on my all ubuntu LTS machines.

tomp · June 12, 2023, 7:24am

Interesting, it did not show up in our daily automated tests, nor did I ever reproduce it on Ubuntu Jammy HWE and LTS kernels.

What version of Ubuntu are you running, can you also confirm that the latest/edge channels resolves it?

sudo snap refresh lxd --channel=latest/edge

RandomUser · June 23, 2023, 12:20am

I noticed this bug only affected AMD machines. It fully affected Jammy HWE and LTS kernels. Thanks for fixing it in edge.

tomp · June 23, 2023, 6:15am

Ah that may explain why I didn’t experience it earlier and why our tests were fine. Its fixed in lxd 5.15.

maxwell-k · July 4, 2023, 8:14pm

I have one example of an Intel machine that was affected, my original report was on an 12th Gen Intel(R) Core™ i5-1245U

I have retested using 5.15 and I’m delighted that this is fixed

Thank you for your help!