When a particular VM is running `lxc ls` is slow

I’ve launched a VM with lxd to test out a k3s Kubernetes cluster. While this VM is running, lxc ls is very slow.

lxc stop actually times out. The info command shows no logs (not sure if that’s supposed to be the case, since it’s stopped).

❯ lxc info --show-log k3s-master
Name: k3s-master
Location: none
Remote: https://caeser:8443
Architecture: x86_64
Created: 2020/11/18 02:30 UTC
Status: Stopped
Type: virtual-machine
Profiles: default

Log:
Verbose VM info
config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 192.168.1.174:8443
  - 10.0.3.1:8443
  - 192.168.122.1:8443
  - 10.205.94.1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
redacted
    -----END CERTIFICATE-----
  certificate_fingerprint: redacted
  driver: lxc
  driver_version: 4.0.5
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.8.0-7630-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Pop!_OS
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: caeser
  server_pid: 1728
  server_version: "4.8"
  storage: btrfs | zfs
  storage_version: 4.15.1 | 0.8.4-1ubuntu11

I have run several VMs on this host before, and this is the first time this has happened. Also, once the host is up, kubernetes api calls feel responsive.

I am running lxd 4.8

One thing different about this VM from the other VMs I’ve run: k3s sets up several new interfaces.

Any advice on debugging the slowness?

If you can get into your VM via another means rather than using the lxc tool (i.e console or ssh) then you can stop the lxd-agent and restart it in debug mode:

systemctl stop lxd-agent
cd /run/lxd_config/drive
./lxd-agent --debug

Then when running lxc ls with just that VM running it would be interesting to see what is output to the lxd-agent window and the timestamps involved.

E.g. I get:

INFO[11-23|09:34:41] Starting 
DBUG[11-23|09:34:41] Detected stale unix socket, deleting 
DBUG[11-23|09:35:16] Found cert                               name=0
DBUG[11-23|09:35:16] Found cert                               name=0

Here is the output of running lxc ls twice (from ./lxd-agent --debug)

root@k3s-master:/run/lxd_config/drive# ./lxd-agent --debug
INFO[11-29|02:13:19] Starting 
DBUG[11-29|02:13:19] Detected stale unix socket, deleting 
DBUG[11-29|02:13:44] Found cert                               name=0
DBUG[11-29|02:13:44] Found cert                               name=0
DBUG[11-29|02:14:15] Found cert                               name=0
DBUG[11-29|02:14:15] Found cert                               name=0

I tried disabling the k3s process, and restarting. lxc ls still feels slow once the vm is started.

Are you able to stop services inside the VM to check which one is causing the slow down?

I’m in the middle of moving this week, so the desktop env where I’ve been doing these experiments is in a box :slight_smile:

My suspicion is k3s, but I can’t confirm it right now. When I get back to this, I’ll want to spin up more VMs to compare. What’s running inside might be a red herring.