When a particular VM is running `lxc ls` is slow

dontlaugh · November 21, 2020, 7:16pm

I’ve launched a VM with lxd to test out a k3s Kubernetes cluster. While this VM is running, lxc ls is very slow.

lxc stop actually times out. The info command shows no logs (not sure if that’s supposed to be the case, since it’s stopped).

❯ lxc info --show-log k3s-master
Name: k3s-master
Location: none
Remote: https://caeser:8443
Architecture: x86_64
Created: 2020/11/18 02:30 UTC
Status: Stopped
Type: virtual-machine
Profiles: default

Log:

Verbose VM info

config:
  core.https_address: '[::]:8443'
  core.trust_password: true
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses:
  - 192.168.1.174:8443
  - 10.0.3.1:8443
  - 192.168.122.1:8443
  - 10.205.94.1:8443
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
redacted
    -----END CERTIFICATE-----
  certificate_fingerprint: redacted
  driver: lxc
  driver_version: 4.0.5
  firewall: xtables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.8.0-7630-generic
  lxc_features:
    cgroup2: "true"
    devpts_fd: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Pop!_OS
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_name: caeser
  server_pid: 1728
  server_version: "4.8"
  storage: btrfs | zfs
  storage_version: 4.15.1 | 0.8.4-1ubuntu11

I have run several VMs on this host before, and this is the first time this has happened. Also, once the host is up, kubernetes api calls feel responsive.

I am running lxd 4.8

One thing different about this VM from the other VMs I’ve run: k3s sets up several new interfaces.

Any advice on debugging the slowness?

tomp · November 23, 2020, 9:35am

If you can get into your VM via another means rather than using the lxc tool (i.e console or ssh) then you can stop the lxd-agent and restart it in debug mode:

systemctl stop lxd-agent
cd /run/lxd_config/drive
./lxd-agent --debug

Then when running lxc ls with just that VM running it would be interesting to see what is output to the lxd-agent window and the timestamps involved.

E.g. I get:

INFO[11-23|09:34:41] Starting 
DBUG[11-23|09:34:41] Detected stale unix socket, deleting 
DBUG[11-23|09:35:16] Found cert                               name=0
DBUG[11-23|09:35:16] Found cert                               name=0

dontlaugh · November 29, 2020, 2:23am

Here is the output of running lxc ls twice (from ./lxd-agent --debug)

root@k3s-master:/run/lxd_config/drive# ./lxd-agent --debug
INFO[11-29|02:13:19] Starting 
DBUG[11-29|02:13:19] Detected stale unix socket, deleting 
DBUG[11-29|02:13:44] Found cert                               name=0
DBUG[11-29|02:13:44] Found cert                               name=0
DBUG[11-29|02:14:15] Found cert                               name=0
DBUG[11-29|02:14:15] Found cert                               name=0

I tried disabling the k3s process, and restarting. lxc ls still feels slow once the vm is started.

tomp · November 30, 2020, 5:51pm

Are you able to stop services inside the VM to check which one is causing the slow down?

dontlaugh · November 30, 2020, 7:17pm

I’m in the middle of moving this week, so the desktop env where I’ve been doing these experiments is in a box

My suspicion is k3s, but I can’t confirm it right now. When I get back to this, I’ll want to spin up more VMs to compare. What’s running inside might be a red herring.

tomp · December 7, 2020, 11:24am

You may be hitting this issue:

Are you using ZFS with the snap package?

dontlaugh · December 9, 2020, 6:21pm

Most of my instances are, I believe. Is this the way to determine that? I need to check which one is btrfs…

❯ lxc storage list  
+---------+-------------+--------+----------------------------------------+---------+
|  NAME   | DESCRIPTION | DRIVER |                 SOURCE                 | USED BY |
+---------+-------------+--------+----------------------------------------+---------+
| default |             | zfs    | default                                | 14      |
+---------+-------------+--------+----------------------------------------+---------+
| k3s     |             | btrfs  | /var/snap/lxd/common/lxd/disks/k3s.img | 1       |
+---------+-------------+--------+----------------------------------------+---------+

tomp · December 9, 2020, 6:22pm

lxc config show <instance> --expanded for each one and look for the pool property on the root disk device.

dontlaugh · December 9, 2020, 6:27pm

The vm in question is k3s-master and it appears to use the default profile which is zfs

❯ lxc profile ls
+------------+---------+
|    NAME    | USED BY |
+------------+---------+
| centos-lab | 1       |
+------------+---------+
| default    | 7       |
+------------+---------+
| k3s        | 0       |
+------------+---------+
| k3s-vm     | 0       |
+------------+---------+

coleman on caeser in ~ on ☁️  us-east-1 
❯ lxc profile show default
config: {}
description: Default LXD profile
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: default
used_by:
- /1.0/instances/chemist
- /1.0/instances/chemist-agent
- /1.0/instances/buildbot
- /1.0/instances/puppet
- /1.0/instances/k3s-master
- /1.0/instances/keycloak
- /1.0/instances/buildy

❯ lxc config show k3s-master --expanded
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu focal amd64 (20201117_07:42)
  image.os: Ubuntu
  image.release: focal
  image.serial: "20201117_07:42"
  image.type: disk-kvm.img
  image.variant: cloud
  limits.cpu: "4"
  limits.memory: 8GB
  volatile.base_image: e8a59b415ea01bffde17fe2eeb1eb01164b41dfd6fc39a82de0210286394db26
  volatile.eth0.hwaddr: 00:16:3e:8c:2a:4e
  volatile.last_state.power: STOPPED
  volatile.uuid: 71643c4c-3f56-49cc-aa58-2e8329cc8a7e
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""

dontlaugh · December 9, 2020, 6:28pm

So I may be encountering the issue you posted. This feels like a recent development, but I haven’t launched too many VMs, to be honest. I’ve only been using lxd for about 6 months.

tomp · December 9, 2020, 6:44pm

Yes that will most likely be it then. Yes it was fairly recent change to return the block device path when activating the VM’s root volume. However with ZFS this caused the 10s when run from inside the snap.

The fix is already in master and LXD 4.9 should be out in a few days.

tomp · December 9, 2020, 6:46pm

It would have been introduced by this change, as the call to getInstanceDisk was pausing for 10s on ZFS with snap.

dontlaugh · December 9, 2020, 6:52pm

That would correspond with the launch of this VM, roughly. It was the first VM I’d launched in a while. I’ll look for your change to come through, and report back.

dontlaugh · December 18, 2020, 7:05pm

After updating to lxd version 4.9, the cli feels very snappy again. Marking this as resolved for me.