Number of CPUs reported by /proc/stat fluctuates, causing issues

amikhalitsyn · November 30, 2022, 10:37am

I rebooted with the -50 kernel, however the issue did reappear, both in our build process as well with the reproducer script in a matter of seconds.

Okay, so, that’s not related to recent kernel changes. Good news for us.

Interestingly, an older machine with the same setup is not affected.

Do you have the same processor (128 threads) on it, or with fewer threads?

Does the data in the container /proc/stat file come straight from the host kernel or does lxcfs massage it in any way?

No-no, it comes from lxcfs fuse. Because we are hooking CPU count and so on.

Thanks a lot for your test with the older kernel, it’s really helpful. I’ll try to guess what happens here. On my 6 core / 12 threads machine, it’s not reproducible )-:

amikhalitsyn · November 30, 2022, 10:47am

Can you confirm that the issue appeared after a software upgrade on your host? So, hardware parts, the number of containers on the node, and other things were not changed?

zrav · November 30, 2022, 11:00am

The other machines I tested were 64 and 16 threads. While testing these I did oversubscribe the CPUs and generated loads with the “stress” tool to see if it is load related.
We did add Mellanox NICs to the machines and installed its DKMS driver. I can’t exclude that being related, but only the 128 thread machine is affected. The number of containers and types of loads did not change significantly, if at all.
If you have any check to be run on the machine, let me know. The help is appreciated!

amikhalitsyn · November 30, 2022, 11:06am

You can try to put some threads on your 128-thread EPIC to offline mode using the CPU hotplug feature. Like this: echo 0 > /sys/devices/system/cpu/cpu65/online (then turn it on after the experiment by writing 1 to the same sysfs file). You can try to disable all threads from 65->128 and check if the issue is still reproducible (or even disable all threads except 32). There may be a hint for us.

zrav · November 30, 2022, 8:26pm

So I was able to reproduce the issue on the 64 thread server too, it just took longer.
When looking at the temp.txt generated by the reproducer when the loop breaks, the pattern is that the number of CPUs reported by /proc/stat in that case were either 4 or the total number of the host cpus. During the looping, I also get occasional “cat: /proc/stat: Invalid argument”, which seems very wrong.

zrav · November 30, 2022, 8:42pm

I also went ahead with the test disabling cpus. The issue still occurs when offlining all but 32.

amikhalitsyn · November 30, 2022, 11:12pm

Huge thanks for playing with that. Will take a look carefully at the code tomorrow.

amikhalitsyn · December 2, 2022, 12:04pm

I’ve found something and posted a pull request

It may be related to your problem, but I’m not sure. Let’s wait for other developers opinion

amikhalitsyn · December 13, 2022, 10:41am

@zrav LXD 5.9 was released yesterday, you can try to update your snap it contains this fix for LXCFS. Hope it helps in your case. If not, then we’ll continue the investigation.

zrav · December 13, 2022, 9:39pm

I updated to 5.9 from the candidate channel and rebooted, however the issue is still reproducible with a similar frequency.

amikhalitsyn · December 13, 2022, 9:55pm

@zrav okay, I have an idea how we will catch this. A special build of lxcfs with ASAN and TSAN
I’ll reach you.

amikhalitsyn · December 16, 2022, 3:27pm

Libfuse3 direct io by mihalicyn · Pull Request #571 · lxc/lxcfs · GitHub should help

amikhalitsyn · December 16, 2022, 6:39pm

@zrav this change was picked up in the last build. Please try snap refresh lxd and check which revision you get (it should be bigger than 24164). And yes, you’ll need to reboot.

zrav · December 16, 2022, 8:16pm

I’m getting exactly version 24164 but I should be getting a higher one?

stgraber · December 16, 2022, 8:27pm

24164 is fine

Stéphane

zrav · December 17, 2022, 8:19am

Yes, that seems to have done the trick, the issue can’t be reproduced anymore
Thank you for getting in a fix so quickly. Once again I’m very impressed by the LXD team!

zrav · December 21, 2022, 11:02am

After running a few days /proc/cpuinfo and the other lxcfs mounts became unreadable:

df -h
df: /proc/cpuinfo: Transport endpoint is not connected
df: /proc/diskstats: Transport endpoint is not connected
df: /proc/loadavg: Transport endpoint is not connected
df: /proc/meminfo: Transport endpoint is not connected
df: /proc/slabinfo: Transport endpoint is not connected
df: /proc/stat: Transport endpoint is not connected
df: /proc/swaps: Transport endpoint is not connected
df: /proc/uptime: Transport endpoint is not connected
df: /sys/devices/system/cpu/online: Transport endpoint is not connected
df: /var/snap/lxd/common/var/lib/lxcfs: Transport endpoint is not connected

LXCFS did crash:

show_signal_msg: 14 callbacks suppressed
lxcfs[3219179]: segfault at 0 ip 00007f8084afdf81 sp 00007f8084a2e780 error 6 in libc-2.31.so[7f8084a94000+178000]
Code: 00 00 4c 89 ef 4c 89 4c 24 08 e8 3a 68 00 00 48 89 e9 4c 89 e2 48 89 ee 48 8d 05 2a d2 15 00 4c 89 ef 48 89 84 24 e8 00 00 00 <c6> 45 00 00 e8 06 7e 00 00 89 d9 4c 89 fa 4c 89 f6 4c 89 ef e8 c6

amikhalitsyn · December 21, 2022, 11:30am

Hi @zrav,

this means that lxcfs fuse daemon crashed for some reason.
I think it’s better to fill an issue on Github Issues · lxc/lxcfs · GitHub

I’ll take a look and try to figure out a reason.

You’ll need to restart all containers to make lxcfs work again.

amikhalitsyn · December 21, 2022, 11:42am

@zrav please, provide the following information:

service apport status
cat /proc/sys/kernel/core_pattern
ls -la /var/crash
ls -la /var/lib/apport/coredump/
cat /var/log/apport.log
journalctl -u snap.lxd.daemon -n 200

zrav · December 21, 2022, 1:47pm

github.com/lxc/lxcfs

lxcfs crash on lxd 5.9 rev 24164

opened 01:45PM - 21 Dec 22 UTC

zrav

Due to https://discuss.linuxcontainers.org/t/number-of-cpus-reported-by-proc-sta…t-fluctuates-causing-issues/15780 we are running LXD 5.9 revision 24164. After running a few days lxcfs crashed: ``` Dec 21 05:33:41 kernel: show_signal_msg: 14 callbacks suppressed Dec 21 05:33:41 kernel: lxcfs[3219179]: segfault at 0 ip 00007f8084afdf81 sp 00007f8084a2e780 error 6 in libc-2.31.so[7f8084a94000+178000] Dec 21 05:33:41 kernel: Code: 00 00 4c 89 ef 4c 89 4c 24 08 e8 3a 68 00 00 48 89 e9 4c 89 e2 48 89 ee 48 8d 05 2a d2 15 00 4c 89 ef 48 89 84 24 e8 00 00 00 <c6> 45 00 00 e8 06 7e 00 00 89 d9 4c 89 fa 4c 89 f6 4c 89 ef e8 c6 ``` This is an Ubuntu 22.04.1 running kernel 5.15.0-56-generic on an AMD Epyc 7702P (128 thread) system with 512GB RAM. ``` lxc info config: core.https_address: '[::]:8443' core.trust_password: true images.auto_update_interval: "0" api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: - ...:8443 architectures: - x86_64 - i686 certificate: ... certificate_fingerprint: ... driver: qemu | lxc driver_version: 7.1.0 | 5.0.1 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.15.0-56-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: server.domain.com server_pid: 1657650 server_version: "5.9" storage: zfs storage_version: 2.1.4-0ubuntu0.1 storage_supported_drivers: - name: zfs version: 2.1.4-0ubuntu0.1 remote: false - name: btrfs version: 5.4.1 remote: false - name: ceph version: 15.2.17 remote: true - name: cephfs version: 15.2.17 remote: true - name: cephobject version: 15.2.17 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.07(2) (2019-11-30) / 1.02.167 (2019-11-30) / 4.45.0 remote: false ``` As requested, further information: ``` service apport status ● apport.service - LSB: automatic crash report generation Loaded: loaded (/etc/init.d/apport; generated) Active: active (exited) since Sat 2022-12-17 08:45:41 CET; 4 days ago Docs: man:systemd-sysv-generator(8) CPU: 27ms Dec 17 08:45:40 server.domain.com systemd[1]: Starting LSB: automatic crash report generation... Dec 17 08:45:41 server.domain.com apport[3908]: * Starting automatic crash report generation: apport Dec 17 08:45:41 server.domain.com apport[3908]: ...done. Dec 17 08:45:41 server.domain.com systemd[1]: Started LSB: automatic crash report generation. ``` ``` cat /proc/sys/kernel/core_pattern |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E ``` ``` ls -la /var/crash total 8 drwxrwsrwt 2 root whoopsie 4096 Nov 15 06:25 . drwxr-xr-x 15 root root 4096 Nov 13 20:56 .. ``` ``` ls -la /var/lib/apport/coredump/ total 8 drwxr-xr-x 2 root root 4096 Oct 27 2021 . drwxr-xr-x 3 root root 4096 Oct 27 2021 .. ``` ``` cat /var/log/apport.log ERROR: apport (pid 3551728) Wed Dec 21 05:33:41 2022: host pid 5646 crashed in a separate mount namespace, ignoring ``` Unfortunately no dumps are available and the lxd log shows nothing of interest during the time of crash: ``` journalctl -u snap.lxd.daemon Dec 17 08:49:16 server.domain.com lxd.daemon[5419]: => LXD is ready Dec 17 09:09:32 server.domain.com lxd.daemon[5660]: time="2022-12-17T09:09:32+01:00" level=warning msg="Detected poll(POLLNVAL) event: exiting" Dec 21 12:22:29 server.domain.com systemd[1]: Stopping Service for snap application lxd.daemon... Dec 21 12:22:29 server.domain.com lxd.daemon[1626376]: => Stop reason is: host shutdown ``` Please tell me if I should modify any configuration to catch the next possible crash.